Unstructured data represents any data that does not have a recognizable structure. It is unorganized and raw and can be non-textual or textual. For example, email is a fine illustration of unstructured textual data. It includes time, date, recipient and sender details and subject, etc., but an email body remains unstructured. Unstructured data...
Canonicalization is the process of converting data that involves more than one representation into a standard approved format. Such a conversion ensures that data conforms to canonical rules. This compares different representations to assure equivalence, to count numbers of distinct data structures, to impose a meaningful sorting order and to improve algorithm efficiency, thus eliminating repeated calculations. Canonicalization is used in numerous Internet and computer applications to generate canonical data from noncanonical information. Canonical representation of data is widely used in search engine optimization (SEO), Web servers, Unicode and XML. This term is also known as C14N, standarization or normalization.
In SEO, URL canonicalization deals with Web content with more than one possible URL. This may create discrepancies in searches because the search engine may not being aware of which URL should be displayed. Canonicalization picks the best URL from several choices, usually referring to home pages. Although certain URLs appear to be the same, Web servers return different results for the URLs. Search engines consider only one URL in canonical form. Computer security is based on file name canonicalization. Some Web servers may have a security rule to execute files only under a particular directory. The file is then executed only if the path has the specified directory in its name. Special care has to be taken to check if the file name is a unique representation. Such vulnerability is called directory traversal. Most of the characters in the Unicode standard have variable-length encodings. This requires a consideration of each string character and makes the string validation more complex. If all character encodings are not considered in the software implementation, there arises a possibility of bugs. This problem can be eliminated using single encoding for every character. The best alternative, which any software can take, is to check if the string is canonicalized. Strings that are not canonicalized can be rejected. A canonical XML document is an XML document in XML canonical form. It is defined by canonical XML specification. Canonicalization in XML eliminates white space within tags, sorts namespace references and eliminates redundant ones, and uses particular character encodings. It also removes XML and DOCTYPE declarations, in addition to transforming relative URLs into absolute URLs.
Read More »
Join 138,000+ IT pros on our weekly newsletter
Home | Advertising Info | Write for Us | About | Contact Us
2010 - 2015
Partner Sites :