How Structured Is Your Data? Examining Structured, Unstructured and Semi-Structured Data

Historically, data analysts were capable of decrypting and extracting information from only one type of data: structured data. This type of data was easily searchable because of its clear patterns, but represented a minor percentage of total data available.

Unstructured data included video, audio, emails, and data coming from social media and mobile devices as well. It was, hands down, the largest reserve of raw information available, yet no one was able to tap this resource reliably.

Things have changed, however, as the increased availability of storage and superior processing capabilities gave birth to unstructured data analytics – a new, and thus immature, form of technology. Better business intelligence is taking full advantage of this opportunity, and substantial investments are being made to aggregate structured and unstructured data analytics to access this apparently endless goldmine of information.

Let's have a look at these two data formats to understand their differences, and what the future holds for all data analysts.

What Is Structured Data?

Structured data is human- or machine-generated and highly organized information that can be easily stored in row database structures known as relational databases (RDBs). It is anything that exists in a format which can be easily captured, stored and organized in an RDB structure to be later analyzed. (To learn more about databases, check out our Introduction to Databases.)

Examples include ZIP codes, phone numbers, and user demographics such as age or gender. Data found in these databases can be queried with Structured Query Language (SQL) or VLOOKUP functions within Excel spreadsheets. Algorithms can also be made to quickly search data found in the various fields using their indexes, or their numerical and alphabetical data. However, all data is strictly defined in terms of field type and name, and the ability to store, query and analyze it is thus restricted to some extent.

What Is Unstructured Data?

The vast majority of data found in an organization is unstructured, and some estimate it as up to 80 percent of total data currently available. By definition, unstructured data is everything that has no identifiable internal structure. However, some types of data falling into this category do have some form of vague internal structure, yet it does not conform to a database or spreadsheet.

Most business data is unstructured, ranging from customer service interactions, text files, web logs, videos and other multimedia content, sales automation, emails and social media posts. There's no need to explain how valuable this data could be if it could be mined, organized and analyzed.

Most unstructured data is generated by humans, and is thus made to be understood by other humans. This means that the neater computer intelligence does not understand this type of information since it's too distant from the linearity of machine language and structured databases.

Falling in Between: Semi-Structured Data

Semi-structured data is a third type of data that represents a much smaller piece of the whole pie (5-10 percent). Literally caught in between both worlds, semi-structured data contains internal semantic tags and markings that identify separate elements, but lacks the structure required to fit in a relational database.

For example, emails might seem like structured data since they could be categorized by date, file size or time. However, they are not, since the most valuable information is the text found within them, rather than its relatively simple labels. Emails can't be truly arranged by content and subject, since humans do not speak in such strict patterns to let a machine understand them unequivocally. Other examples of semi-structured data include NoSQL databases, the open standard JSON and the markup language XML.

Semi-structured data is usually queried and cataloged for analysis by using metadata analysis. For example, an X-ray scan consists of a huge number of pixels that form the image – which are inherently unstructured data which cannot be accessed. However, the scan file will still include a metadata portion that provides information about it, such as annotations and user ID.

Can Unstructured Data Be Transformed into Structured Data?

The fundamental challenge that every data analyst must face is to organize the information at hand in a neat, orderly way so it can be accessed and understood. Data mining tools are usually not equipped to parse information which is, by definition, too akin to human language, meaning that only another human can collect and categorize it.

However, the sheer volume of unstructured data makes any attempt at storing or organizing it extremely laborious and expensive. The pool of information coming from, say, a web-based search engine is so massive, that most elements require a huge investment in terms of work and resources just to extract the most basic ones. Even the most efficient data mining techniques still miss a substantial amount of information found on the web and, even worse, inside the deep web.

But techniques do exist. And they're being developed at an amazing speed. For example, metadata could be used to connect structured and unstructured data together. Information harvested can be filtered and indexed by both users and algorithms as well to just analyze relevant data. Other solutions include "data wrangling," which is a process through which complex data is progressively organized step by step by non-technical users. (For more on ordinary users handling data, see How Big Data Can Help in Self-Service Analytics.)

At some point, we will be able to efficiently transform these massively unorganized amounts of info into a more organized and restructured format. Maybe not today, maybe not tomorrow, but soon we will be able to raid the biggest vault humankind has ever seen: big data.