When it comes to the data that is intended to drive business decisions, you can't afford to just take it at face value.
You have to be assured of its quality, and that process starts with data profiling which is defined as the method of examining the data available in a data source and collecting statistics and information about that data. That forms the basis for assessing the data’s quality.
What is Data Profiling?
Data profiling is necessary for data warehousing, as well as business intelligence projects. The profiling part of data profiling entails applying algorithms to the data sets in question to better understand its “qualitative characteristics,” explains Business Intelligence. The goal is “to discover metadata when it is not available and to validate metadata when it is available.“ That can alert you to metadata anomalies.
Accordingly, data profiling encompasses not just content but also structure discovery to be sure that the data is consistently formatted. (Read How Structured Is Your Data? Examining Structured, Unstructured and Semi-Structured Data.)
More importantly, for predictive analytics, it allows for identifying relationships between data sets that provide insight into key correlations. (Read Predictive Analytics in the Real World: What Does It Look Like?)
Best Data Profiling Techniques
A data analyst can profile data manually. However, given the huge amount of data that just about all organizations have to contend with, it would be very time-consuming and difficult to manage without software-enabled automation.
Data Source Consulting cites numerous benefits to the automated approach. One is speed: manual data profiling takes between 3-5 hours for each attribute, whereas automated profiling can handle an attribute in under 30 minutes.
Another is thoroughness: “With a manual approach, generally only a subset of the attributes and the rows are tested; with a data profiling tool, a thorough evaluation of the data can be performed.” The automated approach also lends itself better to centralized information that can be more easily shared by teams.
Three primary ways to approach data profiling are outlined in Dzone,:
Column profiling counts the number of times every value appears within each column in a table. This method helps to uncover the patterns within your data.
Cross-column profiling looks across columns to perform key and dependency analysis. Key analysis scans collections of values in a table to locate a potential primary key. Dependency analysis determines the dependent relationships within a data set. Together, these analyses determine the relationships and dependencies within a table.
Cross-table profiling looks across tables to identify potential foreign keys. It also attempts to determine the similarities and differences in syntax and data types between tables to determine which data might be redundant and which could be mapped together.
Whichever approach is taken, there is an additional step in the process of data profiling called “rule validation.” The rules would offer a way to ascertain that the data in the system is correct.
Good data is not just the product of gathering as much data as you can. It’s the result of data that is verified for accuracy, completeness, credibility, consistency and timeliness. It's like having your journey mapped out by Waze or Google Maps.
They are most helpful when they alert you to real time conditions and have accurate information about any delays that would affect your trip. The difference between good quality and poor quality data can be seen in the decisions that are based on it.
Business Analytics for Big Wins or Losses
In a Forbes Insight whitepaper, Anthony Scriffignano, chief data scientist and a senior vice president at Dun & Bradstreet, explained why an error in data can have such a big impact. Data is what empowers business to make “more automated decisions, more global decisions and decisions with greater impact to their enterprise.”
That kind of digital transformation offers huge benefits to rapidly scale up. But the drawback is that with such a rapid pace, an error will “propagate itself across a business so rapidly that it’s impossible to chase it and correct it.”
Data records are so prone to critical error, according to Harvard Business Review, that less than 3% of it meets basic quality standards. Accuracy matters because making decisions based on inaccurate data can translate into serious business losses — as much as $3.1 trillion USD each year in the US alone, according to IBM.
Aaron Wallace, principal product manager for customer information management at Pitney Bowes, is also quoted in the whitepaper. He observes that when it is “high-quality data” that drives business process, the results are “relevant insights” that can promote better efficiency, targeted customer marketing, and increased revenue streams.
But when the data are not up to that standard, the strategies informed by them will drive the businesses down the wrong path. Getting back on track then takes more time and resources than ascertaining that your data is reliable ahead of time.
It's the ounce of prevention that is worth a pound of cure.