By now, it's no secret that artificial intelligence (AI) can be a vital tool for the enterprise. That's because it can pull out hidden gems of information from extreme amounts of seemingly unrelated data.
But early AI adopters are coming to realize simply throwing random data at AI is a recipe for failure. (Also read: Why Diversity is Essential for Quality Data to Train AI.)
Indeed, data quality is emerging as an important success factor when it comes to training AI models. With quality data, the enterprise can improve its AI strategy's success, lower costs and push more AI-driven applications into production faster.
And as it turns out, AI could also be the solution to ensuring good data quality.
Here's how, and how you can kickstart an effective data quality management strategy:
How AI Can Improve Data Quality
AI is the ideal tool for data quality management (DQM) because, within most business models, it's the only tool that can handle the volume and complexity of data required without bursting your IT budget. As well, AI can directly impact some of the key characteristics of data quality, such as accuracy, completeness, reliability and relevance. Developing each of these areas requires substantial analysis, which AI can achieve at greater scale and at a faster pace, not to mention less cost, than an army of analysts.
But to truly understand why AI is the best instrument for DQM, we need to understand first why DQM is a unique multidimensional challenge:
Pradyumna S. Upadrashta, chief science officer at data analytics firm Mastech InfoTrellis, points out the various dimensions of data quality management. These include, for example:
- Data sets contain multiple properties — like accuracy, relevance and validity.
- Each data set is viewed differently by each department that interacts with it.
Thus, improving data quality requires myriad processes, including:
- Setting up data profiling measures encompassing the type of data, where and how data is stored, the applications it serves and the stakeholders who use it.
- Considering the data quality reference store that maintains the metadata and validity rules necessary for external processes.
Some of these processes are accounted for in data-centric AI, a current hot topic which prioritizes data quality over quantity — especially for business applications of artificial intelligence.
Automation can help ensure the process pipeline can continuously validate data and update the rules that establish its quality. (Also read: Robotic Process Automation: What You Need to Know.)
The Challenges of AI-Driven Data Quality Management
The Data Quality Paradox
It can be difficult to use AI to improve data quality because you need to train the AI itself with high-quality data. In other words, your AI solution needs to be trained on high-quality data before it can identify high-quality data.
So what's the solution?
One potential answer comes from Patrick McDonald, director of data science at Wavicle Data Solutions. McDonald suggests the first step to AI-driven data quality management is to establish a solid foundation of data governance and stewardship, preferably under an in-house manager's leadership, and then link that to a thorough data monitoring program.
The master data store is a good place to start, since this is the easiest to control and often most critical to the business model.
The Observability Conundrum
The ability to not only “see” data in the pipeline, but to track its movement and evolution, can have a dramatic impact on the resulting AI models' performance, Arize’s Krystal Kirkland explains. This is particularly important for emerging machine learning operations (MLOps) environments.
Enhancing data quality also requires increasing observability as data is created, stored, combined and analyzed.
Sudden changes in various data characteristics, as well as missing and mismatched data, can affect both categorical and numerical data — so it's important to consider both when strategizing ways to improve observability. And when data is unstructured, organizations will have to put even more effort into determining appropriate levels of accuracy, relevance and usability.
But perhaps the biggest challenge to fostering high data quality is the fact that it is a never-ending struggle. For one, “quality” an indefinable metric. And secondly, data and the real-world values they represent are in constant and perpetual flux.
How to Start Improving Data Quality
Don't fret if the prospect of establishing an AI-driven data quality management strategy is making your head spin. In any DQM plan, says tech author George Krasadakis, the first step is understanding where bad data comes from.
In most organizations, the chief culprits of poor data quality tend to be buggy software, system-level issues and the constantly changing formats that make a mess of source and target data stores.
In other words, data quality issues come from the very data ecosystem that the typical enterprise has spent millions of dollars perfecting.
Another key first step is determining what "quality data" means to your enterprise. Data is valuable only in relation to other data, so you need to establish benchmarks to determine what you consider "quality."
Conclusion
Going forward, it seems likely that building and maintaining quality data will become a core function in the digitally transformed enterprise. And it’s a job that will keep both AI and the human workforce busy for a long, long time. (Also read: Edge Data Centers: The Key to Digital Transformation?)