Artificial intelligence (AI) is capable of doing things that were previously unimaginable.
It can distinguish between a pedestrian and a road sign to guide a self-driven car, review the tone of an article and provide feedback, provide helpful patient data to a doctor, and fulfil a thousand other time-saving and thoughtful jobs.
However, to do what it does, AI often depends on structured data, and that dependency can become its Achilles heel.
Sources of Unstructured Data
AI can handle all types of data from various sources – structured or unstructured. Examples include:
- Text data from social media, blog posts, tweets, documents, web pages, news articles, and community forums. Text on web pages is usually bound by stylesheets, tags, and scripts. Text from these sources seldom follows any standard guidelines or structure.
- Audio data from recordings, videos, and podcasts. These data are obtained after converting the audio to text through speech-to-text converters. Depending on the quality of the converters and the input, the quality of the output varies.
- Visual data from images, videos, diagrams, screenshots, and infographics that the AI system must parse to understand.
- Sensor data from various IoT devices, for instance temperature changes in the deep freezer in the kitchen of a big hotel based on the types of raw food stored.
- Geospatial data obtained from various systems and tools like GPS, smartphones, and compasses.
Limitations of Unstructured Data
AI systems need a consistent data format, at least for large-scale tasks, but applying uniformity is a challenge when data from different sources are stubbornly varied and difficult to fit into a structure.
In order to pull the data into shape, the process of pre-processing it — such as removing errors, unwanted spaces, and outliers — is a time-consuming process.
Data can also come in various formats, being fed in by APIs, JSON files, or spreadsheets, and new data formats emerge over time which can complicate the problem further.
Data confidentiality can also add to the complexity, and providers must be extremely cautious to prevent data leaks.
A Case Study: Using AI in Patient Care
Let’s use AI and medical imaging to understand how unstructured data hinders AI adoption, using X-rays, CT scans, and MRIs as test cases.
Ideally, AI should analyze imaging reports and enable radiographers and doctors to accurately and quickly diagnose the illness. However, the following factors severely limit AI’s ability to correctly interpret the imaging outputs:
- Imaging variability
Variability in terms of quality, angle, lighting, and patient positioning makes it difficult for AI to understand the imaging, potentially returning errors or erroneous output.
- Anatomical variation
Variability in terms of the anatomies of different patients is a challenge for AI systems to understand. AI loves uniformity and is still coming to terms with diversity in human anatomy.
- Lack of annotations
Annotations enable AI to understand the imaging better – and a lack of them leaves AI to figure out the imaging plates on its own, which, without any helpful resource, is a challenge.
- Rare or uncommon cases
AI requires uniformity and consistency of data, but imaging on uncommon or rare medical conditions severely limits its ability to process the data. Understanding such conditions requires AI systems to learn as it goes.
- Noise and artifacts
Imaging can contain noise, artifacts, and distortions due to various factors such as machine problems, non-compliance of imaging protocols, or changes to patient body positions. Unstructured data results from such problems and makes understanding difficult for AI.
The Bottom Line
AI has a long way to go in solving multiple use cases due to a dependence on structured data. Meanwhile, for organizations, providing structured data is still a costly and time-consuming task.
Data provisioning and parsing needs to improve to unlock the full potential of AI and, simultaneously, a lot of work needs to happen to equip AI systems to handle unstructured data.