Combing data sources in Hadoop is a complex business. Some of the reasons for this include:
- Custom, source-specific scripts that combine data sources are problematic.
- Using data integration or data science tools introduces too much uncertainty.
- Adding data from external sources is next to impossible.
Today, I’m going to discuss how Hadoop analytics is enhanced through source-agnostic technologies that make it easy to combine internal and external data sources. In addition to describing how source-agnostic methods work, I’ll also cover why Hadoop analytics need built-in intelligence and knowledge transfer capabilities, an understanding of relationships and data characteristics, and a scalable and high-performance architecture.
|Webinar: Matrices of Meaning: Connecting the Dots Within Hadoop – Sign Up Here
- Source-agnostic methods include a flexible, entity resolution model that allows new data sources to be added using statistically sound, repeatable data science processes. These processes leverage algorithms to gather knowledge from the data, and assess, analyze it to determine the best integration approach.
No matter how fragmented or incomplete the original source records, Hadoop analytics technologies should be source agnostic and be able to unify data without changing or manipulating source data. These technologies should also create entity indices based on data content, and attributes about individuals and how they exist in the world. To accomplish this, they must understand data content, context, structure and how components relate to one another.
- Built-in data science and data integration expertise allows data to be cleansed, standardized and correlated with a high degree of accuracy and precision. Visualization tools and reports help analysts evaluate and learn from data, and perform system tuning based on knowledge gained from different steps within the process.
- Understanding relationships between entities results in more accurate entity resolution processes. As real-world entities are not just the sum of their attributes, but also their connections, relationship knowledge should be used to detect when records are the same. This is especially important for handling corner cases and big data.
- Data characterization improves the analysis, resolution and linking of data by identifying and providing context for information within data sources. It can help to validate the content, density, and distribution of data within columns of structured information. Data characterization can also be used to identify and extract important entity-related data (name, address, date of birth, etc.) from unstructured and semi-structured sources for correlation with structured sources.
- Scalable, parallel architecture performs analytics quickly even when supporting hundreds of structured, semi-structured and unstructured data sources, and tens of billions of records.
Hadoop is changing the way the world performs analytics. When new source-agnostic analytics are added to Hadoop ecosystems, organizations can connect the dots across many internal and external data sources and gain insights that weren’t possible before.
This article was originally posted at Novetta.com. It has been reprinted here with permission. Novetta retains all copyrights.