Hadoop Analytics: Not So Easy Across Multiple Data Sources

Why Trust Techopedia

Combining data from different sources can be problematic, but source-agnostic methods can be a solution.

Hadoop is a great place to offload data for analytics processing or to model larger volumes of a single data source that aren’t possible with existing systems. However, as companies bring data from many sources into Hadoop, there is an increasing demand for the analysis of data across different sources, which can be extremely difficult to achieve. This post is the first in a three-part series that explains the issues organizations face, as they attempt to analyze different data sources and types within Hadoop, and how to resolve these challenges. Today’s post focuses on the problems that occur when combining multiple internal sources. The next two posts explain why these problems increase in complexity, as external data sources are added, and how new approaches help to solve them.

Data From Different Sources Hard to Connect and Map

Data from diverse sources have different structures that make it difficult to connect and map data types together, even data from internal sources. Combining data can be especially hard if customers have multiple account numbers or an organization has acquired or merged with other companies. Over the past few years, some organizations have attempted to use data discovery or data science applications to analyze data from multiple sources stored in Hadoop. This approach is problematic because it involves a lot of guesswork: users have to decide which foreign keys to use to connect various data sources and make assumptions when creating data model overlays. These guesses are hard to test and often incorrect when applied at scale, which leads to faulty data analysis and mistrust of the sources.

Hadoop Experts Attempt to Merge Data Together

Therefore, organizations that want to analyze data across data sources have resorted to hiring Hadoop experts to create custom, source-specific scripts to merge data sets together. These Hadoop experts are usually not data integration or entity resolution experts, but they do the best they can to address the immediate needs of the organization. These experts typically use Pig or Java to write hard and fast rules that determine how to combine structured data from specific sources, e.g. matching records based on an account number. Once a script for two sources has been written, if a third source needs to be added, the first script has to be thrown away and a new script designed to combine three specific sources. The same thing happens if another source is added and so on. Not only is this approach inefficient, but it also fails when applied at scale, handles edge cases poorly, can result in a large number of duplicate records, and often merges many records that should not be combined.

Source-Agnostic Methods Better For Combining Data

A better approach is to combine internal data sources using a source-agnostic method that includes a flexible, entity resolution model, which allows new sources to be added easily using a statistically sound repeatable process.

This article was originally posted at Novetta.com. It has been reprinted here with permission. Novetta retains all copyrights.


Related Reading

Related Terms

Jenn Reed
Jenn Reed

Jenn Reed is Director of Product Management at Novetta. In this role, she is responsible for defining and implementing product strategy for Novetta Entity Analytics, establishing and maintaining relationships with clients, partners, and analysts, seeking new market opportunities, and providing oversight of overall strategy, technical and marketing aspects of the product. Jennifer joined Novetta after serving as a Senior Product Manager at IBM for InfoSphere MDM. While at IBM, she was responsible for overseeing MDM strategy for Big Data, including unstructured data correlation, for which she was a co-inventor, and entity resolution on Hadoop. With more than 20 years of…