Hadoop Analytics: Even Harder With External Sources

Why Trust Techopedia

External data can be a valuable source of information, but it must be properly integrated with internal data.

In my post, Hadoop Analytics: Not So Easy Across Multiple Data Sources, I discussed the issues organizations face when attempting to use Hadoop to store and analyze data from multiple internal sources. In this post, I’ll talk about the challenges and benefits of adding external data to the mix.

Adding External Data Improves Predictive Analytics

Organizations increasingly want to analyze third-party data because these sources increase their visibility into the broader marketplace, help them predict future actions and generate additional sales leads. Analyzing internal data alone provides historical perspective about customers and their purchases, which is useful for trending and pattern analysis, but has limited predictive value. These internal sources provide data often referred to as lagging indicators because they follow past events. Although lagging indicators can confirm a pattern is occurring or about to occur, they cannot easily predict what will occur or detect shifts in the market.

Organizations want to combine leading market indicators from external sources with internal historical data and sales channel information. This combination provides them with better insights about patterns and trends, and helps to improve their confidence in the predictive models they are leveraging for sales and marketing programs, fraud detection, risk analysis and more.

Retail is one industry that can clearly benefit from adding external data to Hadoop to improve business results. For example, a retail chain could combine recent public property record filings from external data sources with internal customer data to identify individuals who have recently purchased a home. They could then use that information to immediately offer these customers targeted advertisements and promotions for items new homeowners are likely to buy from their stores.

Webinar: Matrices of Meaning: Connecting the Dots Within Hadoop – Sign Up Here

Merging External Data Sources Introduces Challenges

However, integrating external data sources introduces even bigger technical challenges than internal data. External data is often fragmented or dirty, and can come from structured, semi-structured or unstructured sources. Adding external data to existing analytical models is difficult because the data contains information that cannot be directly correlated to internal sources. External data also includes information about an organization’s and its competitors’ customers, which makes it complicated to determine if data is about an organization’s current customers. Finally, despite its usefulness when combined with internal sources for a particular context, most organizations do not want to govern external data or add it to operational systems.

Overcoming Challenges With New Technologies

Organizations can overcome the challenges of adding external data sources to Hadoop with new technologies designed specifically to simplify these processes without impacting systems of record, other critical enterprise applications, or workflows. The new technologies resolve entities without the need for foreign keys or specific internal identifiers, and handle all types of data, including fragmented, dirty, structured, semi-structured or unstructured.


Before Hadoop, there was nowhere for organizations to easily and cost effectively land and process external or internal data. Hadoop changed that with a platform that made it easy to rapidly ingest and perform analytics on a wide variety of individual data sources. However, the Hadoop ecosystem lacks tools to easily combine and analyze data from different sources and formats to deliver richer analytics and better business results. Stay tuned for my third post in this series for details on what these new technologies should look like.

This article was originally posted at Novetta.com. It has been reprinted here with permission. Novetta retains all copyrights.


Related Reading

Jenn Reed
Jenn Reed

Jenn Reed is Director of Product Management at Novetta. In this role, she is responsible for defining and implementing product strategy for Novetta Entity Analytics, establishing and maintaining relationships with clients, partners, and analysts, seeking new market opportunities, and providing oversight of overall strategy, technical and marketing aspects of the product. Jennifer joined Novetta after serving as a Senior Product Manager at IBM for InfoSphere MDM. While at IBM, she was responsible for overseeing MDM strategy for Big Data, including unstructured data correlation, for which she was a co-inventor, and entity resolution on Hadoop. With more than 20 years of…