In my post, Hadoop Analytics: Not So Easy Across Multiple Data Sources, I discussed the issues organizations face when attempting to use Hadoop to store and analyze data from multiple internal sources. In this post, I’ll talk about the challenges and benefits of adding external data to the mix.

Adding External Data Improves Predictive Analytics

Organizations increasingly want to analyze third-party data because these sources increase their visibility into the broader marketplace, help them predict future actions and generate additional sales leads. Analyzing internal data alone provides historical perspective about customers and their purchases, which is useful for trending and pattern analysis, but has limited predictive value. These internal sources provide data often referred to as lagging indicators because they follow past events. Although lagging indicators can confirm a pattern is occurring or about to occur, they cannot easily predict what will occur or detect shifts in the market.

Organizations want to combine leading market indicators from external sources with internal historical data and sales channel information. This combination provides them with better insights about patterns and trends, and helps to improve their confidence in the predictive models they are leveraging for sales and marketing programs, fraud detection, risk analysis and more.

Retail is one industry that can clearly benefit from adding external data to Hadoop to improve business results. For example, a retail chain could combine recent public property record filings from external data sources with internal customer data to identify individuals who have recently purchased a home. They could then use that information to immediately offer these customers targeted advertisements and promotions for items new homeowners are likely to buy from their stores.

Webinar: Matrices of Meaning: Connecting the Dots Within Hadoop - Sign Up Here

Merging External Data Sources Introduces Challenges

However, integrating external data sources introduces even bigger technical challenges than internal data. External data is often fragmented or dirty, and can come from structured, semi-structured or unstructured sources. Adding external data to existing analytical models is difficult because the data contains information that cannot be directly correlated to internal sources. External data also includes information about an organization’s and its competitors’ customers, which makes it complicated to determine if data is about an organization’s current customers. Finally, despite its usefulness when combined with internal sources for a particular context, most organizations do not want to govern external data or add it to operational systems.

Overcoming Challenges With New Technologies

Organizations can overcome the challenges of adding external data sources to Hadoop with new technologies designed specifically to simplify these processes without impacting systems of record, other critical enterprise applications, or workflows. The new technologies resolve entities without the need for foreign keys or specific internal identifiers, and handle all types of data, including fragmented, dirty, structured, semi-structured or unstructured.

Before Hadoop, there was nowhere for organizations to easily and cost effectively land and process external or internal data. Hadoop changed that with a platform that made it easy to rapidly ingest and perform analytics on a wide variety of individual data sources. However, the Hadoop ecosystem lacks tools to easily combine and analyze data from different sources and formats to deliver richer analytics and better business results. Stay tuned for my third post in this series for details on what these new technologies should look like.

This article was originally posted at It has been reprinted here with permission. Novetta retains all copyrights.