What is Hadoop? It’s a yellow toy elephant. Not what you were expecting? How about this: Doug Cutting – co-creator of this open-source software project – borrowed the name from his son who happened to call his toy elephant Hadoop. In a nutshell, Hadoop is a software framework developed by the Apache Software Foundation that’s used to develop data-intensive, distributed computing. And it’s a key component in another buzzword readers can never seem to get enough of: big data. Here are seven things you should know about this unique, freely licensed software.
How did Hadoop get its start?
Twelve years ago, Google built a platform to manipulate the massive amounts of data it was collecting. Like the company often does, Google made its design available to the public in the form of two papers: Google File System and MapReduce.
At the same time, Doug Cutting and Mike Cafarella were working on Nutch, a new search engine. The two were also struggling with how to handle large amounts of data. Then the two researchers got wind of Google’s papers. That fortunate intersection changed everything by introducing Cutting and Cafarella to a better file system and a way to keep track of the data, eventually leading to the creation of Hadoop.
What is so important about Hadoop?
Today, collecting data is easier than ever. Having all this data presents many opportunities, but there are challenges as well:
- Massive amounts of data require new methods of processing.
- The data being captured is in an unstructured format.
To overcome the challenges of manipulating immense quantities of unstructured data, Cutting and Cafarella came up with a two-part solution. To solve the data-quantity problem, Hadoop employs a distributed environment – a network of commodity servers – creating a parallel processing cluster, which brings more processing power to bear on the assigned task.
Next, they had to tackle unstructured data or data in formats that standard relational database systems were unable to handle. Cutting and Cafarella designed Hadoop to work with any type of data: structured, unstructured, images, audio files, even text. This Cloudera (Hadoop integrator) white paper explains why this is important:
- "By making all your data usable, not just what’s in your databases, Hadoop lets you uncover hidden relationships and reveals answers that have always been just out of reach. You can start making more decisions based on hard data, instead of hunches, and look at complete data sets, not just samples and summaries."
What is Schema on read?
As was mentioned earlier, one of the advantages of Hadoop is its ability to handle unstructured data. In a sense, that is "kicking the can down the road." Eventually the data needs some kind of structure in order to analyze it.
That is where schema on read comes into play. Schema at read is the melding of what format the data is in, where to find the data (remember the data is scattered among several servers), and what’s to be done to the data – not a simple task. It’s been said that manipulating data in a Hadoop system requires the skills of a business analyst, a statistician and a Java programmer. Unfortunately, there aren’t many people with those qualifications.
What is Hive?
If Hadoop was going to succeed, working with the data had to be simplified. So, the open-source crowd got to work and created Hive:
- "Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL."
Hive enables the best of both worlds: database personnel familiar with SQL commands can manipulate the data, and developers familiar with the schema on read process are still able to create customized queries.
What kind of data does Hadoop analyze?
Web analytics is the first thing that comes to mind, analyzing Web logs and Web traffic in order to optimize websites. Facebook, for example, is definitely into Web analytics, using Hadoop to sort through the terabytes of data the company accumulates.
Companies use Hadoop clusters to perform risk analysis, fraud detection and customer-base segmentation. Utility companies use Hadoop to analyze sensor data from their electrical grid, allowing them to optimize the production of electricity. An major companies such as Target, 3M and Medtronics use Hadoop to optimize product distribution, business risk assessments and customer-base segmentation.
Universities are invested in Hadoop too. Brad Rubin, an associate professor at the University of St. Thomas Graduate Programs in Software, mentioned that his Hadoop expertise is helping sort through the copious amounts of data compiled by research groups at the university.
Can you give a real-world example of Hadoop?
One of the better-known examples is the TimesMachine. The New York Times has a collection of full-page newspaper TIFF images, associated metadata, and article text from 1851 through 1922 amounting to terabytes of data. NYT’s Derek Gottfrid, using an EC2/S3/Hadoop system and specialized code,:
- "Ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFFs. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files."
Using servers in the Amazon Web Services cloud, Gottfrid mentioned they were able to process all the data required for the TimesMachine in less than 36 hours.
Is Hadoop already obsolete or just morphing?
Hadoop has been around for over a decade now. That has many saying it’s obsolete. One expert, Dr. David Rico, has said that "IT products are short-lived. In dog years, Google’s products are about 70, while Hadoop is 56."
There may be some truth to what Rico says. It appears that Hadoop is going through a major overhaul. To learn more about it, Rubin invited me to a Twin Cities Hadoop User Group meeting, and the topic of discussion was Introduction to YARN:
- "Apache Hadoop 2 includes a new MapReduce engine, which has a number of advantages over the previous implementation, including better scalability and resource utilization. The new implementation is built on a general resource management system for running distributed applications called YARN."
Hadoop gets a lot of buzz in database and content management circles, but there are still many questions around it and how it can best be used. These are just a few. If you have more, send them our way. We’ll answer the best ones on Techopedia.com.