Everyone’s talking about Hadoop, the hot new technology that’s highly prized among developers and just might change the world (again). But just what is it, anyway? Is it a programming language? A database? A processing system? An Indian tea cozy?
The broad answer: Hadoop is all of these things (except the tea cozy), and more. It’s a software library that provides a programming framework for cheap, useful processing of another modern buzzword: big data.
Where did Hadoop come from?
Apache Hadoop is part of the Foundation Project from the Apache Software Foundation, a non-profit organization whose mission is to "provide software for the public good." As such, the Hadoop library is free, open-source software available to all developers.
The underlying technology that powers Hadoop was actually invented by Google. Back in the early days, the not-quite-giant search engine needed a way to index the massive amounts of data they were collecting from the Internet, and turn it into meaningful, relevant results for its users. With nothing available on the market that could meet their requirements, Google built their own platform.
Those innovations were released in an open-source project called Nutch, which Hadoop later used as a foundation. Essentially, Hadoop applies the power of Google to big data in a way that’s affordable for companies of all sizes.
How does Hadoop work?
As mentioned previously, Hadoop isn’t one thing – it’s many things. The software library that is Hadoop consists of four primary parts (modules), and a number of add-on solutions (like databases and programming languages) that enhance its real-world use. The four modules are:
- Hadoop Common: This is the collection of common utilities (the common library) that supports Hadoop modules.
- Hadoop Distributed File System (HDFS): A robust distributed file system with no restrictions on stored data (meaning that data can be either structured or unstructured and schemaless, where many DFSs will only store structured data) that provides high-throughput access with redundancy (HDFS allows data to be stored on multiple machines—so if one machine fails, availability is maintained through the other machines).
- Hadoop YARN: This framework is responsible for job scheduling and cluster resource management; it makes sure the data is spread out sufficiently over multiple machines to maintain redundancy. YARN is the module that makes Hadoop an affordable and cost-efficient way to process big data.
- Hadoop MapReduce: This YARN-based system, built on Google technology, carries out parallel processing of large data sets (structured and unstructured). MapReduce can also be found in most of today’s big data processing frameworks, including MPP and NoSQL databases.
All of these modules working together generate distributed processing for large data sets. The Hadoop framework uses simple programming models that are replicated across clusters of computers, meaning the system can scale up from single servers to thousands of machines for increased processing power, rather than relying on hardware alone.
Hardware that can handle the amount of processing power required to work with big data is expensive, to put it mildly. This is the true innovation of Hadoop: the ability to break down massive amounts of processing power across multiple, smaller machines, each with its own localized computation and storage, along with built-in redundancy at the application level to prevent failures.
What does Hadoop do?
Stated simply, Hadoop makes big data accessible and usable to everyone.
Before Hadoop, companies that were using big data did so mostly with relational databases and enterprise data warehouses (which use massive amounts of expensive hardware). While these tools are great for processing structured data – which is data that’s already sorted and organized in a manageable way – the capacity for processing unstructured data was extremely limited, so much that it was practically non-existent. To be usable, data had to first be structured so it would fit neatly into tables.
The Hadoop framework changes that requirement, and does so cheaply. With Hadoop, massive amounts of data from 10 to 100 gigabytes and above, both structured and unstructured, can be processed using ordinary (commodity) servers.
Hadoop brings potential big data applications for businesses of all sizes, in every industry. The open-source framework allows finance companies to create sophisticated models for portfolio evaluation and risk analysis, or online retailers to fine-tune their search answers and point customers toward products they’re more likely to buy.
With Hadoop, the possibilities are truly limitless.