The difference between big data and the open source software program Hadoop is a distinct and fundamental one. The former is an asset, often a complex and ambiguous one, while the latter is a program that accomplishes a set of goals and objectives for dealing with that asset.
Big data is simply the large sets of data that businesses and other parties put together to serve specific goals and operations. Big data can include many different kinds of data in many different kinds of formats. For example, businesses might put a lot of work into collecting thousands of pieces of data on purchases in currency formats, on customer identifiers like name or Social Security number, or on product information in the form of model numbers, sales numbers or inventory numbers. All of this, or any other large mass of information, can be called big data. As a rule, it’s raw and unsorted until it is put through various kinds of tools and handlers.
Hadoop is one of the tools designed to handle big data. Hadoop and other software products work to interpret or parse the results of big data searches through specific proprietary algorithms and methods. Hadoop is an open-source program under the Apache license that is maintained by a global community of users. It includes various main components, including a MapReduce set of functions and a Hadoop distributed file system (HDFS).
The idea behind MapReduce is that Hadoop can first map a large data set, and then perform a reduction on that content for specific results. A reduce function can be thought of as a kind of filter for raw data. The HDFS system then acts to distribute data across a network or migrate it as necessary.
Database administrators, developers and others can use the various features of Hadoop to deal with big data in any number of ways. For example, Hadoop can be used to pursue data strategies like clustering and targeting with non-uniform data, or data that doesn't fit neatly into a traditional table or respond well to simple queries.