Advertisement

The 10 Most Important Hadoop Terms You Need to Know and Understand

By Kaushik Pal | Last updated: June 1, 2016
Presented by Bloor Group
Key Takeaways

In order to really understand big data, you need to understand a bit about Hadoop and the language around it.

Source: Trueffelpix/Dreamstime.com

Big data, the catchy name for massive volumes of structured, unstructured or semi-structured data, is notoriously difficult to capture, store, manage, share, analyze and visualize, at least using traditional database and software applications. That's why big data technologies have the potential to manage and process massive volumes of data effectively and efficiently. And it's Apache Hadoop that provides the framework and associated technologies to process large data sets across clusters of computers in a distributed way. So, in order to really understand big data, you need to understand a bit about Hadoop. Here we'll take a look at the top terms you'll hear in regards to Hadoop - and what they mean.

Advertisement
Webinar: Big Iron, Meet Big Data: Liberating Mainframe Data with Hadoop & Spark
Register here

But First, a Look at How Hadoop Works

Before going into the Hadoop eco-system, you need to understand two fundamental things clearly. The first is how a file is stored in Hadoop; the second is how stored data is processed. All Hadoop-related technologies mainly work on these two areas and make it more user-friendly. (Get the basics of how Hadoop works in How Hadoop Helps Solve the Big Data Problem.)

Now, on to the terms.

Advertisement

Hadoop Common

The Hadoop framework has different modules for different functionalities and these modules can interact with each other for various reasons. Hadoop Common can be defined as a common utilities library to support these modules in the Hadoop the ecosystem. These utilities are basically Java-based, archived (JARs) files. These utilities are mainly used by programmers and developers during development time.

Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a sub-project of Apache Hadoop under the Apache Software Foundation. This is the backbone of storage in the Hadoop framework. It is a distributed, scalable and fault-tolerant file system that spans across multiple commodity hardware known as the Hadoop cluster. The objective of HDFS is to store a huge volume of data reliably with high throughput access to application data. The HDFS follows master/slave architecture, where the master is known as NameNode and the slaves are known as DataNodes.

MapReduce

Hadoop MapReduce is also a sub-project of the Apache Software Foundation. MapReduce is actually a software framework purely written in Java. Its primary objective is to process large datasets on a distributed environment (comprised of commodity hardware) in a completely parallel manner. The framework manages all activities like job scheduling, monitoring, executing and re-executing (in the case of failed tasks).

HBase

Apache HBase is known as the Hadoop database. It is a columnar, distributed and scalable big data store. It is also known as a type of NoSQL database that is not a relational database management system. HBase applications are also written in Java, built on top of Hadoop and runs on HDFS. HBase is used when you need real-time read/write and random access to big data. HBase is modeled based on Google's BigTable concepts.

Advertisement

Hive

Apache Hive is an open-source data warehouse software system. Hive was originally developed by Facebook before it came under the Apache Software Foundation and became open source. It facilitates the management and querying of large data sets on distributed Hadoop compatible storage. Hive performs all its activities by using an SQL-like language known as HiveQL. (Learn more in A Brief Intro to Apache Hive and Pig.)

Apache Pig

Pig was originally initiated by Yahoo for developing and executing MapReduce jobs on a large volume of distributed data. Now it has become an open source project under the Apache Software Foundation. Apache Pig can be defined as a platform for analyzing very large data sets in an efficient way. Pig's infrastructure layer produces sequences of MapReduce jobs for doing the actual processing. Pig's language layer is known as Pig Latin and it provides SQL-like features to perform queries on distributed data sets.

Apache Spark

Spark was originally developed by the AMPLab at UC Berkeley. It became an Apache top-level project in February 2014. Apache Spark can be defined as an open source, general-purpose, cluster-computing framework that makes data analytics much faster. It is built on top of the Hadoop Distributed File System but it is not linked with the MapReduce framework. Spark's performance is much faster compared to MapReduce. It provides high-level APIs in Scala, Python and Java.

Apache Cassandra

Apache Cassandra is another open source NoSQL database. Cassandra is widely used to manage large volumes of structured, semi-structured and unstructured data spans across multiple data centers and cloud storage. Cassandra is designed based on a "masterless" architecture, which means it does not support the master/slave model. In this architecture, all nodes are the same and the data is distributed automatically and equally across all the nodes. Cassandra's most important features are continuous availability, linear scalability, built-in/customizable replication, no single point of failure and operational simplicity.

Yet Another Resource Negotiator (YARN)

Yet Another Resource Negotiator (YARN) is also known as MapReduce 2.0, but it actually falls under Hadoop 2.0. YARN can be defined as a job scheduling and resource management framework. The basic idea of YARN is to replace the functionalities of JobTracker by two separate daemons responsible for resource management and scheduling/monitoring. In this new framework, there will be a global ResourceManager (RM) and an application-specific master known as ApplicationMaster (AM). The global ResourceManager (RM) and NodeManager (per node slave) form the actual data computation framework. Existing MapReduce v1 applications can also be run on YARN, but those applications need to be recompiled with Hadoop2.x jars.

Impala

Impala can be defined as an SQL query engine with massive parallel processing (MPP) power. It runs natively on the Apache Hadoop framework. Impala is designed as part of the Hadoop ecosystem. It shares the same flexible file system (HDFS), metadata, resource management and security frameworks as used by other Hadoop ecosystem components. The most important point is to note that Impala is much faster in query processing compared to Hive. But we should also remember that Impala is meant for query/analysis on a small set of data, and is mainly designed as an analytics tool that works on processed and structured data.

Hadoop is an important topic in IT, but there are those who are skeptical about its long-term viability. Read more in What Is Hadoop? A Cynic's Theory.

Advertisement

Share This Article

  • Facebook
  • LinkedIn
  • Twitter
Advertisement

Presented By

Logo for Bloor Group

Written by Kaushik Pal | Contributor

Profile Picture of Kaushik Pal

Kaushik is a technical architect and software consultant, having over 20 years of experience in software analysis, development, architecture, design, testing and training industry. He has an interest in new technology and innovation areas. He focuses on web architecture, web technologies, Java/J2EE, open source, WebRTC, big data and semantic technologies. Kaushik is also the founder of TechAlpine, a technology blog/consultancy firm based in Kolkata. The team at TechAlpine works for different clients in India and abroad. The team has expertise in Java/J2EE/open source/web/WebRTC/Hadoop/big data technologies and technical writing.

More from Bloor Group

Go back to top