Since the time the concept of big data was introduced, it has been going through multiple phases of evolution. Hadoop was introduced in 2005 with some initial features such as the MapReduce processing engine which allowed large-scale data processing workloads distributed in clusters. Hadoop itself has experienced a lot of changes and has developed advanced frameworks and methods.
YARN is a core component of Hadoop 2.0. It basically manages the resources in a clustered environment. The YARN broker interacts with the compute resources (on behalf of the applications) and assigns resources to each application based on different filtering criteria.
In this article, we will look at the top advantages of YARN over Hadoop 1.0.
What is the YARN Framework?
Yet Another Resource Negotiator is a core component of Hadoop 2.0, which manages resources in a clustered environment. The Hadoop YARN framework is an advanced version of Hadoop 1.0 that provides improved performance, which is beneficial for the Hadoop ecosystem and the entire range of technologies associated with it. Now that we are a little more familiar with YARN, let’s take a closer look at Hadoop 1.0 and YARN.
Limitations of the Hadoop 1.0 Framework
In order to understand the advantages of the YARN framework, it is very important to understand how Hadoop 1.0 works and what the limitations of this framework are.
This is where the role of JobTracker comes in. It manages both the cluster resources and determines the MapReduce job execution. In a nutshell, JobTracker schedules and reserves the task slots, and configures and monitors each running task. If a task fails, it reallocates a new slot for the task to start again. Once a task is finished, JobTracker releases the slot for other tasks and cleans the temporary resources.
Major drawbacks of the above approach:
- Availability – JobTracker is the only point of availability in Hadoop 1.0. This means that if JobTracker fails, all the tasks will restart by default.
- Limited scalability – Since JobTracker is performing multiple tasks and running on a single machine, the other available machines are not being used; hence, resulting in limited scalability.
- Resource utilization – In the above approach, the map slots and reduce slots are predefined. It might happen that one of the slots is full but the other machine slots are empty. Since the empty slots are reserved, they will sit idle instead of compromising for the full slots. This might cause an issue of resource utilization.
- Running non-MapReduce applications – JobTracker is an application which is built for the MapReduce framework. The problem arises when a non-MapReduce application tries to run in this framework. The application needs to conform to the MapReduce framework programming in order to run successfully. Some of the common issues faced due to this include problems with:
- Ad-hoc query
- Real-time analysis
- Message passing approach
- Failure in cascading – One of the major issues in this framework occurs when the number of nodes is greater than 4000. In such a scenario, a cascading failure occurs, resulting in deterioration of the complete cluster.
These are some of the major limitations faced while working with this framework. There are some other minor limitations as well, which are not mentioned. The YARN framework was introduced to overcome these limitations.
YARN Framework and its Advantages
The YARN framework, introduced in Hadoop 2.0, is meant to share the responsibilities of MapReduce and take care of the cluster management task. This allows MapReduce to execute data processing only and hence, streamline the process.
YARN brings in the concept of a central resource management. This allows multiple applications to run on Hadoop, sharing a common resource management.
Some of the major components of the YARN framework are:
- ResourceManager – The ResourceManager component is the negotiator in a cluster for all the resources present in that cluster. Furthermore, this component is classified into an application manager which is responsible for managing user jobs. From Hadoop 2.0 any MapReduce job will be considered as an application.
- ApplicationMaster – This component is the place in which a job or application exists. It also manages all the MapReduce jobs and is concluded after the job processing is complete.
- NodeManager – The node manager component acts as the server for job history. It is responsible for securing information of the completed jobs. It also keeps track of the users’ jobs along with their workflow for a particular node.
Keeping in mind that the YARN framework has different components to manage the different tasks, let’s see how it counters the limitations of Hadoop 1.0.
- Better utilization of resources – The YARN framework does not have any fixed slots for tasks. It provides a central resource manager which allows you to share multiple applications through a common resource.
- Running non-MapReduce applications – In YARN, the scheduling and resource management capabilities are separated from the data processing component. This allows Hadoop to run varied types of applications which do not conform to the programming of the Hadoop framework. Hadoop clusters are now capable of running independent interactive queries and performing better real-time analysis.
- Backward compatibility – YARN comes as a backward-compatible framework, which means any existing job of MapReduce can be executed in Hadoop 2.0.
- JobTracker no longer exists – The two major roles of the JobTracker were resource management and job scheduling. With the introduction of the YARN framework these are now segregated into two separate components, namely:
The introduction of the YARN framework has made it easier to build applications for Hadoop developers. Now, the applications are no longer required to be implemented with third-party tools. YARN is a huge change which will allow users to consider Hadoop 2.0 to create applications and manipulate data more effectively. With time, there will be further developments to enhance the usability of Hadoop. For now, the YARN framework will play a crucial role in dealing with the existing problems and creating a hassle-free environment which is more versatile then the earlier version of the MapReduce model.