Apache Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop. Hive supports HiveQL, which is similar to SQL, but doesn't support the complete constructs of SQL.

Hive coverts the HiveQL query into a Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java MapReduce will required a lot of code to be written/debugged compared to HiveQL. So, HiveQL increases developer productivity.

To summarize, Hive, through HiveQL language, provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce, but the Hive community is working to narrow down this gap for most of the commonly used scenarios.

Along the same line, Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which are converted into the Java MapReduce program and then submitted to the Hadoop cluster.




While HiveQL is a declarative language like SQL, PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on.

Some time back, Cloudera published statistics about the workload character in a typical Hadoop cluster and it can be easily observed that Pig and Hive jobs make up a good part of the jobs in a Hadoop cluster. Because of the higher developer productivity, many companies are opting for higher level abstracts like Pig and Hive. So, we can bet there will be a lot of job openings around Hive and Pig when compared to MapReduce development.



Although the Programming Pig book was published in October 2011, the Programming Hive book was published more recently, in October 2012. For those who have experience working with RDBMS, getting started with Hive would be a better option than getting started with Pig. Also note that PigLatin language is not very difficult to get started with.

For the underlying Hadoop cluster, it's transparent whether a Java MapReduce job is submitted or a MapReduce job is submitted through Hive and Pig. Because of the batch oriented nature of MapReduce jobs, the jobs submitted through Hive and Pig are also batch oriented in nature.

For real-time response requirements, Hive and Pig doesn't meet the requirements because of the earlier mentioned batch oriented nature of MapReduce jobs. Cloudera developed Impala, which is based on Dremel (a publication from Google) for interactive ad-hoc queries on top of Hadoop. Impala supports SQL-like queries and is compatible with HiveQL. So, any applications that are built on top of Hive should work with minimal changes with Impala. The major difference between Hive and Impala is that while HiveQL is converted into Java MapReduce jobs, Impala doesn't covert the SQL query into a Java MapReduce jobs.

Should you go with Pig or Hive for a particular requirement? That's a topic for another blog.

Republished with permission from Praveen Sripati. Original article can be found here: http://www.thecloudavenue.com/2012/12/introduction-to-apache-hive-and-pig.html