How Can SQL on Hadoop Help with Big Data Analysis?
SQL on Hadoop combines these two methods of data management to form a new tool for data analytics.
SQL on Hadoop is a group of analytical application tools that combine SQL-style querying and processing of data with the most recent Hadoop data framework elements. The emergence of SQL on Hadoop is an important development for big data processing because it allows wider groups of people to successfully work with the Hadoop data processing framework by running SQL queries on the enormous volumes of big data that Hadoop processes. Obviously, the Hadoop framework was previously not as accessible to people, especially in terms of its querying capabilities. Based on the development, several tools have been in the works that promise to improve the productivity of enterprises when it comes to processing and analyzing big data with quality and speed. There is also no need to invest a lot in learning the tool, as traditional knowledge of SQL should do.
Definition of SQL on Hadoop
SQL on Hadoop is a group of applications that allows you to run SQL-style queries on big data hosted by the Hadoop data processing framework. Obviously, data querying, retrieving and analysis have become easier with the addition of SQL on Hadoop. Since SQL was originally designed for relational databases, it had to be modified according to the Hadoop 1 model that comprises MapReduce and the Hadoop Distributed File System (HDFS), and the Hadoop 2 model that does not have MapReduce and HDFS.
One of the earliest efforts to combine SQL with Hadoop resulted in the creation of the Hive data warehouse with the HiveQL software which could translate SQL-style queries into MapReduce jobs. After that, several applications were developed which could do similar jobs. Prominent among the later tools are Drill, BigSQL, HAWQ, Impala, Hadapt, Stinger, H-SQL, Splice Machine, Presto, PolyBase, Spark, JethroData, Shark (Hive on Spark), and Tez (Hive on Tez).
How Does SQL on Hadoop Work?
SQL on Hadoop works with Hadoop in the following ways:
- Connectors in the Hadoop environment translate the SQL query into a MapReduce format so that Hadoop understands the query.
- Pushdown systems execute the SQL query within the Hadoop clusters.
- Systems divide the huge volume of SQL queries between MapReduce-HDFS clusters depending on the workloads of the clusters.
It seems that the SQL query does not change its nature; it is Hadoop that adapts the query into a format it understands.
Top Benefits of SQL on Hadoop
As already stated, SQL on Hadoop is an important development in the context of making big data analysis accessible to more people and making data analysis easier and faster. There is no doubt that the Hadoop data framework has been a great tool for big data analysis, but it is still only accessible by a limited group of people, not only because of the huge efforts needed to learn its unique architecture, but also because it has compatibility issues with other technologies. SQL on Hadoop promises to address these issues.
More People Can Now Access Hadoop
It seems that SQL on Hadoop has made Hadoop more egalitarian in the sense that wider groups of people can now use Hadoop to process and analyze data. Earlier, in order to use Hadoop, you needed to have knowledge of the Hadoop architecture — MapReduce, Hadoop Distributed File System or HBase. Now, you can plug in almost any analytical or reporting tool and access and analyze the data. Thanks to SQL on Hadoop, a number of SQL on Hadoop engines such as Cloudera Impala, Concurrent Lingual, Hadapt, CitusDB, InfiniDB, MammothDB, MemSQL, Pivotal HAWQ, Apache Drill, ScleraDB, Progress DataDirect, Simba and Splice Machine are now commercially available for use with big data. Obviously, this has opened Hadoop to a wider audience which can now expect to increase their returns on investment in big data.
Analyzing Big Data with Hadoop is Now Simpler
Now, all you need to do is run the good old SQL query on the big data to retrieve and analyze data. SQL has evolved itself from being just a relational database tool to a big data analysis tool, which is indeed a significant change. You do not need to worry how Hadoop is processing the queries — it has its own way of interpreting the SQL queries and giving you the results. Experts believe that although the Hadoop Distributed File System does have parallel processing commodity clusters for big data, it can improve its processing capabilities if it works with SQL-style interactive querying. Before the HDFS combined with SQL, it would take a long time to process data with the HDFS and the task required specialized data scientists. And the queries were not interactive. With the Apache Tez framework, which comprises the Spark analytical engine and the Stinger interactive query accelerator for the Hive data warehouse, these problems have been addressed. According to Anu Jain, the group manager of strategy and architecture at retailer Target Corporation, “It is very important for us to ensure we are giving users interactive query access. With Tez we are able to provide that capability to the business.”
The popularity of interactive analytics has been growing among Hadoop users, as a Gartner survey revealed. According to the survey, 32% of the respondents use third-party interfaces with the HDFS or HBase, 27% use self-created queries through Hive, while 23% use Hadoop distribution-specific tools such as Cloudera Impala and Pivotal HAWQ.
Another Perspective on SQL on Hadoop
While it seems that SQL on Hadoop is going to solve a lot of issues we have with Hadoop, there is another view that believes that SQL may have a lot of problems, especially when combined with Hadoop. According to this view, SQL may not be that efficient after all as an analytical tool when it comes to big data. According to Hadoop Summit user panelist John Williams, SQL may not be the best analytical tool to work with big data. According to Williams, who is the senior vice president for platform operations of TrueCar, which offers users a car-buying platform online, “SQL execution time on a large data set is slow. Meanwhile, Hadoop on SQL is getting faster with things like YARN and Tez."
And that is not the only issue with SQL. There are a lot of overhead tasks such as data studying, schema conceiving, index and query creation and normalization that you need to take care of when you are combining SQL with Hadoop, and you may be spending a lot of time and effort. After all that effort, there is no guarantee that you have accomplished anything permanent. If anything, with the application changes, you may be required to redo what you've already done. Instead of SQL, big data-focused development should be done based on Java and Python since these languages are better suited for unstructured data processing.
The jury is still out on whether SQL on Hadoop is the answer to the problems of people faced with using Hadoop. But clearly, the industry needs a better alternative to Hadoop’s own data querying capabilities, and that alternative has to be interactive. SQL on Hadoop tools provide interactive analytics, which is useful. Enterprises do not want to waste time on trying to make sense out of complicated, time-consuming analytics. For the time being, enterprises are finding SQL on Hadoop tools to be very useful.