Kudu is a new open-source project which provides updateable storage. It is a complement to HDFS/HBase, which provides sequential and read-only storage. Kudu is more suitable for fast analytics on fast data, which is currently the demand of business. So Kudu is not just another Hadoop ecosystem project, but rather has the potential to change the market. (For more on Hadoop, see The 10 Most Important Hadoop Terms You Need to Know and Understand.)
What Is Kudu?
Kudu is a special kind of storage system which stores structured data in the form of tables. Each table has numbers of columns which are predefined. Every one of them has a primary key which is actually a group of one or more columns of that table. This primary key is made to add a restriction and secure the columns, and also work as an index, which allows easy updating and deleting. These tables are a series of data subsets called tablets.
What Is Kudu’s Current Status?
Kudu is really well developed and is already coupled with a lot of features. However, it will still need some polishing, which can be done more easily if the users suggest and make some changes.
Kudu is completely open source and has the Apache Software License 2.0. It is also intended to be submitted to Apache, so that it can be developed as an Apache Incubator project. This will allow for its development to progress even faster and further grow its audience. After a certain amount of time, Kudu’s development will be made publicly and transparently. Many companies like AtScale, Xiaomi, Intel and Splice Machine have joined together to contribute in the development of Kudu. Kudu also has a large community, where a large number of audiences are already providing their suggestions and contributions. So, it’s the people who are driving Kudu’s development forward.
How Can Kudu Complement HDFS/HBase?
Kudu isn’t meant to be a replacement for HDFS/HBase. It is actually designed to support both HBase and HFDS and run alongside them to increase their features. This is because HBase and HDFS still have many features which make them more powerful than Kudu on certain machines. On the whole, such machines will get more benefits from these systems.
Features of the Kudu Framework
The main features of the Kudu framework are as follows:
- Extremely fast scans of the table’s columns – The best data formats like Parquet and ORCFile need the best scanning procedures, which is addressed perfectly by Kudu. Such formats need quick scans which can occur only when the columnar data is properly encoded.
- Reliability of performance – The Kudu framework increases Hadoop’s overall reliability by closing many of the loopholes and gaps present in Hadoop.
- Easy integration with Hadoop – Kudu can be easily integrated with Hadoop and its different components for more efficiency.
- Completely open source – Kudu is an open-source system with the Apache 2.0 license. It has a large community of developers from different companies and backgrounds, who update it regularly and provide suggestions for changes.
How Can Kudu Change the Hadoop Ecosystem?
Kudu was built to fit into Hadoop’s ecosystem and enhance its features. It can also integrate with some of Hadoop’s key components like MapReduce, HBase and HDFS. MapReduce jobs can either provide data or take data from the Kudu tables. These features can be used in Spark too. A special layer makes some Spark components like Spark SQL and DataFrame accessible to Kudu. Though Kudu hasn’t been developed so much as to replace these features, it is estimated that after a few years, it’ll be developed enough to do so. Until then, the integration between Hadoop and Kudu is really very useful and can fill in the major gaps of Hadoop’s ecosystem. (To learn more about Apache Spark, see How Apache Spark Helps Rapid Application Development.)
Kudu can be implemented in a variety of places. Some examples of such places are given below:
- Streaming inputs in near-real time – In places where inputs need to be received ASAP, Kudu can do a remarkable job. An example of such a place is in businesses, where large amounts of dynamic data floods in from different sources, and needs to be made available quickly in real time.
- Time-series applications with varying access patterns – Kudu is perfect for time-series-based applications because it is simpler to set up tables and scan them using it. An example of such usage is in department stores, where old data has to be found quickly and processed to predict future popularity of products.
- Legacy systems – Many companies which get data from various sources and store them in different workstations will feel at home with Kudu. Kudu is extremely fast and can effectively integrate with Impala to process data on all the machines.
- Predictive modeling – Data scientists who want a good platform for modeling can use Kudu. Kudu can learn from every set of data fed into it. The scientist can run and re-run the model repeatedly to see what happens.
Conclusion
Even though Kudu is still in the development stage, it has enough potential to be a good add-in for standard Hadoop components like HDFS and HBase. It has enough potential to completely change the Hadoop ecosystem by filling in all the gaps and also adding some more features. It is also very fast and powerful and can help in quickly analyzing and storing large tables of data. However, there is still some work left to be done for it to be used more efficiently.