Why Hadoop Is a Perfect Match for Genome Sequencing

Why Trust Techopedia

Genome sequencing needs powerful technology tools to handle all of its data, and Hadoop is up to the task.

Clinical genomics is a fascinating subject, where people are working on cutting-edge technologies to process quick and accurate results. There are a lot of genome sequencers available on the market, and they are producing petabytes of sequence data, and the growth in sequencing is going to produce exabytes of data in the near future. Here, Hadoop is the perfect platform for processing complex genomics work flow. Hadoop can store and sort massive amounts of information and can also render meaningful analysis. (To get an idea of just how much data this really entails, read Understanding Bits, Bytes and Their Multiples.)

The Present and Future of Genomics

Today, genome mapping has reached its peak of development. Many people associated with the genomics industry are bursting with curiosity, and as new opportunities are presenting themselves, better technology is the need of the hour. Genome sequencing is a very repetitive and resource-intensive task. In 2013 alone, about 15 petabytes of data was produced, and only by 2,000 sequencers. This jaw-dropping amount included 300 KB of sequenced human genome data. At this rate of data production, it can be estimated that by 2018, about one exabyte of data will be produced. This will be due to the growth of sequencers, which will produce more and more data per run. Another reason is the advent of extremely powerful and low-cost genome sequencing machines. Since 2008, the price of these machines has been decreasing steadily. This is because of powerful next-generation machines that have forayed into the market.

The Needs of the Genome Mapping Industry

Complex algorithms are used for processing the data which is collected from the human genome. Then, this information needs to be stored. It may be reviewed in the future for comparison to the original data. The task of processing and storing 100 GB of data is not too difficult, especially when you are doing it with the powerful machines employed at the sequencing centers. Studies show that this amount of data can be processed in just about 1,000 CPU hours, so it is very easy. At this rate of technical advancement, it is apparent that the genome industry will soon process thousands of gigabytes in just a few seconds.

However, the data management and storage techniques aren’t evolving as quickly, due to which, a large loss of precious data can be expected. This is really undesirable, as it will seriously hinder the progresses made in human genomics. So, the need for an efficient data management technique, which can be easily updated, is very high. This can be effective especially in the near future, where genome mapping will move from large labs with powerful computers to small hospitals and labs.

What Is Expected in the Solution?

The pace at which new genomic sequencing techniques are being discovered and developed is extremely high. This pace can be very beneficial to medical science in the form of a powerful step toward eradicating major diseases. However, this pace can be very challenging too.

The challenge comes in the form of managing the large amounts of data produced by the sequencing projects. So, an effective solution is needed which will help with storage and processing of big data. This solution must be cheap and fast, while being adaptive too. The analysis provided by this solution must also be exact and constant. So, what’s the solution to the problem? Undoubtedly, it is Hadoop. (For more info on uses of Hadoop, see 5 Insights About Big Data (Hadoop) as a Service.)


Why Hadoop Is the Best Solution for Genome Sequencing

What the genomics industry needs is a superior solution that can help them effectively manage the data, process it and store it for future use. This solution seems to be a perfect match with the Hadoop software. So, Hadoop can be considered as the perfect big data management software that can greatly improve the current data storage techniques of the genomics industry.

Hadoop’s real-time capabilities make it possible for genome sequencers to analyze and store large amounts of data at once in real time. This also enables the data’s future use. Hadoop can beat many legacy systems, as it is much faster and more reliable than them.

What Else Can Hadoop Do?

Due to Hadoop, a large number of possibilities and opportunities have opened in the field of genomics and gene sequencing. Hadoop offers parallel computing options due to which faster sequencing is possible. Also, using the MapReduce function of Hadoop, large numbers of genes can be mapped very easily. Because of this, sequencing with Hadoop will truly become “next-gen” and will be much less complicated.

Opportunities for Hadoop

Hadoop has several opportunities in the genome industry, but the best one was derived from Lynda Chin’s article “Making sense of cancer genomic data,” in the journal Genes & Development. In this article, she discusses how modern genomics has opened new doors, and this has led to many positive results like the discovery of genomic information about cancer. Due to this, we are closer to discovering the cure to cancer itself. However, this needs a little more attention and a powerful data management application for better research capability in the field. This can be the best opportunity for Hadoop to prove its speed, power and accuracy.

Crossbow: The Next-Generation Data Management Platform

Crossbow, which is a software pipeline meant for the analysis of genome re-sequencing, is one of the best solutions. It was the result of integration within Hadoop between a quick algorithm for aligning the sequenced data, which is called Bowtie, and a powerful algorithm that compares and examines the sequenced data, i.e. a genotyper named SoapSNP. It is built on Apache Hadoop and is based on an implementation of the MapReduce framework. Crossbow is portable, scalable and is also suitable as a cloud computing tool.

With this powerful integration, a complete genome can be examined in just one day on a local cluster having 10 nodes. With a 40-node cluster, the process is even faster and completes in just three hours with a total cost of less than $100! A study conducted to test the accuracy of Crossbow showed that it can compare each genome with 99 percent accuracy. Another helpful feature of Crossbow is that it runs on the cloud. Thus, Crossbow will enable the thousands of future sequencing centers, like hospitals, to sequence large amounts of genome data without the need for any powerful, costly computers and technology.

Other Hadoop-Based Genomics Software

Many companies have recognized the power of Hadoop in changing the world of genomics. They have suitably modified Hadoop to tap into its potential for advanced genome sequencing. Some examples of famous Hadoop-based genome sequencing solutions are given below:

  • Hadoop-BAM: This is a powerful data management tool which utilizes the MapReduce function of Hadoop for various activities related to genomics, like genotyping. This works in the Binary Alignment/Map format.
  • Cloudburst: This Hadoop-based solution was created in 2009. It is extremely efficient in comparing genome sequences and mapping individual genes. This is also one of the first Hadoop-based applications designed for this purpose.


The integration between big data and the genomics industry is proving to be a boon in modern times. These platforms are effective in the discovery of the treatments of several diseases like cancer. The data which is being found by genome mapping can be used for the formulation of preventive information of such diseases. The advent of big data can be regarded as a turning point in the world of genomics, and if the information is used wisely, then possibly in the broader field of healthcare too. The only way for this field to advance is the use of proper data management tools like Hadoop.


Related Reading

Related Terms

Kaushik Pal
Technology writer
Kaushik Pal
Technology writer

Kaushik is a technical architect and software consultant with over 23 years of experience in software analysis, development, architecture, design, testing and training. He has an interest in new technologies and areas of innovation. He focuses on web architecture, web technologies, Java/J2EE, open source software, WebRTC, big data and semantic technologies. He has demonstrated expertise in requirements analysis, architectural design and implementation, technical use cases and software development. His experience has covered various industries such as insurance, banking, airlines, shipping, document management and product development, etc. He has worked on a wide range of technologies ranging from large scale (IBM…