With all the buzz about enormous NSA data centers holding gazillions of data bits about our private lives, there’s one thing that hasn’t been talked about a lot, at least on CNN. It involves an engineering problem that has emerged along with cloud technology, big data and the impressive physical data storage centers that are now being built all over the world. So what is it? Well, no matter who’s administrating one of the mammoth IT systems that run these facilities, there is a need for software systems that help all of that data get in and out of the pipeline quickly. That need represents one of the most interesting IT questions or puzzles facing professionals today.
As many experts point out, today’s extreme demand for data processing goes far beyond the traditional approaches. Simply put, using simple database structures and tools like SQL query interface is not going to provide enough processing power or functionality for the likes of the proprietary systems that have developed over the past few years. The archives of today’s big tech companies need extremely scalable technology. They need data processing tools that can input and output results in much higher volume than what a single server can facilitate. They need solutions that can be quickly ramped up for growth, solutions that include complex levels of artificial intelligence, solutions that are designed for easy management by an IT department.
The question is, how do companies and government agencies conquer the limitations of the traditional data handling pathway? Here we'll take a look at one very promising option: Software that handles big data and the administration of multiple data centers.
Google File System: A Big Case Study
The proprietary technology that Google uses to access its data centers is one of the best examples of common models for big data handling and multiple data center administration. The Google File System (GFS), developed in 2003, is designed to support the huge volume of high-speed amendments to data systems that are part of getting so much new information into and out of a single platform as millions of users click away at the same time. Experts refer to this as a distributed file system, and use the term "data object storage" to describe these highly complex techniques. In reality, however, these terms don’t even scratch the surface in terms describing what’s at work.
Individually, the features and components that make up a system like GFS may not be ground-breaking anymore, but they are complex. Many of them have been covered on this site as relatively new innovations that are part of the groundwork for a new, always-on, always connected global IT system. Collectively, a system like GFS is much more than the sum of its parts: it’s a largely invisible but hugely complex network teeming with individual data pieces getting thrown this way and that in a process that would, if fully modeled visually, look like chaos. Understanding where all of the data is going takes a lot of energy and commitment, as those manning the battle stations of these systems will readily admit.
"There are too many details that have a profound impact on areas of usability - including external and internal fragmentation, log-based vs. in-place updates, and levels of transaction consistency - to sum up the way it works in a single succinct sentence," says Momchil Michailov, CEO and co-founder of Sanbolic.
"A distributed file system is either a distributed aggregator of local name spaces and free spaces of participating nodes, or a local file system that runs on multiple nodes accessing shared storage with the aid of a distributed lock manager component," he said.
Kerry Lebel is senior product manager at Automic, a company known for its scalable automation platforms. Lebel says that while it’s accurate to describe a DFS as a system that simply assigns workloads to servers attached to low-cost pieces of hardware, that doesn’t really tell the whole story.
"What you end up missing is all the 'cool factor' of how they do what they do," Lebel said.
When you step away from the technical details and just think about the basic idea behind the distributed file system, the "cool factor" that Lebel talks about is evident. These big data handling systems replace old file/folder systems with structures that involve not only multiple delivery systems, but an "object oriented" approach, where a vast number of units are scuttled here and there to prevent bottlenecks.
Think, for example, of a state-of-the-art highway system, where hundreds of thousands of cars are not just funneled down a multilane straightaway, but scooped up into neat little clover leaf or oxbow tributaries, which are spun around and sent toward their destinations on a variety of detours. From the sky, everything looks as choreographed as a Swiss watch. That's the kind of visual model that engineers look at when they dream up new ways to route information around limitations by "kicking" it to different levels of a multi-tiered data containment schema. Leaving aside the specs, this is the top-level goal of a handling system: to keep those self-contained objects with their embedded metadata moving at top speed to where they need to be, to reach consistency goals, satisfy an end user, or even to inform a top-level observation or analysis.
A Look at the Core Technology
An article by Sean Gallagher that appeared on Ars Technica breaks the GFS design down into somewhat more manageable parts, and hints at what’s underneath the sheet at Google.
GFS starts with a redundant and fault tolerant model for data reads and writes. The idea here is that instead of writing a specific update to a single drive, new systems write chunks of data to multiple destinations. That way, if one write fails, others will remain. To accommodate this, one primary network component farms out data handling to other subordinate units, re-aggregating the data when a client "calls" for it. All of this is made possible by a metadata protocol that helps to identify where certain updates and transmission results are within the greater system.
Another very important aspect of this is how these duplicate-heavy systems enforce data consistency. As Gallagher notes, the GFS design sacrifices some consistency while still "enforcing atomicity," or protecting the principle of how data gets updated across multiple storage units to match up over time. Google’s "relaxed consistency model" seems to follow the essential theory of the BASE model, which provides more flexibility in return for a longer time frame for consistency enforcement.
How Do Other Big Systems Achieve This?
"When sufficiently large scale is reached, inconsistencies or corruptions to the data become inevitable," says Michailov. "Therefore, a primary goal of distributed file systems should be the ability to carry out as many operations as possible in the presence of corruption, while providing efficient methods to deal with the corruption simultaneously." Michailov also mentions the need to preserve performance through careful implementation of redundancy.
"For example, creating metadata (data about the data) on each disk enables that disk to rebuild its proper data structure if its mirror copy is corrupted," Michailov said. "Additionally, RAID levels can be used to combat storage failures at either the file system aggregator or the shared volume manager levels."
In discussing another consistency model, Lebel focuses on a system called a Hadoop distributed file system (HDFS), which he calls an "industry de-facto standard."
In HDFS, says Lebel, each data block is replicated three times on different nodes, and on two different racks. Data is checked end-to-end. Failures get reported to NameNode, a data handler that gets rid of corrupt blocks and creates new ones.
All of this supports the kinds of "clean data" that are so important for the integrity of one of these mass data systems.
Maintaining a DFS
Another very different look at GFS comes from an October 2012 article by Wired writer Steven Levy. It is much briefer in characterizing the software approach for Google’s collective top-down network handling.
"Over the years," writes Levy, "Google has also built a software system that allows it to manage its countless servers as if they were one giant entity. Its in-house developers can act like puppet masters, dispatching thousands of computers to perform tasks as easily as running a single machine."
Doing this also involves tons of cyber-based and environmental maintenance, from dedicated test teams trying to "break" server systems, to carefully controlled temperatures across the halls of the data crypt.
Levy also mentions supplementary technologies for GFS, like MapReduce, a cloud application tool, and Hadoop, an analytics engine that shares some design principles with GFS. These tools have their own impact on how big data center handling systems get designed, and what’s likely to emerge in the future. (Learn more about these technologies in The Evolution of Big Data.)
Michailov believes that MapReduce has the potential to support ever-greater data center systems, and talks about a "single implementation" of shared and aggregated file systems that could "keep the name nodes of an aggregated file system in a shared cluster with SSDs for storage."
For his part, Lebel sees a move away from batch processing (the Hadoop-supported method) to stream processing, which will bring these data operations closer to real-time.
"The faster we can process the data and make it available to business decision-makers or to our customers, the more of a competitive advantage there will be," says Lebel, who also suggests replacing the above processing terminology with terms that focus on the end user. By thinking about "synchronous" activities, or activities synced up with end-user actions, and "asynchronous" activities that are more flexible in terms of implementation, Lebel says companies can use SLAs and other resources to define how a given service system will work.
What all of this boils down to, in a sense, is that developers and engineers need to continually work to speed up and improve services over platforms that have grown far beyond their classic, 1990s-era archetypes. That means looking critically at the machinery of data and breaking through bottlenecks in ways that support not only a growing population, but that exponential change happening at break-neck speed that pundits are calling "the next industrial revolution." It’s likely that those who break the most ground on these fronts will end up dominating in the markets and economies of the future.