Although there isn’t a single NoSQL standard database, it’s rapidly rising as a viable alternative to the relational database model that’s dominated the industry. NoSQL concepts represent some of the most fundamental rethinking of database concepts ever since E.F. Codd’s paper on relational databases burst onto the scene in 1970.
This article digs a little deeper into the more advanced NoSQL concepts. These databases, including CouchDB, MongoDB and SimpleDB, are becoming the database management systems of choice for websites that need to serve up lots of data quickly. (Get an intro to NoSQL in NoSQL 101.)
|Webinar: Exposing Differentiation: A New Era of Scalable Infrastructure Arrives – Sign Up Here|
Does NoSQL Pass the ACID Test?
Now that databases are powering the large websites people use everyday, such as Twitter, Facebook, YouTube and even Techopedia, it’s important that they be able to serve up their data quickly.
Traditionally, databases have been engineered to prioritize reliability and consistency over speed. This comes from their traditional mainframe heritage, when they were first employed to handle important jobs like payrolls. If you’re handling money, you want to make sure that every transaction you process is absolutely correct. Besides, you can run jobs that take a long time overnight on a mainframe. Who cares how long it takes as long as everyone gets their paychecks? (Learn more about databases and different types of databases in Introduction to Databases.)
The database industry has defined four key metrics that make up a reliable database (known collectively by the acronym ACID):
Atomicity means that a transaction happens completely, or not at all. For example, imagine a database that serves an airline reservation system. A customer books a flight and enters the credit card details, but something goes wrong. Maybe the server handling the website crashes before the database server can report back confirmation. According to atomicity, the transaction would be rejected and the customer’s card wouldn’t be charged. It happens or it doesn’t happen – there is no in between.
Consistency means that all of the data is reliable and valid from one transaction to the next. A transaction that could leave the database in an invalid state would be rejected.
Isolation means that everything that happens in a database, even if the system is running parallel operations, is exactly the same as if the transactions had been executed serially.
Durability means that the data will be intact even in the face of problems like power failures and other disasters.
Databases have tended to prioritize speed over integrity. Considering that they’re designed for serving up data like video clips and websites, developers and administrators consider it a good tradeoff.
NoSQL databases are different in that instead of absolute consistency, they aim for eventual consistency. Even if transactions leave parts of the database in a state that’s inconsistent with the rest of the database, the changes will eventually be propagated back into the rest of the database when it’s inactive. Websites aren’t completely swamped all the time, and users can forgive the occasional error. After all, many Twitter users put up with the site’s hiccups quite regularly – and still keep coming back for more.
If Henry Ford had gone into the database business instead of cars, he might have said “You can have any database model you like, as long as it’s relational.”
For many years, that was what the industry was like. Even if you went with an open source database management system such as MySQL or PostgreSQL, you still ended up with a relational model.
The loose affiliation of NoSQL-based databases that have proliferated since the late ’90s only have one thing in common: they don’t use traditional relational models.
There are several different kinds of major models that administrators planning a NoSQL database can choose from:
- Document Store: Instead of tables of fixed rows, these databases uses structures based on document standards such as XML or JSON.
- Graph: This database model draws on an area of mathematics known as graph theory. The data points are known as vertices and the connections between them are known as edges. This model is especially useful for showing relationships between nodes. A good example is a social network showing a person’s friends. Edges can be “directed” or “undirected.” A directed edge only goes one way, but an undirected edge goes both ways. Following someone on Twitter is an example of a directed edge if a person doesn’t follow back. Facebook friendship, on the other hand, is undirected, since friendship is mutual.
- Key-Value Store: This is similar to a data structure in several programming languages known as an associative array, a hash or a dictionary. A key-value store matches, as the name says, keys to values. A good example would be a phone directory. A person’s name is the key, and the phone number is the value.
With all of these choices, what is an administrator supposed to do? It’s best to look at the kind of data a database is going to store and select the model that makes the most sense.
Sharding vs. Replication
Now that you’ve selected your database model, the next step is to figure out how to physically store data. One solution is sharding. Sharding treats the various nodes in a database system as partitions in a giant hard drive. Partitions on a local disk will have their own data. With sharding, pieces are stored separately on each node. This allows for massively distributed systems, which can speed up database performance.
Replication, on the other hand, is similar to having a RAID scheme. Different pieces of the database are stored across the nodes, giving some degree of redundancy.
One way NoSQL databases get a speed boost is through denormalization. This means that related operations are grouped together and executed at the same time. This again comes at the expense of consistency. The database administrator must take care to assure that the database does not become overly inconsistent. (Want to be a database administrator? Read Database Administration Careers 101.)
Related to denormalization, aggregate functions group the results of an operation of several pieces of data into a single operation. These operations might include functions to average numbers or to compute the sum of several data points.
One of the defining characteristics of NoSQL databases is how easy they are to implement as distributed systems. One of the most popular techniques, developed by Google, is MapReduce. The mapping element reads aggregate data and reduces it to key-value pairs, which can then be sent to various nodes on the database system.
For serving up large amounts of data in an instant, the various NoSQL databases are providing a serious challenge to the dominant relational databases. This article should help you decided if a NoSQL database is right for you.