Other Types of Databases
Databases vs. Spreadsheets
"Excel also stores my data and I can retrieve and manipulate the data using filters, join it to other files and worksheets, perform advanced functions like VLOOKUP, PivotTable, and so on. So why do I need this fancy database thing you are talking about? You IT guys just want to milk money from me. No, my Excel sheets are perfectly adequate!"
Sound familiar? It might be if you’ve ever worked as a database consultant for small businesses. It would seem at first glance that a lot of the functionality offered by databases can be achieved much more easily (and cheaply!) by just using spreadsheets. However, the spreadsheet has a number of limitations that make it unsuitable for managing some data situations:
- Spreadsheets generally cannot handle multi-user access. If you work in an organization that uses shared files, you have more likely than not come across the annoying "file is locked by another user" error message. Even if a specific type of spreadsheet has some multi-user functionality, it is relatively limited. Databases on the other hand, can perfectly handle multi-user access, even for a mix of read-writes and read-only access for the same data item.
- Spreadsheets provide poor data validation and integrity. There’s little stopping one of your users from totally deleting the data in an Excel file. Sure you can employ worksheet passwords, but these provide a very limited level of security, and cannot stop someone from deleting the entire file. Databases can provide a fine-grained level of security and can even protect users from their own human mistakes. For instance, a database application can easily be set up to ensure that when creating a customer you must also enter a Social Security number.
- Querying and reporting is one feature that spreadsheets do not do well. The ability to run queries against a data set is extremely useful. Granted, spreadsheets can offer some rudimentary reporting via filters and graphing, but comparing this to SQL queries in databases that can join multiple tables and perform complex operations is akin to comparing Fred Flinststone’s car to a brand-new Mercedes S-Class, the epitome of motoring technology.
- Computer resources are severely tasked by handling large spreadsheets. Databases can easily hold and provide access to data in the order of millions of rows, organized into hundreds of tables, all running on a basic server, without breaking a sweat. Try increasing the number of rows in your Excel file to hundreds of thousands of rows, and watch what happens to your computer’s response time.
Spreadsheets are much easier to create and maintain. Databases require more investment in terms of both of financial outlay and human training. However, the reward for this is a much more robust and secure storage and retrieval data system. The point at which you need to move away from spreadsheets and into a database-based system may be when you answer "yes" to one or more of the following questions:
- Is the data held in spreadsheets needed over a long-term or recurring basis, as opposed to a one-time work in progress?
- Do multiple people need access to this data?
- Do you need to safeguard against erroneous entries?
- Does the data need to be protected against inadvertent corruption?
Need more convincing? Well an authority no less than the U.S. government has decreed, via Section 404 of the Sarbanes-Oxley Act, that all public companies must move the reporting of key financial data away from spreadsheets.
As we have already established, the reasons for the dominance of relational databases are simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data.
However, to offer all of this, relational databases have to be incredibly complex internally. For example, a relatively simple SELECT statement could have hundreds of potential query execution paths, which the optimizer would evaluate at run time. All of this is hidden to us as users, but under the hood, RDBMS determines the "execution plan" that best answers requests by using things like cost-based algorithms.
America’s best-selling sedan of the 2000s was the Toyota Camry, and it’s not hard to see why. The Camry is not the best in many of the categories used to judge cars, such as safety, gas mileage, interior volume, reliability and several others. However, it is always near the top in each category, giving it an overall aggregate score that puts it on top. So what, you may ask, does a midsized family sedan have to do with our discussion on databases? Like the Camry, the relational database does not excel or really shine in any one of the qualities of a good database, but it fulfills all sections nicely, enough to make it the default go-to option.
With the advent of the Web 2.0 and especially cloud-based computing, there has been an increased need for Web-capable databases to serve up, store and manage mind-bogglingly large amounts of content. Content such as Facebook’s user profiles and posts for millions around the world; Google’s billions of searches and Web crawls of other websites; Dropbox’s millions of stored user documents and files; eBay’s millions of auction listings, and so on. Broadly speaking, the buzzword for this area is "big data."
For all these Web-centric databases, the main concern is scalability. As more and more applications are launched in environments that have massive workloads (think of the diverse range of Web services available on the Web today), their scalability requirements can change very quickly and grow very large. Relational databases scale well, but usually only when that scaling happens on a single server. When the capacity of that single server is reached, you need to scale out and distribute that load across multiple servers, moving into so-called distributed computing. This is when the complexity of relational databases starts to rub against their potential to scale. Try scaling to hundreds or thousands of servers, and the complexities become overwhelming. The characteristics that make RDBMS so appealing are the very same that also drastically reduce their viability as platforms for large distributed systems.
For cloud services to be viable, vendors have had to address this limitation, because a cloud platform without a rapidly scalable data store is next to useless. So, to provide customers with a scalable place to store application data, vendors have had to implement a new type of database system that focuses obsessively on scalability, at the expense of the other benefits that come with relational databases. Next, we will look at these new, non-relational databases in more detail.
One of the most severe limitations of relational databases is that each item can only contain one attribute. In the bank example we looked at before, each aspect of a customer’s relationship with a bank must (or rather is best) stored as separate row items in separate tables. The customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on. All these tables are linked to each other through the use of relations, or primary keys and foreign keys.
Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model. Key-value pairs allow you to store several related items in one "row" of data in the same table. We place the word "row" in quotes because a row here is not really the same thing as the row of a relational table or spreadsheet, although it is still useful to call it that for comparison’s sake. For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details. All data relating to one customer would be conveniently stored together as one record.
This seems an obviously superior method of storing data. So why on earth would we still use relational databases over this model? Well, the problem with key-value stores is that, unlike relational databases, they cannot enforce relationships between data items. For instance, in our key-value database, the customer details (name, social security, address, account number, loan processing number, etc.) would all be stored as one data record (instead of being stored in several tables, as in the relational model). The customer’s transactions (account withdrawals, account deposits, loan repayments, bank charges, etc.) would also be stored as another single data record.
In the relational model, there is an inbuilt and foolproof method of ensuring and enforcing business logic at the database layer, for instance that a withdrawal is charged to the correct bank account. In key-value stores, this responsibility falls squarely on application logic. Many people are very jittery about leaving this crucial responsibility just to the application, which is why relational databases are not going away any time soon.
However, when it comes to Web-based databases, the aspect of rigorously enforcing business rules is often far down the list of priorities. Topmost on the list is the ability to service large numbers of user requests, typically read-only queries. For instance, on a site like the giant online auction house eBay.com, the majority of users simply browse and look through posted items (read-only operations). Only a fraction of these users actually place bids or reserve the items (read-write operations). And remember, we are talking about millions, sometimes billions, of page views per day. Such a site’s owners are more interested in quick response time to ensure faster page loading for the site’s users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes. You can extrapolate this for other large well-known sites on the Web - Google, Facebook, Amazon, Twitter and so on.
Relational-model databases can be tweaked and set up to run large-scale read-only operations (through data warehousing and data marts), and thus potentially still serve such customers. However, the real challenge is the relational model’s lack of scalability, as we noted earlier. This is where non-relational models can really shine. They can easily distribute their data loads across dozens, hundreds and in extreme cases (think Google search) even thousands of servers. With each server handling only a small percentage of the total requests from users, response time is very good for each individual user. Although this distributed computing model also exists for relational databases, it is a real pain to implement. This is because the relational model insists on data integrity at all levels, and this must be maintained, even as the data is accessed and modified by several different servers. This is the reason for the non-relational model as the architecture of choice for Web 2.0 applications such as cloud-computing and social networking.
Some examples of non-relational databases that power well-known sites are Amazon’s SimpleDB and Google Search’s BigTable. Other non-relational engines, either open-source or commercially available, are CouchDB, Mongo, Drizzle and the unusually named Project Voldemort.
Databases vs. Data Warehouses
A data warehouse is a special type of database optimized for querying, reporting and analysis. The data in the warehouse is almost always read-only and typically originates from the operational database and other systems. It is then set up to upload at periodic intervals to the warehouse using extraction, transformation and loading (ETL) processes to turn it into a form more suitable for reporting and deeper analysis. The exact details of this transformation are complex, so we will not delve into them here. The main benefit of reporting using data warehouses, as opposed to the organization’s transactional databases, is that warehouses allow much better and more fine-grained data analysis for business consumption. Use of a data warehouse also removes the reporting load from the main transactional system, which can get very taxing for large organizations, especially at period-closing times such as end-of-quarter or end-of-year reporting. In addition, data warehouses also keep a clear and complete history of all data history, even though the transactional system may not offer this ability.
A good example of the usefulness of data warehousing is in the financial-services industry. Banks typically exploit their data warehousing capability to carry out in-depth forecasting and planning. They use patterns and trend analysis to answer such questions as:
- What are the client types most likely to default on loans? Is there any relationship between a client’s income, credit repayment pattern and ATM withdrawals, and his or her loan repayment ability?
- What is the absolute best time of year to offer financing to, say, commercial farmers? Do they typically require financial assistance just before or just after harvest time, or halfway through the planting cycle?
- Why do so many college and university students not maintain their accounts with us after graduation and getting jobs? What features or benefits are they getting from other banks that we do not offer?
Data warehouses are also useful in other industries, such as insurance fraud analysis and pattern matching, telecom call record analysis, agriculture weather and climate forecasting, and many others.
A good example of the power of the analysis offered by data warehouses is the well-known story of a large retail store operator. After analyzing data from their data warehouse, the store managers noticed a baffling trend: late-night diaper and baby-food sales also corresponded to a spike in the increase of beverages, especially canned beer. After some head-scratching and hours spent scrutinizing these late-night customers, the answer finally revealed itself. What was happening was that recent parents, especially young fathers, were sent out by their wives or girlfriends to purchase diapers for the baby, and the stressed-out young dads also decided that while they were at it, they may as well grab a drink to help them relax. The retail store then shrewdly placed both classes of items closer together in the aisles to capture even more of this market, allowing them to triple their sales of alcoholic beverages after 9 p.m.
As you can see, data warehouses are distinct from typical databases in that they are used for more complex analysis of data. This differs from the transactional database, whose main use is to support operational systems and offer day-to-day, small scale reporting. Data warehousing sometimes also needs to work closely with market research and other departments.