In a small operation, the network admins or developers double up as database admins (DBAs). In larger businesses, there might be dozens of DBAs specializing in the many different facets from design and architecture, to maintenance, development, etc. No matter what part of IT you work in, you have to store data at one point or another, and it doesn't hurt for just about everybody to have some knowledge of databases and how they work.
The goal of this tutorial is to provide this basic introduction. We're going to explain the basics of what a database actually is, look at the history, understand relational databases, get into some fundamental concepts from columns and rows on up, touch on other types of databases, get versed on some additional concepts to understand, and wrap it all up with a quick review of the major commercial systems on the market today.
For the most part there are no pre-requisites to this tutorial other than basic computing knowledge.
What Is a Database?
A database, in the most general sense, is an organized collection of data. More specifically, a database is an electronic system that allows data to be easily accessed, manipulated and updated.
In other words, a database is used by an organization as an electronic way to store, manage and retrieve information. The database is one of the cornerstones of enterprise IT, and its ability to organize, process and manage information in a structured and controlled manner is the key to many aspects of modern business efficiency.
However, databases go way beyond simply storing data. As we’ll see later, the inherent logic and efficiency in how the data is stored and retrieved can provide an incredibly powerful business tool to an organization. This is especially true when databases are properly exploited for their reporting and business intelligence capabilities.
The Use and Importance of Databases in Today’s World
So what kinds of organization requires a database and can benefit from its use? Well, the short answer is any business or organization that needs to keep track of large numbers of customers or products. By "large" we mean more than can be stored by a human brain - a lot more.
At this point, a skeptic might still argue that there are countless mom-and-pop stores whose owners keep track of inventory and profit/ loss using the trusty ledger and calculator, and are doing OK. That’s true, but the use of an electronic database can still pay off, even for very small businesses. For instance, a ledger cannot run a simulation to extrapolate profits if say, the shop were to increase the price of ballpoint pens by 2 cents. A database can do that. The ledger cannot run a report tracking down re-order levels for all items to show the store owner which items should be restocked at what times during the year. A database can do that too. A database can even automatically alert the business owner via email or text message.
The most significant benefit of databases, however, is still limited to large organizations with customers and products numbering in the hundreds of thousands, millions, or tens of millions, and the need to store large numbers of individual data items for customers. For instance, a commercial bank needs the personal details of all of its millions of customers, such as name, date of birth, address, Social Security number, etc. Each customer in turn spawns another collection of data depending on the products he or she has signed up for, such as account type, account number, account balance, mortgage amount, credit card loan, repayment period and so on. A third collection of data relates to the customer’s specific transactions, such as the time of transaction, amount, balance left, bank charges, loan amount left to repay, etc.
Clearly, a single customer can generate a huge amount of data in a very short time. Multiply this by millions of customers, and it’s easy to see why having an efficient data storage and retrieval medium is not only a good idea, but is also absolutely essential in preventing the bank’s operations from grinding to a halt.
But wait. Organizations managed just fine before the advent of computers and these new-fangled databases, didn’t they? Sure. But if you’re not unconvinced of the power of databases based on the ability to increase efficiency and transaction speed, consider this: Your competitors will be more efficient and offer better service than you do as a result of their use of database systems.
Commercial banks are a prime example of the use of databases in today’s organizations. Other industries whose operations are heavily reliant on databases are insurance companies, hospitals and health care, schools and colleges, manufacturing, telecommunications companies, and hotels and the hospitality industry.
Databases are indeed what the term says - the base or pool of (related) data upon which an organization’s data systems should be built. Indeed, it is not a stretch to say that without databases, the computer in the workplace would be little more than a more efficient typewriter!
History of Databases
It is important to first realize that the organized, systematic methodology of storing records we know and heavily depend on in databases is not a recent invention. What is recent is the computerization of this methodology beginning in the 1960s. Note that even paper-based records, including ledger-based bookkeeping, are (technically) all forms of a database. That is, a database does not necessarily have to be computerized. Computerization only produced a database management system (DBMS), which is obviously several orders of magnitude more powerful, accurate and capable than what a humble ledger or a puny human brain can achieve. And although we are mostly using the term "database" to refer to the DBMS, the two are not the same thing; all thumbs (DBMSs) are fingers (databases), but not all fingers are thumbs.
The ancient Egyptians used elaborate record-keeping systems to keep stock of grain harvests. The Library of Alexandria employed a sophisticated method to keep track of huge numbers of books and scrolls. These were all early examples of databases, although of course their capabilities would be laughable compared to the hugely capable computerized DBMSs of the 21st century.
But even way back in time, back when the entire field of computing was still in its single-celled-organism stage (the 1960s), many people could already visualize that computers would be truly useful if they could provide a way of reliably storing and retrieving data. The development of databases therefore occurred almost in perfect step with the general development and growth of the computing capabilities of the day. As disk capacity and processor speed grew, so did the storage capacity and feature sets of the contemporary database offerings. One important leap that occurred in the mid-1960s was the switch from tape-based storage to direct access storage, or disks. This change allowed multitasking interactive data access, as opposed to the single-operator, batch-type processing necessitated by tapes.
The earliest database systems were navigational in nature. This means that applications processed and read data by using pointers embedded in the data itself. The pointer led to the next data item and could be doubly linked, allowing linkage to both the previous and next data items. This is similar to how hyperlinks work on a Web page by leading the reader to a related Web page from the current one. The two main data models at this time were the hierarchical model epitomized by IBM's IMS system, and the Codasyl, or network, model. But all these were bested and reduced to mere interesting footnotes in history by the emergence of the relational model by a brilliant computer scientist by the name of E.F. Codd.
E.F. Codd and the Relational ModelThe relational model was a radical departure from the reigning hierarchical model in that it focused on the ability to search a database by content rather than by following a linked navigation system. This offered the significant advantage of allowing databases to grow and store more and more data, all without having to change or rewrite the applications that accessed that data. Essentially, Codd single-handedly designed a way to divorce the skeleton or structure of the database from the data records held in the database. So elegant was this model that it is the de facto standard for database design to this day, with such databases termed relational databases. There are a few very important non-relational databases (especially with the advent of big data and Web 2.0), but the relational model is still used for the overwhelming majority of commercial database offerings.
Today, E.F. Codd’s name would mostly evoke a nonchalant "E.F. who?" among most people, even many in the IT industry. However, his work has directly led to the huge benefits and efficiency that relational databases provide. His contribution to the world of computing is comparable in scale to that of Sir Isaac Newton’s to the world of physics.
Codd attended Oxford college, studying mathematics and chemistry, then worked as a pilot in the Royal Air Force during WWII before moving to the U.S. in 1948 to work as a mathematical programmer for IBM. After spending a decade in Canada, he returned to the U.S. in 1963 and received his Ph.D. in 1965.
In 1970, Codd published a paper on data management titled "A Relational Model of Data for Large Shared Data Banks" for IBM. The giant company, however, was heavily invested in the hierarchical model via its Information Management System (IMS), and Big Blue executives were not interested in developing a competitor for one of their own lucrative product lines. Showing guile rarely seen in academic or scientific types, Codd slyly showed his model to select IBM customers, who upon viewing it needed little convincing of its superiority. The influential customers in turn put pressure on the very same IBM executives to develop the model and they reluctantly (and, one imagines, seething quietly with fury at Codd) placed the model under development in IBM’s Future Systems project, with the system itself known as System R.
However, the head honchos were still unwilling to threaten IMS, and sabotaged Codd’s work by placing the System R project in the hands of developers who were unfamiliar with it. The developers thus failed to use Codd’s own Alpha language for development, instead electing to use a much simpler language known as SEQUEL. This turned out to be an accidental masterstroke, however, since SEQUEL is much easier to understand and use. For copyright reasons, the name was changed to SQL, and is very familiar to database developers and administrators today as the language of choice for writing database queries.
A shrewd young businessman who was developing his own database system read about SQL at a conference in 1979. He recognized its superiority and copied the language into a database product by his own small company. The businessman had also previously seen Codd’s work on the relational model, and became convinced that it was the way to go for database systems. He based his own product on it, even though IBM refused to share System R’s code with him. Remember, IBM was not interested in the relational model. That small company has grown quite a bit; today it’s known as Oracle Corp. As for the businessman, his name is Larry Ellison, and his conviction helped him become one of the richest people in the world. It just goes to show how badly IBM miscalculated the potential of Codd’s relational model. In fact, Oracle DB is the most widely used relational database for corporations today.
The Relational Database
As we have already seen, the work of E.F. Codd established the relational model of databases as the clearly superior method of data storage. The elegance of the relational model in storing and manipulating data has been so effective that since the late 1970s, none of the many pretenders to its throne have managed to overthrow it.
So let’s go into some detail on exactly how the relational model works. A relational database is essentially a group of tables or, to use the technical name, entities (refer to rules 0 and 1 in Codd’s 12 Rules of Relational Databases). Each table is made up of rows (tuples) and columns (attributes). The tables have relationships between them that are defined as using a certain column in one table that references a column in another table.
That is the basic definition of a relational database. But as you will soon see, it can get much more elaborate than this. For instance, one of the fundamental concepts of relational databases is that of referential integrity. This rule states that relationships between tables must always remain consistent. In other words, any field located in a foreign key must be in agreement with the primary key that the foreign key references. Therefore, any updates or deletions to a primary key field must either also be applied to all its foreign keys, or must not be allowed to happen. The same restriction also applies to foreign keys; any updates (but not necessarily deletions) must either also be propagated back to the corresponding parent primary key, or not allowed.
The idea of referential integrity is best explained by illustration. Assume that there are two tables in a bank’s database : the CUSTOMER_MASTER table for holding basic customer/account holder data, and the ACCOUNTS_MASTER table for storing basic data about bank accounts. To uniquely identify each customer/account holder in the CUSTOMER_MASTER table, we create a primary key column, which we call CUSTOMER_ID.
You can see that in order to identify the customer to which a certain bank accounts belongs in the ACCOUNTS_MASTER table, we must reference an existing customer in the CUSTOMER_MASTER table. This means we need to create a CUSTOMER_ID column in the ACCOUNTS_MASTER table as well. This is called a foreign key. This column is special because the values it contains are not new values that you can just conjure up. They must reference identical, existing values in the primary-key column of another table, in this case, the CUSTOMER_ID column of the CUSTOMER_MASTER table
Clear, yes? Good. Now, referential integrity simply means that you cannot edit any CUSTOMER_ID value in CUSTOMER_MASTER without also editing the corresponding value in the ACCOUNTS_MASTER table. If you change Andrew Smith’s customer ID in CUSTOMER_MASTER, you must also change it in ACCOUNTS_MASTER. This makes perfect sense if you think about it. Otherwise, how would we link Andrew Smith’s accounts back to him? Remember, the ACCOUNTS_MASTER table does not hold any information about account holders apart from storing their CUSTOMER_IDs as a foreign key. By the same logic, if a customer ID in the bank accounts table were allowed to exist without a corresponding customer ID in the CUSTOMER_MASTER table, this usually means that a bank account can exist without an account holder, which clearly does not make sense.
Now, following this train of thought a bit further, if you delete a customer ID in CUSTOMER_MASTER, you must also delete all corresponding entries in ACCOUNTS_MASTER. This is simply the manifestation of a neat follow-through action in real life, where if someone stops being a customer, you must also do away with their bank accounts. As Star Trek’s Spock would put it, it’s only logical!
The 12 Rules of Relational Databases
As the relational model started to become fashionable for database design in the early 1980s, Codd was at first bemused then angered by the trend by every other database vendor to slap the relational moniker on their product, even when it didn’t apply. He came up with a list of 12 rules that determined whether a database could be called "relational". His difficulties with IBM reached a peak at this time, and he left to form his own consulting company with another lecturer/ consultant called Chris Date.
Below are Codd’s 12 rules or, as some call them, 12 commandments. There are actually 13, but they are numbered from 0 to 12, hence the name. We will look at them in more detail once we delve into the features of relational databases and reference them then so that you can really understand Codd’s rules in context. In raw form, these rules do not make much sense for those who don’t work in the database field, but anyway, here goes:
0. Foundation Rule
A relational database management system must manage its stored data using only its relational capabilities.
1. Information Rule
All information in the database should be represented in one and only one way - as values in a table.
2. Guaranteed Access Rule
Each and every datum (atomic value) is guaranteed to be logically accessible by resorting to a combination of table name, primary key value and column name.
3. Systematic Treatment of Null Values
Null values (which are distinct from empty character strings, strings of blank characters, zeros or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way that is independent of data type.
4. Dynamic Online Catalog Based on the Relational Model
The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data.
5. Comprehensive Data Sublanguage Rule
A relational system may support several languages and various modes of terminal use. However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible:
- Data definition
- View definition
- Data manipulation (interactive and by program)
- Integrity constraints
- Transaction boundaries (begin, commit and rollback)
6. View Updating Rule
All views that are theoretically updateable are also updateable by the system.
7. High-Level Insert, Update and Delete
The ability to handle a base relation or a derived relation as a single operand applies not only to the retrieval of data, but also to the insertion, update and deletion of data.
8. Physical Data Independence
Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods.
9. Logical Data Independence
Application programs and terminal activities remain logically unimpaired when information-preserving changes of any kind - and those that theoretically permit unimpairment - are made to the base tables.
10. Integrity Independence
Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs.
11. Distribution Independence
The data manipulation sublanguage of a relational DBMS must enable application programs and terminal activities to remain logically unimpaired whether and whenever data are physically centralized or distributed.
12. Nonsubversion Rule
If a relational system has or supports a low-level (single-record-at-a-time) language, that low-level language cannot be used to subvert or bypass the integrity rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language.
Fundamental Database Concepts
We will now look at some key database concepts and data objects. Any database administrator worth his salt must be absolutely familiar with these. And they aren’t just theoretical; virtually all DBAs will be intimately involved with these concepts and objects on a near-daily basis. They are to database administration what knowledge of the human body is to the field of medicine.
We will only define each term briefly here. To better grasp the concepts, together with real-world examples, follow the term links and visit the "Related Terms" section to understand how each concept relates to and works with others in the realm of database administration.
The table is the basic data-storage unit in a relational database. Tables consist of columns and rows. The columns are the attributes or qualities that we want to express, while the rows hold the actual data, with one (or no) items per row. Think of the layout of a spreadsheet; this is very similar to the logical organization of a relational table.
A simple example of the tables in the database of a commercial bank can be found below.
Relationships are THE reason why relational databases work so well. If you only learn one concept about databases, this is the one to learn. As the name implies, relationships are the very core of relational databases.
In relational databases, a relationship exists between two tables when one of them has a foreign key that references the primary key of the other table. (More on foreign and primary keys in a bit).
In the diagram below, you can see examples of relationships. For instance, the AccountTypeID field (column) in the AccountTypes table references the AccountTypeID column of the Customer table.
A row, also called a record, represents a set of data about a specific item. Every record in a table has exactly the same structure, but of course different data. Think of the rows in an Excel spreadsheet - the concept of rows in a database is very similar. Each row in a table consists of distinct data items, with one (or zero) items for each column of the table. Rows are also called tuples, although this term is not very common.
See an example of a record or row below:
A column is a specific set of values in a table of the same type. It defines a specific attribute of the table or data. For example, we can create a column called CUSTOMER_SURNAME in a table. This is a self-explanatory column whose purpose is to store customer surnames, one value for each row. Again, think of the rows in an Excel spreadsheet and you’ll have a pretty good idea of how columns in a relational table work.
A primary key is a special column or combination of columns that uniquely identifies each record (row) in the table. The primary key column must be unique for each row, and must not contain any nulls (non-values). So, for example, to identify yourself in various databases belonging to various U.S. government departments, a unique identifier, the Social Security number, is assigned and used.
The primary key, together with the closely related foreign key concept, are the main way in which relationships are defined. Primary keys may also be a combination of columns. For instance, for many companies, a calendar month is a short-term financial period. Therefore, to uniquely identify any period, you can combine the month column and the year column, such as May 2011, to form a primary key to uniquely identify each and every financial period.
We cannot talk about the yin of primary keys without the yang of foreign keys. The two go hand-in-hand. A primary key uniquely defines a record, while a foreign key is used to reference the same record from another table.
In a commercial bank’s database, assume that you have a CUSTOMER_ID column as the primary key in the CUSTOMER_MASTER table. This uniquely identifies each customer; let us also assume the same table also contains other relevant customer information in other columns, such as CUSTOMER_SURNAME, CUSTOMER_FIRSTNAME, CUSTOMER_SOCIAL_SEC_NUMBER, CUSTOMER, CUSTOMER_GENDER, and so on.
We also have another table called LOANS_MASTER to keep track of loans given to the same bank’s customers. We now only need a single column in this table to identify which customer has received a particular loan. We can call this column CUSTOMERID, and it will reference the CUSTOMER_ID column of the CUSTOMER_MASTER table. We don’t need to store all the other customer information (name, gender, Social Security number, etc.) in the loans table. This is the elegance of the relational model.
Structured Query Language (SQL) is the de facto language used for the management and manipulation of data in relational databases. SQL can be used to query, insert, update and modify data. All major relational databases support SQL, which makes life much easier for database administrators (DBAs), who have to support databases on several different platforms. Proficiency in SQL is usually one of the very first things any DBA must learn early in his or her career. Note that some people pronounce SQL as one word, "sequel".
Note that the SQL language is different from SQL Server, a relational database platform from Microsoft. It can be confusing for beginners because of the use of the generic term SQL.
Most commercial RDBMS platforms have their own customized SQL implementations, but these tend to be fully compatible with the standard SQL.
Other Types of Databases
Databases vs. Spreadsheets
"Excel also stores my data and I can retrieve and manipulate the data using filters, join it to other files and worksheets, perform advanced functions like VLOOKUP, PivotTable, and so on. So why do I need this fancy database thing you are talking about? You IT guys just want to milk money from me. No, my Excel sheets are perfectly adequate!"
Sound familiar? It might be if you’ve ever worked as a database consultant for small businesses. It would seem at first glance that a lot of the functionality offered by databases can be achieved much more easily (and cheaply!) by just using spreadsheets. However, the spreadsheet has a number of limitations that make it unsuitable for managing some data situations:
- Spreadsheets generally cannot handle multi-user access. If you work in an organization that uses shared files, you have more likely than not come across the annoying "file is locked by another user" error message. Even if a specific type of spreadsheet has some multi-user functionality, it is relatively limited. Databases on the other hand, can perfectly handle multi-user access, even for a mix of read-writes and read-only access for the same data item.
- Spreadsheets provide poor data validation and integrity. There’s little stopping one of your users from totally deleting the data in an Excel file. Sure you can employ worksheet passwords, but these provide a very limited level of security, and cannot stop someone from deleting the entire file. Databases can provide a fine-grained level of security and can even protect users from their own human mistakes. For instance, a database application can easily be set up to ensure that when creating a customer you must also enter a Social Security number.
- Querying and reporting is one feature that spreadsheets do not do well. The ability to run queries against a data set is extremely useful. Granted, spreadsheets can offer some rudimentary reporting via filters and graphing, but comparing this to SQL queries in databases that can join multiple tables and perform complex operations is akin to comparing Fred Flinststone’s car to a brand-new Mercedes S-Class, the epitome of motoring technology.
- Computer resources are severely tasked by handling large spreadsheets. Databases can easily hold and provide access to data in the order of millions of rows, organized into hundreds of tables, all running on a basic server, without breaking a sweat. Try increasing the number of rows in your Excel file to hundreds of thousands of rows, and watch what happens to your computer’s response time.
Spreadsheets are much easier to create and maintain. Databases require more investment in terms of both of financial outlay and human training. However, the reward for this is a much more robust and secure storage and retrieval data system. The point at which you need to move away from spreadsheets and into a database-based system may be when you answer "yes" to one or more of the following questions:
- Is the data held in spreadsheets needed over a long-term or recurring basis, as opposed to a one-time work in progress?
- Do multiple people need access to this data?
- Do you need to safeguard against erroneous entries?
- Does the data need to be protected against inadvertent corruption?
Need more convincing? Well an authority no less than the U.S. government has decreed, via Section 404 of the Sarbanes-Oxley Act, that all public companies must move the reporting of key financial data away from spreadsheets.
As we have already established, the reasons for the dominance of relational databases are simplicity, robustness, flexibility, performance, scalability and compatibility in managing generic data.
However, to offer all of this, relational databases have to be incredibly complex internally. For example, a relatively simple SELECT statement could have hundreds of potential query execution paths, which the optimizer would evaluate at run time. All of this is hidden to us as users, but under the hood, RDBMS determines the "execution plan" that best answers requests by using things like cost-based algorithms.
America’s best-selling sedan of the 2000s was the Toyota Camry, and it’s not hard to see why. The Camry is not the best in many of the categories used to judge cars, such as safety, gas mileage, interior volume, reliability and several others. However, it is always near the top in each category, giving it an overall aggregate score that puts it on top. So what, you may ask, does a midsized family sedan have to do with our discussion on databases? Like the Camry, the relational database does not excel or really shine in any one of the qualities of a good database, but it fulfills all sections nicely, enough to make it the default go-to option.
With the advent of the Web 2.0 and especially cloud-based computing, there has been an increased need for Web-capable databases to serve up, store and manage mind-bogglingly large amounts of content. Content such as Facebook’s user profiles and posts for millions around the world; Google’s billions of searches and Web crawls of other websites; Dropbox’s millions of stored user documents and files; eBay’s millions of auction listings, and so on. Broadly speaking, the buzzword for this area is "big data."
For all these Web-centric databases, the main concern is scalability. As more and more applications are launched in environments that have massive workloads (think of the diverse range of Web services available on the Web today), their scalability requirements can change very quickly and grow very large. Relational databases scale well, but usually only when that scaling happens on a single server. When the capacity of that single server is reached, you need to scale out and distribute that load across multiple servers, moving into so-called distributed computing. This is when the complexity of relational databases starts to rub against their potential to scale. Try scaling to hundreds or thousands of servers, and the complexities become overwhelming. The characteristics that make RDBMS so appealing are the very same that also drastically reduce their viability as platforms for large distributed systems.
For cloud services to be viable, vendors have had to address this limitation, because a cloud platform without a rapidly scalable data store is next to useless. So, to provide customers with a scalable place to store application data, vendors have had to implement a new type of database system that focuses obsessively on scalability, at the expense of the other benefits that come with relational databases. Next, we will look at these new, non-relational databases in more detail.
One of the most severe limitations of relational databases is that each item can only contain one attribute. In the bank example we looked at before, each aspect of a customer’s relationship with a bank must (or rather is best) stored as separate row items in separate tables. The customer’s master details are in one table, the account details are in another table, the loan details in yet another, investments in a different table, and so on. All these tables are linked to each other through the use of relations, or primary keys and foreign keys.
Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model. Key-value pairs allow you to store several related items in one "row" of data in the same table. We place the word "row" in quotes because a row here is not really the same thing as the row of a relational table or spreadsheet, although it is still useful to call it that for comparison’s sake. For instance, in a non-relational table for the same bank, each row would contain the customer’s details as well as their account, loan and investment details. All data relating to one customer would be conveniently stored together as one record.
This seems an obviously superior method of storing data. So why on earth would we still use relational databases over this model? Well, the problem with key-value stores is that, unlike relational databases, they cannot enforce relationships between data items. For instance, in our key-value database, the customer details (name, social security, address, account number, loan processing number, etc.) would all be stored as one data record (instead of being stored in several tables, as in the relational model). The customer’s transactions (account withdrawals, account deposits, loan repayments, bank charges, etc.) would also be stored as another single data record.
In the relational model, there is an inbuilt and foolproof method of ensuring and enforcing business logic at the database layer, for instance that a withdrawal is charged to the correct bank account. In key-value stores, this responsibility falls squarely on application logic. Many people are very jittery about leaving this crucial responsibility just to the application, which is why relational databases are not going away any time soon.
However, when it comes to Web-based databases, the aspect of rigorously enforcing business rules is often far down the list of priorities. Topmost on the list is the ability to service large numbers of user requests, typically read-only queries. For instance, on a site like the giant online auction house eBay.com, the majority of users simply browse and look through posted items (read-only operations). Only a fraction of these users actually place bids or reserve the items (read-write operations). And remember, we are talking about millions, sometimes billions, of page views per day. Such a site’s owners are more interested in quick response time to ensure faster page loading for the site’s users, rather than the traditional priorities of enforcing business rules or ensuring a balance between reads and writes. You can extrapolate this for other large well-known sites on the Web - Google, Facebook, Amazon, Twitter and so on.
Relational-model databases can be tweaked and set up to run large-scale read-only operations (through data warehousing and data marts), and thus potentially still serve such customers. However, the real challenge is the relational model’s lack of scalability, as we noted earlier. This is where non-relational models can really shine. They can easily distribute their data loads across dozens, hundreds and in extreme cases (think Google search) even thousands of servers. With each server handling only a small percentage of the total requests from users, response time is very good for each individual user. Although this distributed computing model also exists for relational databases, it is a real pain to implement. This is because the relational model insists on data integrity at all levels, and this must be maintained, even as the data is accessed and modified by several different servers. This is the reason for the non-relational model as the architecture of choice for Web 2.0 applications such as cloud-computing and social networking.
Some examples of non-relational databases that power well-known sites are Amazon’s SimpleDB and Google Search’s BigTable. Other non-relational engines, either open-source or commercially available, are CouchDB, Mongo, Drizzle and the unusually named Project Voldemort.
Databases vs. Data Warehouses
A data warehouse is a special type of database optimized for querying, reporting and analysis. The data in the warehouse is almost always read-only and typically originates from the operational database and other systems. It is then set up to upload at periodic intervals to the warehouse using extraction, transformation and loading (ETL) processes to turn it into a form more suitable for reporting and deeper analysis. The exact details of this transformation are complex, so we will not delve into them here. The main benefit of reporting using data warehouses, as opposed to the organization’s transactional databases, is that warehouses allow much better and more fine-grained data analysis for business consumption. Use of a data warehouse also removes the reporting load from the main transactional system, which can get very taxing for large organizations, especially at period-closing times such as end-of-quarter or end-of-year reporting. In addition, data warehouses also keep a clear and complete history of all data history, even though the transactional system may not offer this ability.
A good example of the usefulness of data warehousing is in the financial-services industry. Banks typically exploit their data warehousing capability to carry out in-depth forecasting and planning. They use patterns and trend analysis to answer such questions as:
- What are the client types most likely to default on loans? Is there any relationship between a client’s income, credit repayment pattern and ATM withdrawals, and his or her loan repayment ability?
- What is the absolute best time of year to offer financing to, say, commercial farmers? Do they typically require financial assistance just before or just after harvest time, or halfway through the planting cycle?
- Why do so many college and university students not maintain their accounts with us after graduation and getting jobs? What features or benefits are they getting from other banks that we do not offer?
Data warehouses are also useful in other industries, such as insurance fraud analysis and pattern matching, telecom call record analysis, agriculture weather and climate forecasting, and many others.
A good example of the power of the analysis offered by data warehouses is the well-known story of a large retail store operator. After analyzing data from their data warehouse, the store managers noticed a baffling trend: late-night diaper and baby-food sales also corresponded to a spike in the increase of beverages, especially canned beer. After some head-scratching and hours spent scrutinizing these late-night customers, the answer finally revealed itself. What was happening was that recent parents, especially young fathers, were sent out by their wives or girlfriends to purchase diapers for the baby, and the stressed-out young dads also decided that while they were at it, they may as well grab a drink to help them relax. The retail store then shrewdly placed both classes of items closer together in the aisles to capture even more of this market, allowing them to triple their sales of alcoholic beverages after 9 p.m.
As you can see, data warehouses are distinct from typical databases in that they are used for more complex analysis of data. This differs from the transactional database, whose main use is to support operational systems and offer day-to-day, small scale reporting. Data warehousing sometimes also needs to work closely with market research and other departments.
Other Important Database Concepts
We have discussed the fundamental database concepts, but these are not the only ones. There are also other important secondary concepts and data structures worth learning about. In this section, we’ll take a brief look at these.
An index in an RDBMS is a data structure that works closely with tables and columns to speed up data retrieval operations. It works a lot like the index at the beginning of a book. In other words, it provides a reference point that allows you to quickly find and access the data you want without having to traverse the entire book (database).
A schema is the structure behind data organization. It is a visual overview of how different tables are related to each other. This serves to map out and implement the underlying business rules for which the database is created.
The Oracle DB has a somewhat different definition of schema. Here, schema refers to a user’s collection of database objects. The schema name and username are the same but function quite distinctly; i.e., a user may be deleted while his collection of objects (schema) within the database remains intact, and can even be reassigned to another user.
See a visual example of a simple database schema below:
Normalization is the process of (re)organizing data in a database so that it meets two basic requirements: there is no data redundancy (all data is stored in only one place), and data dependencies are logical (all related data items are stored together). For instance, for a bank’s database all customer static data, such as name, address and age, should be stored together. All account information, such as account holder, account type, account branch and so on, should also be stored together; it should also be stored separately from the customer static data.
Normalization is important for many reasons, but chiefly because it enables databases to take up as little disk space as possible, resulting in increased performance. There are several incremental types of normalization, and they can get somewhat complex.
In the RDBMS world, constraint refers to the exact same thing as in the real world. A constraint is a restriction on the type of data you can input into a certain column. Constraints are always defined on columns. A common constraint is the not-null constraint. It simply specifies that all rows in a table must have a value in the column defined as not null.
Transactions (Commits and Rollbacks)
Think of a bank building out its database systems. Imagine if it crashed right as a wire transfer was in progress. Big problems, right? This is the basic idea behind a transaction: All items in a series of changes need to be made together. In the case of a simple transfer, if you debit one account, you need to credit anther account.
In relational databases, saving a transaction is known as a commit, and undoing any unsaved changes is known as a rollback. That is the basic definition, but it gets more complicated when you consider that databases typically have to serve several users simultaneously. Before the transactions is saved, what happens when other users query the same data? At what point do the other users see the saved data? All RDBMSs must be capable of satisfactorily answering these questions, and they do this through the commit/rollback features.
Databases must also provide fault tolerance, even in case of disk failure. When data is committed, there must be a good-as-gold guarantee that the data is actually saved. Relational databases have ingenious ways of achieving this, such as two-phase commits and use of log files.
The term ACID here does not refer to psychedelic trip-inducing substances that help DBAs work better. Rather, it is an acronym describing four highly desirable properties of any RDBMS:
- Atomicity: This refers to a database’s ability to either fully process or fully roll back a transaction.
- Consistency: The database should ensure that all data written therein follows all rules and constraints specified in the database.
- Isolation: Transactions must be processed securely and independently, without interfering with each other.
- Durability: The database must ensure that all committed transactions are saved permanently and cannot be accidentally erased, even in a database crash.
As we have seen before, databases allow multiple users to simultaneous access a set of data. So an inevitable issue that arises is how to handle situations in which two or more users want to access or update the same piece of data. RDBMSs solve this issue by the use of locks.
There are different types of locks, such as transactional or data-level locks that simply lock down a single data field; row-level locks that lock down an entire record of data, while table-level locks restrict access to a whole table. You can see that the last lock type is especially restrictive. As such, such it is employed only when absolutely necessary, such as when performing table-wide data loads or table-wide backups.
By the way, spreadsheets can only employ table-level locks. This is a prime example of their inferiority when compared to databases. If you work in an office environment you have most likely encountered it when working on a shared Excel spreadsheet and you get the annoying "file is locked by another user" message. Locks are a critical way of implementing the ACID (Atomicity, Consistency, Isolation, Durability) properties of an RDBMS.
The most common type of lock is the data-level or transactional lock. It usually happens completely invisibly to the user. When a user updates (but not necessarily commits) a particular data item, it is then locked down to prevent other users from making any other changes to it. However, the other user can still read the same data item. This is a neat feature of RDBMSs. At this time, the other users will only see the "old" value of the data item that existed before the changes. Once the main user commits (saves) the change, all users who now query the data will see the new, updated value. Keep in mind that the term "new value" could even include a deletion of the data that was there, in which case the field is now a null. If the user instead abandons the change and performs a rollback, the other users will still see the very same data, but they can now also update the data item.
If a user tries to change an item that is already locked down by a first user, the second user will either receive an error message or the system will just hang for a short time while waiting for the first user to either commit (save) the change or roll it back (undo it).
A curious situation may arise when two users have each locked a data item, then both attempt to access and make a change to the exact data item held by the other user. In this case, there is the potential to lock the item indefinitely while each user waits for the other. This is called a deadlock, and most RDBMSs have a way of detecting and resolving them. The system will usually choose one user at random and roll back his or her changes, thus freeing up the data locked by that user and ending the deadlock.
You can surmise that locking is an expensive activity in terms of database and computer resources. Systems that host write-intensive databases need to be robust enough to support the resources expended by locking, especially the CPU. Applications also need to be well designed so as to lock data as infrequently as possible. One way of doing this is by building data retention capability into the application itself. Once a data item is read from the database, the application should then perform all changes internally without first interacting with the database. Only when the user clicks "Save" does the application quickly write the item back to the database. This minimizes the amount of time that, from the database point of view, a data item will be in a state of limbo of "changed" but not yet "committed" or "rolled back".
Table-level locks, as we have mentioned, are very restrictive because they lock down an entire table. Although they are rarely used, they are invaluable in some operations such as bulk uploads of data or when reorganizing fragmented data for an entire table. In such cases, it is convenient to ensure that no other users can access the table for the duration of the operation, and the lock should be released immediately once the operation is over. As such, these locks are typically only employed by DBAs during these special operations. Table-level locks usually have to be explicitly specified; by default the RDBMS employs row-level or transactional-level locks.
Commercial RDBMS Systems
We have now covered most of the basics of relational databases. They are the most common database type out there, in addition to being one of the most important types of software, right up there with operating systems, office productivity and games. So it will come as no surprise to hear that the large, successful RDBMS vendors are some of the titans of the software industry, and are also household names not only in the software world, but even in mainstream news and culture. Names like Oracle and Microsoft are ubiquitous and very familiar, even to non-IT types.
Many RDBMS vendors also churn out both related and completely unrelated products. But a common thread for all of them is that the RDBMS is one of their most crucial product lines. They listen closely and work to gather feedback from the marketplace, which does not always happen in the software universe.
However, for strategic business reasons, many of their products do not work well with competitors’ offerings, or with other software. For instance SQL Server from Microsoft is only available for the Windows operating system, which is also from Microsoft. There have been complaints that Oracle DB does not mesh as well with the Windows operating system as it does with Linux, and so on.
An increasing trend in the industry is consolidation and bundling of the RDBMS with other software from the same manufacturer, such as a preferred operating system or other complementary software such as:
- Additional data security modules, such as Oracle DataGuard,
- Integration with a development platform, such as Microsoft’s .NET,
- All-in-one platforms combining hardware and software, such as Oracle’s Exadata or Microsoft’s Trefis.
Oracle is one of the behemoths of the RDBMS world. Founded by the charismatic, adventure-loving CEO Larry Ellison in 1977, today the company is a multibillion dollar giant in the world of commercial databases thanks to its flagship product, Oracle DB. Oracle also produces a bewildering array of other products, from middleware to enterprise resource planning systems and customer relationship management (CRM) offerings. Many of these were not developed in-house, but instead came through acquisitions of other software companies such as PeopleSoft, Siebel Systems, Sun Microsystems and BEA Systems. Several of these products were integrated, with mixed success, into the Fusion middleware product.
The Oracle DB is widely used in enterprise-level databases. It comes in different editions to meet different needs. Oracle DB is fully compliant with the SQL language, although it also maintains its proprietary version called SQL*Plus. Oracle DB is the leading RDBMS, with a market share of 48.8 percent of the RDBMS market as at end of 2011.
An interesting point to note about Oracle is that in 2009, it acquired Sun Microsystems, the license holder of MySQL, one of Oracle DB’s key competitors. As a result, Oracle has two RDBMS offerings. However, Oracle DB and MySQL may not interfere destructively with each other, as they play in slightly different market spaces and cater to slightly different needs.
Microsoft is another big boy in the world of RDBMS software with its SQL Server product, although it is better known for its universal Windows operating system and Office suite of office-productivity programs.
SQL Server was developed in conjunction with Sybase systems around 1989, but the two companies parted ways and developed separate products. Microsoft kept the SQL Server name and Sybase opted to rename its offering Adaptive Server Enterprise to avoid confusion with Microsoft’s SQL Server. The RDBMS only runs on the Windows range of operating systems.
SQL Server uses a proprietary query language called T-SQL, which is very similar to and compatible with the standard SQL. The RDBMS commands about 20 percent of market share as of the end of 2011, but has also been increasing its share in recent years.
SQL Server and Oracle DB have a lot in common, ranging from the data structures to transaction processing methods and database objects. Like Oracle DB, SQL Server also supports advanced ETL (Extraction, Transformation, Loading) operations, which help in moving data to data warehouses. Both also offer advanced reporting functionality.
PostgresPostgres, also known as PostgreSQL, is an open-source relational database that can also support database objects. It is not owned by any one person, but is maintained by the PostgreSQL Global Development Group, a dedicated group of volunteers managed and employed by companies in the open-source software-development field, such as RedHat and EnterpriseDB.
Postgres is available on Linux, Windows and MacOS.
MySQLMySQL is another open-source RDBMS. It is a full-featured database system sponsored by Swedish company MySQL AB, which is now owned by Oracle after its parent company Sun Microsystems was bought out by Oracle in 2010.
MySQL is very popular for Web-based, back-end databases, either individually or as part of the Linux, Apache, MySQL, PHP (LAMP) stack used to deliver Web-centric applications.
DB2Like SQL Server and Oracle DB, DB2 from IBM is a full-featured object RDBMS from a major player in the software industry. Originally developed in the early '80s exclusively for IBM’s mainframes, it was later ported to other platforms, such as Linux, Unix, Windows (LUW) and IBM’s own OS/2.
It is a widely used commercial RDBMS, and also has a small free version for developers called Express-C. In 2009, IBM released version 9.7 of DB2, which closely mimics the features of Oracle DB, the market leader. This has helped it capture some sales by making it easy for Oracle-savvy database professionals to easily understand and start working on DB2.
ConclusionCongrats - you've made it this far!
We've literally just scratched the surface. After all, many people devote an entire professional career to databases; there is a lot to learn! Let's recap what we've covered here:
- A database, in the most general sense, is an organized collection of data. More specifically, a database is an electronic system that allows data to be easily accessed, manipulated and updated.
- Any business or organization that needs to keep track of large numbers of customers or products can benefit from a database, but large organizations stand to gain the most.
- The earliest database systems were navigational in nature. This means that applications processed and read data by using pointers embedded in the data itself.
- The early hierarchical and network data models were bested by E.F. Codd's relational model.
- The relational model was a radical departure from the reigning hierarchical model in that it focused on the ability to search a database by content rather than by following a linked navigation system.
- There are a few very important non-relational databases (especially with the advent of big data and Web 2.0), but the relational model is still used for the overwhelming majority of commercial database offerings.
- A relational database is essentially a group of tables or, to use the technical name, entities (refer to rules 0 and 1 in Codd’s 12 Rules of Relational Databases). Each table is made up of rows (tuples) and columns (attributes). The tables have relationships between them that are defined as using a certain column in one table that references a column in another table.
- There are 13 "rules" that determine whether a database can be called "relational".
- The table is the basic data-storage unit in a relational database. Tables consist of columns and rows.
- Relationships are THE reason why relational databases work so well.
- In relational databases, a relationship exists between two tables when one of them has a foreign key that references the primary key of the other table.
- A row, also called a record, represents a set of data about a specific item. Every record in a table has exactly the same structure, but of course different data.
- A column is a specific set of values in a table of the same type. It defines a specific attribute of the table or data.
- A primary key is a special column or combination of columns that uniquely identifies each record (row) in the table. The primary key column must be unique for each row, and must not contain any nulls (non-values).
- The primary key, together with the closely related foreign key concept, are the main way in which relationships are defined. A primary key uniquely defines a record, while a foreign key is used to reference the same record from another table.
- Structured Query Language (SQL) is the de facto language used for the management and manipulation of data in relational databases. SQL can be used to query, insert, update and modify data.
- Spreadsheets and databases have some similar capabilities, but the spreadsheet has a number of limitations that make it unsuitable for managing some data situations.
- One of the most severe limitations of relational databases is that each item can only contain one attribute. Non-relational databases, specifically a database’s key-value stores or key-value pairs, are radically different from this model. Key-value pairs allow you to store several related items in one "row" of data in the same table.
- A data warehouse is a special type of database optimized for querying, reporting and analysis. The main benefit of reporting using data warehouses, as opposed to the organization’s transactional databases, is that warehouses allow much better and more fine-grained data analysis for business consumption.
- An index in an RDBMS is a data structure that works closely with tables and columns to speed up data retrieval operations.
- A schema is the structure behind data organization. It is a visual overview of how different tables are related to each other.
- Normalization is the process of (re)organizing data in a database so that it meets two basic requirements: there is no data redundancy (all data is stored in only one place), and data dependencies are logical (all related data items are stored together).
- In the RDBMS world, constraint refers to the exact same thing as in the real world. A constraint is a restriction on the type of data you can input into a certain column.
- In relational databases, saving a transaction is known as a commit, and undoing any unsaved changes is known as a rollback.
- ACID is an acronym for Atomicity Consistency Isolation Durability, the four highly desirable properties of an RDBMS.
- Oracle is one of the behemoths of the RDBMS world. Its flagship product is Oracle DB.
- Microsoft is another big boy in the world of RDBMS software with its SQL Server product, although it is better known for its universal Windows operating system and Office suite of office-productivity programs.
- Other commercial RDBMS systems include Postgres, MySQL and DB2.