Other Important Database Concepts
We have discussed the fundamental database concepts, but these are not the only ones. There are also other important secondary concepts and data structures worth learning about. In this section, we’ll take a brief look at these.
An index in an RDBMS is a data structure that works closely with tables and columns to speed up data retrieval operations. It works a lot like the index at the beginning of a book. In other words, it provides a reference point that allows you to quickly find and access the data you want without having to traverse the entire book (database).
A schema is the structure behind data organization. It is a visual overview of how different tables are related to each other. This serves to map out and implement the underlying business rules for which the database is created.
The Oracle DB has a somewhat different definition of schema. Here, schema refers to a user’s collection of database objects. The schema name and username are the same but function quite distinctly; i.e., a user may be deleted while his collection of objects (schema) within the database remains intact, and can even be reassigned to another user.
See a visual example of a simple database schema below:
Normalization is the process of (re)organizing data in a database so that it meets two basic requirements: there is no data redundancy (all data is stored in only one place), and data dependencies are logical (all related data items are stored together). For instance, for a bank’s database all customer static data, such as name, address and age, should be stored together. All account information, such as account holder, account type, account branch and so on, should also be stored together; it should also be stored separately from the customer static data.
Normalization is important for many reasons, but chiefly because it enables databases to take up as little disk space as possible, resulting in increased performance. There are several incremental types of normalization, and they can get somewhat complex.
In the RDBMS world, constraint refers to the exact same thing as in the real world. A constraint is a restriction on the type of data you can input into a certain column. Constraints are always defined on columns. A common constraint is the not-null constraint. It simply specifies that all rows in a table must have a value in the column defined as not null.
Transactions (Commits and Rollbacks)
Think of a bank building out its database systems. Imagine if it crashed right as a wire transfer was in progress. Big problems, right? This is the basic idea behind a transaction: All items in a series of changes need to be made together. In the case of a simple transfer, if you debit one account, you need to credit anther account.
In relational databases, saving a transaction is known as a commit, and undoing any unsaved changes is known as a rollback. That is the basic definition, but it gets more complicated when you consider that databases typically have to serve several users simultaneously. Before the transactions is saved, what happens when other users query the same data? At what point do the other users see the saved data? All RDBMSs must be capable of satisfactorily answering these questions, and they do this through the commit/rollback features.
Databases must also provide fault tolerance, even in case of disk failure. When data is committed, there must be a good-as-gold guarantee that the data is actually saved. Relational databases have ingenious ways of achieving this, such as two-phase commits and use of log files.
The term ACID here does not refer to psychedelic trip-inducing substances that help DBAs work better. Rather, it is an acronym describing four highly desirable properties of any RDBMS:
- Atomicity: This refers to a database’s ability to either fully process or fully roll back a transaction.
- Consistency: The database should ensure that all data written therein follows all rules and constraints specified in the database.
- Isolation: Transactions must be processed securely and independently, without interfering with each other.
- Durability: The database must ensure that all committed transactions are saved permanently and cannot be accidentally erased, even in a database crash.
As we have seen before, databases allow multiple users to simultaneous access a set of data. So an inevitable issue that arises is how to handle situations in which two or more users want to access or update the same piece of data. RDBMSs solve this issue by the use of locks.
There are different types of locks, such as transactional or data-level locks that simply lock down a single data field; row-level locks that lock down an entire record of data, while table-level locks restrict access to a whole table. You can see that the last lock type is especially restrictive. As such, such it is employed only when absolutely necessary, such as when performing table-wide data loads or table-wide backups.
By the way, spreadsheets can only employ table-level locks. This is a prime example of their inferiority when compared to databases. If you work in an office environment you have most likely encountered it when working on a shared Excel spreadsheet and you get the annoying "file is locked by another user" message. Locks are a critical way of implementing the ACID (Atomicity, Consistency, Isolation, Durability) properties of an RDBMS.
The most common type of lock is the data-level or transactional lock. It usually happens completely invisibly to the user. When a user updates (but not necessarily commits) a particular data item, it is then locked down to prevent other users from making any other changes to it. However, the other user can still read the same data item. This is a neat feature of RDBMSs. At this time, the other users will only see the "old" value of the data item that existed before the changes. Once the main user commits (saves) the change, all users who now query the data will see the new, updated value. Keep in mind that the term "new value" could even include a deletion of the data that was there, in which case the field is now a null. If the user instead abandons the change and performs a rollback, the other users will still see the very same data, but they can now also update the data item.
If a user tries to change an item that is already locked down by a first user, the second user will either receive an error message or the system will just hang for a short time while waiting for the first user to either commit (save) the change or roll it back (undo it).
A curious situation may arise when two users have each locked a data item, then both attempt to access and make a change to the exact data item held by the other user. In this case, there is the potential to lock the item indefinitely while each user waits for the other. This is called a deadlock, and most RDBMSs have a way of detecting and resolving them. The system will usually choose one user at random and roll back his or her changes, thus freeing up the data locked by that user and ending the deadlock.
You can surmise that locking is an expensive activity in terms of database and computer resources. Systems that host write-intensive databases need to be robust enough to support the resources expended by locking, especially the CPU. Applications also need to be well designed so as to lock data as infrequently as possible. One way of doing this is by building data retention capability into the application itself. Once a data item is read from the database, the application should then perform all changes internally without first interacting with the database. Only when the user clicks "Save" does the application quickly write the item back to the database. This minimizes the amount of time that, from the database point of view, a data item will be in a state of limbo of "changed" but not yet "committed" or "rolled back".
Table-level locks, as we have mentioned, are very restrictive because they lock down an entire table. Although they are rarely used, they are invaluable in some operations such as bulk uploads of data or when reorganizing fragmented data for an entire table. In such cases, it is convenient to ensure that no other users can access the table for the duration of the operation, and the lock should be released immediately once the operation is over. As such, these locks are typically only employed by DBAs during these special operations. Table-level locks usually have to be explicitly specified; by default the RDBMS employs row-level or transactional-level locks.