Eric Kavanagh: Ladies and gentlemen, hello, and welcome back once again. It's a Wednesday, at four o'clock Eastern, and those of you who know the program, know what that means, it's time for another episode of Hot Technologies. Yes, indeed. My name is Eric Kavanagh, I will be your moderator for today's session: "Index Insanity: How to Avoid Database Chaos." Or as I referred to it in the last email blast to go out, “database wrangling.” Hot term these days, “wrangling.” Everybody does it. There’s a slide about yours truly. And enough about me.
So, the Hot Technology series really was designed to define a particular space, as opposed to the Briefing Room which is just one-on-one live analyst briefing, for Hot Tech we get two analysts. Today, it's going to be our very own Doctor Robin Bloor and our data scientist Dez Blanchfield. And we're talking about a topic which I think is really quite emblematic of what's happening in the marketplace today.
The bottom line is that we're in a world of complexity these days. Really, if you think back fifteen years, or twenty years, it was a vastly different world back then, especially with respect to database technology. Databases used to be fairly simple. There were only a handful of them; most of them were relational. Now, we have this whole panoply of database technologies. Literally scores of options on the table for anyone who wants to build an application or do something with data. Everything is changing and that affects the people who try to manage these systems. We're going to talk today with Bert Scalzo, who is a real expert in the field; he's the senior product management for IDERA, about what you can do to get a handle on all that data. With that, I'm going to hand it off to Doctor Robin Bloor to take it away. Robin, the floor is yours.
Robin Bloor: Okay, thanks for that introduction. I think it – because it's a two-handed thing, I think that I would just talk about database optimization in general as an introduction to this Hot Tech show. I started life – in technology and analysis – I started life doing this because I used to write articles on the capabilities of databases on the DEC VAX platform. And for that reason, database spenders used to brief me. And the thing that kind of occurs to me is that, why would you have a database? I mean, in those days an awful lot of people used to create key value files and use those to have a kind of index sequential fallacy as we call them, but to create a kind of database capability, and you know, why would you have anything else?
And the answer to that, I think Michael Stonebraker gave the best answer to that, and he said, "A database can know more about where the data is and how fast to get it, than any program can ever know." And I think that’s interesting; it is the nature of the game. But in the 19 – well about 1989 that I started in technology analysis and you know, in that point in time, databases were very simple and relational databases were super simple. They had so little capability, I mean, they could store data, obviously, and you could back up and they had, they were ACID compliant, but they really had very weak optimizers. In fact, it would be hard to argue that they had the optimizer capability at all.
And later they just got better and better, but, you know, when a database doesn't function – as these kangaroos seem to be in one way or another indicating – there can be an awful lot of reasons why it's going slow. And that brings me to the point: Databases have many functions, but the most important one is query optimization. If they didn't do that, you wouldn't use them. It’s about getting information quickly, it’s about being able to do it when there’s a lot of concurrent users, and that's a hard problem. And when you actually look at the, let's call them mature databases, if you like – but certainly Oracle, to a slightly lesser extent, Microsoft SQL Server, certainly Teradata and DB2 – the optimizers of those databases have got, have been decades in the building. You know, they didn't – somebody didn’t sit down on – six guys on a two-man, year, project and just knock one together. It doesn't work like that. The optimization capability has gradually grown, and it takes a lot of growing. Anyway, let's talk about the background to the database. Well, there's an awful lot that's said about NoSQL database now, and there's even a lot of enthusiasm for graph database. And the use of SQL over Hadoop and things like that. But, the truth of the matter is that if you want a database right now, if you want a fully functional, capable of OLTP and large query traffic, it's a relational database, or it's nothing.
Amongst relational databases, Oracle is dominant in popularity. Microsoft SQL Server, I think, is second. They're both capable of being used for OLTP and query workload, but actually you really can't get away with mixing those workloads. You need different incidents for OLTP workloads and query workloads. There are alternatives to SQL and graph. Most companies standardize on one specific database, which is why – I mean after decades of fighting it out with all the other players, Oracle became the most dominant one. Simply because they ended up being able to sell corporate licenses, and so companies would only use alternative products in exceptional products Oracle simply wouldn't do them. And databases are strategic in that also they evolve. And you know I did a little bit of research for this presentation, and it's kind of – I’ll come to it in a while, but it's kind of interesting how they evolve, in terms of looking at it from a DBA’s position. This is what I call the invisible trend. It's Moore’s law cubed. It's roughly like this: The biggest database is, and new databases, there isn't an old database that got a lot more data to ingest. It's normally a database that's being applied to a new problem. And they actually grow in terms of data volumes. Roughly at the cube of Moore's law. So Moore's law is a factor of ten times every six years. VLDBs tend to grow a factor of one thousand every six years. In 1991, 1992, the big databases are measured in terms of megabytes. In ‘97 and ’98, gigabytes. 2003, ‘4, terabytes. 2009, ‘10, you started to see petabyte databases. I think there was possibly one or two exabyte databases out there right now, but the largest I've heard of is the 200 petabytes on time, and you know, not getting data to a petabyte databases. But, it's most of that will obviously be the new big web 2.0 companies, possibly, you've got Facebook heading in that direction.
But anyway, if you actually look at that, expecting a database to go through that kind of escalation in volume, it's asking a lot. And remarkably, certainly up to the petabyte level, they seem to have done reasonably well. I mean, I'm talking about the older products rather than anything new. They seem to have done extraordinarily well. If we look at database performance, bottlenecks, this takes me back to the time I actually used to care about them, and had to worry about them. You know this is fundamentally the breakdown of the hardware. There are CPU bottlenecks, possibly, there are memory bottlenecks, possibly, there are disk bottlenecks, possibly. It can be the network that causes you grief, and you can also get problems with locking, depending on what you're doing, but normally that's because the program doesn't know who to call lock. So, if you're going to tune a database, you're actually trying to tune it so that it dances between these five possible bottlenecks as well as it can do. And that's no easy matter, because the amount of memory that you could configure on any given server is increased dramatically. Then CPUs have become multicore, disk, well we can now do, I think, even on commodity servers, I think you can do hundreds and hundreds of terabytes, quarter of petabyte, maybe, even on a commodity server. So, of all of these things, you can play with, network of course can go at different speeds, but mostly when you're dealing with databases, you really want to have fiber cables between the servers and nothing else running on that, particular that way.
Database performance factors. I mean, I'm leaving out what this is all going to be about, because I know Dez is going to talk about it, but bad database design means a poorly performing database. Bad programming design can possibly mean throwing very stupid SQL at a database, which will just take an awful lot longer. Concurrency and workload mixing, too much concurrency will cause bottlenecking problems. The workload mixing, when you've got large queries with very small, short, sharp queries, that causes problems. There's a load balancing issue. Most databases take care of that, but if you haven't got a sophisticated product, then you know, just adding a few servers, isn't all you do if you actually want to increase the size of a cluster. You actually have to balance load before you get the optimum performance. You need to do capacity planning. Absolutely. Especially now in these days as when data volumes increases more dramatically than they used to for databases. And there are whole data layer issues to how you ingest the data, how you move data about. Not getting data to a database on time can be a performance issue later on because we've gone from databases working in Windows, to twenty-four by seven by three hundred and seventy-five operation and there are no windows where you can slow the database down or it's unlikely that there will be nowadays.
The Oracle DBA problem. This is what I was thinking about. I've been in Oracle’s DBA with Oracle 7, and I remember how to tune that. And if you actually look at Oracle now, it's way, way – it's got way, way more capability. It's got bitmap indexing and things like that, but I actually took the time to look and see how many tuning parameters there actually are in an Oracle database at the moment. And there are over three hundred and fifty tuning parameters and there's a further one hundred hidden parameters, which specialist DBAs might know about, but normal Oracle DBAs don't know about. And that means that tuning this kind of database is a tough thing. It's not a simple thing at all. You've got to have a feel for it, you've got to have been doing it for a long, long time, and you've got to know exactly what the problem you think you are solving, because the tuning starts when the performance gets poor, but it might not be the performance of everything. It might be the performance of specific queries that matter, and you might be able to fix that by pinning certain data and memory, or you may need to fix it by indexing, or you may need to start doing partitioning in a different way. There's a lot of things you can do, is the point. So, consequently, they're not going to do it in their heads – DBAs need tools. I shall now pass on to Dez who's going to tell you about indexing, I think.
Eric Kavanagh: Alright Dez, take it away.
Dez Blanchfield: Thank you, Robin, and I love the cover page. I think you've thrown the gauntlet down there for me to come even come remotely close to something that exciting. But I've used an image of our little galaxy, as my view of what today's challenge for database administrators has turned into, because this is the mental image that I tend to conjure up when I get into an environment and I'm no longer in the world of administering databases or designing databases at that level anymore. But, like yourself, Robin and I have had many years of being involved in the world of databases, either as administrator or developer, or eventually architect, and then realized that I could do better things to earn a crust. But it does tend to feel like you're staring at this galaxy of data, and more so today, when we go from, as you outlined, we've gone from megabytes to petabytes and exo-scale in a very short period of time, in the grand scheme of things. But the phrase that I have in my mind is, that database indexes are now a black art and they're not really the sort of stuff that mere mortals should sort of dabble in, for enterprise-grade business applications and the type of formulating you were just talking about. But, I wanted to go through a quick rundown of the type of history that I've had with database worlds and bring to context to where we're going to draw a conclusion to, and then run through some material today with our friends at IDERA, because I think there's a lot of different thinking about how to get database performance tuning and one of them is throwing tin at the thing. For a lot of shops that I come across, they invariably don't get to the point of doing performance tuning at the database layer and particularly the index layer until they've got through the hard route of thinking they can throw a tuner at it.
A lot of people just take a big iron approach to it, in my mind, and I’ve got a picture of The Flash here because if you've ever watched any old movies or certainly the latest TV show with The Flash, as in Flash Gordon the old character, and now that he's called “The Flash,” he tends to go very, very fast and invariably his energy runs out. And this is what happens when you throw big iron at database performance. Invariably, in my experience, you can put high performance, hard work in the game, you can optimize your operating systems and tune them to a certain point. You can ensure that you've got fast multicore, multithreading CPUs to make the application run faster, you can throw lots of RAM at it, you can have high-throughput backplanes, you can go from hard drives to caching hard drives to solid state, and high-performance storage array. And even now, people throw in things like flash and NVMe at their database engines, thinking that they're going to get this login times two performance gain. And invariably they do get some gain. But, it all comes back to the same basic performance tuning problems. Lots of low-latency networking connections, so that the clusters work fast. And of clustering database infrastructure, so you've got more than just one machine doing all the work. But you do tend to come back to the same basic performance problem, and that is reading data. Writing data, is for the most part, a fairly linear challenge and unless it's done properly.
And then we have the challenge in today's world: Not all databases are created equal. There's databases and quote-on-quote “database.” And when we think about database engines, people often think about the traditional, usual suspects as they were in the SQL world. You know, we've got Oracle, and Microsoft SQL Server, and there's a couple around it in the open source world with MySQL, which is now owned by Oracle, but it’s still open source. And then we've got the not-so-usual suspects, the NoSQL engines, which still have an issue around indexing and performance management, and I won't go into them in a lot of detail, but there's an increasing number of these things popping up every day and they look and feel like database engines from the developers’ point of view and from a performance point of view, but they are very, very different beasts and they have their own little niche in the world to carve out either in-memory performance or linear scale on disk. But this is what the world looks like in the database world. This is the 2016, this is the version three of the map of, by a range of people who produce this ongoing landscape map of what databases look like, and this is where it – not even a superhuman database architect or database administrator could make sense of it. Literally hundreds, and hundreds, and hundreds of different makes, models, manufacturers of databases, invariably SQL compliant. And the interesting thing is, they all come back to the same challenge. Performance and performance tuning around the database engine, and particularly by how data is indexed.
So let's just quickly cover database indexing, because it's an interesting topic, and you have to get into it in more detail with the demo, I believe. But, I think it's fairly well accepted and standard industry practice that database index performance tuning is where the world starts and ends as far as ensuring your data is accessible on a fast and speedy format. But what is database indexing? If we think about indexing in the form that we're used to as everyday humans, think of an index page in a book. If you want to find something in a book – particularly the likes of an encyclopedia, or something like a reference material of some form – if you're looking for something like this page, where I’m looking for things like the topic of dams in an encyclopedia. I want to find every reference to dams, the catchment of water and a large buildup area, man-made generally. I'll go to the back, I’ll find it in an alphabetized, sorted list, A to Z, left to right, and I'll find D. I'll find the word “dams” and I can see that on pages 16, 38, 41 there's a reference to them, and then I can go to those pages, I can scan down my eyes and I'll find the reference to the word “dam.” It’s essentially the same concept in a database, but it’s now a rocket science in many ways. So much so, that effectively every database administrator I have ever come to know well, considers indexes to be the single most critical tool for performance tuning in any database world, regardless of what their experience might be as far as throwing tin at it, or whatever the case may be.
Generally when we talk about database indexing, there are a number of common approaches. And the more complex database indexes become, the more complex the approach to indexing data. But essentially when you think about indexing data – imagine that we have a file that's got a list of names; they may not be sorted in alphabetical order. Let's imagine there's twenty of them. If we're going to sort – if we're going to search for data in that list, from top to bottom, and let's say it's a list of names. If I choose a random name and I start to scroll down that list, from top to bottom, in a linear format and it's an unordered list, there are two criteria that I think about as my average search time and my maximum search time – and I've got a typo in the second line, should be “maximum search time,” sorry – but my average search time is essentially N plus one, divided by two, and that is on average, it takes me fifty percent of the time to scan from the top of the list, to the bottom of the list to find any random thing in that list. And the second line there, under linear, should be “maximum search time.” But the maximum search time is essentially the number of items, and that is that if I have a list of twenty things, that the most time it can take me to search for something in that database is to go from the top to the bottom, which is let’s say 20 items in this simplified example. And it’s a very slow process and there's really no way to performance tune that. And then, there's other types of ways of taking that data and creating an index, which is effectively a short list of pointers to where the actual data is, such as binary, B-tree, bitmap, hashing, clustered and non-clustered, and then there are different types of data such as spatial, filtered, XML and full text.
Binary is a very common-used one for things where the data lends itself to it. B-tree is probably the single most common in a general sense, historically, in that it's a common way to structure an index to any form of data and allows loggers, selections, and insertions and deletions are relatively easy as you move pointers around the reference to the pointers, the points. There are other types, like bitmap, where data types concern like if we've got an associated range of some form. Hashing works very well for large objects, particularly blogs and images. And you can see that there's a number of different types of scientific approaches, mathematical approaches, to indexing data. For the mere mortal, they’re an interesting challenge to talk about at this level. When you talk about it at performance level for a database administrator, they really do become a rocket scientist and people do degrees in them, and I know that Doctor Robin Bloor has certainly done that, and written books on it for the likes of IBM and other large brands over the last couple of decades. And so, the – my view, is that we've actually passed a time where, you know once upon a time I would personally be able to sit in front of a system and I would be able to pull it apart, and show you exactly where the performance issues were at a command line or at a graphic user interface start tool, and start to delve into the data and tell you where the issues were, and build indexes, or sub-indexes, or primary and secondary indexes into that data and start to use it to find things. But when you think about that landscape I showed you, where we've got hundreds and hundreds of brands, makes and models, and manufacturers and types of databases, we're well and truly past that time now, where a human being can make sense of the types of database engines we've got. Particularly, even if we just come back to the likes of Oracle, predominant brands these days in relational database platforms.
The number of databases they have to deal with either from a proprietary platform like an ERP or HR or finance system, or whether they’re a home-baked platform for various reasons, the number of databases and database tables and records that we end up dealing with are just astronomical and you physically can't do it by hand. And we've had an additional complication now, where once upon a time, a database server might just sit under your desk. You know, as a young kid after school, I used to go and work on database software on the, originally, Apple IIes and then DOS PC-based systems, like dBase II, dBase III, went through an era with mainframes and mid-range and even VAXs and PDPs and log file on that. And the like of Sabre, and then eventually when some of the SQL databases came along. But these days when we're thinking about database engines, they look like the bottom left-hand corner. A database server isn’t just one machine sitting on the floor under a desk anymore; it's hundreds of machines running copies of database engines, and clusters, and they do scale up to hundreds and hundreds of terabytes of data, if not petabytes of data, which is thousands of terabytes. And even to the extreme, as Doctor Robin Bloor mentioned, that some specific use cases – airlines, government agencies in particular – can get to exabytes. They're still fairly niche-y, but hundreds of terabytes and even dozens of petabytes is not unusual anymore, particularly from the dotcom boom to now, sort of what we're calling web 2.0 companies, the likes of Facebook, Google, Yahoo and so forth.
We also have the complication now that things are moving to external service. We've got infrastructure platform and software as a service approach providing infrastructure. And particularly platform service where we cannot just buy for the likes of Oracle and their cloud platform, databases and servers. And so this allows us to do very rapid development of application and just plug a database back into the servers. We don't have to think about what's under the hood. The downside, is that we often don't think about how we design and implement the database back in until it starts hurting and performance becomes an issue and then we end up having to look for the right tool to diagnose why our database is hurting and where the performance issues are. And invariably it brings it back to that common problem of how we've indexed that data and the types of indexes we've used for that data and that then brings us back to superhuman performance requirement. And someone who has access to the right systems and the right tools to performance tune those engines, and start to find a hot spot and look at where the queries are, where the data's moving, the types of queries, how the queries are structured, who’s doing the queries, and whether the queries are being queued, and having to be cached. What replication do you look for?
And so we're well and truly – in my view – at a point now where even the world's best database gurus, essentially our database architects and our database administrator and performance bases, in my view they very much need to start leveraging the right tools to deliver optimal performance index tuning for any database engine. Because the scale that we're dealing with and the speed that things are moving at, we simply can't do it by hand, and attempting to do that invariably can introduce other performance issues, because we may not have experience in that space that we're trying to solve a problem in. And I believe that that's where we're about to hand to Bert, and we're about to talk about how they've solved this varied problem and the type of things that their tool can do, particularly for the Oracle world. And with that there, Bert, I'm going to pass over to you.
Bert Scalzo: Thank you. Welcome everybody, my name is Bert Scalzo, I work for IDERA. I'm the senior product manager for some of our database products. I will be demonstrating some of those today. But I want to talk about indexes, because I agree with everything that everyone has said here, especially the last slide, that indexes are so complex now that you need a tool, and I hope to convince you. So Oracle index design, it's not as easy as it used to be in the old days. A lot of people will be unsure of themselves when they look at the options, and I like this saying that I pulled out from history, “in these matters, the only certainty, is that nothing is certain.” And that's how I kind of feel about indexes these days, because even if you think you know the answer of you should index X, Y or Z, you really can't be certain until you try it, because those optimizers sometimes behave differently to the way you expect. And so there's a lot of trial and error with index design. Now, in the good old days, if you needed an index there generally was just two questions, or one question. Was it unique or was it not unique? And you might have thought of other things like, “How many indexes can I have maximum on a single table?” because too many indexes slows down your inserts, updates and deletes. You also might have been in your database system, had restrictions on how many columns could be in a multi-column index, because sometimes there were limits based on the page or block size of your database engine, but in reality it was pretty simple back in the good old days. You either indexed it or you didn't. And really, everything was in a B-tree. We could allow the duplicates or not, and that was about it. Life was good, life was simple.
Well today, life isn't so good or so simple. I've put the red Ghostbuster sign through the way we used to do it, because now we have B-tree versus bitmap, versus bitmap join. And I'm going to explain what some of these are in a moment. Clustered and non-clustered, unique or duplicates, forward or reverse order, function-based, partitioned or not partitioned. If there is partitioning involved, is it global or local partitioning? I'll explain that as well. And then also there's something called an indexed organized table. And there's actually half a dozen others that I've left off of here, because I think I've got enough here now that should convince you that indexes are a lot tougher than you might have thought. In this particular slide, I’m going to start in the top-left portion of the diagram and I've got a table. And the first thing I have to decide is, depending on your database version and your database vendor, do they allow object tables or are they only relational? I'm going to go down the right-hand side and say that we're building a relational table. Now, the next question I have to ask myself is, is it in a cluster? And a lot of you who have done Oracle for some time will remember that clusters were back for the Oracle 6 days. They're probably not very heavily used anymore today, but let me go down that branch first.
If I was going to put my table in a cluster, I would have to have a clustered index on that table. Now, in Oracle, when you clustered a table, you were basically storing the rows or the rows were close to each other where the values were similar. And so, you have to have a clustered index and that clustered index could be non-partitioned. In other words, there weren't really any partitioning methods for how you would do a clustered table. It was strictly non-partitioned. And because it was non-partitioned, it was global. I will explain what global is in a minute. And it was always B-tree. In other words, when I went down that branch, it was pretty simple, I didn't have many choices. Now, if I did a non-clustered index on a clustered table, which was allowed in some versions, again it was non-partitioned; when it's not partitioned, then your only choice is global. And so, there you have the choice of B-tree or bitmap. Again, it depended on your version of the database. But now, let's go back up to the relational table and start going down the right-hand side again and now we're just going to have a plain, old, regular, heaped table: relational. It's going to be in a table space. I'm kind of going down the right-hand side here first. So it's organization, heap. The next question I have to ask myself is, “Do I want to partition this table or don't I?” Now, sometimes you would partition because you thought, “Hey, the optimizer will be smarter about how it can optimize queries.” But a lot of DBAs will tell you that the reason you do that is for administrative purposes. If you have a hundred-billion-row table, if you break it up into partitions or buckets, when you want to add data to the last bucket, you can drop and index that’s only a few million rows. You can insert that data and then you can rebuild that index on just that bucket.
While it was a good technique for some, optimization techniques like partition elimination, its real value was being able to administer or do administrative tasks on smaller pieces. When I go to the organizational heap, the first question was, “Did I partition it or not?” Let's go to the left, I’m not going to partition the table. Now, it may seem odd when I tell you this, but you could have a non-partitioned table and then you cannot partition the index like you're accustomed to, or you can partition the index. Stop and think. Your table has basically one bucket, like you've always thought, and yet your index is going to have multiple buckets. When that happens, where there's a mismatch between the number of buckets and the table, and the number of buckets in the index, that's what's meant by global. And so, if the table's not partitioned, and if the index is partitioned, it's considered global, because there's a mismatch. Now, let me go back up on my organization heap, and come down instead on the partition side. Now, if I have a partition table, and let's say the table has four buckets, four partitions, my index could have four buckets so that my index matches my table design. And so that's over, way over, on the right-hand side. That would be considered local. A local index means basically that the partitioning of the table and the index is done the same way and has the same number of buckets. And then once I have the local index, it could be a B-tree or a bitmap, and that green arrow that kind of goes up, shows you that even if it's a B-tree, there are still choices that could be made. It could be function-based. And also, if it's a bitmap, there are different types of bitmaps. There's something called a bitmap join index. If you're doing data warehousing, that's a very popular kind of index for star schema or design. What happens is that index has the row IDs for what it points to in the table, but it'll also have row IDs for the parent tables so that when you're – you've got to star schema design and you're looking at a fact table, that index on the fact table points you to the data that you're interested in, and points you to every row in your dimensions, so that you only have to have one index.
And actually, this came into being because of Red Brick, which was a database many years ago – a lot of people may remember that. And so, if you look at this picture – and keep in mind I didn't put everything in this picture because the picture would be a lot bigger – there are still additional issues, which I have in text here over the top-right portion. Is it a reverse-order index? And you might say, “Why would I want a reverse-order index? That makes no sense whatsoever.” Well if you're in a clustered environment in Oracle, if you're doing real application clusters, if you keep your indexes in order, so non-reversed, if you have a lot of processing that’s hitting the same values or the same index values, what would happen is, you would have hot areas of your B-tree. Meaning that you would have contention and possibly locking to try and access that stuff, and you would be doing that across nodes in a network. Well, if you put in a reverse-order index, now you can undo that. You can say, “Well, the similar values are in different parts of the trees, so I don't have my separate nodes competing for hot areas in the tree.” And then notice too that unique does not work with some of the options. If you look, I've numbered three, five, eight and eleven, so there are some cases where I can't have a unique index. Likewise, there are some cases where I can can't have a reverse index, and then there's additional issues like logging or no logging, and parallel and non-parallel. I can assign things to a specific area in memory.
And this leaves out still quite a bit of features in Oracle. I would say that when you look at Oracle 12, there's probably again about another half a dozen things I could add to this picture. Indexing is really complex and I really agree with the prior speaker, in order to navigate through this and make a good choice, you need a tool. You sort of need, maybe, a picture like this, and some kind of a methodology on how you would pick things and hopefully the tool would help you get there. And then it's going to be trial and error. I always tell people on indexing, “look before you leap.” And then you can see the little dog here, he's jumping without looking, he’s going to end up in water with the shark, or the guy getting ready to jump into the water, and he's going to impale himself. You've got to think about your indexing, because creating an index does not always mean things get better. In fact, creating an index can slow things down. And query performance can be an order of magnitude better with one choice over another. And I'll give you a good example. If you're doing a star schema of design, and on your dimension tables you use bitmap indexes in one case, and in another case you say, “I'll use B-tree indexes,” you've got bitmap versus B-tree. I can tell you that one solution will be an order of magnitude or possibly several orders of magnitude faster than the other. But keep in mind what works in one environment, like in a data warehousing environment, probably is not a good choice in an OLTP environment.
For example, if you were to take a transactional table, and put bitmap indexes on a transactional table, it's expensive to calculate and reset bitmaps, these long strings, and so in an OLTP table, you may hit the table so heavily that the bitmap index can become corrupt and slow your system down because they’re just not meant for updates. They're great for fast access, but are not good for updates. I do think index takes trial and error. There really is no golden rule anymore – there's too many different variables in this equation to know – and ultimately you're going to have to look at execution or explain plans in your database to see whether or not you're making good selections. And sometimes, the plan analysis can almost be a science unto itself. I'm not going to cover that today – that's another topic – but don't take index design for granted. There are legitimate reasons why there are all these crazy index types I showed you, in the prior picture, and that the prior speaker spoke about. These were not just created because it was a neat feature to put on a checklist somewhere for a database vendor; there are use cases or scenarios where these indexes are important and will make a significant difference. Now with that, I'm going to show you some examples of different types of indexes in one of our tools. Let me just get my screen up so you can see it. Okay, so here I'm sitting inside of – let me minimize this application. I'm sitting inside of the VMware and I'm running a Windows Server 2012 VM.
And you can see, I've got just about every tool known to man. As a product manager, I have to stay aware of my competition, so it's not just what tools do I have, but what do my competitors do? And we've got this tool here called DBArtisan, which I've already got running, but I'm going – so I'll just bring it up. And what you can see is this is a really nice tool, because instead of having to use, say an enterprise manager for Oracle and a SQL Management Studio for SQL Server, and the MySQL Workbench for MySQL, and twelve other databases that we support, well I've got all my databases built into this one tool. There's DB2, there's MySQL, Oracle, Postgres, SQL Server and Sybase, and that's – I only have six databases in this particular thing because I can't – the tool supports twelve databases but my poor VM, running six databases concurrently, and trying to do a demo, is about as much as my hardware will facilitate. So let me go back up into Oracle now, and if you notice, all of these things are the same. If I want to measure my performance in DB2, it's the same choices I would have in Oracle. Now under the covers we do lots of different stuff so you don't have to know what's going on, but we give you a consistent interface so you can be an expert with multiple database platforms. And that would include working with indexes, the topic of this discussion.
Let me come in here and let me first start by going to look at some tables, and I've got a movies database which just has a few tables. And if I look a particular table, like the customer table, when I bring it up here, I can see my table design, here's my columns in my table, and here's information about each column. I've got properties for the table, but note that I have a tab here for indexes and I can see here are the indexes on the table. Notice that one of these indexes is my PK index, my primary key. These other ones look to be just indexes for improving query access, maybe we query by first name, or last name, or we look at phones and zip codes. And if I pick a particular index, like this zip code here, and I double click on it, now I can see that, hey, it's a non-unique index and here are some of the other types, bitmap, non-unique, unique, whether or not it's sorted, whether or not that logging, whether or not it's reverse order, whether it's a function base. Oh, here's a fun one I didn't cover. You can actually have invisible indexes. And you'd say, “Well, why the heck would I want to do an invisible index?” Well, I'll give you a good example. You're in your production system and you have a performance problem and you're not sure creating the index will fix the problem, so you don't want to create the index and slow down production, but somehow or the other you want to be able to test it. You can create the index in production as invisible, meaning not many application code, calling the optimizer, will use that index. It's been created, it's valid, but it will not be used. Then you can take a query that you think that this index would help with, or a series of queries, and you can stick a hint in and say, “Hey, optimizer, there's an invisible index out there I want you to use and let me know whether I've made things better.” And now I've tested something in production, but I haven’t broken the applications in production that were running. That's the use for an invisible index. It sounds dumb when you first hear about it, but it has a use.
We can also, on indexes, define whether they're parallel, and also how many instances they're parallel across. Now, in a non-clustered or a non-real application cluster environment, so non-rack, parallel would mean how many sub-processes can my query bring up to try, and worker processes, to try and get thing through faster or quicker. And parallel instances would be, if I'm in a real application cluster, say I've got ten nodes, how many of the nodes am I allowed to split the work across? Maybe it's four of the ten, and on each of them, four sub-processes. That's an example. And then we have key compression. You can actually compress indexes? Yes or no. And then of course you have your storage parameters that you can specify on indexes. Now, I didn't cover these because they're really more a storage parameter than an index issue. And then finally, we have whether or not to make these partitioned or non-partitioned. Let me drop that here for a second. I’m going to go to a different schema. This is a star schema and, for example, this period table is a dimension table. If you've ever done star schema design you typically have a dimension for time and so in this database and this star schema, period is a time dimension. Now, I know it'll look funny, you'll say, “Gee, look at all those columns – has the guy ever heard of normalization?” Well, when you're in a data warehouse or a star schema design, you typically have non – you have tables that a typical person would look at and say, “Gee, these aren't very well designed.” But that's the way you do it in a data warehousing environment.
Now, watch what's going to happen because, okay, there's all these columns, look at that, I've got an index on every single column. Now, in an OLTP environment that would be a no-no. It would slow down all my operations. In a data warehousing environment, I would drop them during my batch load cycles. Load without the overhead or the indexes, and I'd recreate the indexes. And if I partitioned my table, then instead of having to drop the index for every bucket in the table, I could just drop the index on the bucket or buckets where data was going to be going into during that batch load cycle. And then recreate just the index portion for those buckets. And so that makes it very manageable. And if I look at – so here's a column called “Holiday Flag” and basically that's a yes or no. Notice that this is a bitmap index, and for most of you you'll say, “Well, that makes sense.” Yes or no, Y or N, there's only two values that makes sense. And because when you read the documentation for bitmap indexes, they always tell you pick something with low cardinality.
Now let me go into one of my fact tables, so here we have my orders. And this is my orders per day. And you're going to see now, that again I have quite a few columns, and again, I’m going to have more than a few indexes. And right here, we have something called the universal price code. This was for a retail store, so you know those little bar codes when you buy something at the store, this is the universal price code. Now, there are millions of universal price codes. Now, for this particular company that was selling stuff, they had probably 1.7 to 2 million universal price codes, so you're going to expect that this is not going to be a bitmap index because 1.7 million distinct values sounds like high cardinality. But in reality, in a data warehousing environment, you want this to be a bitmap. Now, let me explain why. Well, there may be 1.7 million distinct values for this universal price code, the number of rows in this order table is in the hundreds of millions to billions of rows. My index is low cardinality in comparison to the size or cardinality of the table. That makes it low cardinality. That makes the bitmap index useful, even though it's counterintuitive with 1.7 million distinct values that you would choose bitmap here. Now, if I knew that I wanted to use a bitmap join index, currently the product doesn't support that, I'm getting that added in for the next release, but that would be another alternative here. And in a star schema, remember, the bitmap index would be on the fact table and that one index in the B-tree would point to the row in the fact table and then to every row that was apparent in the dimension table for that fact. And so, you have another option there. And so, let's see, I want to come out of tables now and I just want to show you quickly that I have the same information, under indexes, and I'm going to do the same basic thing.
Now, the reason I brought this up is that you may notice, hey there's no primary keys here. Primary keys are done with a key constraint, so they're actually covered by the constraint definitions. These would be indexes that are not part of constraint. Now you might say, “Well, wait a minute, that might look like a foreign key, and a foreign key is a constraint,” but foreign keys and most databases do not automatically create an index on the foreign key column, even though it's advisable, and there you go – I've got all the same choices again. And if I want to change just to be compressed, I can do that.
Now compression only works on a B-tree index. What that allows is, when you look at the various nodes in the B-tree, it allows for compression of some of the values. It really is not compression like table compression, it’s a compression of what's stored in the B-tree in the non-leaf nodes. It doesn't save a ton of space, but it can make a difference. And with that I noticed that, I'm getting pretty close to time, so what I want to do is, I want to go back, and stop my sharing. And, we have our product out there for a fourteen-day trial on idera.com. It's a pretty good product, especially if you work with multiple database platforms. If you work with two or three different databases, this tool will make your life a lot easier. We do have tools to help you with the index design and selection, we have a tool called DB Optimizer. I just couldn't cover that today, that would be too much. And if you want to contact me, there's my email address, it's Bertskozer@idera.com, or you can catch me at my private email, and I've got blogs, I've got a website and blogs, and a LinkedIn profile there. So feel free to reach out to me on anything, even if it's not product related, if you just want to talk databases, I'm a geek at heart and I love to gab about technobabble.
Eric Kavanagh: Alright, well Dez, Robin, I'm sure you've each got a couple questions at least, we've got a few minutes left here. Dez, what do you think?
Dez Blanchfield: I've got one great question I have to ask you, it's been sitting at the back of my mind. What’s the craziest scenario you've seen? I've read your blog, I follow you closely, the – you’re, you're probably one of the few people who's lived in almost every unlikely, and I think Dr. Robin Bloor is the second that I've met in my lifetime. But, you know, you've probably seen every crazy scenario, what are some of the craziest scenarios you've seen, that you've come across, and like human beings who just couldn’t cope, you've managed to walk and perform Jedi mind tricks with this whole DBArtisan?
Bert Scalzo: We had a customer once who, in their database design they thought very much the way they would think in a file layout design, and so, it – when you normalize a database, the first thing you try to do is get rid of repeating groups. Well, they had a column and they made it a long, or a BLOB or CLOB, and in it they would put value, number one, semicolon, value number two, semicolon, value number, semicolon, and they would have thousands of values in there, but they needed to search on that column and they're like, “Why does this thing run so slow?” And I'm like, “Well, you can't create an index on what you did, it's just not allowed.” So we actually showed them, using the plans, that what they needed to do was to normalize that table. Not because normalization is some academic exercise that makes things better, but because they wanted a query on that field, which meant they wanted to be able to index it, and you couldn't index it on an repeating group, or at least not easily. And so that's probably the worst thing I've ever seen.
Dez Blanchfield: Yeah, it's interesting how often you come across, I think the challenge with databases, people forget that it is a science. And there's people who do degrees and PhDs in this whole space, write papers on it, and you've written a whole swag including your TOAD handbooks and other things from memory. The trend towards sort of, quote-on-quote “big data” now – I see a lot of people forgetting the fundamentals of database architecture and database technology, database science, if you like. What are you seeing out in the field as far as the shift away from traditional database platforms and traditional database thinking that we did effectively nail to the ground, and it was just a case of performance tuning and scaling. Are you seeing a lot of people relearn and have an experience where they just sit there and have an “a-ha” moment, like a eureka moment, where they realize, this big data stuff is actually just sort of really big databases? Is that a thing out there and people are answering you back and kind of, “We forgot, what we knew and can you bring us back from the dark side?”
Bert Scalzo: Well, no, and this is horrible to have to sort of admit, but the relational database vendors have drank that Kool-Aid as well. If you remember, I don't know, about a decade ago, we started putting unstructured data into relational databases, which was sort of an odd thing to do, and then the data, the relational databases, are now adding the NoSQL-type stuff. In fact, in Oracle 12, CR2 – I know it's not out yet – but if you look at the beta, if you're in the beta program, it supports sharding. And so, now you've got a relational database that's not added the concept from NoSQL sharding. And so, the “a-ha” moment seems to be more for the people on the relational side who are going “a-ha.” No one's ever going to do it right again, not even the database managers, so we've got to go over and join the dark side.
Dez Blanchfield: Right, so you're saying a shift to a lot of the messy data, if I understand right, being put into the, what we're now calling big data platforms, which is kind of funny, because they're not that old, but doesn't that mean then that they are refocusing on what they're doing with their relational database to get more bang for their buck?
Bert Scalzo: No, usually, if they have a need in the – that would have been quote a “big data-type need,” they're finding that instead of having to go to the other database platform and do something in a non-relational way, the database vendors are now giving them the same non-relational techniques inside of their relational database, to do those things. I mean, a good example would be, if you have unstructured data, like a JSON data type or some other complex data type that has meaning embedded in the data itself, the database vendors not only support that, but they'll give you ACID compliance on unstructured data. The relational databases have embraced the newer techniques and technologies and so, again the “a-ha” seems to be more not that, “Hey we, the application developers, have unlearned something and we need to learn it again,” it's “Hey, we do it this way now, how can I do it that way in your traditionally relational database and do it like I do in this database over here?” and that's becoming more prevalent, and like I said, the database vendors themselves are enabling that.
Dez Blanchfield: Right, who are the traditional suspects in this space for the tool DBArtisan and that? I did some homework on what you'd written recently, and from memory you'd written something, I think it was one of your blogs, on extreme database performance in the Oracle world. I can't remember when it was, I think it was sometime this year from memory, or from late last year, you'd written this thing. And it seemed to me it was the traditional, usual suspect for the type of topic we're talking about today, where people will go to very large-scale database environment and looking for what you're calling extreme gains in it. Who are the usual suspects that you're seeing out there who are taking up DBArtisan and putting it to good use?
Bert Scalzo: Well, we have a lot of customers, in fact, today I was on with a very large government agency who – and they have literally probably close to 1,000 copies of our software, because it allows people to focus on what they're doing, and not how to do it. And it's okay, I mean, everyone should know how to do something, but productivity is getting the “what” done. If the business asks me to do a task, that's all they're interested in. When did I get a checkmark to say when the task was done? Not what technique or what technobabble did I use to get there. And so, our tool lets them focus on the what, and lets them be far more productive, and that's really the huge advantage, and like I said, some databases offer a tool just for their database platform. We offer it for twelve database platforms. I have the same workflow, the same graphical user interface, the same navigations. If you know how to grant a privilege to a user or how to create a table or create an index in a database, you can do it in all twelve because it's the same look and feel and same workflow. That has huge value to our customers.
Dez Blanchfield: Yeah, I guess, people want to get a lot more bang for their buck from their human resources. And the days of having an individual specialist in Oracle, Ingres and DB2 are all gone. People are expected to be the Jack of all trades, so I think this thing has absolutely saved their lives.
Just one last quick thing before I hand it to Doctor Robin Bloor. You mentioned there's a free download for fourteen days, what does – if I'm going to go ahead and I'm going to do that, by the way, I'm going to put it in the Bloor tech lab and spin this thing up and get hands on it myself – I hadn't had a chance to do that prior to today. You mentioned a fourteen-day trial, you said you're running it on a VM on your computer, I'm assuming it’s a laptop. What are the, what is the entry-level setup for someone to get hands on and use the fourteen-day trial look like, just before I hand back to Robin to his questions?
Bert Scalzo: Any Windows environment, so Windows 7, virtual machine with one CPU and four gigs of memory. We're not a really fat or expensive tool. Now if you wanted to run your database server on that same VM under that same Windows, yeah, you would need to add more, but if you're running your database on a database server or on a separate VM, the VM to load and run our product is very lightweight: one CPU, four gigs of memory, pretty much any version of Windows – and we support both thirty-two- and sixty-four-bit installs. But you do have to install your database vendor's client. So if you wanted to connect to Oracle, you have to install the SQL net client, because that's what Oracle requires in order for you to talk to a database.
Dez Blanchfield: It sounds pretty straightforward. I think that one thing from this more than anything that I'm hoping that people are going to take away, other than the realization that this tool is going to save their lives, is that they should go and download it and play with it, given that you're offering a fourteen-day free trial. And it can run on their current laptop without installing anything extra, because if they're already doing database administration, they're already working with databases they've got all those tools in place and whether its running on a local VM or on their local desktop, it sounds like it's painless to install and have a play with. So I highly recommend people do that.
Robin, I'm sure you've got questions and Eric, you've probably got some from the audience, so Robin, how about I pass to you, and then back to Eric?
Robin Bloor: Yes, okay, well I've got things to say, I mean, I've always found this area fascinating because it was – I cut my teeth on it. But the truth is, probably since about 1998, 1999, I've been adrift of what Oracle is actually capable of. And, I knew Sybase and Microsoft SQL Server, both of those are fairly simple compared to what Oracle could do. You made me laugh when you – I mean, I covered my mouth, when you started talking about sharding. Oracle did this before. Oracle introduced at some point in time, they got nervous of the object-relational idea, so they introduced the ability to create a kind of object notation and object storage in Oracle, and I talked to one of their engineers, something like a couple of years after they introduced it and I asked how many people used it, and he said I think two customers had tried it and that was it. And I think the same thing is going to happen if they start to try and do trending NoSQL things. You know, I think it's a mistake, I mean, I'm kind of interested in what your thoughts are. Certainly, the – they drink the Kool-Aid. They feel as though they've got to be able to make claims similar to the big NoSQL databases like Cassandra, but you know, does it make any sense to you?
Bert Scalzo: No, you've hit the nail right on the head. To me, I would, if I'm going to do relational, I'll pick a relational vendor like an Oracle or a SQL Server or a DB2 or a Postgres, but if I'm going to do something that's non-relational, in the big data space, or the NoSQL space, I'm going to pick the right tool for the right job. And I don't think that that would be naturally go to my relational database vendor first. And then, you add the other wrinkle to it, which is, what's available in the cloud? So many people wanting to get their databases off premise. Then you have to look at your cloud provider and say, “Okay, what do you provider, what databases do you have available for me that fit my needs and how saleable are they, and frankly what is the rate or charge for using that database in the cloud per hour, or per day. And per gigabyte or terabyte?” And what you'll find is maybe some of the relatively newer databases like Mongo or Cassandra, maybe their rates are cheaper, so if you're going to do multi-petabyte-type big data, you might have to – just from the cost standpoint – have to consider the NoSQL databases in the cloud because they may be the most cost-effective way to do it.
Robin Bloor: Yes, right. I mean, my kind of – the thing about relational databases in my experience – which is long enough to have scars, that's for sure – there's a lot of common sense that if you start applying it and – you understanding what relational actually is, that, I mean, I remember going to do some consultancy with one customer once, and they led me into a room and they had done a kind of entity diagram and created a third normal form, a model of what the company's primary systems were like. It had two hundred and forty tables about and they said, “Well, what do you think of that? We're going to build a database for this,” and said “What do you think of that?” I said, “I don't think it's going to work.” And it's exactly right, you know, because they were ending up in order to create particular structure within eleven-way joins. And that's the thing to understand about relational. So I'm kind of interested in terms of how much bad design you encounter. I mean, I don't have any problem with DBArtisan – it's doing very sensible things and the fact that you can actually display out on multiple platforms, I think, is wonderful – but how much do you encounter out there where the design is issue where people could have solved themselves all sorts of heartache if they come down to a star schema rather than getting snowflake-y about it, you know?
Bert Scalzo: Well, I don't want to sound like, presumptuous or arrogant, but I would say more often than not. Clearly, the majority of the databases that I get involved with out there, they have issues or problems. Which is good, because our tools, like our database optimizer tool, can help them to solve those problems, and, but what's really funny to me, is that a lot of the problems are the same simple problems over and over again. I was just working with a customer the other day who had an eleven-way join query, and I'm like, “Okay, why didn't you use a with clause?” and they're like, “Well, I didn't know what that is.” And then I said, “And look at your sub-selects here on your correlated and your non-correlated,” I said, “In some cases you have in your where clause at the deepest level, a table reference form the outer.” I said, “That's, move it out to the right level, don't embed it deeper than it has to be, you'll confuse the optimizer.” And with a few couple of tweaks we took something that was running about two hours and got it down to ten minutes and it was just – in that case we didn’t do anything other than improve the SQL that they had written. I think the problem is that a lot of universities and a lot of people who learn programming in a non-academic environment, they learn it as recorded-time processes or row-orientated process and relational is a set orientated by nature, and so you have to think in sets to write good SQL.
Robin Bloor: Yes, I think that's exactly right. And you have to understand, it's things like, people ought to know the ABCs of stuff like this. It doesn't matter. You aren't going to be able to do rational things if you don't realize that even a well-designed, well-modeled database, joins will take time, sorts will take time. They do because the world has never found a way of making those go fast. They've found ways of organizing the data so they go faster than the otherwise, and a lot of the enthusiasm I have to say for the NoSQL databases is simply that they are avoiding doing joins. They just start building the databases with the same spread of data in them, because if you do join in any of the NoSQL databases they suck mightily. Don't you think?
Bert Scalzo: Oh absolutely. And I have to laugh because, I started way back before relational databases and back when Ingres was RTI, Relational Technology Institute, and we didn't have SQL, we had pre-SQL relational languages. I think in Ingres, back then, it was called Quel. So you got from these old database paradigms like network and a higher graphical, or hierarchical, and you go through the relational paradigms after a couple of decades and now to me it feels like we're going back to almost a hierarchical again. It's almost like we've reverted.
Robin Bloor: Yes, right. Better hand you on to Eric, I'm consuming too much time, but have we got any questions from the audience, Eric?
Eric Kavanagh: We do, we've got a few. We're going a bit long here but I'll throw a couple over at you. We had a couple of questions around the invisible indexes. One question was, “Does someone need to use your tool in order to see those?” Another question was, “Well, what if you're blind?”
Bert Scalzo: That's a good one.
Eric Kavanagh: Curious question too, so just FYI.
Bert Scalzo: No, you don't have to have our tools. That's an Oracle feature, the invisibles index. Basically in the data dictionary, Oracle just keeps a piece of metadata that says, “Optimizer, ignore this index. It's here, but unless you're physically instructed via a hint in the, an optimizer hint in the SQL command, do not use this.” And so, no, you don't have to have our tools, and in every respect it is a plain old index, you can see it in any tool, it's just the optimizer will say, “We'll ignore it in normal query processing.” You have to direct it if you want it to get used. It's really handy for the scenario I described which is, if you wanted to build an index in production but not risk breaking the reports, or the things that are already running, but you wanted to test them, you could do it. That's what it's most useful for.
Eric Kavanagh: That's good stuff and then there was another good question here. “What about some of these new in-memory databases? How do in-memory database technology change the game in respect to indexing?”
Bert Scalzo: Boy, well we – now that's a good, I'm glad someone asked that question, we're going to have to go another half hour. No, the in-memory, it depends on the database vendor. Now, normally, I am, I speak nothing but praise of anything that Oracle does because it's amazing the technology they've built, but when you tear back under the covers and you look at what in-memory is in Oracle, in the Oracle database, what it is in reality is it still kept row store on disk, and it will get loaded column-store in-memory, and if there's insufficient memory to hold the whole table, it will revert back to for the portions; it won't fit in memory, to doing it row store, and so you could actually do a select against the table and for half the table, you 're using an indexing hitting traditional rows at the table, and for the other half of the select it's actually going out and just grabbing everything from an in-memory search, and so, it's different in the way that SQL Server, for example, implemented it with their Hekaton technology, you know, and SQL 2014, and it's been improved in SQL 2016, but in some respects, theirs is a more true version of in-memory, and, but each implementation has a pros and cons, but you have to kind of look under the covers and realize. Because, I had a customer who said, “Oh this table's in-memory – I'm just going to draw up all the indexes,” and I'm like, “The table's bigger than the memory that you have on the server, so at some point some of the query’s got to hit disk.”
Eric Kavanagh: That's a good description; that's good stuff. Well, folks, we're going to have a few more webcasts with these guys over the rest of this year, come back anytime you hear of Bert being on a presentation because we know he knows his stuff. It's always fun to talk to the experts. We do archive all these webcasts for later viewing. Here's Bert's contact information once again, and we'll try to dig up that link for the download and send it out as well by email, but you can always email yours truly: email@example.com, we've got a bunch more webcasts lined up for this year and we're doing the ed cal right now, so, folks, if there's any topics you really want to hear about next year, don't be shy: firstname.lastname@example.org. Take care, folks, we'll talk to you next time. Bye-bye.