Eric Kavanagh: Ladies and gentlemen, it is Wednesday, at four Eastern time. I’m in New Orleans, the summer is coming, that means it’s hot! It’s time for Hot Technologies, yes indeed, yes indeed. My name is Eric Kavanagh, I’ll be your host. I’m going to kick the ball back here for Hot Technologies. The topic today is “Forward Momentum: Moving Relational Beyond Traditional.” Folks, we have three database experts on the phone today, so any questions you have, send them the hard ones, don’t be shy. We have a bunch of good content lined up for you today. There is the spot about yours truly, enough about me. Of course, this year is hot. We’re talking all about hot technologies in this show, which is a partnership with our friends from Techopedia. And we’re going all the way down to the foundation of information management today, which of course is the database. We’re going to talk about how we got here, what’s happening today, and what’s happening going forward. Lots of very interesting stuff going on.
Obviously we have some serious innovation in the database space. It was kind of quiet for a while; if you talk to some of the analysts in the business, I would say probably from the year like, 2005 to 2009 or ‘10, it didn’t seem like there was too much going on in terms of innovation. And all of the sudden it just broke out, like a jailbreak or something, and now there’s all kinds of interesting stuff happening. A lot of that is because of the scale of the web, and all the cool web properties that are doing different interesting things. That’s where the NoSQL concept came from. And that means two different things: it means no SQL, as in it doesn’t support SQL, it also means not only SQL. There’s a term “NewSQL” that some people have used. But obviously, SQL’s – the Structured Query Language – really is the foundation, it’s the base of querying.
And it’s interesting that all these NoSQL engines, what happened? Well, they came out, there was a lot of excitement about it, and then a few years later, what did we all start hearing? Oh, SQL on Hadoop. Well, all these companies started slapping SQL interfaces onto their NoSQL tools, and anyone who is in the programming world knows that’s going to lead to some challenges and some difficulties, and some crossed wires and so forth. So we’re going to find out about a lot of that stuff today.
There are our three presenters: we’ve got Dez Blanchfield calling in from Sydney, our very own Robin Bloor who’s in Texas, and so is Bert Scalzo, he’s in Texas as well. So, first of all we’ll hear from Dez Blanchfield. Folks, we will tweet at the hashtag of #HotTech, so feel free to send your comments, or send your questions through the Q&A component of the webcast console, or even through the chat window. And with that, Dez Blanchfield, take it away.
Dez Blanchfield: Thank you, Eric. Hi, everyone. So I’m going to try and set the scene at a 30,000-foot point of view of kind of what’s happened in the last decade, and the significant shifts we’ve seen – or at least a decade and a half anyway – of the database management systems, and some of the impacts from a commercial or a technical point of view, and some of the trends that we’ve endured of late, and lead us into the conversation we’re about to have today around the topic.
My cover image here is a sand dune, and there’s wind blowing tiny little bits of sand off the top of it. And as a result of that, what happens is that sand dune slowly walks from one space to another. And it’s an amazing phenomenon, where these massive 40- and 50-foot high mountains of sand, effectively, they actually move. And they move very slowly, but they move surely, and as they move, they change the landscape. And it’s quite something to watch if you spend any time at all in an area where sand dunes are a natural thing. Because you can look out the window one day, and realize that this massive mountain of sand, little tiny grains have moved all by itself, in effect, and that the wind slowly shifts it from one place to another.
And I think in many ways, that’s been the world of database systems for quite some time. Until very, very recently, that very small shift in the form of sand grains moving a giant mountain of sand in the form of a sand dune. Little shifts have come into the database platforms over the years, and it’s been a fairly stable and solid environment around database systems and platforms, through the mainframe of the mid-range era. But of late, we’ve had some fairly significant things happen to our commercial needs and our technical drivers. I’m going to walk us through those.
I have a view that the basic concept of a database, as we knew it for many, many years, and as you may have heard in the pre-show banter, our two experts who are on the call with me today had a lifetime in this space and they are quite right in sharing bragging rights of being there when it all started in the early ‘80s. But we’ve seen this massive shift in the last decade and a bit, and I’m going to quickly walk us through before I hand it over to Dr. Robin Bloor.
We’ve been through this what I call, “bigger, better, faster, cheaper” experience. As I said, the definition of a database has changed. The landscape in which the database platforms have had to address performance, and technical and commercial requirements has shifted as well. We’ve seen this increase in demand for solutions to deal with either more complex commercial, or more complex technical requirements. And so a really quick look through what that actually means, in my mind, is that we got to sort of the ‘90s, and we saw database technology impacted by the introduction of the internet, and kind of what we called back then the internet scale. We weren’t just talking about people sitting in front of terminals, originally the likes of teletype terminals with physical printers built into them and 132 columns of text coming out in paper. Then the early green screen terminals, punching with keyboards.
But you know, our world was terminals and serial cables or network cables talking to computers for a long time. Then along came the internet, and this explosive growth of connectivity, that you didn’t have to be plugged into the computer anymore. To get to a database system you just needed a web browser. So database technology had to dramatically change, to deal with the scale of everything from the basic search engine technologies that were used to index the world, and store an index of information, in the example of database format scale. And people like Google and others provided a platform to do that. And all new types of database storage and querying and indexing was produced. And then we had music sites and movie sites come along.
And then in the 2000s, we saw the dot-com boom, and that produced an even more dramatic explosion in the number of people using systems that were invariably powered by a database of some form. This stage, relational databases still coped with most of the load, we just put them on bigger tin, and we kind of went to the very, very, very big mid-range systems running Unix platforms from people like IBM and Sun and so forth. The dot-com boom just made things bigger and faster from a hardware, performance point of view, and there were some significant changes in the database engines, but for the better part, it was still the same thing we had kind of seen for a long time.
And then we got this era of web 2.0, as we refer to it. And this was a monstrous shift, because all of the sudden we needed much simpler database platforms, and there had to be a scale at a horizontal form. And that was such a significant shift in the way that we approached the idea of what a database was. We’re still really catching up now in my view. And now we’re dealing with this whole quagmire, and I say that with a positive spin, not a negative connotation, this quagmire of what we refer to as big data, and an enormous explosion, and I mean explosion. This outrageous shift vertically on the graph of the number of options we have when we talk about a database, and some form of relational querying capability.
And interestingly enough, I’m personally of the view that I think that big data really is just the tip of the iceberg. We do tend to get a little bit excited about what the impact of big data’s been, and the types of choices that we have available now. We’ve got everything from NoSQL engines, we’ve got graph engines, we’ve got all these different types of platforms that we can throw data at and do things with it. Even to the point where in fact, one of the very first conversations I had with Eric Kavanagh, who’s here with us today, was around a conversation pertaining to a thing called Apache Drill, which is an open-source project that allows you to query data inside model different data types: everything from raw CSE files sitting on a hard drive, through to HDFS file systems at petabyte scale. And you know, it allows you to do these SQL-style queries of structured and unstructured data of all kinds of exciting plants.
We’re about to see “smart building” become a thing, and we’d like to think we’ve got smart buildings of security and heat management, but I’m talking about smart buildings that know a lot more about who you are and where you are when you walk in, and do all kinds of neat things at that level, through to smart cities – entire ecosystems at city level – that know how to do things intelligently. And beyond that, we’ve got this incredible thing that I don’t think anyone in the world’s fully grasped, and that’s the form of the Internet of Things. There’s been all these different changes through the last decade and a bit, maybe two decades roughly, if we round it up, that have sort of just impacted the world of what we consider databases, in my view.
There’s been a couple of significant things that have made this even possible. The cost of hard drives has come down dramatically, and in many ways that’s what made it possible to drive some of the reference architectures such as the Hadoop model, in that we take lots of data and spread it out on lots of hard drives, and do smart things with it. And in effect, what became sharding, in my view, of the relational database or traditional DB unit model. And RAM got very, very cheap, and that gave us a whole new opportunity to play with different reference architectures such as in-memory, and to do things like partitioning very, very large lumps of data.
And so this gave us this little picture that we’re looking at now, which is a diagram that shows the types of platforms that are available if you’re in the big data landscape. And it’s very, very difficult to read, and the reason for that, there’s just too much information on that. There are so many make, model and manufacture options of ways to put data into database systems of any form, and query it, and do the traditional read-writes. And they’re not all [inaudible] compliant, in fact very few of them even comply to any basic [inaudible] style standard, but they still consider themselves to be a database. And I’m going to show you a couple of screens in a second to give you some context around what I mean by the shift from the ‘90s and the internet scale, to web 2.0, and then the whole growth through big data. If we think that this big data technology landscape graph is exciting because there’s a lot of options on it, let’s just have a look at one key vertical.
Let’s look at marketing technology. Here are the options for database management systems, or data management inside just the mar-tech space, so technology related to marketing. Now this was in 2011, so a few years ago; five years ago, this is what the landscape looked like. If I just go back one slide briefly, this is what today’s data landscape looks like in the various brands and offerings we’ve got in database technologies. This is what one vertical looked like five years ago, just in marketing technology.
Now if I go to today’s view, this is what it looks like, and it’s completely impenetrable. It’s just this wall of brands and options, and it’s thousands and thousands of combinations of software that considers itself to be in the database class, that it can capture, create or store and retrieve data in various forms. And I think we’re entering a very, very interesting and brave time now, where once upon a time you could know the major brands, you could know the five or six different platforms from Oracle and Informix, DB2 and so forth, and be almost an expert on all of the brands that were available some 20 years ago. Ten years ago, it got a little bit easier because some of the brands fell off, and not all the brands could cope with the scale of the dot-com boom, and some companies just went broke.
Today, it’s absolutely impossible to be an expert on all the database technology that exists, whether it’s relational databases, or standard database management platforms that we’ve come to know over the last couple of decades. Or likely the case, the more modern engines like Neo4j and those types. And so I think we’re entering into a very brave world where a lot of options are available, and we’ve got platforms in scale on a horizontal basis, either in-memory or on disk now. But I think it’s a challenging time for technology and business decision makers, because they need to make some very big decisions on technology stacks, that in some cases have only been around for essentially months. Eighteen months old is not a scary number now for some of the more exciting and new open-source database platforms. And they start to merge platforms and become even newer and more exciting.
I think we’re going to have a great conversation today about how this all has impacted the traditional database platforms and how they’re responding to it, and the types of technologies that are being thrown at that. And with that in mind, I’m going to pass now to Dr. Robin Bloor, and get his insights. Robin, over to you.
Robin Bloor: Okay, thanks for that. Yeah, this is way too large a topic. I mean, if you just took a sliver of one of the illustrations that Dez just showed you, you could have a long conversation just about one of the slivers. But you know, you can go a database – I’ve been looking at databases, I don’t know, since the 1980s, and you can look at database in different ways. And one of the things that I figured that I would do, just throw into the conversation today, was to talk about the reason disruptive things have happened at the level of hardware. And you have to bear in mind, an awful lot of disruptive things have actually happened at the level of software as well, so this is not the full picture of anything, this is just a hardware thing.
I wasn’t going to talk for particularly long either, I just wanted to give you the hardware picture. A database was data retrieval capabilities spanning CPU, memory and disk, and that’s changing dramatically. And the reason I say that, was that I learned to understand database from the perspective of what you actually did. You know, there’s a difference in latency between data actually on the CPU, and data being pulled into the CPU from memory, and data being pulled from disk into memory, and through the CPU. And the old database architectures were just trying to balance that. You know, they were just saying, “Well, this goes very slow, we will cache the data on the disk so it’s in memory. We will try and do that in a really accurate way so that a really good proportion of the data we ask for is already in memory. And we will march the data onto the CPU as fast as we actually can.”
And databases were written in the old days [inaudible] machines are written for small clusters. And now, for the [inaudible] ignorant of parallelism. Because if you’re going to get some performance out of a cluster, you’ll have to do various things in parallel. Parallelism is a part of the game, nothing like the way it is now. I’ll just kind of walk through what happened.
First of all, disk. Well disk is over, really. It’s pretty much over as regards databases. I think there are a number of contexts to archiving of data, and even very large data lakes running on Hadoop, the worst spinning disk is probably viable nowadays. Really, the problem with spinning disk was that the read speeds didn’t improve particularly much. And when CPU was going up Moore’s law speeds, kind of order of magnitude, faster every six years. And memory was kind of following in its wake, then those two were reasonably keeping pace with each other, it wasn’t entirely smooth, but they did.
But the random read to a disk where the head flies about the disk, I mean, apart from anything else, it’s a physical movement. And if you’re doing random reads off a disk, it’s incredibly slow compared to reading from memory, it’s like 100,000 times slower. And fairly recently, most of the database architectures I’ve looked at in any depth have actually just been serially reading from disks. You really want to, in one way or another, just cache as much as you can from the disk, and pull it off that slow device and put it onto a fast device. And there’s a lot of smart things that you can do with that, but it’s kind of over.
And solid-state disks, or flash drives, really, is what they are, is very quickly replacing spinning disk. And that changes again completely, because the way that data is organized on a disk, is it’s organized according to the way that the disk works. It’s actually about a head moving across a spinning surface, actually multiple heads moving across multiple spinning surfaces, and picking up the data as they go. A solid-state drive is just a block of stuff that you can read. I mean, the first thing is all the traditional databases were engineered for spinning disk, and they’re now being re-engineered for SSD. New databases can probably – anybody that’s writing a new database right now can probably ignore spinning disk, not think about it at all. But Samsung, the major manufacturer of SSDs, tells us that SSDs are actually on the Moore’s law curve.
They were already, I think, about three or four times faster than spinning disk, but they’re now going to get a lot faster every 18 months, basically. Double in speed, and 10 times in speed up to about six years. If that was just it, however, that isn’t it, as I will tell you in a moment. Spinning disk of course is becoming an archiving medium.
About memory. First things first, RAM. The CPU ratio between RAM per CPU is just increasing all the time. And that of course, in a way, delivers an awful lot more speed, because the acres of memory that you can have now can store a lot more. What this actually does is, it kind of reduces the pressure on MLTP kind of applications, or random read applications, because it’s easier to cater those, because you’ve now got a lot of memory, and that way, you can cache anything that’s likely to be read into memory. But you run into problems with a bigger data heap, so big data is actually not that simple, really.
And then we have Intel with 3D Xpoint, and IBM with what they call PCM, which is phase-change memory, are delivering something that they believe is – well, it’s at least 10 times faster than current SSDs, and they believe it will get very close to being the same speed as RAM. And of course it’s less expensive. So previously, you had this database structure of CPU, memory and disk, and now we’re moving towards a structure that’s got four layers. It’s got CPU, memory or RAM, and then this kind of faster-than-SSD memory, which is actually non-volatile, and then SSD. And these new technologies are non-volatile.
And there’s HP’s memristor, which is not yet, you know, because it was announced about seven years ago, but it’s not yet appeared. But the rumors I hear is that HP’s going to change the game a little bit with a memristor as well, so you’ve got just a new memory situation. This isn’t like we’ve got faster stuff, this is like we’ve got a whole new layer. And then we’ve got the fact that SSD access, you can read it in parallel. You can’t read spinning disk in parallel, except by having a lot of different spinning disks. But a block of SSD, you can actually read in parallel. And because you can read that in parallel, it goes way faster than its simple read speeds, if you actually set up multiple processes across the various processes on a single CPU, and just have at it with the SSD.
It’s estimated you can get almost up to RAM speeds by doing that. And all that this is saying is, the future of memory architecture is unclear. I mean, the reality is that the various dominant vendors, whoever they turn out to be, will probably determine the direction of the hardware. But nobody knows where it’s going at this point in time. I have talked to some database engineers who say, “I’m not afraid of what’s happening,” but they don’t know how to optimize it from the get-go. And you always kind of did, so that’s interesting.
And then there’s the CPU. Well, multicore CPUs weren’t just multicore CPUs. We also have significant volumes of L1, L2 and L3 cache, particularly L3, which is up to, I don’t know, tens of megabytes. You can put a lot there, you know. And therefore, you can actually use the chip as a caching medium. So that changed the game. And certainly, vector processing and data compression, a number of vendors have actually done that, dragged that stuff onto the CPU to make it all go a lot faster at the CPU. Then you get the fact that, well, CPUs with GPUs are really good at accelerating analytics. And they’re really quite good at certain kinds of queries, it just depends upon what your query is.
You can either create boards with CPUs and GPUs on, or as AMD are doing right now, you produce something called an APU, which is a kind of marriage of a CPU and a GPU; it’s got both kinds of capability on it. So that’s a different kind of processor. And then the recent announcement by Intel that they’re going to put an FPGA on the chip, that kind of did my head in. I was thinking, “How on earth is it going to happen?” Because if you’ve got the possibility of CPU, GPU, and you’ve got the possibility of CPU, FPGA – and by the way, if you really want to, on the same board you could put a CPU, and a GPU, and an FPGA. I have no idea how you would actually run anything in that way, but I do know of companies that are doing things like this, and they’re getting very, very fast query responses. This isn’t something that’s going to be ignored, this is something that’s going to be used by the established vendors, and by new vendors coming up, perhaps. DBMSs were always parallel, but now the parallel possibilities have just exploded, because this allows you to parallelize this with that, with that, with that in various ways.
Finally, to scale up or scale out? Scaling up is really the best solution, but for one thing. You get far better node performance if you can just absolutely optimize the performance of the CPU and the memory on the disk on one node. And you will use fewer nodes, so it’s going to be cheaper, right? And it’ll be easier to manage. Unfortunately, it’s a hardware-dependent design, and as hardware changes, it becomes less and less possible to do that, unless your engineers are going to be able to run as fast as the hardware is changing. And you do get workload issues, because when you’re scaling up, you’re making various assumptions about what workload’s going to do.
If you scale out, that is, if your architecture emphasizes scale out before scale up – actually you’ve got to do them both, it’s just that you emphasize one. Then you will get better network performance, because the architecture will deal with it. It will be more expensive in hardware terms because there will be more nodes, but there will be fewer workload issues, and there will be more flexible design.
And I just thought I would throw that in, because if you actually think of all the hardware changes I just pointed my finger at, and then you thought about, how are you going to scale up and scale out on that stuff? Then you realize that database engineers are, in my opinion at least, well underpaid. So if you just contemplate the hardware layer, the database challenges are clear. Now I pass this on to Bert, who’s going to make us all feel educated.
Eric Kavanagh: That’s it! Bert?
Bert Scalzo: Thank you very much. Let me just get straight into these slides. I have a lot of slides to go through, so on quite a few of them I may go rather quickly. We’re going to be talking about this “Forward Momentum: Moving Relational Beyond Traditional.” It’s not your father’s database anymore. Things have changed, and as an earlier speaker said, the last six to seven years, the landscape has changed radically.
Myself, I’ve been doing databases since the mid-'80s. I’ve written books on Oracle, SQL Server, benchmarking and quite a few other things. “The world is changing very fast. Big will not beat small anymore. It will be the fast beating the slow.” I added the “to adapt.” That was from Rupert Murdoch. I really believe this is going to be true. You’re not going to be able to do database stuff the way you did 10, 15, 20 years ago. You’re going to have to do it the way the business wants it now.
I’m going to try to stay a little generic in what I’m presenting, but most of the features I’m talking about you will find in Oracle, you will find in SQL Server, MySQL, MariaDB and some of the other big players. The relational database revolution, I kind of again agree with the earlier speakers. If you look right around 2010, we went from the red race car to the yellow race car. There was a significant change, and come 2020, I believe you’re going to see another radical change. We’re in a very interesting time.
Now, this slide is key, that’s why I put a key up there. There’s all this change going on, and on the left-hand side I’ve got technology, and on the right-hand side I’ve got business. And the question is, which one is causing which, and which one is supporting which? We have all these hardware changes: disks coming down, disk size going up, new types of disks, so that was covered by the earlier speakers. The price of memory dropping, all these newer versions of databases. But on the right-hand side, we’ve got data protection and compliance, data warehousing, business intelligence, analytics, mandatory data retention. Both sides of the equation are driving, and both sides of the equation are going to make use of all these new features.
First of all, we’ve got our typical SAS spinning disk, they’re up to 10 terabytes now. If you’ve not seen, Western Digital, HGST has what they call their helium drive, that goes up to about 10 terabytes right now. The spinning disk costs are getting pretty low. As was mentioned earlier, you can get solid-state disks up to about two terabytes, but Samsung has a 20-terabyte unit coming soon. The costs are becoming reasonable. One thing I am going to talk about the others didn’t is, the concept of flash disks. PCIe, that’s PCI Express, versus NVMe, you may or may not have heard of this, non-volatile memory express. Basically, NVMe is going to be a replacement for SAS and SATA, and it’s really more of a communication protocol than anything else. But those disks are up to about three terabytes now.
You also may have seen that some SAS drives now come with U.2 connectors, which is sort of a different connector than a SAS or SATA, that supports NVMe with a standard disk – the disk has to support it as well, of course. And then SATA with M.2 connectors, and those are starting to get NVMe. In fact, there are notebook vendors now selling notebooks that have an NVMe flash disk in it, and those things will scream compared to the technology you’ve used before.
A lot of people don’t know what all these different flashes are. If you look in the bottom right corner, that’s an example of an M.2. You may say, “Well gee, it looks a lot like the mSATA drive to the left of it.” But as you can see, it’s got two gaps in the pins as opposed to one, and it is a little bit bigger. And also, the M.2 can come in three different sizes.
And then the PCI Express flash, and the NVMe flash. Now, the NVMe flash is also PCI Express, but the PCI Express is typically still a SAS- or SATA-type controller algorithm that was written for spinning disk, and NVMe is the algorithms or techniques that were written specifically for flash. And again, you’re going to be seeing all of these.
NVMe offers quite a few things. I think the two biggest improvements are, up in the top right corner, the latency is reduced by as much as 70 percent. I’ve actually seen even higher than that. In addition, if you look in the bottom right corner, when your operating system talks to the NVMe disk, it goes through far fewer levels of software. Basically, you go through the NVMe driver that’s included now with the operating system, and it talks straight to the media. There’s a lot of reasons why this technology is going to radically change the database world.
And a lot of times, people will say, “Well, how fast is NVMe?” You know, the good old days, back 2004 and before, we got excited if we had Ultra-320 SCSI, 300 megabytes per second. Today’s speeds, a lot of you are probably on fiber or InfiniBand, and those kind of top out. NVMe over there on the right, starts at where the current technologies end. What I’m getting at is, PCI Express 3.0 with an eight-lane link starts at almost 8000, and it will go up as we get newer versions of PCI Express, versions four and so on. NVMe has nowhere to go except up.
Now, what are some of the things changing in the database? Now in the top right corners of my slides, I put the business reasons I think the technology showed up. In this case, because of data warehousing and because of regulatory reasons for mandatory data retention, the databases are starting to offer compression in them. Now, some databases offer compression as an add-on, some offer it as built-in to the standard, let’s say enterprise edition of their database, and yet some databases, like in Oracle, could even have an even better version of compression that’s in, say, their Exadata platform, so they’ve actually built hardware that can support a very specialized compression and that one in Exadata, for example, gets a 40x compression rate, and so it’s very significant. And I think it’s the mandatory data retention, people just want data longer. The businesses, in order to do analytics and BI they need the last 5, 10, 15 years’ worth of data.
Now another feature that started showing up right around that 2008, 2009 period was partitioning. Again, you will find this in databases like Oracle, SQL Server, and in both of those you have to pay for it. In Oracle you have to buy the partitioning option and in SQL Server you have to be on the data center edition. It’s your traditional divide-and-conquer technique and what you do is you have the concept of a logical big table at the top there and when it gets put on disk, it actually is broken up into buckets. And you can see that those buckets are organized by some criteria for separating, typically referenced or called your partitioning function, and then likewise, you can also sub-partition in some database platforms and you can go even further.
Again, I think both data warehousing and the mandatory data retention have pushed this, and in some of these databases you can have up to 64,000 partitions, and I believe on some other databases even up to 64,000 sub-partitions. This allows you to break up your data into manageable pieces. You also will partition the indexes; it’s an option, you don’t have to, but you can partition your indexes as well. One of the reasons to do this might be that you have a sliding window of data. You want to keep 10 years’ worth of data but in order to drop the indexes to run tonight’s batch load, you don’t want to have to drop the indexes on every single row, only on the rows that are in the current bucket. Partitioning is actually a very good administrative tool even though most people think that its great benefit is forgoing partition elimination in your plans and therefore speeding up your queries. That’s really kind of icing on the cake.
Now you probably heard about sharding and you probably think, “Well, why did you put this slide in here?” This is one of those NoSQL – this is one of those Hadoop-type environments. Oracle 12c released two, which is not G8 yet, but which is being shown or previewed actually has sharding in it. You’re going to have a traditional database system like Oracle and you’re going to be able to shard like you do in the Hadoop model, and so you’re going to have another divide-and-conquer technique that’s going to split your table row-wise into groupings per node and this is going to be – just like what you see in some of your NoSQL databases. And actually MySQL, you can actually accomplish this pretty much using one of their clustering techniques, but it is coming to a traditional database and my guess is Microsoft won’t want to get left behind. These two play leap frog with each other all the time so I would expect to see sharding in maybe the next version of SQL Server.
Data life-cycle management, again mandatory data retention, but also for business intelligence and analytics. Really, this is a divide-and-conquer technique, and typically DBAs do this manually, and that is, “I’m going to keep this year’s data on fast disks, last year’s data on slightly slower disks, maybe I’m going to keep the last two years before that on even slower disks, and then I’ll have some archival method.” It’s typically not taped anymore, it’s typically – you’ve got some kind of network-attached storage or some device that has lots of storage and is, you know, cost effective but it’s still spinning disk.
And so now you can actually – both on Oracle and on SQL Server – you can purchase an option where you define the rules and this just happens automagically in the background. You don’t have to write scripts anymore, you don’t have to do anything. And if you’ve seen SQL Server 2016, which just came out June first, there’s a new feature that’s called “Stretch Databases” which basically lets you do – in the bottom right corner there – you can move from multiple layers directly into the cloud and again this is a feature that is built into the database, you just say something like, “If the data is more than 365 days old, please move it into the cloud and, you know, do it automagically for me.”
This is going to be a really cool feature, in fact I’m thinking that it may be what we’re going to see in the future, which is you’re going to have hybrid databases where you’re going to keep some local and some in the cloud. Before this, people were thinking, “Oh, I’m either going to do on-premise or I’m going to do on the cloud.” Now we’re seeing the marriage of the two technologies in this hybrid fashion. I think this will be pretty big and Microsoft got there first.
Redaction, this is due to data protection and compliance. Now in the good old days we might have said, “Hey, application developer, when you display this in the report, when you display this on the screen here is some security things you should check and please, you know, only show the data they’re supposed to see or mask or redact the data that they’re not supposed to see.” Well, as is usual, when you push it out to the application it’s not done on one place so it gets done differently or it doesn’t get done in some places. And so now you’ve actually got this capability in your database systems.
Now in SQL Server 2016, this feature is built in so it’s not an optional cost item yet to be on the data center addition, I believe; and in Oracle 12 you have to buy their life-cycle management add-on, but this is something new and again it’s being driven by the business. And especially because you’re keeping so much data now, and you’re doing the data mining, so the BI and the analytics, you’ve got to know who’s accessing what data and making sure that they’re only allowed to see what they’re allowed to see.
Likewise, again look at that, data protection and compliance. You will find that a lot of the database systems now are building compression, or I’m sorry, encryption directly into the database and what’s important about this encryption, if you look at the down arrow and the up arrow on the diagram it writes it down to disk encrypted and then it reads it back up into memory and decrypts it. That’s actually one model, there’s another model that would, you know, actually only do it when it communicates that data across the network to the actual client application.
In that case, it would even still on the database server in memory it could be encrypted and only decrypted when it’s sent over to the client application. There’s two different models here and you will find these in the databases, and in fact one of the databases that just added this recently was MariaDB in their version 10.X; I believe they’re on 10.1 or 10.2 now. And I actually did some benchmarking on this encryption, and in order to get this encryption, I only experienced about an 8 percent decrease in throughput or speed. In a benchmarking test, the encryption did not cause that much and so it’s a very useful feature.
Now, we have mentioned earlier about flash memory and SSDs and things like that. One of the features you have in Oracle and SQL Server that a lot of people don’t realize is you can take a flash or SSD that’s on your database server and you can say to the database, “Use this as if they were memory. Treat the RAM as preferential, but pretend like this is slow memory and use that as an extended cache.” Now in SQL Server 2014 this came out and was called “Buffer Pool Extension,” it’s free. In Oracle, it came out in 11g R2 and it was called “Database Flash Cache” and it was also free there.
My advice, though, is to test drive this feature carefully. Every time you make the cache bigger when you go to do a lookup, it takes longer. If you put a three-terabyte flash card and say to the database, “Add that to your memory,” you actually might find that something slowed down because of the time to look in and see is it in flash, is it a dirty or clean? There is a point of diminishing return. My advice is again test drive this, see what works for you, but again, it’s in your database and in case of Oracle’s, in both SQL Server and Oracle, it’s been there for a couple of years now.
And then that brings us to the granddaddy which was the in-memory databases and that’s because the database prices have dropped. The other reason that you probably would think that this has occurred is a lot of the analytics is requiring having the data be very quickly accessible, and so it needs to be in-memory. Do note that the algorithms that the databases use to access this data, to compress it, to encrypt it, to store it, you know in some cases some databases may continue to store in-memory as a row.
In some cases, some databases may break this into a column oriented and the reason they do that is they get a much higher compression level, somewhere around the 11 to 12X by storing it in column order versus row order. This first showed up in SQL Server 2014, it was called “Hekaton.” It’s been radically increased in SQL Server 2016, they’ll see it referenced by some different names and it came out in Oracle 12c; I say the second release here, not R2. There were two different releases of Oracle 12c, the 18.104.22.168 and the 22.214.171.124. It’s the second release of the R1 version of the database.
And the way that you define it, in-memory object is similar in both databases. Here you can see on the right top corner, I’m creating a SQL Server and you can see it says with memory optimized and durability being schema only. I’m not going to go over all these syntax meanings, and in Oracle it’s actually even simpler, you just alter a table and say in-memory or not and you can change that. I can say today it’s in-memory and tomorrow it’s not and so it’s very flexible.
I did some tests on Oracle with in-memory tables, I had some tests that took almost 40 minutes to run, up there on the top row. Now what’s important is by the time I got to the bottom two rows, I had increased the runtime or decreased it, I should say, to five minutes approximately, and when I looked at the compression factor, the data in-memory was actually 3.6 to 4.6 times smaller. That’s important because in this case I was using column oriented format and it’s compression. And so guess what? I actually was fitting almost four to five times as much data in my memory. Not only was I getting the advantage of in-memory, the advantage of column oriented, but also the advantage of far more data – up to five times as much data in the memory cache, so this is a pretty powerful technique. Again Oracle and SQL Server, you want to look at these, they’re really cool features. And with that, I think I’ll open it up to questions.
Eric Kavanagh: Well Bert, first of all you’ve been very selfless in all this wonderful education. Could you talk just for a minute about what you guys do? Because you’ve got some enabling technology that can facilitate what you’ve been talking about. Just talk for a minute about what you guys do and then let’s get Dez and Robin down in the equation here.
Bert Scalzo: Yeah, I work for a company called IDERA. We’re in Texas, we’re headquartered in Houston, and I’m actually sitting in Austin right now but I’m based in Dallas. We make database tools and we make database tools to help you solve problems. That problem could be something as simple as productivity in which case we have a tool called DBArtisan that lets you do your database administrative tasks and it’s one tool to let you manage 12 different database platforms. I can manage SQL Server, I can manage Oracle, I can manage MySQL, DB2, Postgres, and I’m using one tool, one executable, one GUI design and one consistent set of workflows. We also make tools to do compliance, we have a tool called SQL Compliance Manager to help you meet your compliance needs. Another tool called SQL Security, so we try to make the tools that will help you be effective and efficient, and what’s really nice if you go to our website, we have a whole bunch of freeware out there, so if nothing else, go download – I think we’ve got like 20 or 25 freewares. There’s some really good freeware stuff out there like there’s a SQL Server and a Windows Help Check that will just basically look at what you’ve got and tell you whether you’ve got issues or things and it’s totally free.
Eric Kavanagh: And you really kind of—
Bert Scalzo: Definitely the first stuff—
Eric Kavanagh: You’re speaking to the heterogeneity in the marketplace today, there used to be kind of a one-size-fits-all equation that in fact I remember interviewing Dr. Michael Stonebraker way back when in 2005, as he went on a big push talking about verdict on the column-oriented database movement and he was talking all about how the one-size-fits-all relational model dominated for many years, and he was predicting that would all change, and boy was he right about that. Now we have this really diverse and interesting environment with lots of different options and opportunities, but you do need somebody to manage all of that and it seems to me that your company is focused pretty acutely on solving math problems, thus being an enabler of the header of heterogeneity, right?
Bert Scalzo: Absolutely. I mean there’s always going to be DBAs who say, “I don’t want to use a GUI tool, I do everything with scripts,” you know? They think they’re the superman type of DBA and that’s fine but for most of us people, we want to just get work done and – you know, I use Microsoft Word to write my documents. I use Microsoft Outlook to do my email. I mean, I have tools for doing tasks. We’re building the same kind of concept, we’re building tools for database administrators and developers to help them focus on what they want to do and not how they have to do it.
Eric Kavanagh: That makes sense, but let me turn you over to our experts, and folks feel free to dive in. We’ve got a couple of comments coming in from the audience. Maybe, Dez, a couple of questions and Robin a couple of questions?
Dez Blanchfield: Sure. One of the first questions that I want to throw at you, given the enormous span of experience you got, do you see a point in time soon when any of this is going to slow down? Or do you think we’re really just at the entry point of this continual growth line of change? I think one of the greatest issues that companies are facing, and then invariably the people trying to support the technology being provided those companies to run their businesses, is that the rate of change is so dramatic that they just can’t keep up with all the different features, and software, and systems, and frameworks, and architectures, and new code coming up, and then the hardware underneath that, do you see the current rate of change slowing down at all immediately? I mean, you deal with such a wide range of platforms with the entire IDERA suite, are we going to slow down soon or are we sort of on this crazy runaway freight train for a long time yet?
Bert Scalzo: I think we’re at the first 20 percent of that growth curve and we’ve got a long way to go and there are two things pushing it. The technology keeps evolving. You have mentioned some of the new memory types that are going to be coming out, that’s going to be fantastic. Samsung’s going to have a 20-terabyte flash drive here real soon. That’s going to change things. We’ve got all these NoSQL and cloud databases, this is just going to keep going. The one thing that’s kind of funny, though, is when I look at databases like Oracle and SQL Server and some of the others, they’re really not relational databases anymore. I can put unstructured data into Oracle and yet maintain ACID compliance. If you’d have told me that 20 years ago, I’d just said you were on drugs.
Dez Blanchfield: Yes, yes, they’re cool. Well even now those engines that have got quite nice niche verticals like GIS, just better than native capability now. You made some great comments about the challenges that DBAs are facing and the different times of DBAs that we hope to see around the place, but what’s the world looking like with the sort of that layer of the business that you’re dealing with? I mean, these are the people that use the different platforms from your diagnostic manager, to the inventory tools, and all the way down to the bellowing to the defragging, how are DBAs coping with this change and how do they sort of – you know, what are they doing with your tools to kind of deal with this significant shift in their landscape?
Bert Scalzo: Well, I’m going to go back almost 20 years ago, then I’m going to say that DBAs solve a very specific role in an organization. They typically work with one database platform, maybe two, and they managed a relatively small number of databases. Now fast forward to today and the database administrator, he’s actually going to know 10 database platforms. He’s managing, and this is no joke, in some cases thousands of databases; that’s more on the SQL Server world or the MySQL world. But still in the Oracle world they could be managing hundreds of databases. And so they’ve got all these new features coming out, they’ve got all these new platforms, and they’ve got all these databases they’re responsible for. They’re looking for tools to enable their productivity and also to help them learn some things.
And I’ll give you an example – if I want to partition a table it’s a pretty obscure syntax, and if I want to sub-partition it, the syntax gets even more difficult. I know what I want to do, I want to create buckets. If I’ve got a tool like DBArtisan that says, “Hey, here’s a nice screen that lets you concentrate on what you’re trying to do rather than how you’re trying to do it, and oh by the way, push the Show SQL button when you’re done and we’ll show you what the SQL was so you can start to really learn and master this.”
DBAs are finding that tools that help them get the job done but also help teach them all this new stuff that they’re using and the same would be true – let’s say I’m an Oracle guy and I go over to MySQL and say, “Okay, create a database, DBArtisan. Now show me the SQL because I wonder what it is like to create a database on MySQL and I just learned to syntax.” And so we’re not only helping them to work across database, we’re also educating them across database.
Dez Blanchfield: It gets even more interesting when you get out to some of the more modern – or not more modern, that’s not a fair thing to say – but once upon a time a database is a database. These days I see everything you’re talking about there with the added challenge that the technology stacks that we traditionally see from vendors and you sort of open source into it and also that they’re good. Not just deal with the database engines and the query languages, but they also deal with the data types, the structured and unstructured, you know, the challenge of having to deal with everything from the far end of the spectrum of a multi-petabyte HDFS environment to little tiny [inaudible] containers, and packet files and various log file formats.
And I think that that’s something now we’re seeing where just no human being, no matter how much of a superman, superwoman, whatever they might think they are, they physically, they just can’t mentally deal with that rate of change and the scale of variations. I think the suite of tools you’re offering now are going to get to a point where they’ll almost be on a default set of [inaudible] in many ways so that we can’t run the database environments we got without them because we just physically can’t throw that many bodies at them. I really enjoyed your presentation. I’m going to pass to Dr. Robin Bloor, I’m sure he’s got plenty of questions to throw at you as well.
Robin Bloor: Okay. Well I certainly have questions. Bert, I don’t know where you’re going – I had a really interesting conversation a couple of days ago where someone started telling me about the latest DU data protection, and it seemed to me from what they were saying that it was incredibly draconian in terms of things they insisted on. I wondered if you’d actually looked at that; is it something you’re familiar with?
Bert Scalzo: Absolutely. Yeah.
Robin Bloor: 2016, Okay, tell us about it.
Bert Scalzo: And I’ve actually—
Robin Bloor: Deeply interesting.
Bert Scalzo: I actually worked for a while for a flash vendor, in their database area helping them build flash products for databases, and I can tell you that the draconian goes all the way down. What I mean is, if you remember my one slide, I said in some databases it will do the encryption but it puts it into the server memory and in some databases the encryption – it’s still encrypted in the server memory, it only gets decrypted when it gets sent to the client. Well what you’ll also find is some of these government standards, especially Department of Defense or military here in the U.S., they also go all the way down to the flash level and they want to know not only that you support encryption and decryption in your hardware, but that if someone stole the chips that – you know, pulled them out of the thing, out of your server, that what’s there is encrypted and so even though they’ve got the storage it can’t be and they would all the way down to the actual – not to the flash part itself but down to the individual chips. They wanted to know that chip by chip, everything was encrypted.
Robin Bloor: Wow. I mean there are a lot of things that – you know, I think it was only one or two slides that you’ve brought up about this, but it was something, a scenario that I think is really interesting. The redacting of information for instance, there’s got to be a little bit clever than just masking off various fields because especially with machine learning nowadays, you can do deductive things that allows you to surface information that you couldn’t previously surface.
If you’re trying to protect, let’s say health information, then that’s a very, very draconian rules in the U.S. with regards to health information, but you can actually, using various machine learning techniques, you can often work out who’s somebody’s medical information actually is. I just wondered if you’ve got anything to say about that because they all think that’s an interesting area.
Bert Scalzo: Yeah, absolutely, and I’m just using this as example, I’m not trying to say one database is better than another, but this is a very good example for what you just asked. In Oracle, if I am not allowed to see a row of data for example, like I’m not allowed to see the John Smith medical record. In Oracle if I say, “Select that record,” I’ll be blocked or I’ll be allowed to see what I’m allowed to see and it will be redacted. And if I say, “Select account star from the table where equals John Smith,” I’ll get zero.
In SQL Server, it can do the redaction but it has some holes. If I say, “Select account star from the table where it equals John Smith,” I’ll actually get back a one, so I know there’s a John Smith. One is more secure than the other. Now I expect them to fix that, they always play leap frog with each other. And again, I’m not trying to differentiate between the databases other than to show an example of – look at what we’re talking about now, something as simple as select account has to also be cut by the redaction, even though, technically speaking, there’s nothing being redacted other than the existence of the row.
Robin Bloor: Yeah, right. That’s kind of interesting. I mean, another general question because I don’t got a lot of time, is really just about the improvements. I mean you’ve been in one where I know that you’ve been showing us examples of various test results you’ve run – do you think that the traditional databases, let’s call them the dominant databases, SQL Server and Oracle, do you think that they’re going to stay ahead of the completion? Or do you think they’re actually going to get caught by one or another of various kinds of disruptions in the marketplace that really run for them? What’s your opinion?
Bert Scalzo: I have an opinion and it’s – you know, again I’m going to say it’s my opinion – Microsoft for example, in the post-Ballmer era is just impressing the living hell out of me. I mean this stretch database getting SQL Server on Linux, getting .NET over on Linux, getting PowerShell over on Linux; I don’t think that traditional database vendors are going to get left behind. I think they’ve decided, “Hey, let the new guys, the startups define something. Let them figure out what sharding is and how it should be perfected, and once they’ve done all the research and development, we know exactly what users want, now let’s add sharding to Oracle.” I think they’re just getting smart and saying, “Hey, being second or third is not bad when you’re the dominant player because then people won’t migrate off of you.”
Robin Bloor: Yeah, I mean it is a strategy that has been used. I mean IBM used to do that and the whole of the – for the whole of their product ranges and it does rate reasonably well until somebody comes up with something that’s just completely off the wall that nobody’s ever thought of, but you can’t plan against that anyway.
Questions from the audience, Eric?
Eric Kavanagh: Yeah, but you’ve got time I think just for one maybe and I know that Bert has to run. There was something in here about – okay, the sharding architecture on Oracle 12c is that an indication of – or what is that an indication of in your opinion, what do you think is happening there?
Bert Scalzo: Well, Oracle is absorbing or/and offering everything that all the other database vendors are. For example, I can put unstructured data in Oracle. I don’t know how you can put unstructured data and then call it a relational database, so it doesn’t make any sense, but you can. And now Oracle is adding sharding, so Oracle is saying, “You know what? Whatever the market wants, we will make our database offer because the market wants what the market wants and we want to deliver the solution, we want them to stay with us.”
I think that you’re going to see additional items. I would not be surprised to see Hadoop-like clustering of database nodes not in an Oracle rack or real application cluster, but basically in more of a traditional Hadoop-type clustering doing that sharding. And so I think you’ll be able to deploy a database like Oracle like you would a Hadoop, and these kind of trends are going to continue. These big database vendors, they make billions of dollars and they don’t want to lose their market, so they’re willing to adapt to anything or adopt anything.
Eric Kavanagh: Well, you know, it’s funny because I’ve followed the open-source vendors for quite some time and have wondered all that while how big of an impact it will have on traditional closed-doors technology, and for a while it sure felt like the open-source vendors were making some serious headway, and now as I look at the marketplace I see kind of what you’re saying, that the big guys have done their math, have sharpened their pencils and they figured out how they can weave a lot of that stuff into their architectures. Whether it’s IBM, or Oracle, or SAP – I was just at the SapphireNow Conference last month and Steve Lucas, who heads half of that company, bragged that SAP now incorporates in their HANA cloud platform, more open-source components than any of their competitors. If you do the math on that, it’s a pretty impressive statement and it tells me the big guys aren’t going anywhere anytime soon.
Bert Scalzo: No, I would bet my money on both. I mean if you look, Microsoft’s stock recently was at about $50 and, you know, just a few years ago it was at 25. You don’t double your stock price in a short period unless you’re doing good things and, you know, from doing everything from Windows 10 being free for the first year to all the other smart things they’re doing, this stretch database feature I think is just phenomenal. I think what’s going to happen is a lot of people are going to end up in Azure, not directly, not like they said, “Let’s migrate my database over to Azure.” It’s going to migrate over there magically because it’s going to get archived over there using this new stretch database feature and so the adoption of Azure is going to just skyrocket.
Eric Kavanagh: Well that’s one of the trends in the marketplace that even I can see, even on your Mac. As you go in your Mac to save some documents, they now – and the newer Macs just follow through the cloud, right? I mean, there’s a lot of sense in that strategy and I also look at it and go, “Okay guys, you’re trying to lure me piece by piece into your cloud environment, and then someday when I want to watch some movie if my credit card is expired I’m going to be in trouble.”
Bert Scalzo: Yeah, but you do it on Facebook.
Eric Kavanagh: Yeah. That’s true.
Bert Scalzo: You put everything on Facebook.
Eric Kavanagh: Well, not quite everything.
Bert Scalzo: No, I mean—
Eric Kavanagh: Yeah, go ahead.
Bert Scalzo: These social trends are reaching into businesses. Now businesses still have a lot of other things they have to do, but they’re seeing these trends and they’re doing the same kinds of things. I don’t see either Oracle or Microsoft going away. In fact, I’m going to be buying stock on both each time there’s a dip.
Eric Kavanagh: Yes, indeed. Well folks, go to idera.com, I-D-E-R-A dot com. Like Bert said, they have a whole bunch of free stuff up there and it’s one of the new trends in the marketplace – give you some free stuff to play around with, get you hooked, and then you go buy the real stuff.
Folks, this has been another Hot Technology. Thanks for your time today, Bert, Dez of course, and Robin as well. We’ll talk to you next week, folks, lots of stuff going on. If you have any ideas, feel free to email yours truly, firstname.lastname@example.org. We’ll talk to you next time folks, take care. Bye-bye.