Eric Kavanagh: Ladies and gentlemen, hello and welcome back once again. It is four o’clock Eastern Time on a Wednesday, and these days that can mean just about one thing if you are in the world of data: it’s time once again for Hot Technologies! Yes, indeed.
My name is Eric Kavanagh, I will be your host for the show. It’s designed to figure out what’s hot, what’s happening out there, what’s the cool stuff that’s being used in the enterprise, and of course, right at the foundation of everything we do in this whole field is the database. So we are going to talk about protecting your database. The exact topic is, “Protect Your Database: High Availability for High Demand Data.” So, there’s a slide about yours truly. And, enough about me, hit me up on Twitter, @eric_kavanagh.
First, this year is hot, data is hot, big data is very hot, but it’s really still kind of on the edge. More of the cutting-edge companies are leveraging big data these days, most bread and butter organizations out there in the world, they are still using traditional data, and if your data is in high demand, then you want to ensure that it’s available because when systems go down, when data is inaccessible, that’s when you get unhappy clients, unhappy prospects, you get customer churn, you get unhappy all kinds of things, partners, etc. So you don’t want that.
We are going to learn from some of the best today in the business – we will hear from our own Dr. Robin Bloor, database expert of some three decades running. Dez Blanchfield, who has been doing this for about as long, but he started when he was really young, and Bert Scalzo from IDERA, who is really quite the database black belt. So don’t hold back, folks, ask questions – the big part of this event is valuable to you is when you ask good questions and get good answers, so send them via the chat window or the Q and A component of your console.
And with that I’m going to hand it off to Robin Bloor – take it away.
Dr. Robin Bloor: OK, let me click this and see if it moves – it does. I’m not going to talk about database particularly. I thought that, you know, because I’m doing the intro, first introduction presentation, so I’ll talk around the expected service levels and of course availability, which is the deal, which is the topic of today’s show.
And the question is to, you know, “Really, what is availability? And what part does it play in the way that people run data centers nowadays?” One thing that I noticed – I noticed this actually sometime in the ‘90s – I was working on one site and users started to complaining because their email was down for 15 minutes.
And it was interesting because the CTO or whoever was in charge of IT had actually, one of the few places where in those days they had actually determined the service levels and the email being down for 15 minutes wasn’t in violation of anybody’s service level. I think it’s allowed to be out for two hours, in actual fact. It wasn’t the email couldn’t be used, it just was that you couldn’t send and receive because the server was out. And that kind of alerted me to the fact that I have noticed moving forward since then, that everything just speeds up and so does the expectations of the users, and this leads you to the situation where people might have three service levels, but often they will start complaining when service levels aren’t actually violated.
So definition of service levels, just to give a— well, it can depend exactly upon what you’re talking about in terms of service levels. We’ve talked about IT system or IT application. Normally define in terms of performance, availability and metrication – in other words, you can’t really define a service level unless you can measure it, so normally there is some kind of measurement involved and it’s normally about response times, particular transactions and the availability of the systems over a particular period of time, and before about 1994–1995, it was really rare that any systems were required to be available for more than normal working hours. So let’s say eight in the morning till six in the evening, to give a normal span – and people built systems and that way and that meant – in my mind, particularly with the database – you could configure the database in a particular way and as the batch window started to shrink, the need to think again started to arise in some systems and then other systems, and then we got the advent of service or the architecture, which started to make dependences between systems that hadn’t previously been dependent upon each other, making everything even worse. We got the squeeze in terms of the availability of the systems.
The point I was making, was when talking about availability, it includes backup and recovery and includes – it’s like is not just availability in the normal terms we are talking about; there are a lot of different ways in which an application can fail. You know, you can get hardware failure or you can get the database failure, you can get software failure and there are loads of different species of those things, and when it occurs you need to be able to recover and therefore you need to also back up the systems. So there needs to be some scheme of backing up the system and you also, on a lot of sites nowadays, you need the disaster recovery capability in case a whole building blows up. And something worth mentioning here, and I am going to harp on about it in a minute, but the business processes, they have service levels too, and in actual fact, the service levels of the business process that really matters to the business. IT just has to do its part of it and according to whatever agreement.
IT service levels are normally subsidiary to business process service levels, but just as it was really quite rare 15 years ago for any organization to have well-defined service levels, it’s still quite rare for organizations to have well-defined service levels for business processes. That’s something that’s kind of happening now; it’s not something that has been going on for a long time.
This is the acceleration and time barriers, it’s just worth mentioning time barriers. We gradually move into an event-processing world and because of that we gradually move into a real-time world, and because of that we gradually move into availability to being required 24 by 7, and that’s actually tough for a lot of systems – it’s difficult to achieve. Either it’s very expensive, or in some instances you might actually have to change the systems, even move to a different database, a different version of the database software we’re using.
Also these time barriers – and I always like to mention these whenever I get a chance – these are time barriers that our applications run into; applications might want to be as fast as possible, that’s when software speaks to software. There really isn’t any acceptable license in some situations, you want to be as fast as it can be, and those situations in the business terms like market situations, where the person who comes with the buy order second gets a worse price than someone who comes first, and therefore the software speed really matters.
But you know, below that when you’re actually dealing with – interacting with – human beings, the best response time that can really be demanded of you is one tenth of a second, because that’s about a human being’s response time. You don’t need to go any faster than that because a human being won’t notice anyway. Between 1.1 and four seconds is a wait time that human beings will normally tolerate, but as soon as you go past about four seconds, they are off doing something else, and therefore you are really into a batch activity.
So you can see that certain time frames and day, week and months for those things where a batch behavior makes sense and therefore you aren’t in an event-processing world, and therefore availability might be actually quite different in terms of what you need to be able to provide. But as soon as you are in the event world, then you are in some 24/7 availability and technology change is a factor as technology goes faster and faster, then the availability mightn’t increase; it just stays the way it is.
This is layers of complexity and I don’t want to go into this in any depth, it’s just, you know, there are three things to consider here. There is service level of infrastructure, this is the vertical axis, and then there is a service level of any given application and then there is a business service level, and those are dependent upon each other and they’ll need to be taken into consideration if you’re actually looking at creating a responsive environment where service levels are met, basically.
Then you have, down in the bottom here, which is just represented databases, but you can do anything within the system, you know you got the nonstop configuration, which means what it says: it will never stop. You’ve got the hot standby situation, where in one way or the other, there are different ways of achieving it, but in one way or another, if a database fails, it’s switched to a hot standby and there is very little lag in terms of time, to the point where users would probably notice, but wouldn’t notice much.
Warm standby is more like the 20-minute switchover where everybody rings up the help desk and bitches at the help desk while the database is being switched over to a standby. Then there is a reboot situation where it can take a very long period of time. It’s worth noting any given application or any given database may be in any one of the situations depending upon what actually is going on and on what the service level required of the application actually is.
From that, I just want to make a point about the complexity curve. The complexity derives from nodes and connections, the dependences. In the world that we’re living in, the number of nodes and connections involved in anything just keeps on growing, so you are running to this kind of expediential curve. If you can look at the way complexity is increasing and the way that time dimensions are shrinking, then you know for availability levels, are there time targets, are they likely to be reducing?
And the natural evolution therefore is towards nonstop operation, which is of course the most expensive – at least in my experience – it is the most expensive configurations you can create. In one way or another, any organization that is thinking about this, really needs to think not just about what’s happening now, but what’s going to happen in the future.
Perhaps the last point I want to make is, the management of service levels is an ongoing activity; it isn’t something that you know you have a project, you do it and it’s over. It isn’t, because things just keep on changing. Having said that, I will pass the ball to Dez.
Dez Blanchfield: Thank you Robin. I love your opening slide. We just had the rerun of, I think it’s “Finding Nemo 2,” the movie. You had Nemo searching for availability in the form of nines, which I thought was pretty cute. Always a tough act to follow. When I think about uptime and availability and high performance, the first image that comes to mind, because I grew up in the Solomon Islands near volcanoes and the equator, is a volcano erupting in my data center; there’s this image I always have in my mind that that’s what could potentially happen if something goes bang. This a picture of the lovely Mt. Etna, which is the northeast corner of Sicily, which is right next to Catania.
My approach to this is to have a conversation with you and give you a couple of takeaways at the same level I do in a boardroom on a regular basis from C-suites and the heads of line of business with a view that we have a conversation about what can impact your organization from a commercial or a technical sense and the types of engineering.
We need to be thinking about and how— what we take away from that, and how we go to then address some of the challenges that we are talking about when we talk about high availability and uptime, particularly around automation and platforms.
So, the question we pose initially is, what do we actually mean when we talk about database systems and database platform availability? What does it actually mean to talk about the actual challenge of making something available to a level as Robin talked about in the service-level agreement installed mapping of what do we actually need and want?
So, the reality of today is that – and in fact here are a couple peak realities in my mind – today everything is effectively database driven. There are very few systems that are built today and built in such a way that stuff just gets stored in files or is some sort of flat file log; invariably everything is database driven. As a result of that, we have this need to stop thinking about the availability to those databases, to the different systems and applications and tools that depend on them and rely on them to deliver the services we’re looking to deliver, sell or consume. And all the infrastructure around it.
In fact, so much so, when you think about the big disruptions of data of late, in particular, the digital natives or cloud natives, some of the companies that have come along like Uber and Airbnb and so forth, and the slightly older PayPals and the eBays of the world – the scale and size of those organizations is only possible because of modern database technology and modern cloud infrastructure. Without that, without the added provided ability, they just certainly wouldn’t exist. Imagine a scenario where you could only get to eBay between 9:05 and 9:25 because it was unavailable for the rest of the day because it was trying to do an iCloud or a backup or something like that, it just wouldn’t have worked.
So, and there are other key areas when you think about our day-to-day lives, you know, like retail and banking and finance and the airlines and so forth. The big industry groups like aviation logistics, transport shipping, there is government as a whole, there is national security and police and so forth. All of these industries, all of these market segments, all these bodies, groups depend on their environments being up and running.
So, with that in mind, we also have the other caveat that we’ve got to think about, the other takeaway that I want to leave you thinking about, and that is that our world is now what I call “always on.” We’re permanently connected and this a theme you’ll hear on a regular basis and I’m going to repeat it and reiterate it. We now have smartphones in our hands all day, every day. We don’t turn them off, we put them next to the bed, we invariably use them as alarm clocks, we use them as cameras and we take photos, they push those photos up into the cloud.
They’re always on, permanently connected mentality. In fact, there is a phrase coin that I like to use, and that is we’re now sort of living the Fitbit generation, which is where we’re measuring everything, we’re monitoring everything, and it needs to be logged and that’s going to go somewhere.
And there’s also another phrase I’m going to leave you with, and that is, it’s nine o’clock somewhere, all the time. It’s a 24/7/365 world we live in. The Earth constantly spins around the Sun and at some point, and time, every hour of the day it’s nine o’clock. And that means people are getting out of bed and trying to do stuff, buy things, install things, etc.
So, what do we mean when we talk about high availability? Well it sounds really obvious until you start to dive into the detail. So, you know when we think about “OK, what does high availability mean?” Well the reality is, there is no silver bullet. It is quite a complex concept, as Robin related to with some of the topics he mentioned such as measuring availability and service-level agreements. We map it to things like, I have these questions, is it uptime? Do we worry about things like what we call five nines, which I will go into in a minute. Do we consider ourselves with what’s in our service-level agreements? For example, in service-level agreements, I mean there’s delays, the three-letter acronym for service-level agreements have become more and more critical these days.
As you’re going through this whole process of on-premise and self-hosted to outsourced to third-party data centers and outsourced managed services, and now we’re going all the way to cloud. And the reality is when you talk about cloud, it’s just really other people’s computers. And that means you’re not running the infrastructure, you’re not running the systems and invariably you’re not running the cloud. You’re doing infrastructure set up as the platform, so it’s even more important in sales force service. Now imagine sales for example, you know you don’t touch any of that infrastructure, you just log into a web interface.
So, the only mechanism you have in that world of cloud and outsourced infrastructure of any form to control that is service-level agreements, that’s the only mechanism you’ve got, and if people aren’t meeting your installation, then they either endure penalties and a reduction in the amount of money you pay them or you just don’t pay them.
So, this brings back to mind this whole challenge of, you know, how do we manage high availability? How do we manage availability uptime if it’s not your infrastructure – it’s all about SLA, for example. If it is your infrastructure or even if it’s someone else’s infrastructure as a design point of view. We talked about load balancing to model science, is it a fault tolerance design patent?
Do you run active active, or active standby in your architectures? Do you have multiple servers, multiple storage platforms? How do those storage platforms operate? Do they replicate each other, do they mirror each other? Are you running RAID? What type of RAID are you running for redundant storage? Are you running RAID at a disk level? Are you running an object storage platform that replicates across model drives and model systems and drives? Is it N plus one for every little piece of infrastructure you got? Do you add another one and is it in the same data center or another data center? Have you built a design patent that accounts for no single point of sale, for example?
All these fundamental things, now they sound like simple concepts, but when you get into each one of these things they are very, very detailed things. When we talk about availability, we invariably end up talking about nines. And what do we mean with nines? We have all heard about these, but let’s just think about what they mean for a minute and why they’re important.
So, we talk about one nine, which is just 90 percent of our availability. I know that sounds very high. So, when we talk 24 by 7 by 365, if we just look at one year for example, when we talk at one nine which is 90 percent of the time, that allows for thirty six and a half days of downtime a year. Let’s just round that to just over a month.
Now think of any business that we deal with every day – whether it’s online banking, eBay, PayPal or social media platforms like LinkedIn, Twitter or just a general retailer – let’s just say I wanted to book a flight to come to the U.S. from sunny Australia, would I be happy if I wanted to come to America in a weeks’ time, if my favorite airline was down for thirty six and a half days because their service provider said, “Look, we are up for 90 percent of the time”? Of course I wouldn’t.
As you go up this model, two nines: 99 percent. Well that becomes 3.65 days, roughly three and a half days downtime a year. Is that a big deal? Well it is if you are running Black Friday, and you are running a sale special and people can only buy during those couple of days.
Three nines become as little as 8.7 hours a year, but even 8.7 hours a year, that’s consecutive non-stop eight hours of our time. Well that in banking and finance, in health – if it’s a hospital, well that could cost lives. As you climb up, four nines is 52 minutes, five nines is five minutes and six nines is basically 30 seconds. Six nines is extremely high, and as you go up this ladder, as you climb up this Christmas tree of nines, the more nines you go up, the harder is the design, the environment and the platform. The harder it is to deliver that service, and if you think about the reduction in the amount of time you’ve got for things like backups to be run, administration, patching, maintenance windows for any form of outage – all non-trivial challenges – and it all comes down to percentages of outages, effectively.
The key here that I would like to convey is, there is no silver bullet, as I mentioned before. When it comes to availability, there is no “one size fits all.” You may have a particular type of design patent that suits key industries. The same challenges are faced by all banks. Some might be retail banks, some might be premium banks. Some banks might be focusing on trading and investment, wealth management. Some might be purely consumer. Some might be internet placing only and not even have tellers and only deal with ATMs when dispensing cash. So in those scenarios, even in banking and wealth management and financial services industry as a whole, for each of them they still have their own particular flavor or thing they need when it comes to availability.
So when we think about availability in plain English, the mix between availability and high availability – we think they are the same thing, but they are actually chalk and cheese. Availability is, I have put it in plain English, a measure of time that a server or process functions normally or generally, tied to their usage. That just means how we describe whether it is available or not. When we talk about availability we often fall into this trap of thinking, “I am providing it in an available form,” versus high availability in protecting the security of that infrastructure.
High availability, in another sense in plain English, is the design where you implement or achieve some sort of outcome and availability of data in particular where almost all the time –24/7/365 days a year – that availability gets to some of those nines. Invariably it does not mean 100 percent. One hundred percent is technically not possible in a real world in any one environment. It is very difficult for one server in an operating system with a database on it, with a platform running and on that an application you can deliver it and expect it to run 100 percent. So then we start thinking about designs. Do we have redundancies, do we have multiple slides to replicate? Then when you put it in plain English, it is interesting just how different the topic of availability versus high availability becomes.
I thought I would put it in a real simple graphical form just to give us an idea of what this looks like when you start climbing up the challenge of increasing availability in protecting your service uptime. On the bottom left-hand corner we have got a single nine. I have laid out the five nines that we generally talk about. Six nines is a little bit outrageous. When we talk about five nines in the bottom left-hand corner, 35 days roughly that outage, it is a low-cost and low-complexity environment you are trying to provide that because you have a number of things that can fail and you can still meet your service-level agreements.
But as you go along the bottom from left to right, and you get to the point where there is more nines in the picture, you get the scenarios where you begin to think about replication of systems and platforms. You’ve got to think about clustering and virtualization of various parts of infrastructure. You’ve got to think about geolocation of those clusters, multiple sites of data centers, and you’ve got to think about the type of industry and market segment you are aiming for. So what type of service level do you need to meet? What service provision you are looking for? Areas that are real-time card-based services that tell of communications. Is it military services? So this graph goes from bottom left to top right and as you get through that curve, cost and complexity increase. As you get more complex and more demanding environments you are going to need more nines.
This graph, for example, does a very similar thing: it describes the story between the cost component versus the desired availability component. So, on the top left-hand corner we map highly available complex systems, and the cost incurred if that availability drops versus the benefit of having availability in zero downtime. So for example, if we have an environment on the left-hand side where things are down, we can incur losses that are financial. We have legal implications that can be commercial business-strategy-level implications.
There is all kind of potentially, I guess, even moral issues around having a service benefits. If it is a health industry and they start to go through the cost of an outage, impact to the customers, reduction in customer satisfaction, staff productivity, user productivity, etc. These things are impacted if we think about designing highly complex, highly dependent, highly risky environment where there is potential risk for outage and therefore loss.
On the right-hand side we try to aim for a scenario where if we invest high cost and planning in design, we invest in intelligent implementation. We invest in providing people with skills and resources and we have highly regarded network and highly regarded operational environment and hardware and software. We get high availability but it comes at a high cost. So the swinging magic pendulum spot of the optimum position in the middle where they cross over, where we’ve got slightly reduced cost, and increasing availability that just juggles between the levels of nines and the high availability that is continuous availability and this is an ever-going challenge for us to meet, as in how much money you are willing to invest to get the service level your are looking for?
We also have the topic into which I won’t go into detail on, but I just want you to take this away and think about it. The difference between mean time between failure in your design, versus mean time to recover. In other words, are you investing in better quality infrastructure, better quality design, better quality hardware and software and better quality skilled staff and resources to engineer things and reduce the mean time between failure, the average time it takes to find the break as opposed to lower investing in infrastructure, in resources and design and blind patents, the high capability to recover? In other words, if something breaks, you’ve got lots of it to plug in. If someone has a laptop and it dies, you’ve got a spare one. You hand it to them and in 30 seconds they log in. These are very different ends of the pole. The top one infers you’re engineering with high cost and high investment to avoid failure, and the bottom one says that “I am going to accept that failure is going to come, so I am going to engineer around that and be prepared for failure and recover quickly.”
As I mentioned before, where I could say, “My availability is not your availability.” So when it comes to database environments and supporting the infrastructure, running your database and protecting that and ensuring high availability, there is really no one-stop shop. Everyone has their own needs and wants. So you’ve got to ask yourself these fundamental questions that I’ll leave you with, and that is: What can your organization afford? I am not just talking about dollars and cents. I am talking about, as an organization, what can you from resources, time and effort and so forth, afford as far as the level of availability can provide? Also, what can your business support? So, the current capabilities, the current skills, the current infrastructure, the current funding you can raise. So that juggles between what you can actually afford versus what you can support is an interesting balance.
Also, you then have to ask yourself the questions: What skills and technology do you have in-house? Can you outsource some of that challenge? Can you then move things to the cloud? If you’ve got the infrastructure service apart from software service, you are left without that stack as you go further up the stack. So should you invest more in platforms and service and not worry about the infrastructure piece, or should you look at software as a service offering because you wouldn’t have to worry about the platform?
What type of market and consumer or customer are you servicing? I mean, if you are a telecom and someone has to pick up the phone and you get a dial tone all the time, that’s a very different challenge to opening a small retail store between Monday and Friday, nine to five and closing down for an hour on lunchtime like a corner store barber. So you’ve got to think very long and hard how that works and what that means to your organization, what you need to be able to provide.
And then the juggle between what’s on the premises, what’s externally hosted and potentially, what’s in the cloud. As I said before, that comes from time challenges as well. So we are left to that final question that I look forward to our friends at IDERA to tell us how they address these very things, and that is the fine juggle between matching your desired and required availability with performance, and what your business needs and what your market and your consumers need.
And the reality is it’s no mean feat. It is going to take time, effort and money across the board to think about these things. And invariably it’s investment in people and skills capability and investment in software and tools to automate some of those processes and provide those people the right tools and right systems to make their lives not just better, but possible because monitoring very large-scale environments and protecting and managing those large-scale environments is often beyond individual human capabilities.
So, with that in mind, hopefully I’ve set the scene for a great conversation for our friends on IDERA to talk about their platform and tools, and I look forward to ask some great questions at the end. And I’ll pass on over.
Dr. Robin Bloor: Alright. Bert, I just gave you the keys, take it away.
Bert Scalzo: Thank you! Thank you, Dez and Robin. I’m going to continue on with the topic of high availability for your data. And I’m actually going to leverage a lot of what Dez just talked about. So, the choices, the nines, the trade-offs, the affordability. I’m going to try and put that more in terms to the database administrator or someone closer to the trenches would, how they would look at it? How they would architect it? And what those choices kind of mean.
Now, I’m going to try to be database agnostic. I’m not going to draw, for example, an Oracle-specific or SQL-Server-specific solution, but I’m going to draw, let’s say, a generic architecture that all the database vendors offer, something along those lines. They all call it by different names, but that’s one type of choice that you have in common, and I want to look at that from both the business and technology perspective, and how it relates to the business requirements.
And I want to start from what the most basic pseudo-high-availability solution is through the options you have at storage-level solutions, virtualization-level solutions, at database-level solutions. And then I kind of want to also introduce you to the fact that all of the choices are available in the cloud as well.
So, again, I’m going to try to stay fairly database agnostic. Now, most of the things I’m going to talk about, I know that they exist in Oracle, SQL Server, MySQL, PostgreSQL. There are also some third-party vendors, who make tools that also would give you additional architectures that you could consider. And, as Dez just said, no one solution is the best; it all depends. But there is one universal fact in what we are going to be looking at, is there’s going to be more moving parts, so it’s going to be more complex and therefore more costly.
So, we all know data is an important asset. And everybody knows that fast access to the data is always nice. But, reliable access to the data is critical. And as he was talking about with his nines examples, can you really afford to have 36½ days of downtime? It’s critical that that data is available all the time. And so, downtime can cost a fortune, both in terms of lost revenue, but even more important, in lost customers, or in loss of customer goodwill. I’ll give you a good example; if a particular website where I make purchases is slow, I may try to find a new website who sells similar items at a similar cost who don’t have slow websites. And so, it’s not just the loss of the customer, it’s the goodwill that the customer has towards you.
Now, hardware is a lot cheaper these days, so therefore there’s more and more demand for high availability. And again, I’m going to lead us to the cloud, when we look at that. And we have offerings from various levels: the storage vendors, the database vendors, the virtualization vendors, and now even the cloud vendors. And so, what’s really interesting with the cloud is after I draw all these wonderful pictures of these architectures that you could build in the cloud, a lot of times is just some checkboxes you check. And you say, “I want replication across geographic regions.” Checkbox. “I want replication of key hardware components.” Checkbox. And so, if you understand the pictures, sometimes in the cloud it’s just checking a few boxes to build the picture that you’ve got in your mind.
Now, the key thing is, what are the business requirements for high availability? For example, do I only have to worry about failure at a single site, or do I have to have it across multiple sites? In other words, can I have one computing center and I don’t care if that one center goes offline? I’m not making a business requirement that it expands across multiple sites. It’s a business question. And it’s important to know how the business perceives the answers to that question, because that typically defines your budget.
Now, you also want to look down at the level of failure protection. Could it be a power failure? Could it be a component failure? Like a NIC or an HBA goes bad, a host bus adapter. Is it a hard disk that goes bad? Is it a storage cabinet failure? Is it a computer failure? Or, in some cases, is it a site failure? That’s different than, in some cases, you can have a site failure, because the site itself is offline. In another case, it can be that a substantial portion of the site is offline, but from your perspective that’s the whole site.
And then, as Dez was talking about, what’s the expectation of the time to resume operations? That’s a business question. If the business says you’ve got to be able to resume operations within two minutes, then obviously, that’s going to define some of these pictures that I’m going to show you will work, and some of them will not be options that you can choose.
And another question that comes up during high availability, but often people forget to ask is, “Hey, business, if something happens while I’m in the middle of processing a transaction, what am I allowed to lose upon resumption of the system?” In other words, if I can bring the system back up in two minutes, and I can lose no more than 10 seconds of, let’s say, transactions that were in flight, is that acceptable business? And again, that will define what the business is willing to spend for that, and then again, that may define which pictures that I’m going to show you either apply or don’t apply.
So, let’s start with the most basic pseudo-high-availability solution. This is really not high availability, but I like to start with this, because it gets people thinking the right way. If I’ve got a server and a storage array, typically I will put multiple NICs, network interface cards, in that server, and bond them so that if one NIC fails, I’m still up. And I’ll do the same thing with my host bus adapters, I’ll multi-path that through different switches, so that I have multiple ways to get to my storage. And I got a universal power supply, and I’ve got repetitive controllers inside my storage array, and maybe I’ve done something like RAID 10 with my discs. In other words, in this picture I’ve prevented single-component failure at multiple levels. So, I am not bound by the NIC, or the HBA, or the controller, or the switch.
But if you notice, the server is in red and the storage array is in red. I still have two areas where if they fail, if my server goes, I’m dead, if my storage array cabinet goes, I’m dead. So, while this is not really high availability, it starts you to seeing and looking at the picture and saying, “I want a picture where there is no red.” And that’s really the goal of these pictures, to get us pointed in the right direction.
So, the first thing to happen is, as a DBA, I might always want to put the high-availability solution as a database implementation, but it might be that it’s available that it could be done as a storage solution, or it might be that it could be a storage-level replication. In case of the left, I’ve got storage virtualization. What’s happening is I’ve got RAID 0 in two different storage cabinets for my discs, but I’ve got RAID 1 across the two different storage cabinets. In other words, I can actually now have a storage cabinet fail, and I’m not dead. So, it’s better than the prior picture, because in the prior picture – remember we had both red on the server and red on the storage array – and now we made a small improvement, we now no longer have red at the storage level, we’ve used— storage virtualization solved that problem.
Now, another way you could do it – and not all vendors provide this – is that you may be able to do storage-level replication. I’m not talking database replication, I’m actually talking about replicating your block I/O for your storage. And that can be done at the storage level. And so again, now I have on the right-hand side, another picture where I remove the red from the bottom, because I’m using storage replication.
And so, this is another picture that may or may not be available. And the person who would manage this may be your storage administrator, rather than your database administrator. I like to bring this up, because sometimes people think of, “Oh! High availability, it must be the DBA that addresses this problem.” That’s not always true; it could in this case be the storage administrator.
Now next, we can do server virtualization as a possible solution. Now if you remember, in the first picture I had red at the server and red at the storage array. I could, in this case, using virtualization, I might be able to relocate, and in some cases that relocation is sort of a warm relocation, and in some cases can actually even be a hot relocation. Some virtualization or hypervisors provide the capability to move a virtual machine in flight. And some databases will accept that movement in flight readily. Now, again, not all hypervisors provide this, but this is one possible level of solution. Now, I’ve made the top servers are no longer red, but I still have the shared storage array and guess what, this solution may be a joint effort between the database administrator and the virtualization administrator. Or it could even be just the virtualization administrator, depending on what level of relocation is supported on that hypervisor and that database.
If you’re wondering, “Wow, what does he mean by this relocation? Give me a specific example.” For example, in VM where you may use VMotion to move your virtual machine from one host to another and do that without downtime. Now, clearly that prior picture had some red in it still. I still had the storage as being a single point of failure. And so we move up to the next solution which is, well, let me combine the storage and the server virtualization.
Now, in this case, again, it could be the storage administrator and the virtualization administrator who are building this solution and now look: I have a picture with no red in it. I’ve got high availability because I can relocate the virtual machine or the running application or database from one server to another and I have virtualization in my storage array by having it doing RAID 1 across two separate storage arrays. I’ve multi-pathed my switches and my HBAs.
So now I’ve built an HA system and I’ve done it primarily not at the database level. In other words, I’ve used other technologies to accomplish the same thing. So, this is a solution. Then we get into what’s called the shared-storage scalable cluster. It’s really not an HA solution, but again, I like to show it for the picture.
And what happens here is we have two servers running a database and it’s considered to be one database. It’s not two separate databases; it’s not like a master and a slave, or a hot and a cold, or an active and a standby. This is, both of those nodes work together to present one logical database. And so, what happens is, if a particular node fails, you’re still up. So, it protects you from server-level failure and does that basically by, sort of, sharding the node resources, if you will, but you still have the single point of failure to bottom for the disk. And so, this is a shared-storage scalable cluster and Oracle calls this Real Application Cluster or RAC.
Now, another solution is to use a shared-storage failover cluster. So, on the left I’ve got an active node, on the right I’ve got a passive node, I’ve got a heartbeat in between. I’ve got a shared storage array, and this is critical; you have to have that. And basically, what happens is if the active node encounters problems, the passive node can take over. There are licensing issues to this. Some database vendors allow you to have the passive node with a reduced license for a fixed time. In other cases, you have to have complete duplicate licensing. It all depends on your database vendor. But they all support this kind of picture which is, if one node goes down, the other node can take over.
And typically, this is one of those scenarios where it’s sort of, when you go from the active node to the passive node, you are going to probably, in most databases – not all – you’re going to lose some of the in-flight transactions. Then we get into what the database administrator really can look at, which is database replication, and there are two different ways of doing database replication.
There’s physical replication, and what’s important is, in the middle of this picture, you can see with the green star, that the replication, it’s being done by the database but, much like the storage-level virtualization, it’s being done at the block level. So, we’re repeating the actual block I/Os from the active node to the read-only or passive node. And this is considered to be physical replication.
Now, let me go to the next slide because it’s almost identical and it’s logical replication and the only thing that changes in the picture is that in the middle, instead of sending over the block I/O, we’re essentially sending over the log files with the SQL commands in it. So, in other words, what we’re replicating is not the physical I/O, but the commands that cause the physical I/O.
And so, this is often called log shipping or log-based replication. Some database vendors give you this natively. Other database vendors may not offer this, but then third-party vendors offer it, and so this is a very popular HA solution and it’s considered a complete solution. But this solution is primarily the responsibility of the DBA.
So, I’m not using virtualization in order to accomplish this. I could, but I’m not dependent on it. And I’m not using storage virtualization. Again, I could, but I’m not dependent on it. But I’m building a solution with the database being the primary driving feature. So, this is logical replication.
Now, it’s also possible to combine database and storage virtualization. I could have, at my data center, let’s say, on the left in blue, I could have virtualization for the storage so that I’m not bound to a particular storage array failing. But I may be doing database-level log-based or logical replication from one data center to the other so that the commands are executed in data center too, resulting in I/O, but not necessarily the same I/O, because I’m not sending over the block I/O, either by the storage solution or by the database, but I’m shipping the logs, and therefore the SQL commands.
And so, this is a picture that’s a very common picture for very large organizations. And I like this picture here because if I have to set this up on premise using a database like Oracle, I can do it; it’s a fair amount of work, it’s pretty complex, there’s lots of moving parts. If I do this in the cloud I can literally just say, checkbox, I want two geographic regions, I want the regions separated by, you know, on different continents, I want storage-level virtualization at a particular geographic region. I can even say that I want the ability to do virtualization type allocation or high-availability definition, and again, it’s another checkbox.
And the other thing I like about in the cloud, there’s another checkbox often to say, “I don’t want to deal with patching, just patch it,” you know, just work it into the workflow of everything else you do behind the scenes, keep me patched at all times. And so, while some of these pictures are getting very complex and they might be very hard to do on premise, they’re actually becoming quite easy to do in the cloud.
Now, the interesting thing is, it’s easy to check all the checkboxes, but guess what, that costs more money on a monthly basis. Because if you’re running two data centers, you know, you’ve got two data centers out in the cloud that you’re utilizing, you’re going to pay more than if you were just using one. Likewise, if you’re doing the storage level or the virtualization high availability as an additional layer, again, there may be additional costs.
So, it is interesting that while it’s hard to do on site and you may overthink it, in the cloud it’s so easy to do, you may underthink it. So, always know what the picture looks like and always know what the cost ramifications are for whatever picture it is that you’re building. Now, there are lots more combinations than what I showed here. This is not a complete or exhaustive example. There’s new technologies coming at a regular interval, so who knows – I may not have shown one that’s just come up in the last three months. And high availability is a lot more common than it was ten years ago.
In fact, I would not consider it a stretch to say that for most large organizations it’s a mandatory business requirement these days. And I like to go back to this slide because I just said it’s a mandatory business requirement. And I got these two tables on the right. The top one is out of the SQL Server documentation and the bottom one is out of the Oracle documentation. And what these are, these are tables to help you pick, well, which replication method should you use.
And notice that you start with some very simple questions. How much data am I allowed to use? And if the answer is zero, you know that you can only, in that top chart, pick the first or the fourth row. Then you ask another question. Well, how long am I allowed to take for the recovery? And if someone says, well, seconds or minutes, then that makes choices for you. And then, does the failover have to be automatic or does it require someone manually to do it? And that’s another business question. They may say that they want it automatic because they don’t want to rely on, you know, an escalation procedure and then somebody getting assigned a ticket and then solving the problem. They just want it to be fixed.
These are all business questions and it’s the same questions if I go down and do the same for Oracle. And I ask, OK, what kind of failure do I allow, what kind of duration, what can I lose, what’s the recovery procedure? These are all business choices, so if the business tells me the answers to three or four questions, my job’s real easy, I just come in here, I pick whichever of these matches the closest and then I build that. And remember, in the cloud, it may just be a few checkboxes to actually implement those.
And with that, that brings me to the end of my material and the time to open this up for questions.
Eric Kavanagh: Alright, Dez, maybe you first and then Robin?
Dez Blanchfield: Absolutely. In fact, probably a little unfair for those not on Twitter, but I just tweeted a picture of a graph that I want to visualize in everyone’s mind and then I wanted to throw the question to our learned friend on the call here. When I think of proprietary versus open source in this space – which is often what we talk about, sort of, proprietary databases from the likes of Oracle and Microsoft and so forth, versus open source – you end up with this challenge wherein the proprietary world the internet software vendor or software developer or the company invests in the bodies to build that complexity in. And so, you end up with a scenario where you buy the software and you don’t need to invest in many people because you’re buying the capability built in and in open source – you don’t pay for the software or it’s low cost, let’s say, but you don’t pay for the software, but you’ve got to invest in the bodies.
And I’m keen to get your thoughts on the juggle, particularly now that we’re moving into cloud models where you can get either/or. You can go to AWS or Azure and your Rackspace, whatever, and buy as a service that provides your database platform, or you can do it through open source code. And what we’ve just talked about, what’s the juggle between proprietary and open source and how the design patterns you’re talking about take effect and what are your general thoughts around this topic as we’re moving forward, particularly around providing availability?
Bert Scalzo: One of the large items that I run into when I’m trying to address that question, I go back to the customer and ask them about their performance requirements. And the reason I do that is, I have found – at least historically and in my own experience – that when it comes to customers who need high throughput on their replication, I’m almost always better off with the replication that’s provided by the database vendor, due to the nature that it’s more inherently built in and it’s at a lower level, and sometimes it uses mechanisms that are not available to the outside world, even in an open-source solution.
And I’ll give you a good example of one case I had. I had an internet-based company who was using MySQL as their database and they were on an old version of MySQL, like, Version 4.0, and the replication between their nodes was the limiting factor on how large they could scale their databases. And they were looking at buying a third-party solution, then they were looking at, “Well, maybe we can use one of the open-source solutions.” And what it really boiled down to was, all they had to do was upgrade their MySQL to Version, I think it was 5.5 we went to, because the difference between those two database versions was in the 4.0 Version of MySQL replication was not threaded and in Version 5.0 it was, and that was actually the best path for them.
Now, we looked at the other choices, but the deciding factor was performance and staying with the database vendor solution, and doing the database upgrade actually ended up being our best solution to get the highest probability of getting the performance they needed to go along with the higher availability.
Dez Blanchfield: Yeah, that mirrors my own thinking, to be honest. Just for full disclosure, and I won’t go into brands, but I’ve come from a proprietary background working for OEMs and software vendors and IOCs in general, and that’s definitely been my experience and at the same time I’m very pro-open-source and I’m a code contributor for a bunch of projects that we won’t name, but I agree with you in that if you’re a large organization – let’s say you’re a bank, or whatever you might be – invariably you don’t want to be an IT shop. You know, like, for example, if you’re a newspaper publisher or if you’re a retailer, you don’t want to be an IT shop that publishes newspapers, you want to be a newspaper shop that actually just leverages IT.
And so, investing in the proprietary capabilities where the software developers build all that capability, the load balancing, and so forth, in the tool, makes a hell of a lot more sense versus if you’re, like, a dotcom startup or something like that that can invest in human bodies. Where do you see this going?
Probably my last question before I hand over to Dr. Robin Bloor, because I know we’re running short of time. Where do you see this going from a trend point of view? So, you’re out there all the time, you’re on the bleeding edge of the stuff, are you seeing people have sat up and paid attention and woken up to the need to make this a commercial part of their day-to-day conversation back to the board room? Or are you still seeing it being very much the geek farm, the techies and the hoodies thinking about availability because it makes them wake up at four o’clock in the morning when something goes offline?
Do you think the trend is swinging now to organizations of every size, not the obvious ones like airlines and banking and finance, but just businesses in general? Do you think people have really gotten out of value proposition to protect their database environments and provide high availability and investing in that, or do you think we’ve still got a way to go? What’s the general sense in the market out there?
Bert Scalzo: Right now, I think there is still a gap, but it’s not a gap because the business isn’t asking for it, it’s a gap in the communication levels between the two sides of the fence. In other words, the business people are very clearly saying, “These applications require high availability and have these specific requirements when we say high availability.”
And somehow or other that message is not getting clearly across to the tech people. Or the tech people will come back and say, “Oh, well that’s complicated and it’ll cost you more money,” and this, that or the other. I think what’s going to happen is that’s going to erode away finally because, honestly, with it being, for example, in the cloud, just checking a few boxes here or there to say, “Build me this really complex technology structure,” there’s really no good reason for the technology people to come back and say to the business people, “Oh, it’s expensive,” or, “It’s hard to do,” or this or that, and the business people are starting to know that that’s the fact.
And I’ve even seen in environments where, you know, their own IT people will come and say, “Oh, you can’t have what you want. It’s too expensive.” And they’ll bring in a third-party consulting firm who will then say, “No, that’s not correct. Here’s how you could do it. Here’s what it’ll cost you.” So, I think we’ve got still a little bit of time between the communication levels between the two sides before that becomes automatic still.
Dez Blanchfield: Yeah, that’s definitely mirrored what I’ve seen here in Australia and around Asia Pacific. I’m sure it’s a global thing. And that is that a lot of the key decision makers from the boardroom down, all the heads of line of business, they’re’ a lot more technically savvy – they’re reading the blogs, they’re watching webinars, they’re tuned into various articles and podcasts and they’re going to events and forums and meetups and they now know their options and they know cloud is an option.
They also know that they can bring that, as you said, their capability in-house, and so I think there’s this interesting challenge now, that conversation’s that’s got to take place which is basically what we’ve done today where people, kind of, start doing things internally and just run brown bag lunches and have an internal briefing on what’s our current state, what’s our ideal state, where do we need to get to? And then, sort of, get that together.
I had a private message which I’m just going to quickly touch on just now. Someone asked a question, “Is it realistic that you can get 100 percent availability?” And you might be able to correct me here, but I’m going to say yes. I’ve built a platform for an electronic funds transfer, EFTPOS gateway between swift banking platforms and the EFTPOS terminals. I built this in the early 2000s. It’s actually been online 100 percent of the time for 17 years. In fact, it was built prior to the 2000s, but it went production only 2000/2001 roughly.
So, the 17 years has been in place from development to testing and then going into production. In that 17 years, very low-cost commodity off-the-shelf PCs, running an open-source operating system, but proprietary database, have been doing active/passive swapping every 90 days, with different design patents being applied, with replication of disks in each server, replication of data between model servers, replication of multiple data centers, and flipping from data center A doing production for 90 days and then flipping to data center B and doing production.
And as it flips, it automatically patches and updates so just to the question I just got privately, yes, it’s possible, but with a lot of investment in that project in a design point of view. So, the infrastructure was actually not that expensive, but the design and the testing and the implementation was very expensive to get that. So, we didn’t have to spend a lot of money in hardware and infrastructure, but we used very smart tools, back in the day when cloud wasn’t even a coinage.
So, the answer’s yes, it can be done, even more so now with cloud, as we just heard that, with the click of a button you can enable that capability. I’m going to throw that to Robin because I’m sure he’s got questions as well. But thank you very much for answering my questions and I really loved hearing your message today. Completely on board with all that because it mirrors everything I’ve been doing for the last nearly 30 years myself.
Dr. Robin Bloor: Well, OK, I shall pick it up. One of the things that fascinated me about your presentation was the number of options that are available now that weren’t available when I used to have to struggle with this stuff. I’m kind of interested in who’s going to design these configurations, or who, nowadays, designs these configurations? What used to happen, or, the world that I’m used to, is that there would be a fairly heavy transactional system and you would be interested in high uptime, high availability. Because, you know, the transactional system, it would be expensive if it went down in any way. And you wouldn’t have all the options that you’ve just presented to me, but in one way or the other, you could find a way, via replication mostly, to create a hot standby that wouldn’t click in unnoticeably, but it would give you a degraded service until you got back.
And I’m, kind of, looking at what you were showing me and thinking about it, not having done any of that kind of design work for 15 years, who’s doing that work now? Is this, as it was in my day, something that you did at the onset of a project, you know, get the infrastructure running? Or is this something that is an ongoing activity within an organization? Because there’s new technology choices that come along.
Bert Scalzo: In the large companies who are very efficient and effective at all of their operations, including their IT, they typically will have a centralized architecture group, or they’ll have some name for it, I’ve heard it called “the architecture group” a lot of times. And it will be their responsibility to know all these different pictures and what the pros and cons are and what the costs are. And what will happen is, when a particular application is looking and says, “Hey, I have to meet business requirements X, Y and Z. Hey, architecture team, what are my choices?”
They will give them the answer, like, here’s the two or three that are available, and then at that point, the decision moves back to the lower level to the application team or to the business sponsor of the application. But typically, there’s a centralized group who are staying on top of this and having that information at the ready and pre-built.
Now, it’s the medium-sized companies where it’s not that formal. What will tend to happen is, you will get one or two of your senior DBAs or system administrators and they will informally be quote “the domain expert” for that kind of expertise. So, even in the medium-sized companies it happens, it just happens in a non-formalized structure.
Dr. Robin Bloor: That’s really kind of interesting. In my day, we would never be thinking of high availability except for the transactional systems. Well, nowadays, of course you’ve got streaming systems that are subject probably to even greater demands in terms of availability. But, in the query-based, back-end, analytics, data warehouse, DI kind of environment, do you ever see requirements for high availability there?
Bert Scalzo: Yeah, and I’m glad you asked that question. I did some work for a retail firm and their strategic decisions for the business were based in a large part off of the analysis they would do from the data warehouse. And, in fact, they were interviewed by Forbes Magazine and the CEO of the company said, “Hey, our stock price grew 250 percent over the last five years and a very large reason that’s true is because we know how to effectively leverage our data in our data warehouse.” They were so good at making business decisions that, for them, the data warehouse and being able to do those analytics, being able to make decisions on a daily basis against their operational data, was actually, to them, a production system.
And I’ll give you a good example of how important it is. With this particular retail vendor, the guy who was responsible for beer sales, he was, like, the third most important executive in the company, because he brought in, you know, 60, 70 percent of the revenue. And so, he had to be able to, in order to stay competitive in that market, he had to be able to know every day, you know, what promotions should I be running. And that could be based on, you know, not just time of the year, but, weather, patterns, and other critical data that can affect sale of something like beer.
Dr. Robin Bloor: Well I guess there’s bound to be things like that. We’re kind of out of time, I think I should hand on to Eric in case he’s got some questions from the audience. Eric?
Eric Kavanagh: Yeah, this has all been great stuff, Bert. I think you addressed all the questions that we had from the audience in your presentation. But it is fun to watch. I’m glad that you, kind of, talked about storage virtualization and how much of an impact that can be. So, this is all good stuff.
Well, folks, we do archive all these webcasts for later viewing. So, hop online to Techopedia.com to look for the webcast section. All those Hot Techs will be listed there. A big thanks to our friend Bert for his expertise. And of course, to Dez and Robin. And with that we’re going to bid you farewell, folks. Take care. We’ll talk to you next time. Bye, bye.