Eric Kavanagh: Ladies and gentlemen, hello and welcome back once again to Hot Technologies. Yes indeed! My name is Eric Kavanagh, I will be your host for another webcast today in this really fun, exciting series we’ve got as a compliment to our Briefing Room series. The title is “Application Acceleration: Faster Performance for End Users.” Come on folks, who doesn’t want that? If I’m the guy out there helping your application run faster, I’m thinking I’m the guy getting beers bought for me at the bar after work. It’s got to be a pretty cool thing to walk in and speed up anyone’s application.
There’s a slide about yours truly, hit me up on Twitter @Eric_Kavanagh. I always try to follow back and I always re-tweet if you mention me, so feel free to mention me.
The whole purpose of this show is to focus on different aspects of enterprise technology and really help define certain disciplines or certain faces, if you will. A lot of times vendors will pick up on certain marketing terms and talk about how they do this or that or some other thing. This show was really designed to help our audience understand what a software tool needs to have in order to be a leader in its space. The format of this being two analysts. Each go first, unlike the Briefing Room where the vendor goes first. Each give their take on what they think is important for you to know about a particular kind of technology.
Today we’re talking about application acceleration. We’re going to hear from Dez Blanchfield and also Doctor Robin Bloor – we’re all over the world today – and then Bill Ellis is dialing in from the greater Virginia area. With that, I’m going to hand it over to our first presenter, Dr. Bloor. We tweeted the hashtag of #podcast by the way, so feel free to tweet. Take it away.
Dr. Robin Bloor: Okay, well thanks for that introduction. Application performance and service levels – this is a kind of an area, I’ve done a lot of work in this area over the years, in the sense I’ve actually done an awful lot of work in monitoring performance and working out in one way or another, how to try and calculate those levels. It has to be said that until – we used to have this era, a while ago, where people built systems in silos. Basically, amount of work they actually have to do to make a system perform reasonably well if it was in a silo wasn’t actually too hard because there’s very little, very small amount of variables you had to take into consideration. As soon as we got properly networked, interactive and service orientation came into the equation. It got a little difficult. Performance can be one-dimensional. If you just think of an application executing a particular code path repeatedly, doing it reasonably, in a timely manner, feels like a one-dimensional thing. As soon as you start talking about service levels, you’re actually talking about multiple things competing for computer resource. It becomes multi-dimensional very quickly. If you start to talk about business processes, business processes can be threaded together from multiple applications. If you’re talking about service-oriented architecture, then a given application can actually be accessing the capabilities of multiple applications. Then it becomes a very complicated thing.
I looked at – long time ago, I drew this diagram. This diagram is at least 20 years old. Basically, I call it the Diagram of Everything because it’s a way to look at everything that exists in the IT environment. It’s really only four pieces: users, data, software and hardware. Of course they change over time, but you actually realize when you look at this that there is a hierarchical explosion of each one of these pieces. A hardware yes, a hardware can be a server but a server consists of possibly multiple CPUs, networking technology and memory, and this, kind of an awful lot of controllers, as it happens. If you actually look at this, it all breaks down into pieces. If you actually think about trying to orchestrate all of that, in respect of data that changes, software’s performance changes, because the hardware changes, and so on and so forth, you’re actually looking at an incredibly difficult multi-variate situation. This is the complexity curve. Of course it’s complexity curve for just about everything, but I’ve seen it drawn time and again when talking about computers. Basically, if you put nodes on one axis and the important connections on the other axis, you end up with a complexity curve. It almost doesn’t matter what the nodes and connections are and that will do if you want a representation of the volume growth in the telephone network.
In actual fact, when talking about nodes in the computer environment, you’re talking about individual things that care about each other. Complexity, it turns out to be, a matter of variety structure and the various constraints that you’re trying to obey. Also, the numbers. When the numbers go up, they go crazy. I had an interesting chat yesterday, I was talking to someone – I can’t mention who he was, but it doesn’t really matter – they were talking about a site that had 40,000 – that’s four-zero, 40,000 – instances of databases in the site. Just think about that – 40,000 different databases. Of course the only thing we had – they obviously had many, many thousands of applications. We are talking about a very large organization, but I can’t name it. You actually look at that, and you’re actually trying to, in one way or another, get service levels that are going to be adequate across the board for some multiple users, with multiple different, if you like, expectations. It’s a complex situation, and that all I’m really saying is, this stuff’s complex. The numbers always increase. The constraints are determined by business processes and business goals. You will have noticed the expectations change.
I remember as soon as Gmail, and Yahoo mail, and Hotmail, all of those mail systems came up, people started to have an expectation of their internal mail systems within the organization would merit the service levels of these huge operations with vast server farms outside the organization and started to be pressured to make all of that kind of thing happen. Actually, service-level agreements are one thing, but expectation are another thing and they fight each other within an organization, an awkward thing. Here’s just a business perspective. In some systems, the optimal response time is one-tenth of a second of human response time. One-tenth of a second is the time it takes a cobra to bite you. If you’re standing in front of a cobra, and it decides to bite you, it’s too late, it’s going to because you can’t respond in one-tenth of a second. One-tenth of a second is about the time it takes for the ball to leave the hand of the pitcher to reach the guy with the bat. Basically, as he sees the ball thrown, he’s got to respond at exactly that point in time. Human response, kind of an interesting thing. Software-to-software, can obviously have a higher expectation.
Then you get into some situations which I think are those market situations, where being first is where the business value is. This is like if you want to sell a particular stock in the stock market is probably less, for example, because you think it’s going down and a lot other people think it’s going down, you get the best price if you get to market first. There’s a lot of situations, ad serving and things like that, very similar situation. You’ve got this movement in terms of service-level expectation. You’ve got one thing that’s a kind of glass ceiling for human response. Once it’s software-to-software, if you’ve got this ceiling situation, then there is no best service level. Faster than everybody else is the best.
Okay, this is, I think, the final slide that I was doing, but this is just to give you a big picture of the complexity, once you actually look at an organization’s requirements, the service. You’ve got, going up the left-hand side here, you’ve got system management, which is a set of software that serves into service management, which is trying to manage a service level. Above that you’ve got business performance management. Then if you look down the bottom here, the service management automation area, you’ve got fragmented services that evolve into standardized services, if you actually care to invest in this kind of thing, which evolve into integrated services, which evolve into optimized services. Mostly what people have done is, only in the bottom left-hand corner of this. Maybe a little bit of service management. Business performance management, very rare. Fragmented, nearly all of it. A perfect world would fill that grid. Instrumentation – I mentioned a sub-optimization problem. You can optimize parts of a system and it’s no good for the whole system. If you make the heart optimal, then your blood might circulate too fast for the rest of your organs. That’s an issue with large organizations and service levels. Clearly nothing is going to be achieved without sophisticated tools because the variables have just gotten – well there are too many variables to try and optimize.
Having said that, I’ll pass on to Dez who’ll talk about something else entirely, hopefully.
Dez Blanchfield: Thank you, Robin. Like Dr. Robin Bloor, I have spent far too many years thinking about the performance of very complex systems at very large scale. Probably not quite the same scale as Robin, but performance is a daily topic and it’s part of our DNA to want performance, to get the best out of everything. In fact, I’ve used a graphic of one of my favorite things in the world, Formula I car racing, where the entire planet sits still for a while and watches cars go round in circles very quickly. Every single aspect, there’s no aspect of Formula I that is not specifically about getting performance. A lot of people poo-poo the sport because they think it’s a waste of money. It turns out the car we drive every single day to drop the kids at soccer on the weekends and school the other days, is derived from performance-based development and research. It’s kind of the life of Formula I car racing. Everyday technology, everyday science, often comes from the likes of something that has been focused purely on high performance.
The reality, though, is that our new “always on” world, which demands 100 percent uptime – as Robin mentioned earlier – with things like the introduction of webmail and other services we take for granted in continuous space, and we now expect that in our enterprise and work environment. The reality is that being up doesn’t always mean you’re meeting your service-level agreement. I have this take on the need to manage application performance and availability service-level agreements has undergone a fundamental shift in the last decade. We’re not just trying to worry about the performance of one system anymore. When the world was a bit simpler, we might have a situation where a single server running multiple services can be monitored live and it was relatively straightforward to support. We could – and here’s my little, the things we used to worry about when I was a system administrator for example, many years ago – we would look around, is the service typically up and responding? Can I log into a terminal for example? Is the operating system responding and can I type commands? Are the applications up and running? Can I see processes and memory in doing things and I/O across the network and so forth? In the mainframe days you could hear tapes going zip-zip-zip and paper falling out of them.
Are the apps responding and can we log in and do things on them? Are the users able to connect to some of those servers? It goes on. They’re fairly fundamental, you know. Then a few funny ones – is the help desk green? Because if not, then everything’s running fine, and who’s going to get the donuts? Life was really simple in those days. Even in those days, and then I’m talking to 20–30 years ago, the complexity was still really high. We could, in a relatively straightforward fashion, manage service-level agreements and keep an eye on performance. We can’t do it by hand anymore, as Robin alluded to. The challenge is too great. The fact is the time when a few good apps, admins, system network and database, admins can monitor and meet performance SLAs. SLAs are so far gone now that I struggled last night when I was putting my final notes together to even think of the year when I last managed to look at a system of a very complex stack, and make sense of it and even comprehend what was going on under the hood, and I come from a deeply technical background. I can’t imagine what it’s like facing that on a day-to-day basis now in an administrative fashion.
What happened? Well, in 1996, database-driven apps were transformed with the internet boom. A lot of us have been through that. Even if you weren’t around the internet boom, you can easily just look around and realize that in everyday life, that we hook everything to the internet now. I believe we’ve got a toaster that apparently comes with the option to get on Wi-Fi which is ridiculous, because I do not need my toaster connected to the internet. In the 2000s, particularly the early 2000s, we had to deal with this massive growth in complexity round delivering service performance in dot-com boom. Then another ridiculous awkward spark in web 2.0, where smartphones came about and now applications were in our hands 24/7 and were always-on mode.
It’s 2016 now, we’re faced with another quagmire in the form of cloud and big data and mobility. These are systems that are just so large that they are often difficult to comprehend and put in plain English. When we think about the fact that some of the large unicorns that we talk about have tens of hundreds of petabytes of data. This is entire floor of disc space and storage just to hold your email, images and social media. Or in some cases, in transport and shipping logistics, it’s all in banking, it’s where your money is, or where your post is, or your, where the thing you bought on eBay is. The next big wave we’re about to face is this very heavy challenge of internet of things.
If this wasn’t bad enough, we’re about to build artificial intelligence and cognitive computing into just about everything. We talk to Siri and Google engines these days. I know Amazon’s got one of its own. Baidu have one of these devices were you can speak to, they convert it to text that goes into a normal system, the database makes a query and comes back and reverses the process. Think about the complexity that goes into that. The reality is that the complexity of today’s standard application stack is far beyond human capabilities. When you think about everything that happens when you push a button on your smartphone device or your tablet, you speak to it, converts that to text, runs that all the way to the internet to a back-end system, a front-end receives that, converts it to a query, runs the query through an application stack, goes through a database, hits disc, comes back out, and in the middle there’s a carrier network, there’s a local area network status center. The complexity is mad.
We effectively assert of this as hyperscale. The complexity and speed of hyperscale is just eye watering. Applications and databases have become so large and so complex, that managing performance is in fact a science in itself. Many refer to it as a rocket science. We’ve got onsite technology, we’ve got offsite technology, we’ve got a range of data center options; physical and virtual. We’ve got physical and virtual servers, we’ve got cloud, we have infrastructure as a service and platform as a service and software as a service is a thing now we take for granted. The latter, software as a service, became scary for a while a few years ago when CFOs and parts of the organization realized that they could pick up their credit card and just buy things themselves and go around the CIO and effectively we called this “shadow IT” and the CIOs now try to wind this back and wrestle control back over.
In infrastructure we’ve got software-defined networking, network function virtualization, below that we have, probably over, now we’ve got micro services and apps of active services. When you click on a URL, there’s a bunch of business logic that sits at the end of that URL that describes what it needs to actually deliver it. It doesn’t necessarily have prebuilt logic waiting for it. We’ve got traditional databases on one side that are scaling very, very large. We’ve got the likes of Hadoop infrastructure and ecosystems at the other spectrum that are just so large that, as I said, you know, people are talking about hundreds of petabytes of data now. We’ve got complexity mobility as far as devices that carry around go, laptops and phones and tablets.
We’ve got BYOD in some enclosed environments and increasingly now, since the Gen Y experienced people are bringing their own devices. We just let them talk to them about web interfaces. Either over the internet or over Wi-Fi, we have a free Wi-Fi in the cafe downstairs as they’re having coffee. Or our internal Wi-Fi. Machine-to-machine is ever-present now. That’s not directly part of the internet of things, but it’s also related. Internet of things is a whole new game of a complexity that’s mind boggling. Artificial intelligence, and if you think that what we’re playing with now, with all the Siri and other related devices we speak to is complex, wait till you get to a situation where you see something called the Olli which is a 3-D printed bus which takes about six people and can drive itself around the city and you can speak plain English to it, and it will speak back to you. If it hits traffic, it will decide to turn left or right off the main area where there’s traffic. As it turns and you get worried about why it’s turned left or right off the main road, it will say to you, “Don’t worry, I’m about to turn left. There’s traffic ahead and I’m going to go around it.”
Managing the performance of all the systems in there and all the complexity, tracking where that data goes, whether it goes into the database, all the interconnects and all the relevant bits is just mind boggling. The reality is that managing performance and SLAs at today’s speed and scale requires tools and systems, and by default this is no longer something where you’d just think it would be nice to have a tool – it’s a prerequisite; it is just absolutely necessary. Here’s something just as a little example, a list of the high-level application design diagrams for the OpenStack, open-source software-defined cloud. This is just a big chunk. This is not just servers and database. This is where each little blue blob represents clusters of things. In some cases files and servers or hundreds of databases or of course not more than tens of thousands of little pieces of applications logic running. That’s a small version. It really is quite mind boggling when you start thinking about the complexity that comes about in this. Today, even in just the big data space, I’ll just put some screenshots of just the brands. When you think about all the pieces we’ve got to manage here, we’re not just talking about one brand necessarily, these are all brands in the big data landscape and the top brand, not just every little small one or open source. You look and you think it’s quite a mind-boggling chart.
Let’s just have a look at a couple of verticals. Let’s take marketing, for example. Here’s a similar chart but from the technology stacks that are available in marketing technology alone. This is the 2011 graph. Here’s the 2016 version. Just think about, this is just the number of brands of products you can run for technology with regard to marketing technology. Not the complexity of the systems inside there, not the different app and web and development and network and all the other [inaudible]. Just the brand. There’s the before, five years ago and here’s today. It’s only going to get worse. We’re at this point now where the reality is, humans simply can’t ensure all service-level agreements. We cannot dive into enough detail, fast enough, and at the scale we need. Here’s an example of what a monitoring console looks like now. This is like nearly twenty-odd screens glued together pretending to be they are one great, big projected screen monitoring every little piece. Now it’s interesting in here, I won’t mention the brand, but this monitoring platform is monitoring a single application in a logistics and shipping environment. Just one app. If you think about what Robin was talking about where organizations can have 40,000 databases now in production environments. Can you just visualize what 40,000 versions of this collection of screens monitoring one application could be like? It’s a very brave world we live in. As Robin said and I will absolutely, 100 percent echo that, without the right tools, without the right support and folk on the table using those tools, application performance is a lost game to the humans and it has to be done by tools and software.
With that I will pass over to our friends in IDERA.
Eric Kavanagh: All right, Bill.
Bill Ellis: Thank you. Sharing out my screen here. I guess can somebody confirm that you can see my screen?
Dr. Robin Bloor: Yeah.
Eric Kavanagh: It looks all right.
Bill Ellis: Thank you. The one thing he referred to was, I really can’t wait for was the self-driving car. The one thing that I hadn’t heard anybody talk about is, what happens when it snows? I kind of wonder if the engineers in California realized that in other parts of the country it snows quite a bit.
Dez Blanchfield: I like that, I’m going to remember that one.
Eric Kavanagh: A typical one mile an hour.
Bill Ellis: We’re here to talk about application performance management in a complex environment. One thing I like to talk about is, a lot of people, when they talk about performance, the nature of reaction is, hey more servers, more CPU, more memory, etc. The other side of that coin is processing efficiency. Really, that’s two sides to the same coin and we’re going to take a look at both of them. The ultimate goal is to meet the service-level agreements for the business transactions. Ultimately all of this technology exists for the business. We talked about having an industry-first performance management database. The ideal of that is to fit into the ideal mold of performance and managing it from the beginning of the applications life cycle.
The topics really boil down to four pieces; one is the process of managing performance. We talked to everybody, and everybody has tools. If they don’t have tools, they have scripts or commands, but what they’re missing is context. Context is simply connecting the dots across the application stacks. These applications for – are browser based. They are very tightly coupled from tier to tier. How the tiers interact is also vital. Then, we’re talking about the business transaction. We’re going to provide the visibility not just to the technical folks, but also for the application owners and the operations managers.
I have a couple case studies to just kind of share with you how customers have put these to use. This is a very practical portion of the presentation here. Let’s take a look at what typically happens. I like to diagram – it was just like an incredible collage of technologies. The number of technologies in the data center has just grown, and grown, and grown. Meanwhile, an end user doesn’t care about it, and is oblivious to it. They just want to exercise the transaction, have it be available, have it complete rapidly. What typically happens is, the professionals in IT are unaware that the end users even had a problem, until they self-report. That kicks off kind of a time-consuming, slow process, and often frustrating. What happens is, people will open up their tools, and they look at a subset of their application stack. With that subset, it becomes very difficult to answer the simplest question. Is it usual for you to have the problem? What transaction is it? Where in the application stack is the bottleneck? By spending all of this time, looking tier by tier, not able to answer these questions, you end up spending a lot of time and energy, a lot of staff, funds and energy kind of finding out.
In order to solve this, in order to provide a better way, what Precise does is actually take the end-user track transaction, captures metadata about it, follows the transaction through the network, into the web server, into the business logic tier and we support .NET and ABAP and PeopleCode and E-Business Suite, in multitier applications that ultimately all the transactions are going to interact with the system of record. Whether it’s an inventory lookup, reporting time worked, they always interact with the database. The database becomes the foundation of business performance. The database, in turn, relies on storage. What the metadata about the transactions answers, who, what transaction, where in application stack, and then we have deep code-level visibility to show you what’s executing. This information is captured continuously, put into the performance management database – that becomes a single sheet of music for everybody to see what’s going on. There are different people and organizations that care about what’s happening: the technical experts, the application owners, ultimately the business itself. When a problem comes out, you want to be able to extract information about that transaction.
Before we get to look at the investing transaction, I want to show you how that might appear to different people in the organization. At a management tier, you might want to have an overview of multiple applications. You might want to know about the health that’s calculated by SLA compliance and availability. That health doesn’t mean everything is 100 percent working perfectly. There is room in this case you can see the investing transaction is in the warning status. Now, a little bit deeper, maybe in the line of business, you want to have some additional detail about individual transactions, when they breach SLAs, transaction counts, etc. The operations team will want to be notified about that through an alert of some sort. We have performance alerts built in. We actually measure the performance in the end user’s browser. Whether it’s Internet Explorer, Chrome, Firefox, etc., we are able to detect, this answers the first question: is an end user having a problem?
Let’s dive in and see what else we can show about that. The people who are interested in performance would open up Precise. They’d evaluate the transactions. They’d look at the SLA column to identify transactions that were not SLA compliant. They’d be able to see the end users that were impacted as well as what that transaction did as it flowed across the application [inaudible]. The way that you decipher these hieroglyphics are, this is the browser, the URL, the U is for URL, that’s the entry point into the JVM. Now this particular JVM makes a web server call out to the second JVM that then executes the SQL statement. This is clearly a database issue because this SQL statement was responsible for 72 percent of the response time. We are focused on time. Time is the currency of performance. It’s how end users experience whether things are running slowly or not, and it’s a measure of resource consumption. It’s a very handy; it’s kind of a single metric that is most important for evaluating performance. When this problem is handed off to the DBA, it’s not just a database problem, it’s this SQL statement. This is the context I was talking about.
Now armed with this information, I’m able to go in and analyze what’s happened. I can see first of all, the y-axis is time across the day. Excuse me, the y-axis is response time, the x-axis is time across the day. I can see there’s a database issue, there’s two occurrences, go back to that flow, pick up that SQL statement and go into the expert view, where the Precise is able to show you what’s happening, its controls, how long that code takes to execute. In the database tier, it’s the execution plan. You’ll note that Precise picked out the real execution plan that was used at execution time, which is distinguished from the estimated plan, which would be when the plan was given and not during execution time. It may or may not reflect that the database actually did.
Now down here, is a response time analysis for the SQL statement. Ninety percent of the time spent in storage; ten percent was used in the CPU. I can see the text of the SQL statement as well as the findings report. The text of the SQL statement actually starts to reveal some coding problems. It is select star; that returns all rows – excuse me, all columns from the rows that were returned. We’re turning back additional columns the application may or may not need. Those columns consume space and resources to process. If you run SAP, one of the big changes, because the HANA database is columnar, is that basically rewriting SAP to not choose select star so they can greatly reduce resource consumption. This is basically something that happens a lot of time also in homegrown applications, whether Java, .NET, etc.
That screen, this shows you who, what, when, where and why. The why gets to, like the SQL statement and execution plan that allows you to solve problems. Because Precise runs continuously, you can actually measure before and after, at the SQL statement level, at the transaction level, so either you can measure for yourself, as well as through the application owners and for management, that you’ve solved the problem. That documentation is really helpful. There’s a lot of complexity in this application stack. Of many applications, in fact, everybody we’ve talked to, run at least a portion of application stack under VMware. In this case, they’re looking at the customer service application, they’re looking at transaction time, and correlate it with the slowdown is a virtualization event. Precise tracks all the virtualization events. We have a plug-in to vCenter to pick that up.
We also are able to detect contention. Contention is different than utilization. Actually showing when maybe a noisy neighbor is impacting your guest VM, in context of the customer server’s application. Now, I’m able to drill in and get information and I can actually see the two VMs that are contending, in this case, for CPU resources. This allows me to have visibility so that I can look at scheduling. I can put a guest VM on a different physical server. All of these types of things that you might respond and then, in addition to that, I can actually look at the code efficiency to maybe have it use less CPU. I think I have a pretty good example in this presentation of how somebody was able to reduce CPU consumption by orders of magnitude.
That was VMware. Let’s go into the code itself, the application code. Precise will be able to show you what’s happening within Java, .NET, the ABAP code, E-Business, PeopleCode, etc. These are the entry points into, in this case, into WebLogic. Down here, there’s a findings report that tells me it’s these EJBs that you need to look at, and will tell me you also got locking happening on this system. Once again, the drill down within the business logic tier, to show what’s going on. In this case, I’m looking at particular instances; I also support clustering. If you have numerous JVMs are running, you can either look at the cluster as a whole, or look at bottlenecks within the individual JVM.
As you get into locking, I can get into exceptions. Exception is a little bit different than a performance problem. Typically, exceptions are run very fast. Because there’s a logic error and once you hit that logic error, it ends. We were able to capture a stack trace at the prime of an exception, this could save a lot of time as it’s going through trying to figure out what’s going on, you just have the stack trace, right there. We’re also able to capture memory leaks as well. The solution also includes the database tier, I can go in, I can evaluate the database instance. Once again, the y-axis is where the time was spent, the x-axis is time across the day. There’s a findings report that just automatically tells me what’s happening in the system and what I might look at.
One of the things about Precise’s findings report, it doesn’t just look at logs or wait state – it looks at all execution states including CPU, as well as returning information from storage. Storage is a very important part of the application stack, especially with the advent of solid state. Having information along those lines can be very helpful. For certain storage units, we can actually drill down and show what’s happening at the individual device level. That type of information – once again, it’s deep visibility; it’s broad in scope – to give you just enough information to have more leverage to pull as an application performance professional, so that you can optimize your applications on an end-to-end basis to meet those business transactions.
I have a couple of case studies I wanted to share with you. We cruise along pretty fast; I hope I’m going at an okay pace. Talking about storage, everybody over time changes hardware. There’s a hardware warranty. Did it really deliver what the vendor told you? You can evaluate that with Precise. You come in, and what happened here, they basically put in a new storage unit, but when the storage administrators looked just at the storage unit level, they saw a lot of contention and they thought there might be a problem with this new storage unit. Looking at more from an end-to-end perspective, precisely to show where that would actually happen. They actually went from of throughput of about 400 meg per second, where the storage was responsible for 38 percent of response time, so it’s pretty high. With the new storage unit we actually bumped up the throughput to six, seven hundred megs per second, so basically double, and we’re able to cut the contribution of the storage tier to transaction time in half. I’m able to actually graph that out before, this is the cutover period, and then the after.
So once again, documentation to prove that the hardware investment was worth it and they delivered as that particular vendor had expected. There’s all, because of the complexity, number of things, there’s all kinds of things that can happen. In this case, they actually had a situation where everybody was kind of blaming the DBA, the DBA was like “Well, not so fast.” Here we’re actually looking at an SAP application, I think this type of scenario is pretty common. What happened was, they were developing a custom transaction for a user. The user is like, “This is so slow.” The ABAP coder – that’s the programming language in SAP – said, “This is a database issue.” They ended up opening up Precise; they measured that end user 60 seconds, so well over a minute. Fifty-three seconds was spent in the back end. They drilled into the back end and they were actually able to reveal the SQL statement presented in descending order.
This top SQL statement which is responsible for 25 percent of the resource consumption, its average execution time is two milliseconds. You kind of can’t blame the database. You know, hey, not so fast, guy. The question is, why are there so many executions? Well, they bounced it back to the ABAP, he went in, looked into the nesting of the loop, found out they were calling the database in the wrong place, they basically made the change, tested the change and now the new response time is five seconds. A little bit slow, but they could live with that. Far better than 60 seconds. Sometimes, just ferreting out, is it the application code, is it the database, is it storage? Those are the areas where Precise, by having the context of the end-to-end transactions, that’s where Precise comes into play. You basically end those things.
I’m looking at the time, looks like we still have a little bit of time to go through a couple more of these. I am streaming through these. This application was under development for over a year. When they went into QA, they were seeing that the web servers were maxed out 100 percent and it looked like the application couldn’t run under VMware. The first thing everybody said was, “Put this on physical; it can’t run under VMware.” Precise actually offered them additional ways to solve the problem. We looked at the transactions, we saw a web server call, it comes in as an ASMX in IIS.NET. It actually revealed the underlying code. You see this where I’m pointing? This is 23 days, 11 hours. Wow, how is that possible? Well each invocation takes 9.4 seconds and this thing is invoked 215,000 times. For every invocation, it uses 6 seconds of CPU. This is the reason, this code is the reason why this thing could never scale. In fact, it couldn’t scale in physical.
What they did, is they went back to their developers and they said, “Can somebody make a change?” They kind of had a contest, and they tested out the different suggestions and they came up with a suggestion that was able to run much more efficiently. The new one completed one point, a little less than two seconds, with two-hundredths of a second in CPU. Now this could scale and it could run on the VMware farm. We were able to basically document that at both the code level as well as the transaction level. This is kind of the before, and then the after. Now that you can see here in the stack bar graph that shows web, .NET and database, now you’re interacting with the database. This is a profile you would expect to see for an application that was running more normally.
All right, I’m picking and choosing in terms of additional things I can show you. A lot of people like this because this bedazzles many shops. If you’re unable to meet a business SLA, and everybody is like, “Help us out.” This shop had a situation where the business SLA is in orders received by 3 p.m., it’s shipped that day. Is really vital that they get the orders out, and the warehouse is very busy. This JD Edwards’ sales order screen, was freezing and you can get a very good idea that this is a just-in-time retail inventory management system. Empty shelves are unacceptable in retail. Got to have the merchandise there in order to sell it. What we did is we dived in, in this case, we’re looking at the SQL server database. The look and feel is the same whether it’s SQL, Oracle, DB2 or Sybase.
We identified the select from PS_PROD and we’re able to capture the duration, the fact they execute so much. The dark blue matched the key that said they’re not waiting on some wait state or some logging or even storage – this thing is bound by CPU. We tracked the SQL statement by 34301 so every time this is executed, we increment our counters to keep track of it. That means that we have a detailed history and I can access it by clicking that tune button. Here’s the history tab. This screen here shows average duration versus changes. Wednesday, Thursday, Friday, the average duration was about two-tenths of a second. Very few screen freezes, they’re able to meet the business SLA. Come February 27th, something changes and all of the sudden, execution time is up here, and that’s actually slow enough to cause timeouts, which result in screen freezes. Precise, by keeping a detailed history, including the execution plan and general changes to the table’s indexes if that SQL is in use. We were able to pick up that the access plan changed on February 27th. Monday through Friday’s bad week. Come March 5th, the access plan changed again. This is a good week. This pink star tells us the volume updated.
You can see here the number of rows in the underlying tables is growing and this is typical for a business. You want your tables to grow. The thing is that the statements are parse, the SQL statements come in, the optimizer has to decide what to do and choose when execution plan is fast, choose another execution plan when it’s slow, causing the screen freeze. On a deep technology basis, I need to know what the execution plan is and Precise captures it for me complete with the date and time stamp. This is the one that was fast and efficient, this is the one that was slow and inefficient. This filter join simply uses a lot more CPU to reconcile, to do this particular SQL statement. They still have the same ultimate effect, but this one basically has a slower, less efficient recipe for delivering the result set. So, we step through. Hey, we have time for a couple more?
Eric Kavanagh: Yeah, go for it.
Bill Ellis: Okay, I’ll skip ahead. One thing I want you to take a note of, we talked about hardware, talked about SAP, we talked about .NET, we talked about JD Edwards and the Java-SQL Server environment. This is SAP, over here we’re looking at PeopleSoft. Precise’s support matrix is wide and deep. If you have an application, more than likely, we can instrument it to provide this level of visibility. One of the biggest changes that’s happening right now is mobility. PeopleSoft introduced mobility with its Fluid UI. The Fluid UI uses a system very differently. This application is evolving. The Fluid UI – what it does from a management perspective is it allows the end users to use their phone and it greatly increases productivity. If you have hundreds or thousands or even more employees, if you can increment their productivity, 1–2 percent, you can have a huge impact on the payroll and everything else. What happened was, this particular shop rolled out the PeopleSoft Fluid UI. Now, talking about complexity, this is the PeopleSoft stack. One application, a minimum of six technology, numerous end users. How do you start it?
Once again Precise is going to be able to follow these transactions. What we’re showing you here is a stacked bar graph showing client, web server, Java, Tuxedo database, PeopleSoft application stack. The green maps to J2EE, which is kind of a fancy way of saying WebLogic. This is the cutover. The end users start using the Fluid UI and the response time goes from maybe one and a half, two seconds, up to around nine, ten seconds. What this one screen does not show is number of people who got “not responding.” They actually got screen freezes in application. Let’s take a look at some of the visibility that Precise is able to provide this customer.
First of all, when I look at the PeopleSoft transactions, they can see basically, we saw this type of thing across the board. All the transactions were impacted, as well as all of the locations. Incidentally, when you look at this, you can actually see locations around the world. From Asia Pacific, to Europe as well as North America. The performance problem wasn’t located to a particular transaction, or particular geographical location, it’s system wide. It’s kind of a way of saying that the change or the way the Fluid UI was global in impact. You can see here from the scalability standpoint, people are trying to do the same type of amount of activity, but the response time basically just degraded and degraded. You can see that things are not scaling. Things are going very, very badly. Over here, when I look at the axis count and the concurrent connections, you see something that’s very interesting in terms of the access count and the connections. Here we’re only scaling up to about 5,000 and you’re looking at about, this tops out at 100 concurrent connections. This is done after; this is before. So what my real demand on the system, if this thing could scale, is in the 300,000 range. In the old days, with the classic UI, you’re looking at 30 concurrent connections.
Now what this is telling you is that the Fluid UI uses at least 10x numbers of concurrent connections. We start to pull back what’s happening under the covers with PeopleSoft so you can start to see the impact on the web servers, the fact that SLAs are starting to breach. Not going to go into everything, but what ends up happening is that they basically rely on messaging. They basically exercise is WebLogic and causes queuing within Tuxedo. There was actually a multitier dependency issue that showed up with the Fluid UI, but Precise was able to show that by a whole bunch of different things, that we can focus in on what the problem was. It turns out that there was also a problem in the database itself. There’s actually a messaging log file, and because of all the concurrent users, that log file was locking. It basically had things to tune, in every single tier within the application stack. Talk about complexity, here’s actually the Tuxedo tier showing you the queuing and you can see the performance degrading within this tier as well. I could see the processes; I could see the domains and the servers. In Tuxedo, for people to use that, typically what you do is you open up additional queues, domains and servers, just like at the supermarket to relieve the congestion, to minimize the queuing time. Last and final option, Precise shows a lot of information.
As I had mentioned earlier, every significant transaction interacts with the system of records. Visibility into the database is paramount. Precise shows what’s happening within the database, within WebLogic, within Java, .NET, within the browser, but the place that Precise really excels is in the database tier. This happens to be the weakness of our competitors. Let me show you one of the ways that Precise could help you go through this. I’m not going to spend time on the triangle of database optimization, but we’re basically looking at low-cost, low-risk, to wide-scope, high-risk, high-cost type changes. I actually will tweet out this slide afterwards if people want to try and take a look at it. It’s a pretty big guide, I think, for tuning problems. Here’s the Precise for Oracle expert view. Top on the findings report, 60 percent impact is this particular SQL statement. If you open up this screen of activity, it shows it up there. I can look at this select statement, there’s one execution plan. Every execution takes a second – 48,000 executions. That adds up to 48,000 more hours of executions.
The dark blue, once again, is CPU. This thing is CPU bound, not a wait state, not a log. I emphasize that because some of our competitors only look at wait states and logging events but generally speaking, CPU is the busiest execution state and offers the most buyback. Getting into this expert view – and I’m going very quickly – what I did is I looked at the table, 100,000 rows, 37,000 blocks. We’re doing a full-table [inaudible], yet we have six indexes on this thing. What’s going on here? Well, when I look at the where clause, what this where clause is doing is it’s actually converting a column to uppercase and it’s saying where it’s equal to uppercase, find variable. What’s happening is every time this thing executes, Oracle has to convert this column to uppercase. Rather than do that nearly fifty thousand times, it’s a lot more efficient to build that index in uppercase of a function-based index and it’s available not only in Oracle enterprise’s division, also standard division. When you do that, what you can then do is verify the execution plan issuing that new index user perm uppercase, that was just kind of my thing.
Then, from a before-and-after measurement, you’re looking at one-second execution time, aggregates up to 9 hours 54 minutes, with that same exact SQL statement, but having that index built in uppercase for 58,000 executions, the response time drops to sub-milliseconds, aggregate together, it comes up to seven seconds. I basically saved ten hours of CPU on my server. This is huge. Because if I’m not due for a server refresh, I’m able to live on that server. I actually drop that server usage by 20 percent and you can actually see the before and the after. That’s the type of visibility that Precise can provide. There’s also some additional things that we might look at like, why do you have all of these indexes if they’re not being used? They can follow up with that. There’s architecture, and I will wrap it up, since we’re reaching the top of the hour. I’m a true believer in this solution and we want you to be a true believer. At IDERA we believe that a trial makes a customer, so if you’re interested, we are able to do evaluations in your site.
With that, I will pass the beacon back.
Eric Kavanagh: Yeah this has been tremendous detail that you showed there. It’s really quite fascinating. I think I may have mentioned to you in the past that – and I know in some of the other webcasts we’ve done with IDERA, I’ve mentioned that – I’ve actually been tracking Precise since before it was acquired by IDERA, all the way back to 2008, I think, or 2009. I was fascinated by it back then. I’m curious to know how much work goes into staying on top of new releases of applications. You mentioned that SAP HANA, which I think was pretty impressive that you can actually dig into the HANA architecture and do some troubleshooting there. How many people do you have? How much of an effort is it on your part and how much of that can be done somewhat dynamically, meaning when the tool gets deployed, you start crawling around and seeing different things? How much of that can be dynamically, sort of, ascertained by the tool, so that you can help people troubleshoot complex environments?
Bill Ellis: [Laughs] You asked a lot of questions there.
Eric Kavanagh: I know, sorry.
Bill Ellis: I did provide a lot of detail because for these applications, looking at the code, the devil is in the detail. You have to have that level of detail to really be able to have something that’s actionable. Without actionable metrics, you just know about symptoms. You’re not actually solving problems. IDERA is about solving problems. Staying on top of the new releases and stuff is a big challenge. The question of what it takes to do that, that’s really for product management. I don’t have a lot of visibility into the team that basically keeps us up to date on things. In terms of HANA, that’s actually a new addition to the IDERA product line; it’s very exciting. One of the things with HANA is – let me talk about the task for a second. In the task, the SAP shops would do is they’d replicate database for reporting purposes. Then you’d have to have people reconcile to what’s actually current. You’d have these different databases and they’d be out of sync by different levels. There’s just a lot of time and effort, plus the hardware, the software, and people to maintain all that.
The idea of HANA to have a highly parallel in-memory database, to basically avoid the need for duplicate databases. We have one database, one source of truth, it’s always up to date, that way you avoid the needed to do all that reconciliation. The importance of the performance of the HANA database goes up – I’m going to say 10x or at least more valuable than the sum of all those other databases, hardware, resources can buy. Being able to manage HANA, now that component is actually in beta testing right now, it’s something that’s going to go GA soon. So that’s pretty exciting for IDERA and for us to basically support the SAP platform. I’m not sure what other parts of your question I kind of shortchanged but –
Eric Kavanagh: No that’s all good stuff in there. I threw a whole bunch at you all at once, so sorry about that. I’m just fascinated, really, I mean this is not a very simple application, right? You’re digging deep into these tools and understanding how they’re interacting with each other and to your point, you have to kind of piece the story together in your head. You have to combine bits of information to understand what’s actually happening and what’s causing you the trouble, so you can go in there and solve those problems.
One attendee is asking, how difficult is it to implement Precise? Another person asked, who are the people – obviously DBAs – but who are some other roles in the organization who would use these tool?
In terms of the transactions, you don’t have to map out the transactions because they’re tightly coupled. This URL becomes an entry point to JVM and then invoked this message, resulting in a JVC caught from the database. We’re able to basically catch those natural connection points and then present them to you in that transaction screen that I showed you where we also calculated how much time, or the percentage of time that was spent in each individual step. All of that is done automatically. Generally speaking, we allocate 90 minutes to do – to basically install the Precise core and then we start implementing the application. Depending upon the knowledge of the application, it may take us some additional sessions to get the entire application instrumented. Many people use just the database component of Precise. That’s fine. You can basically break this, break it up into the components that you feel like your site needs. We definitely believe that the context of having the entire application stack instrumented so you can see that tier-to-tier dependency actually magnifies the value of monitoring an individual tier. If anybody wants to explore instrumenting their application stack further, please go to our website – I guess that’s the easiest way to request additional information, and we’ll discuss it a little bit further.
Eric Kavanagh: Let me throw one or two quick questions to you. I’m guessing that you are collecting and building up a repository over time, both for individual clients and as a corporate entity overall, of interactions between various applications and various databases. In other words, scenario modeling, I guess, is what I’m alluding to. Is that the case? Do you actually maintain a sort of repository of common scenarios such that you can make suggestions to end users when certain things come into play? Like this version of E-Business Suite, this version of this database, etc. – do you do much of that?
Bill Ellis: Well, that type of information is built into the findings report. The findings report says what are the performance bottlenecks, and it’s based upon the execution time. Part of that findings report is learn more and what do you do next. The information or the experience from customers and so forth is basically incorporated into that library of recommendations.
Eric Kavanagh: Okay, that sounds good. Well folks, fantastic presentation today. Bill, I loved how much detail you had in there. I just thought this was really fantastic, gritty, granular information, showing how all this stuff is done. At a certain point it’s almost like black magic, but really, it’s not. It’s very specific technology you guys put together to understand very, very complex environments and make people happy because no one likes when applications run slowly.