Rebecca Jozwiak: Ladies and gentlemen, hello and welcome to Hot Technologies of 2016! Today’s title is “Harnessing the Firehose: Getting Business Value from Streaming Analytics.” This is Rebecca Jozwiak. I am the second in command for webcast host whenever our dear Eric Kavanagh cannot be here, so it’s nice to see so many of you out there today.
This episode is a little different from our others. We kind of talked about what is hot and of course this year is hot. The last several years have been hot. There’s always new stuff coming out. Today, we’re talking about streaming analytics. Streaming analytics is kind of new itself. Of course streaming, center data, RFID data, those aren’t necessarily new. But in the context of data architectures, we’ve been so focused on data at rest for decades. Databases, file systems, data repositories – all for the purpose mostly of batch processing. But now with the shift to create value from streaming data, data emotions, some call it living streams, they really require a stream-based architecture, not the data at rest architectures that we’ve been used to and it needs to be capable of handling fast ingestion, real-time or near real-time processing. It has to be able to cater not just for the Internet of Things but the Internet of Everything.
Of course, ideally, it would be nice to have the two architectures living side by side, one hand washing the other, so to speak. While the days-old data, weeks-old data, years-old data still of course have value, historical analytics, trend analysis, it’s the live data that’s driving the live intelligence these days and that’s why streaming analytics has become so important.
I’m talking more about that today. We have our data scientist, Dez Blanchfield, calling in from Australia. It’s early in the morning for him right now. We have our chief analyst, Dr. Robin Bloor. We are joined by Anand Venugopal, product head for StreamAnalytix at Impetus Technologies. They’re really focused on the streaming analytics aspect of this space.
With that, I’m going to go ahead and pass it to Dez.
Dez Blanchfield: Thank you. I need to grab control of the screen here and pop forward.
Rebecca Jozwiak: Here you go.
Dez Blanchfield: While we’re grabbing the slides up, let me just cover the core topic.
I’m going to keep it fairly high level and I’ll keep it to roughly 10 minutes. This is a very big topic. I participated in an event where we spent two to three days diving into the details of what stream processing is and the current frameworks that we’re developing and what doing analytics in those high-volume streams should mean.
We’re going to just clarify what we mean by streaming analytics and then delve into whether business value can be derived because that’s really what businesses are looking for. They’re looking to have folks explain to them very quickly and succinctly, where can I derive value by applying some form of analytics to our stream data?
What is streaming analytics?
Streaming analytics gives organizations a way to extract value from high-volume and high-velocity data that they have coming through the business in various forms in motion. The significant difference here is that we’ve had a long history of developing analytics and lens and views of data which we’ve been processing at rest for decades since the mainframe was invented. The massive paradigm shift we’ve seen in the last three to five years at what we call “web scale” is tapping into the streams of data coming into us in real time or near real time and not just processing and looking for event correlation or event triggers but performing really detailed, in-depth analytics on those streams. It’s a significant shift to what we’ve been doing before which is either collecting data, putting it into some sort of repository, traditionally big databases now, large big data frameworks such as the Hadoop platform and performing batch-mode processing on that and getting some sort of insight.
We’ve got very good at doing that very quickly and trying lots of heavy iron at the stuff, but we’re still really capturing data, storing and then looking at it and getting some sort of insights or analytics on it. The shift to performing those analytics as the data are streaming in has been a very new and exciting growth area for the types of things happening around big data. It requires a completely different approach to just capture, store and process and perform analytics on.
One of the key drivers for the shift and focus to performing analytics in the stream is that you can gain significant business value from getting those insights faster and more readily as the data is coming to you, as information is being made available to the business. The idea of doing end-of-day processing now is no longer relevant in certain industries. We want to be able to do the analytics on the fly. By the end of the day, we already know what has happened as it has happened rather than getting to the end of the day and doing a 24-hour batch job and getting those insights.
The streaming analytics is about tapping right into that stream while streams of data are usually multiple streams of very high volumes of data and data coming at us in motion very, very quickly and getting insights or analytics on those streams as they come to us as opposed to allowing that comes out at rest and performing analytics on them.
As I mentioned, we’ve had decades and decades of performing what I call batch analytics. I’ve put a really cool picture here. This is a picture of a gentleman standing in front of a mocked up computer that was created by RAND Corporation a lifetime ago and this is what they viewed a computer in a house to look like. What’s interesting is that even then, they had this concept of all these little dials and these dials represented information coming in from the house and being processed in real time and telling you what’s going on. A simple example is a set of barometric pressure and temperature that we can see where we’re seeing what’s happening in real time. But I imagine that even way back then when RAND Corporation put that little mockup together, they actually were thinking already about processing data and performing analytics on it as it’s coming in in stream format. I’m not quite sure why they put a steering wheel on the computer, but that’s pretty cool.
Since the invention of the printer, we’ve had a view of capturing data and performing batch analytics on it. As I’ve said with the big shift now and we’ve seen this from the likes of the web scale players who we all know, they are all household brands like Twitter, Facebook and LinkedIn, that interactive behavior that we have with those social platforms requires not just capture, store and then process in batch mode but they’re actually capture and drive analytics on the fly from the streams of data coming through. When I Tweet something, not only do they need to capture and store and do something later, but they also need to be able to put it immediately back on my stream and share it with other people that follow me. That is a batch processing model.
As you can see on that screenshot, it says 25 right now. That’s 25 people right now at the time of that screenshot were on that page. That’s the first real chance we played on consumer-grade analytics tool. I think a lot of people really got it. They just understood the power of knowing what was going on and how they can respond to it. When we think of the scale of avionics, aircraft flying around, there’s like 18,700 domestic flights a day in the USA alone. I read a paper some time ago – it was about six or seven years ago – that the amount of data that was being produced by those aircraft was about 200 to 300 megabytes in the old engineering model. In today’s designs of aircraft, these aircraft are producing about 500 gigabytes of data or about half a terabyte of data per flight.
When you do the math very quickly off the top of your head, that 18,700 of domestic flights every 24 hours in the US airspace alone, if all the modern aircraft are producing about half a terabyte, that’s 43 to 44 petabytes of data coming through and it’s happening while the planes are in the air. It’s happening when they land and they do data dumps. That’s when they go into the shop and have a full data dump from the engineering teams to look at what’s happening in the bearings, wheels, and inside the engines. Some of that data has to be processed in real time so they can make decisions on if there is a real issue whilst the plane was in the air or while it’s on the ground. You just can’t do that in batch mode. In other industries that we see out there around finance, health, manufacturing, and engineering, they’re also looking at how they can get with this new insight into what’s happening in real time as opposed to what’s just being stored in the databases on a term.
There’s also this concept of dealing with data as what I call a perishable good or a perishable commodity – that a lot of data loses value over time. This is more and more the case with mobility apps and social media tools because what people are saying and what’s trending now is what you want to respond to. When you think about other parts of our lives with logistics and shipping food around, we understand the concept of perishable commodity in that sense. But think about the data that’s going through your organization and the value it has. If somebody is doing some business with you right now and you can interact with them real time, you don’t want to wait for an hour so that the data can be captured and put into a system like Hadoop and then press this button, you won’t be able to deal with it right now and you want to be able to do it at the client’s demand immediately. There’s a term you’ll see pop up a lot now where people talk about having this real-time data stream that can give you personalization, and that personalization tune in the system you’re using to your individual experience. So when you hit a tool like Google Search tool for example, if I do a query and you do the same query, invariably, we’re not getting the exact same data. We get essentially what I refer to as a celebrity experience. I’m treated with a one-off. I get my own personal version of what’s happening in these systems based on the profiles and data that they’ve collected on me and I was able to do analytics in real time in the stream.
This idea of data being a perishable commodity is a real thing for now and the value of data being diminished over time is something that we have to deal with today. It isn’t a yesterday thing. I love this picture of a bear grabbing a salmon jumping out of the river because it really does paint exactly what I see streaming analytics. It’s this massive river of data coming at us, a firehose if you will, and the bear is sitting in the middle of the creek. It’s going to perform real-time analytics on what’s happening around it such that it can actually engineer its capability of capturing that fish in the air. It isn’t like just dipping in the stream and grabbing one. This thing is jumping in the air and it having to be at the right place at the right time to catch that fish. Otherwise, he doesn’t get breakfast or lunch.
An organization wants to do the same thing with their data. They want to extract value from what are now massive volumes of data in motion. They want to perform analytics on that data and high velocity data so it isn’t just the amount of data that’s coming at us but it’s the speed at which it’s coming from this. In security for example, it’s all your routers, switches, servers, firewalls and all the events that are coming from those and tens of thousands if not hundreds of thousands of devices, in some cases that are perishable data. When we think about it in the Internet of Things and the industrial Internet, we’re talking about millions if not billions of sensors eventually, and as the data is coming through which is performing analytics, we’re now looking at doing complex events processing at orders of magnitude and speed that we’ve never even seen before and we’re having to deal with this today. We’re having to build tools and systems around that. It’s a real challenge for organizations because on one hand, we’ve got the very big brands that doing DIY, bake it yourself, when they’ve got the capacity to do that and skill set and the engineering. But for the average organization, that isn’t the case. They don’t have the skill sets. They don’t have the capacity or the time or even the money to invest in figuring it out. They are all aiming toward this concept of near-real-time decision making.
Use cases that I’ve come across, and they are across every broad spectrum of every sector that you can imagine, people are sitting up and paying attention and saying, how do we apply some analytics to our stream data? We talk about web-scale online services. There’s the traditional social media platforms and online e-tailing and retailing – [inaudible – 00:14:30] apps for example. They’re all trying to give us this real-time celebrity experience. But when we get down into more of the technology stack services, telephone services, voice and video, I see people walking around doing FaceTime on phones. It’s just exploding. It boggles my mind that people hold the phone out in front of them and talking to a video stream of a friend as opposed to holding it to their ear anymore. But they know they can do it and they adapted and they liked that experience. The development of these applications and the platforms that are delivering these are having to perform real-time analytics on that traffic and on the profiles of the traffic so they can do simple things like routing that video perfectly so that the quality of the voice in the video that you get is adequate to get a good experience. You can’t batch process that kind of data. It wouldn’t make the real-time video stream a functional service.
There’s a governance challenge in financial transactions. It isn’t okay to get to the end of the day and find out you broke the law moving private data around the place. In Australia, we have a very interesting challenge where moving privacy related data offshore is a no-no. You can’t take my PID, my private personal identification data, offshore. There are laws in Australia to stop that from happening. Providers of financial services in particular certainly, government services and agencies, they have to be doing real-time analytics on their streams of data and instructions with me to make sure that what they’re providing to me doesn’t leave the shores. All the stuff has to stay locally. They’ve got to do it real time. They can’t break the law and ask forgiveness later. Fraud detection – it’s a pretty obvious one that we hear about with credit card transactions. But as the types of transactions we’re doing in financial services are changing very, very rapidly, there are sorts of things that PayPal are doing first now in detecting fraud in real time where money isn’t moving from one thing to another but it’s a financial transaction between systems. Ebay bidding platforms, detecting fraud has to be done real-time in a streaming office.
There’s a trend moving now to performing extraction and transforming load activity in the streams so we don’t want to capture anything that’s going to the stream. We can’t really do that. People have learned that data likes to be broken really quickly if we capture everything. The trick now is to perform analytics on those streams and do ETL on it and just capture what you need, potentially metadata, and then drive predictive analytics where we can actually then tell what’s going to happen a little bit further down the pathways on what we’ve just seen in the stream based on the analytics we performed on that.
Energy and utilities providers are experiencing this massive desire from consumers to have demand pricing. I might decide that I want to buy green power at one particular time of the day because I’m just home alone and I’m not using a lot of devices. But if I have a dinner party, I might want to have all my devices on and I don’t want to be buying cheap power and waiting for it to be delivered but willing to pay for more cost to get that power. This demand pricing particularly in utilities and energy space has already happened. Uber for example is a classic example of things you can do every day and it’s all driven by demand pricing. There are some classic examples of people in Australia getting $10,000 fares because of the massive demand at New Year’s Eve. I’m sure they’ve dealt with that issue but stream analytics being performed in real time while in the car telling you how much I should pay.
Internet of Things and sensor streams – we’ve only just scratched the surface on this and we’ve really just had the basic conversation happening on this but we will see an interesting shift in how technology deals with that because when you’re talking not just about thousands or tens of thousands but hundreds of thousands and potentially billions of devices streaming to you, almost none of the technology stacks we’ve got now are engineered to cope with that.
There are some really hot topics we’ll see around the place like security and cyber risk. They are very real challenges for us. There is a really neat tool called North on the web where you can sit and watch in a webpage various cyberattacks happening in real time. When you look at it, you think “oh it’s a nice cute little webpage,” but after about five minutes in there, you realize the volume of data that system is doing analytics on all the different streams of all the different devices around the world that are being fed into them. It starts to boggle the mind of how they’re performing that at the edge of that record essentially and providing you that simple little screen that tells you what to [inaudible – 00:18:43] or something else attacking it real time and what types of attacks. But it’s a really neat little way to just get a good taste of what stream analytics can potentially do for you in real time by just watching this page and getting a sense of just the volume and the challenge of taking the streams, processing analytics queries on them and representing that in real time.
I think the conversation that I have for the rest of the session is going to address all of those types of things with one interesting view, from my point of view, and that is the challenge of DIY, bake it yourself, suits some of the classic unicorns who are able to afford to build those types of things. They’ve got the billions of dollars to build these engineering teams and to build their data centers. But for 99.9% of the organizations out there who want to drive value in their business of stream analytics, they need to get an off-the-shelf service. They need to buy a product out of the box and they generally need some consulting service and professional service to help them implement it and they gain that value back in the business and sell it back to the business as a working solution.
With that, I’m going to hand back to you, Rebecca, because I believe that’s what we’re about to cover in detail now.
Rebecca Jozwiak: Excellent. Thank you so much, Dez. That’s a great presentation.
Now, I will pass the ball to Robin. Take it away.
Robin Bloor: Okay. Because Dez has gone into the nitty gritty of streams processing, it didn’t seem to make sense to me to cover it again. So I’m just going to take a completely strategic view. Looking almost from a very high level down on what the hell is going on and positioning it because I think it might help people, especially us people that are not encamped in streams processing at great depth before.
Streams processing has been around for a long time. We used to call it CEP. There were real-time systems before that. The original process control systems were actually processing streams of information – of course nothing was going as far as it is nowadays. This graphic that you see on the slide here; it’s pointing out a lot of things actually, but it’s pointing out above and beyond anything else – the fact that there is a spectrum of latencies that appear in different colors down here. What actually happened since the invention of computing or commercial computing that arrived right around 1960 is that everything has just got faster and faster. We used to be able to depend upon the way that this was actually coming out if you like in waves, because that’s what it looks like. This actually depends upon it. Because it was all driven by Moore’s law and Moore’s’ law would give us a factor of about ten times speed over a period of about six years. Then once we actually got to about 2013, it all broke, and we suddenly started to accelerate at a rate that we’ve never, which is oddly unprecedented. We were getting a factor of about ten in terms of increase in speed and therefore a reduction in latency about every six years. In the six years since about 2010, we’ve got a multiple of at least a thousand. Three orders of magnitude rather than one.
That’s what has been going on and that’s why the industry in one way or another appears to be moving at fantastic speeds – because it is. Just going through the meaning of this particular graphic, the response times are actually by the way are in algorithmic scale down the vertical axis. Real time is computer speed, faster than human beings. Interactive times are orange. It’s when you’re interacting with the computer that’s where you really want one-tenth to about one second of latency. Above, there’s transactional where we actually think about what you’re doing in the computer but if that goes out in about fifteen seconds it becomes intolerable. People would actually just won’t wait for the computer. Everything was done in batch. A lot of things that were done in batch are now coming down right into the transactional space, right into the interactive space or even into the real-time space. Whereas previously, a wavy with very small amounts of data we could do some of this, we can now do with very large amounts of data using hugely scaled out environment.
So basically, all of these is saying is really the transaction and interactive human response times. An awful lot of what’s being done with streams right now is to inform human beings about things. Some of it is going faster than that and it is informing stuff well so it’s real time. Then we take a license to just drop like a stone, making instant analytics feasible and incidentally quite affordable. It’s not just the speed has come down and the top has just collapsed as well. Probably the biggest impact in all of these amongst all of the various applications, you can do all these predictive analytics. I’ll tell you why in a minute.
This is just the hardware store. You got parallel software. We’re talking about in 2004. Scale-out architecture, multicore chips, memory increase, configurable CPU. SSDs now go so much faster than spinning disk. You can pretty much wave spinning disk goodbye. SSDs are in multiple cores as well, so again faster and faster. Soon to appear, we’ve got the memristor from HP. We’ve got the 3D XPoint from Intel and Micron. The promise of those is that it will make it all go faster and faster anyway. When you’re actually thinking of two new memory technologies, both of which will make the whole of the fundamental small piece, the individual circuit board go way faster, we haven’t even seen the end of it.
Streams technology, which is the next message really, is here to stay. There is going to have to be a new architecture. I mean Dez has kind of mentioned this in several points in his presentation. For decades we viewed architecture as a combination of data heaps and data pipes. We tended to process the heaps and we tended to pipe the data between the heaps. We’re now moving fundamentally towards what we call the Lambda data architecture that combines the processing of data flows with data heaps. When you are actually processing a stream of events coming in against historical data as a data flow or a data heap, that’s what I mean by Lambda architecture. This is in its infancy. It’s only a part of the picture. If you consider something as complex as Internet of Everything which Dez has also mentioned, you’ll actually realize that there are all sorts of data location issues – decisions as to what you should process in the stream.
The thing that I’m really saying here is that when we were processing in batch, we were actually processing streams. We just couldn’t do it one at a time. We just wait until there is a big heap of stuff and then we process it all at once. We’re moving to a situation where we actually can process stuff in the stream. If we can process stuff in the stream, then the data heaps that we hold are going to be the static data which we need to reference in order to process the data in the stream.
This takes us to this particular thing. I’ve mentioned this before in some presentation with the biological analogy. The way that I’d like you to think about is at the moment we are human beings. We have three distinct networks for real-time predictive processing. They are called the somatic, autonomic and enteric. The enteric is your stomach. The autonomic nervous system looks after fight and flights. It actually looks after fast reactions to the environment. The somatic which looks after the moving of the body. Those are real-time systems. The interesting thing about it – or I think is kind of interesting – is a lot of it are more predictive than you would ever imagine. It’s as if you’re actually looking at a screen around about 18 inches from your face. All that you can see clearly, all that your body is capable of seeing clearly is in actual fact about an 8 × 10 rectangle. Everything outside of that is actually blurred as far as your body is concerned but your mind is actually filling in the gaps and making it not blurry. You don’t see a blur at all. You see it clearly. Your mind is actually doing predictive method of the data stream in order for you to see that clarity. That’s kind of a curious thing but you can actually look at the way nervous system operates and the way that we manage to get around and behave reasonably – at least some of us – reasonably sanely and not bumping to things all the time.
It’s all done by a series of neural analytics scale inside here. What’s going to happen is that organizations are going to have the same kind of thing and is going to build the same kind of thing and it is going to be the processing of streams including the internal streams of the organization – the things that are happening within it, the things that happen outside it, the instant responses that actually have to be made are of course feeding the human being to make decisions, to make all of these happen. That’s where we’re going, as far as I can see.
One of the things that is a consequence of that is that the level of the streaming application is going well. There’s going to be an awful lot more than we see now. Right now, we’re picking the low-hanging fruit of doing the things that are obvious.
So anyway that’s the conclusion here. Streaming analytics is once a niche but it is becoming mainstream and it will soon be adopted generally.
With that, I will pass it back to Rebecca.
Rebecca Jozwiak: Thank you so much, Robin. Great presentation as usual.
Anand, you’re up next. The floor is yours.
Anand Venugopal: Fantastic. Thank you.
My name is Anand Venugopal and I’m the Head of Product for StreamAnalytix. It’s a product offered by Impetus Technologies, out of Los Gatos, California.
Impetus has had actually a great history in being a big data solutions provider for large enterprises. So we’ve actually done a number of streaming analytics implementations as a services company and we learned a lot of lessons. We also took a shift to becoming a product company and solutions-driven company in the last couple of years and stream analytics is heading the charge in transforming Impetus into a largely product-driven company. There are some critical, very, very key assets that Impetus cleared thanks to our exposure to enterprises and StreamAnalytix is one of them.
We’re 20 years in the business and there’s a great mix of product and services that makes us a huge advantage. And StreamAnalytix was born out of all the lessons learned from our first five or six implementations of streaming.
I will touch upon a few things, but the analysts, Dez and Robin, have done a fantastic job at covering the space overall so I’m going to skip a lot of content that overlaps. I’ll probably go fast. We see besides true streaming cases using a lot of just batch acceleration where there are literally very, very important batch processes in enterprises. As you can see, this whole cycle of sensing an event and analyzing and acting on it could actually take weeks in large enterprises and they are all trying to shrink it down to minutes and sometimes seconds and milliseconds. So anything faster than all of these batch processes are candidates for business acquisition and that’s very well put that the value of data dramatically diminishes with its age, so the more value there is in the initial portion in the seconds that it just happened. Ideally, if you could predict what was going to happen, that is the highest value.That depends on accuracy, though. The next highest value is when it is right there when it is happening you can analyze it and respond. Of course, the value dramatically reduces after that, the main restrictive BI that we’re in.
It’s interesting. You might expect some dramatically scientific answer to why streaming analytics. With many cases, what we’re seeing is it’s because it is now possible and because everybody knows batch is old, batch is boring and batch is not cool. There’s enough education that everybody has had now on the fact that there’s streaming possible and everybody has Hadoop now. Now Hadoop distributions have a streaming technology embedded in it, whether it’s Storm or Spark streaming and of course message queues, like Kafka, etc.
Enterprises we see are jumping into it and starting experimenting with these cases and we are seeing two broad categories. One has something to do with customer analytics and customer experience and the second operational intelligence. I’ll get into some of the details on that a little later. The whole customer service and customer experience angle and we at Impetus StreamAnalytix have done this over in many different ways is really all about really, truly capturing the multi-channel engagement of the consumer in real time and give them very, very context-sensitive experiences which are not common today. If you are browsing on the web, on the Bank of America website, and you were researching some products and you just call the call center. Would they say, “Hey Joe, I know you were researching some banking products, would you like me to fill you in?” You don’t expect that today, but that’s the kind of experience that’s truly possible with streaming analytics. In many cases, it makes a huge difference, especially if the customer started researching ways to get out of their contract with you by looking on early termination clauses or early termination terms and conditions on your website and then call in and you are able to not directly confront them about it but just indirectly make an offer about some kind of first promotion because the system knows that this person is looking at early termination and you make that offer at that point, you could very well protect that churning customer and protect that asset.
That would be one example, plus a lot of customer services are all very good examples. We are implementing today brings down the cost in the call center as well as provides dramatic delightful customer experiences. Dez did a great job in summarizing some of the use cases. You can stare at this chart for a couple of minutes. I classified it as verticals, horizontals, and combo areas, IoT, mobile app and call center. They are all verticals and horizontals. It depends on how you look at it. Bottom line, we see a good deal of horizontal uses that are fairly common across industry verticals and there is a vertical specific use cases including financial services, healthcare, telecom, manufacturing, etc. If you’re really asking yourself the question or telling yourself that, “oh, I don’t know what use cases there are. I’m not sure if there is really any business value in streaming analytics for my company or for our enterprise,” think hard, think twice. Talk to more people because there are use cases that in your company are relevant today. I’ll get into the business value on how exactly the business value is derived.
At the bottom of the pyramid here, you have predictive maintenance, security, churn protection, etc. Those kinds of use cases constitute protection of revenues and assets. If Target protected their security breach which happened over hours and weeks, the CIO could have saved his job. It could save tens or hundreds of millions of dollars, etc. Real-time streaming analytics really help in protecting those assets and protecting losses. That’s direct business value added right there.
The next category is becoming more profitable, lowering your cost and deriving more revenues from current operation. That’s efficiency of the current enterprise. Those are all the category of use cases that we call real-time operational intelligence where you are getting deep insights into how the network is behaving, how your customer operations are behaving, how your business process is behaving, and you’re able to tweak all of that in real time because you get feedback, you get alerts. You get deviances, variances in real time and you can quickly act and separate the process that’s going out of bounds.
You could potentially also save a lot of money in expensive capital upgrades and things which you think are necessary which may not be necessary if you optimized the network service. We heard of a case where a major telco deferred a $40 million upgrade in their network infrastructure because they found that they had enough capacity to manage their current traffic, which is by optimizing and better doing the intelligent routing of their traffic and things like that. Those are all possible only with some real-time analytics and action mechanism that acts on those insights in real time.
The next level of value add is up-sell, cross-sell where there are opportunities to make more revenues and profits from current offerings. This is a classic example that many of us know about they’ve experienced where, you think about in your life where you are willing to actually buy a product today that is not being offered to you. In many, many cases, that actually happens. You have things in your mind that you like to buy that you know you want to buy, that you have a to-do list or something, that your wife told you or if you don’t have a wife but you really wanted to buy and you go either shopping on a website or you’re interacting in a retail store, the storefront just doesn’t have the context, doesn’t have the intelligence to compute what you might need. Hence, they don’t get their business safe. If streaming analytics could be deployed to really make accurate predictions and which are really possible on what would most suit this particular context, this customer at this time at this location, there’s a lot of up-sell and cross-sell and that again comes from streaming analytics – being able to make a propensity decision of what this customer is likely to buy or respond to in that moment of truth when there’s an opportunity. That’s why I love that picture that Dez showed with the bear just about to eat that fish. That’s pretty much it.
We also think there is a big category out there of dramatic, transformational changes in an enterprise of offering completely new products and services simply based on observation of customer behavior, all based on the observation of the behavior of another enterprise. If, let’s say, a telco or a cable company really observing the usage patterns of customers in what segment of the market he is viewing, what program at what time, etc., they actually end up creating products and services that are being almost begged for in some way. So the whole concept of multi-screen behavior right now where we are now almost taking it for granted that we can see TV or cable content on our mobile apps. Some of those examples are coming from those new products and services that are being offered to us.
I’ll get into, “What are the architecture considerations of streaming analytics?” It’s ultimately what we’re trying to do. This is the Lambda architecture where you’re blending the historical data and the real-time insights and seeing it at the same time. That’s what Sigma enables. We all have the batch architecture and enterprise picture today. We’re gleaning into some kind of a BI stack and utilization stack and the Lambda architecture added. As the speed layer or the need and the Lambda is all about merging those two insights and seeing that in a combined way, in a rich way that combines both insights.
There’s another paradigm called the Kappa architecture that is being proposed where the conjecture is that the speed layer is the only input mechanism that is going to persist in the longer term. Everything is going to come through this speed layer. There is not even going to be an offline ETL mechanism. All the ETL will happen. Cleanse, data cleansing, quality ETL – all of that will happen on the wire, because keep in mind all data was born real time. At some point, it was real time. We’ve gotten so used to putting this on lakes, on rivers and oceans, then doing it on static analysis that we forgot that the data was born at some point in real time. All data is actually born as a real-time event that happened in the point of time and most of the data today on the lake just got put on the database for a later analysis and we have now the advantage in Lambda and Kappa architecture of actually seeing it, analyzing it, pre-processing it and reacting to it as it arrives. That is what is enabled by these technologies. When you look at it as an overall picture, it looks like something like this where there’s Hadoop inside, there’s MPPs, and data warehouses that you already have.
We put this up because it’s important to not just talk about new technologies in an island. They have to integrate. They have to make sense in the current enterprise context, and as solution providers that are serving enterprises, we are very sensitive to this. We help enterprises kind of integrate the whole thing. There are data sources on the left side feeding in to both the Hadoop and data warehouse layers as well as to the real-time layer on top and each of those entities are stock computers as you can see and the data consumption layer is on the right side. There is a constant effort to move the majority of compliance, governance, security, life cycle management, etc., that is available today are all have been amassed into this new technology.
One of the things that stream analytics is trying to do, if you look at the landscape today, there are a lot of stuff going on in the streaming technology landscape and from an enterprise customer point of view, there are so much to understand. There is so much to keep up with. There are data-gathering mechanisms on the left side – NiFi, Logstash, Flume, Sqoop. Obviously, I have put up a disclaimer saying it’s not exhaustive. Coming into the message queues and then coming into the open-source streaming engines – Storm, Spark Streaming, Samza, Flink, Apex, Heron. Heron is probably not open source yet. I’m not sure if it is, from Twitter. Those streaming engines then lead into or support a setup analytical application component such as complex event processing, machine learning, predictive analytics, alerting module, streaming ETL, enrichment statistical operations filters. Those are all what we call now operators. The set of those operators when stringed together would potentially also some custom largely concluded if necessary becomes a streaming application that runs on a streaming engine.
As part of that chain of components, you also need to store and index the data into your favorite database, your favorite index. You might also have to distribute cache and again that leads into the data visualization layer on the right side on the top part to commercial products or open source products, but ultimately you need some sort of a product to visualize that data in real time. Also, you need to sometimes figure other applications. We’ve all seen that the values derived only by the action that you take on the insight, that action is going to be a trigger from an analytical stack into another application stack that maybe changed that’s something in the IVR side or triggers a call center outbound call or something like that. We need to have those systems integrated and some mechanism for your streaming cluster to trigger off other applications of sending data downstream.
That’s the overall stack from going from left to right. Then you have the service layers, the middle monitoring, security general service layer, etc. Coming to what products that are out there in the enterprise space that customers are seeing like Hadoop distributions which all has streaming like I said and there is commercial or single-vendor solutions that are obviously in our competitors. There are many more as well in the landscape that we may not have mentioned here.
What you’re seeing there is broadly the enterprise user is seeing. A complex and rapidly evolving technology landscape for stream processing, as you can see. We got to simplify the choice and their user experience. What we think enterprises really need is the functional abstraction of all of that in one-stop-shop, easy-to-use interface that brings together all of those technologies that makes it really simple to use and does not expose all of the moving parts and the degradation issues and the performance issues and the life cycle maintenance issues to the enterprise.
The functionality abstraction is one. The second part is the streaming engine abstraction. The streaming engines and the open-source domains are coming up once every three, four or six months now. It was Storm for a long time. Samza came up and now it’s Spark Streaming. Flink is raising its head, beginning to get attention. Even the Spark Streaming roadmap, they’re making a way for potentially using a different engine for pure event processing because they also realize that Spark was designed for batch and they are making a way in their architecture vision and their roadmap for potentially having a different engine for stream processing in addition to the current microbatch pattern in Spark Streaming.
It is a reality that you have to contend with that there is going to be a lot of evolution. You really need to protect yourself from that technology flux. Because by default, you’re going to have to pick one and then live with it, which is not optimal. If you’re looking at it in another way, you’re fighting between, “okay, I got to buy a proprietary platform where there is not a lock-in, there’s no leverage of open source, could be very high cost and limited flexibility versus all of these open source stack where you got to do it yourself.” Again, like I said, it’s a lot of costs and delay in getting to market. What we’re saying is StreamAnalytix is one example of a great platform that pulls together the enterprise class, reliable, single vendor, professional service supported – all of that that you really need as an enterprise and the power of flexibility of the open source ecosystem where a single platform brings them together – Ingest, CEP, analytics, visualization and all of that.
It also does a very, very unique thing, which brings together many different technology engines beneath one single user experience. We really think the future is about being able to use multiple streaming engines because different use cases really demand different streaming architectures. Like Robin said, there is a whole spectrum of latencies. If you’re really talking about millisecond latency level, tens or even hundreds of milliseconds, you really need Storm at this time until there is another equally mature product for less leniency or lenient timeframe and latencies of maybe in a couple of seconds, three, four, five seconds, that range, then you can use Spark Streaming. Potentially, there are other engines that could do both. Bottom line, in a large enterprise, there are going to be use cases of all kinds. You really want the access and the generality to have multiple engines with one user experience and that’s what we’re trying to build in StreamAnalytix.
Just a quick view on the architecture. We’re going to rework this a little bit, but essentially, there is multiple data sources coming in on the left side – Kafka, RabbitMQ, Kinesis, ActiveMQ, all of those data sources and message queues coming in to the stream processing platform where you get to assemble an app, where you get to drag and drop from operators like the ETLs, all the stuff that we talked about. Underneath, there are multiple engines. Right now, we have Storm and Spark Streaming as the industries’ only and first enterprise-class streaming platform that has multiple engine support. That’s a very unique, flexibility we’re offering besides all the other flexibility of having real-time dashboards. CET engine embedded. We have the seamless integration with Hadoop and NoSQL indexes, Solr and Apache indexes. You can land to your favorite database no matter what it is and build applications really quickly and get to market really quickly and stay future proof. That’s our whole mantra in StreamAnalytix.
With that, I think I’ll conclude my remarks. Feel free to come to us for more questions. I’d like to keep the floor open for Q&A and panel discussion.
Rebecca, over to you.
Rebecca Jozwiak: Great, okay. Thank you so much. Dez and Robin, do you have some questions before we turn it over to the audience Q&A?
Robin Bloor: I’ve got a question. I’ll put my headphones back on so you can hear me. One of the interesting things, if you could kindly tell me this, a lot of what I’ve been seeing in the open-source space looks what I would say immature to me. In a sense, yes you can do various things. But it looks like we’re looking at software in its first or second release in reality and I was just wondering with your experience as an organization, how much do you see the immaturity of the Hadoop environment as problematic or is it something that doesn’t create too many problems?
Anand Venugopal: It is a reality, Robin. You’re absolutely right. The immaturity is not necessarily in the area of just functional stability and things, but maybe some cases of that too. But the immaturity is more in readiness of usage. The open-source products as they come out and even as they are offered by the Hadoop distribution, they are all a lot of different capable products, components just slapped together. They don’t work together seamlessly and are not designed for a smooth seamless user experience that we’ll get like Bank of America or Verizon or AT&T, out to deploy a streaming analytics application within weeks. They aren’t designed for that for sure. That’s the reason why we come in. We bring it together and make it really easy to understand, to deploy, etc.
The functional maturity of it, I think to a large extent, is there. Many large enterprises use for example Storm today. Many large enterprises are playing with Spark Streaming today. Each of these engines have their limitations in what they can do that is why it is important to know what you can and what you can’t do with each engine and there’s no point in breaking your head against the wall and saying, “Look I chose Spark Streaming and it doesn’t work for me in this particular industry.” It’s not going to work. There are going to be use cases where Spark Streaming is going to be the best option and there are going to be use cases where Spark Streaming may not work at all for you. That’s why you really need multiple options.
Robin Bloor: Well you need to have expert teams on board for most of this. I mean I don’t even know where to start on this either. A sensible co-action of skilled individuals. I’m interested in how the engagement you get involved and how it happens. Is it because a particular company is after a specific application or are you seeing kind of what I would call strategic adoption where they want a whole platform to do a lot of things.
Anand Venugopal: We are seeing examples of both, Robin. Some of the top ten brands that everybody knows are going about it in a very strategic way. They know that they are going to have a variety of use cases so they are evaluating platforms that will suit that need, which is a variety of different use cases in a multi-tenant manner to be deployed in an enterprise. There are single use case stories that are starting as well. There’s a particular business activity monitoring-type use case in a mortgage company that we’re working on which you would not imagine as first use case but that is the business solution or use case they came up with and then we connected the dots to streaming. We said, “You know what? This is a great case for streaming analytics and this is how we can implement it.” That’s how it was starting. Then, in that process, they get educated and say, “Oh wow, if we can do this and if this is a generic platform, then we can separate the application, layer them into platform, and build a lot of different applications on this platform.”
Robin Bloor: Dez, you got any questions?
Anand Venugopal: Dez is probably on mute.
Dez Blanchfield: Apologies, mute. I just had a good conversation myself. Just following on the original observation of Robin, you’re absolutely correct. I think that the challenge now is that enterprises have an ecosystem and a cultural and behavioral environment where free and open-source software is something that is known to them and they are able to use tools like Firefox as a browser and it has had a decent lifetime until it becomes stable and secure. But some of those very big platforms that they use are enterprise-grade proprietary platforms. So the adoption of what I consider open-source platforms is not always something that is easy for them to culturally or emotionally get across. I’ve seen this across just the adoption of small programs that were local projects to just play with big data and analytics as a fundamental concept. I think one of the key challenges, I’m sure you’ve seen them now across the organizations, is their desire to get the outcome but at the same time having their one foot stuck in the old can where they could just buy this from “insert a big brand” Oracle, IBM and Microsoft. These new and known brands are coming through with Hadoop platforms and even more. More exciting brands are coming through which has leading-edge technology like stream.
What are the sorts of conversations you’ve had that kind of get or cut through that? I know that we have a massive attendance this morning and one thing that I’m sure is on everyone’s mind is “How do I cut through that whole challenging layer from board down to management level, oh it’s too open source and too bleeding edge?” How do conversations you have with clients go and how do you cut through to that point where you kind of allay those types of fears to consider adopting the likes of StreamAnalytix?
Anand Venugopal: We’re actually finding it fairly easy to sell our value proposition because customers are naturally moving towards open source as a preferred option. They are not easily just giving up and saying, “Okay, I’m now going to go open source.” They actually go through a very committed evaluation of a major product, let’s say it’s an IBM or a typical product, because they have these vendor relationships. They wouldn’t treat us or the open-source engine against that product. They will go through six to eight to twelve weeks of evaluation. They will convince themselves that there is a degree of performance and stability here that I want and then they make up their minds saying, “Wow, you know what, I can actually do this.”
Today for example, we have a major tier one telco that has stream analytics running in production on top of a lot of the stack and they are evaluating that against another very, very large well-known vendor and they were convinced only after we proved all the performance, stability and all of those things. They don’t take it for granted. They found out open source is competent through their evaluations and they realize that, worst case, “Maybe there are those two use cases that I maybe can’t do but most of my businesses acceleration use cases today are eminently possible with the open-source stack.” And we enable usage of it. So that’s the big sweet spot right there. They wanted the open source. They are really looking to get out of the vendor lock-in situation they have been used to for many, many years. Then here we come and say, “You know what, we’ll make open source much, much easier and friendly to use for you.”
Dez Blanchfield: I think the other challenge that the enterprises find is when they bring in the traditional incumbent they are often a generation behind some of the bleeding edge of the exciting stuff we’re talking about here and I don’t mean that as a negative slight. It’s just that the reality is they’ve got a generation and journey to go through to release what they consider stable platforms to go through, old-school development and UATN integration cycles and testings and documentation, and marketing and sales. Whereas in the type that you’re doing, I think the thing that I am interested to think about is that looking at some of your latest releases last night doing some kind of research work, you’ve got this mix now where you got the competency from an upfront consultancy point of view and an implementation but you also got a stack that you can roll in. I think this is where the incumbents are going to struggle for some time. We’ve seen many of them like I did in the market. They are often in what I call catch-up nodes whereas from what you’re telling us when you’re out there making those conversations and you’re out there implementing.
Can you give us a couple of examples of some of the border verticals that you’ve seen adoption? For example, there is really nichey environment like rocket science and putting satellites in space and collecting data from Mars. There is only a handful of people doing that on the planet. But there are big verticals like health for example, in aeronautics, in shipping and logistics, in manufacturing and engineering, what are a couple of examples of the larger and more broad industry sectors you’ve been so far that you’ve seen really good adoption in?
Anand Venugopal: Telco is a big example.
I’m just going to quickly fix my slides here. Are you able to see the slide here, case study 4?
This is a case of a large telco ingesting set-top box data and doing multiple things with it. They are looking at what customers are really doing in real time. They are looking at where errors are happening in real time in set-top boxes. They are trying to inform the call center on, if this customer calls in right now, the code link information from this customer’s set-top box, maintenance ticket information quickly correlate whether this particular customer’s set-top box has a problem or not even before the customer speaks a word. Every cable company, every major telco is trying to do this. They ingest the set-top box data, do real-time analytics, do campaign analytics so that they can place their ads. There’s a huge use case.
As I said, there’s this mortgage company which is again a generic pattern where large systems are involved in processing data from. The data that flows through system A to system B to system C and these are regulated businesses that everything needs to be consistent. Often, systems go out of sync with each other, one system is saying, “I’m processing a hundred loans with a total value of $10 million.” The system is saying, “No, I’m processing 110 loans of some other different number.” They have to resolve that really quickly because they are in fact processing the same data and making different interpretations.
Whether it is a credit card, loan processing, business process, or whether it’s a mortgage business process or something else, we’re helping them to do correlation and reconciliation in real time to ensure that those business processes stay in sync. That’s another interesting use case. There is a major US government contractor who is looking at DNS traffic to do anomaly detection. There is an offline training model that they built and they are doing the scoring based on that model on real-time traffic. Some of those interesting use cases. There is a major airline looking at security queues and they’re trying to give you that information that, “Hey, it’s your gate for your plane for your flight. The TSA queue today is about 45 minutes versus two hours versus something else.” You get that update upfront. They are still working on it. Interesting IoT use case but great case of streaming analytics heading to the customer experience.
Rebecca Jozwiak: This is Rebecca. While you’re on the subject of use cases, there’s a great question from an audience member who is wondering, “Are these case studies, are these initiatives being driven from the information systems analytic side of the house or are they more being driven from the business who has specific questions or needs in mind?”
Anand Venugopal: I think we see about 60 percent or so, 50 percent to 55 percent, largely very proactive, enthusiastic technology initiatives who happen to know, who happen to be fairly savvy and understand certain business requirements and they probably have one sponsor that they identified but these are technology teams getting ready for the onslaught of business use cases coming through and then once they build the capability, they know that they can do this and then they go to business and aggressively sell this. In 30 percent to 40 percent of cases, we see business has a particular use case already which is begging for a streaming analytics capability.
Rebecca Jozwiak: That makes sense. I have got another slightly more technical question from an audience member. He is wondering if these systems support both structured and unstructured data streams, like sediments of Twitter streams or Facebook posts in real time, or does it need to be initially filtered?
Anand Venugopal: The products and technologies that we are talking about very imminently support both structured and unstructured data. They can be configured. All data has some kind of structure whether it’s a text or an XML or anything at all. There is some structure in terms of there is a time stamp feed. There’s maybe another blob which needs to be parsed so you can inject parses into the stream to parse out the data structures. If it is structured, then we just tell the system, “Okay, if there is a comma separated values and the first one is a string, second is a date.” So we can inject that parsing intelligence into the up-screen layers and process easily both structured and unstructured data.
Rebecca Jozwiak: I’ve got another question from the audience. I know we’ve run a little bit past the top of the hour. This attendee wants to know, it seems like real-time streaming applications may be developing both a need and an opportunity for integrating back into transaction systems, fraud prevention systems they bring up for example. In that case, do transaction systems need to be tweaked to kind of fit with that?
Anand Venugopal: It’s a merge, right? It’s a merge of transaction systems. They sometimes become the source of data where we’re analyzing transactions in real time and in many cases where let’s say there’s an application flow and here I’m trying to show a static data lookup site and then in our case where some sort of streaming in and you’re looking up a static database like an HBase or an RDBMS to enrich the streaming data and the static data together to make a decision or an analytical insight.
There’s another big industry trend that we’re also seeing – the convergence of OLAP and OLTP – and that’s why you have databases like Kudu and in-memory databases supporting both transactions and analytical processing at the same time. The stream processing layer would be entirely in memory and we’ll be looking at or interfacing with some of these transactional databases.
Rebecca Jozwiak: Mixed workload has been one of the last hurdles to jump, I think. Dez, Robin, do you two have any more questions?
Dez Blanchfield: I’m going to jump into one last question and wrap up on that if you don’t mind. The first challenge that the organizations that I’ve been dealing with for the last decade or so leading into this exciting challenge of stream analytics, first thing they tend to put back on the table when we started the conversation around this whole challenge is where do we get the skill set? How do we retrain the skill set and how do we get that capability internally? Having Impetus coming in and hand hold us through the journey and then implement as a great first step and it makes a lot of sense doing that.
But for medium to large organization, what are the kinds of things you’re seeing at the moment to prepare for this, to build that capability internally, to get anything from just a basic vocabulary around it and what kind of message can they do with the organization around the transition to this sort of framework and retooling their existing technical staff from IT from CEO so they can run this themselves once you build and implement it? Just very briefly, what sort of challenges and how are they solving them, the customers you’re dealing with, the types of challenges they found and how they go through solving that retraining and regaining experience and knowledge to get ready for this and to be able to go around operationally?
Anand Venugopal: Often, the small set of people that are trying to go out and buy a streaming analytics platform is already reasonably smart in that they are Hadoop aware, they have already gotten their Hadoop MapReduce skills, and because they are working closely with Hadoop distribution vendor, they are either familiar. Everything is getting Kafka, for example. They are doing something with it and either Storm or Spark streaming is in their open-source domain. Definitely, people are familiar with it or building skills around it. But it starts with a small set of people that are skilled enough and are smart enough. They are attending conferences. They are learning and that they ask intelligent questions to vendors and in some cases they learn with the vendors. As the vendors are coming and presenting at the first meeting, they may not know stuff but they co-read up and then they start playing with it.
That small group of people is the nucleus and then it starts growing and everybody now realizes that the first business use case gets operationalized. There begins a wave and we saw in the Spark summit last week where a large enterprise like Capital One was out there and in full strength. They were opting Spark. They were speaking about it. They are educating a lot of their people in Spark because they are contributing to it also in many cases as a user. We see the same with many, many large enterprises. It starts with a few small set of very smart people and then it begins a wave of overall education and people know that once a senior VP or once a senior director is in align and they want to bet on this thing and the word gets around and they all start picking up these skills.
Dez Blanchfield: I’m sure you have a fantastic time building those champions too.
Anand Venugopal: Yes. We do a lot of education as we work with the initial champions and we hold training courses and many, many for our large customers we have gone back and had waves and waves of training to bring a lot of the users into the mainstream usage phase especially in Hadoop MapReduce site. We found that in a large credit card company who is a customer of ours, we have delivered at least maybe five to eight different training programs. We also have free community editions of all these products including ours, sandboxes that people can download, get used to and educate themselves that way also.
Dez Blanchfield: That’s all I have this morning for you. Thank you very much. I find it incredibly interesting to see the types of models and use cases you’ve got for us today. Thank you.
Anand Venugopal: Great. Thank you very much folks.
Rebecca Jozwiak: Thanks everyone for joining us in these Hot Technologies webcast. It has been fascinating to hear from Dez Blanchfield, Dr. Robin Bloor and from Impetus Technologies, Anand Venugopal. Thank you presenters. Thank you speakers and thank you audience. We have another Hot Technologies next month, so look for that. You can always find our content archived at Insideanalysis.com. We also put lots of content up on SlideShare and some interesting bits on YouTube as well.
That’s all folks. Thanks again and have a good day. Bye, bye.