The Key to Quality Big Data Analytics: Understanding 'Different' - TechWise Episode 4 Transcript
Takeaway: Host Eric Kavanagh discusses big data analytics with industry experts.
We have several presenters. As you can see, there’s yours truly at the top. Mike Ferguson is calling in all the way from the UK, where he had to get special privileges to stay in his office building this late. That’s how late it is for him. We’ve got Dr. Robin Bloor, our very own Chief Analyst here at the Bloor Group. And we’ll have George Corugedo, CEO and Co-founder of RedPoint Global, and Keith Renison, Senior Solutions Architect from SAS Institute. These are fantastic companies, folks. These are companies that are really innovating. And we’re going to dig into some of the good stuff of what’s happening out there right now in the whole world of big data. And let’s face it, the small data hasn’t gone away. And to that, let me give my executive summary here.
So, there’s an old French expression: "The more things change, the more they stay the same." And let’s face some facts here — big data is not going to solve the problems of small data. Corporate small data is still out there. It’s still everywhere. It is the fuel of operations for today’s information economy. And big data offers a compliment to these so-called small corporate data, but it does not supplant small data. It’s still going to be around. I do like a lot of things about big data, especially stuff like machine-generated data.
And today, we’ll probably talk a little bit about social media data, which is also very powerful stuff. And if you think about, for example, how social has changed business, well just think about three quick websites here: Facebook, LinkedIn and Twitter. Think about the fact that five years ago, nobody was doing that kind of stuff. Twitter is an absolute juggernaut these days. Facebook, of course, is huge. It’s gargantuan. And then, LinkedIn is the de-facto standard for corporate networking and communication. These sites are humongous, and to be able to leverage the data that’s in them, it’s going to revive some game-changing functionality. It’s really going to do a lot of good for a lot of organizations — at least the ones that take advantage of it.
So, governance — governance still matters. Again, big data doesn’t nullify the need for governance. Quite frankly, there’s a whole new need to focus on how to govern the world of big data. How do you make sure that you have your procedures and policies in place; that the right people are getting access to the right data; that you’ve got contacts, you’ve got lineage involved here? You actually know where the data comes from, what has happened to it. And that’s all changing.
I’m frankly really impressed by some of what I’ve seen out there in this whole new world leveraging the Hadoop [inaudible 00:02:47] ecosystem, which is, of course, a lot more than storage in terms of functionality. Hadoop is a computational engine as well. And the company has to figure out how to harness that computational power, that parallel processing capability. They’re going to do really, really cool things. We’ll learn about that today.
The other thing to mention, this is something that Dr. Bloor has talked about in the recent past, is that the innovation wave is not over. So, we’ve seen a lot of, of course, attention around Hadoop. We’ve seen companies like Cloudera and Hortonworks, you know, really making some waves. And they’re developing partnerships with, well, companies on the call today, quite frankly. And they’re developing partnerships with lots of folks. But the innovation wave is not over. There are more projects spinning out of the Apache Foundation that are changing not just the end point, if you will — the applications that people use — but the infrastructure itself.
So, this whole development of YARN — yet another resource negotiator — is really like an operating system for big data. And it’s a big, big deal. So, we’re going to learn how that changes things as well. So, just a couple bits of obvious advice here, be wary of long contracts going forward, you know, five-, ten-year contracts are going to be the wave, the path that seems to me. You’re going to want to avoid lock-in at all costs. We’re going to learn about all of that today.
So, our first analyst speaking today — our first speaker of the whole program is Mike Ferguson, calling in from the UK. With that, I’m going to hand you the keys, Mike, and let you take it away. Mike Ferguson, the floor is yours.
Mike, you there? You might be on mute. I don’t hear him. We may have to call him back. And we’ll just jump right up to Robin Bloor’s slides. Robin, I’m going to pull rank on poor Mike Ferguson here. I’m going to go [inaudible 00:04:51] for a second.
Is that you, Mike? Can you hear us? Nah. I think we’re going to have to go ahead and go with Robin first. So, hold on one second, folks. I’ll pull some links to the slides here in a couple of minutes as well. So with that, let me hand the keys to Robin Bloor. Robin, you can go first instead of Mike, and I’ll call Mike in a second.
Eric: Hold on, Rob. Let me go ahead and get your slide up here, Rob. It’s going to take a second.
Eric: Yeah. You can kind of talk about what we’re dealing with, though, here in terms of governance. I know you’re going to talk about governance. That is typically thought about in the context of small corporate data. So now, I’ve got the slide up, Robin. Don’t move anything. And here you go. The floor is yours. Take it away.
Robin: Okay. Yeah. I mean, well, we kind of arranged beforehand was, Mike would talk about the analytical side, and I’ll talk about the governance side. To a certain extent, the governance follows the analytics in a sense that it’s a reason that you are doing the big data stuff, and the reason that you assemble all of the software to do the analytics is, that’s where the value is.
There’s an issue. And the issue is that, you know, the data has to be wrangled. The data has to be marshaled. The data has to be brought together and managed in a way that enables the analytics to take place with full confidence — I guess, is the word. So, I thought I’d talk about was the governance side of the equation. I guess, the thing to say, really, is that, you know, governance was already an issue. Governance was already an issue, and it starts to become an issue in the whole of the data warehouse game.
What’s actually happened is it’s turned into a much larger issue. And the reason it’s turned into a much larger issue as well as more data, but I mean, these are the reasons, really. The number of data sources has expanded dramatically. Previously, the data sources we have were by and large defined by whatever fed the data warehouse. The data warehouse would normally be fed by RTP systems. It’s possible a little external data, not much.
Now, we’ve gone to a world where, you know, a data market is coming into existence right now, and therefore, there will be trading in data. You’ve already got loads and loads of different streaming sources of data that you can actually bring in to the organization. We’ve got social media data which has taken them, taken off on its own account, so to speak. I mean, an awful lot of the, the value in the social media sites is actually the information they aggregate and can therefore make available to people.
We’ve also got the discovery of, you know, it’s like they already existed. We already had those log files, you know, in the advent of Splunk. And soon, it became obvious that there’s value in a log file. So, there was data within the organization which were — which we could call new data sources as well as external sources. So, that’s one thing. And that really means that, you know, whatever rules of the management of data we had in place before, they’re going to have to be, in one way or another extended, and will continue to need to be extended to actually govern the data. But we’re now starting to assemble in one way or another.
And going down this list we have streaming and the speed of arrival of data. One of, I think, the reasons to the popularity of Hadoop is it can pretty much be used to catch a lot of data. It can also ingest data speed, that if you don’t actually need to use it immediately, it’s a nice parallel, huge parallel environment. But you’ve also got the fact that there’s a fair amount of streaming analytics going on now. It used to just be the banking sectors that was interested in streaming applications, but now it’s gone kind of global. And everybody is looking at streaming applications in one way or another or so, a potential means of deriving value from data and doing analytics for the organization.
We’ve got the unstructured data. The statistic, usually part of the only 10% of the world’s data was in relational databases. Now, one of the major reasons for that was mostly it was actually unstructured, and it was — a good deal of it was out there on the Web, but pretty much strewn about various websites. That data has proved to be also analyzable, also usable. And with the advent of the Symantec technology which is gradually creeping into the situation, is getting to become more and more so. So, there’s a need to actually gather and manage unstructured data, and that means it’s much greater than it was before. We’ve got a social data that I already mentioned, but the point about that, the main point about that, is it probably needs cleaning.
We’ve got Internet of Things data. That’s kind of a different kind of situation. There’s likely to be so much of that, but a lot of it is going to have to stay distributed somewhere near the place it runs. But you’re also going to want, in one way or another, pull it in to do the analytics within the organization on the data. So, that’s added yet another factor. And that data will be structured in the different way, because it will probably — it will probably be formatted in JSON or in XML, so that it declares itself. And not only, in one way or another, that we are actually pulling data in and able to do kind of schema on read on that particular piece of data.
We’ve got the issue of provenance, and this is an analytics issue. The results in any analysis you’re doing data really cannot be — if you like — approved of, taken to be valid, unless you know the data provenance. I mean, that’s just professionalism in terms of data scientists’ activity. But you know, in order to have data provenance, that means that we actually have to govern data and keep a note to its lineage.
We have the issue of computer power and parallels and what all that does is make everything go faster. The problem is that obviously, certain processes that we’ve got in place may be too slow for everything else. So, there’s possibly mismatches in terms of speed.
We’ve got the advent of machine learning. The machine learning has the effect, really, of making analytics a different game than it was before. But you can only really use it if you’ve got the power.
We’ve gotten the fact of new analytic workloads. We’ve got a parallel world and some analytical algorithms need to be executed in parallel for maximum effect. And therefore the problem actually is governing how you actually, in one way or another, push the data around, make the data if they’re available. And where you actually execute the analytical workloads, because you may be doing that within database. So, you may be doing it within analytical applications.
So, there’s a whole series of governance challenges. What we did this year — the research we did this year was really around big data architecture. And when we actually try to generalize it, the conclusion that we came to — the diagram that we came up with looked a lot like this.
I’m not going to go into this, especially as Mike is going to do a fair amount on data architecture for analytics. But what I actually like people to just focus on is this bottom area where we are, in one way or another, assembling data. We have something that I would like to refer to is the data refinery or data processing hub. And that’s where the governance takes place. So, you know, if we kind of focus in, it looks like that. You know, it’s being fed by data from internal and external sources. The hub should be, in theory, taking all the data that’s being generated. It should either be streamed and managed as it’s streamed if you need to do analytics and streaming data, and then passed to the hub. Or else, it all comes into the hub. And there are a number of things that’s going — that are going on in the hub. And you can’t have a certain amount of analytics and SQL going on in the hub. But you’ve also got the need for data virtualization in each cell to push data to other areas. But before any of that happens, you actually need, in one way or another, to do the refining the data preparation. You can call it data preparation. It’s much bigger than that. These are the things that I think it includes.
We have system management and service management, in a sense, that this is the major portion of the data layer, then we actually have to apply all of the systems managing operational system management effort that we traditionally have done to pretty much all operational systems. But we also need, in one way or another, to monitor other things going on to make sure these various service levels are being met, because there are bound to be defined service levels or any kind of analytics as being actioned, or BI data is being actioned.
We need performance monitoring and management. If anything else, we need that in order to know what further computer resources we may need to allocate at various points in time. But also, an awful lot of the workload is here in actual fact, fairly complex and competing with each other for resources. There’s something quite sophisticated that needs to be done in that area.
We’ve now got data life cycle in a way that we never had it before. The deal here really is above and beyond anything else, that we didn’t gather data and throw it away before. We tended to gather data that we needed and probably kept it, and then we archive it. But an awful lot of what we will be doing from here on is exploring data. And if you don’t want the data, let’s bury it away. So, the data life cycles are different thing depending upon the situation, but will also be an awful lot more aggregation of data. Therefore, you know, knowing where an aggregate came from what the… what the source of aggregation is, and so on and so forth. That’s all necessary.
Data lineage naturally lends. Without it, you have to know the problems, so the data… We have to know the data is valid, but with how reliable it actually is.
We’ve also got data mapping, because a lot of the data is actually going to be, in one way or another. And this is, if you like, this relates to a certain extent at MDM. It’s just that it’s far more complicated now, because when you’ve got an awful lot of data defined by JSON or based on our XML schema on read, then you’re going to need to, in one way or another, have very active data mapping activity going on.
There’s a metadata management situation which is more than MDM, because there’s a need, in one way or another, to build what I’d like to think of now as a kind of metadata warehouse of everything that you have an interest in. There’s metadata discovery, because some of the data will not necessarily have its metadata declared, and we want to use it immediately. And then, there’s data cleansing, which is a huge thing as how series of things that one can do there. And there’s data security as well. All of this data has to be secured to an acceptable level, and that might even mean in certain instances — for instance, encrypting a lot of the values.
So, all of this workload is actually the governance empire. All of this, in one way or another, has to be going on at the same time or before, all of our analytical activity. This is a large number of coordinated applications. It’s a system in its own right. And then, those that don’t do it at various points in time will suffer from a lack of it as they go forward, because an awful lot of these things aren’t really optional. You end up with just increasing entropy if you don’t do them.
So, in terms of data analytics and governance, the thing that I’d say is that, really, one hand washes the other. Without governance, analytics and BI won’t flounder in time. And without analytics and BI, there wouldn’t be much need for governing the data anyway. So, the two things really walk hand-in-hand. As they say in the Middle East, "One hand washes the other." And that’s actually all I’ve got to say. I hope — hopefully, we’ve now got Mike back.
Eric: We do. Mike, I presume you’re there. I’m going to push your slide up.
Mike: I am. Okay, can you hear me?
Eric: Yeah, I can hear you. You sound wonderful. So, let me introduce… There you go. And you are now the presenter. Take it away.
Mike: Alright, thank you! Good morning, good afternoon, good evening to all of you out there. Forgive the hiccup at the beginning. For some reason, I got myself muted and can see everybody but they couldn’t hear me.
Alright. So, what I want to do quickly is talk about, you know, the big data analytical ecosystem. If you want to ask me questions, I’ll say, in this session or later, you can get hold of me on my contact details here. As I said, in the middle of the night here in the UK.
Well, let me get to what I want to talk about. Clearly, over the last few years, we’ve seen the emergence of all kinds of new-found types of data that businesses now want to analyze — everything from clickstream data to understand online behaviors, social media data that Eric was talking about at the beginning of the program here. I think Robin mentioned JSON, BSON, XML — so, semi-structured data that’s self-describing. Of course, we’ve got a whole ton of other stuff as well — everything from unstructured data, IT infrastructure logs, sensor data. All of this relatively new data sources that businesses have now taken an interest in because it contains valuable insight that could potentially deepen what we know.
So, that basically means the analytical landscape has moved beyond traditional data warehousing. We still structure data into the world of a combination of structured and multi-structured data, where the multi-structured data could come from inside or on the outside of the enterprise in many cases. And as a result of these new data types and new needs to analyze, we’ve seen the emergence of new analytical workloads — everything from analyzing data in motion, which kind of turns the traditional data warehousing architecture on its head, somewhat, where we, in traditional circles, integrate data, cleaned it, transformed it, stored it and analyzed it. But analyzing data in motion, we’re capturing the data, integrating it, preparing it through analyzing it and then storing it. So, there’s analysis going on on data before it’s stored anywhere.
We [inaudible 00:21:48] complex analysis of structured data, perhaps for model development, statistical and predictive model development, that’s nothing new to some folks in a traditional data warehousing space. We’ve got exploratory analysis of on-model data. That’s the amount of structured data there. We’ve got new workloads in the form of graph analysis which for my clients in financial services includes things like fraud. It also includes cyber security. It includes social networks, of course, understanding influencers and stuff like that there. I even mastered it in management, has some years of graph analysis.
We’ve got the data warehouse optimization or offloading of ETL processing, which is more of a sort of IT use case, CIO might fund that. And even archiving data and data warehouses to keep it online in things like Hadoop. So, all of these new analytical workloads have added new platforms, new storage platforms, to the analytical landscape. So, rather than just having traditional data warehouses, data marts, what we’ve now got is Hadoop. We’ve got NoSQL databases such as graph databases which are often used for analytical workloads. Of course, we can do graph analysis now on Hadoop itself as well as in a NoSQL graph DBMSs. We’ve got streaming analytics which Robin mentioned. And we’ve got — if you like — building of models, perhaps on analytical data warehouse appliances as well. But all of that has complicated the analytical landscape, multiple platforms now being needed. And I guess the challenge from, for any business with a front office or back office, or finance, procurement, HR and some kind of operations, is to figure out which analytical projects are associated with a traditional data warehousing scene. And once you know analytical projects are associated with these new big data platforms and where to run, you know, which analytical workload, but not to lose sight of business in the sense that it’s — you’ll be now seeing it’s a combination of big data analytical projects and traditional big data warehousing projects that together are needed to strengthen inside around customer or around operations, around risk, or finance or sustainability. And therefore, we want all of these to be aligned with our strategic business priorities, that we stay on track to, you know, push in the needles that need to be pushed in, you know, to improve business performance, to reduce cost, to reduce risks, etc., you know, for our company as a whole. So, it’s not that one replaces the other here with big data and traditional. It’s both being used together. And that dramatically changes the architecture, you know.
So, what I have here is a relatively new architecture that I’ll use with my clients. And so, as you can see now along the bottom, a vast range of data sources, not just structured anymore. Some of those are streaming live data like sensors, like markets data, that kind of thing. It could even be live clickstream data. It could be live video streaming data. So it didn’t have to be structured. So, we can be doing stream processing on that data to take automatic actions in real time, and any data of interest could be filtered and passed into an enterprise information management tools that can be used to populate analytical data stores. Unless you can see in the mix here, now we’ve got traditional data warehousing, Hadoop and NoSQL databases. We’ve got master data management in the mix as well. And that puts more pressure on the whole data management tool suite, not only to populate these data stores but to move data between them.
On top of that, we have to simplify access tools. We cannot just turn to the user and say, "get all these data stores, hold these APIs — your problem." What you’ve got to do is simplify access. And so, kind of in the dotted lines there, you’ll see data virtualization and optimization are kind of hiding the complexity of multiple data storage, try and make it easier for end users to access this. And of course, there’s a range of tools on the top, you know — everything from traditional BI tools that have kind of started over at the top of data warehousing, gradually moving towards the left of your chart to kind of connect up into the Hadoops and then NoSQL databases of the world.
We’ve got search getting a new lease on life to particularly around the body structured, non-structured data that’s often stored in Hadoop. We’ve got custom analytic applications to be done on a Hadoop platform with MapReduce, so the Spark framework, for example. We’ve got graph analytics tools to, you know, focus on very specific workloads there. So, a range of tools and the data flows are also more complex. It’s no longer just a one-way street in the data warehouse. It’s now master data, of course.
We’ve got new data sources coming in, either being captured in NoSQL, you know, data stores like MongoDB, like Cassandra, like HBase. We’ve got data being brought directly into Hadoop for analysis and data preparation there. We’ve got new insights coming out of Hadoop and the data warehouses. We’ve got archive coming off the data warehouses into Hadoop. Now we got data feeds going to, you know, all the NoSQL databases and data marts as well. So, what you can see here is, there’s far more activity going on in data management. And it means it’s putting the data management software under considerable pressure. It’s no longer just a one-way street. It’s two-way data movement. It’s a lot more activity going on, and therefore, scalability is important on the data-management-tool front as well as on the data source.
So, this chart goes back to that architecture I mentioned a moment ago. It shows you the different analytical workloads running in different parts of this architecture. Sort of on the bottom left there, you’ve got real-time streaming, stream processing going on on data coming out of, you know, any kind of live data store. We’ve got class analysis happening on NoSQL graph databases. It can also happen on Hadoop. With the Spark framework, for example, and GraphX there, we’ve got investigative analysis and the data refinery that Robin was talking about happening on Hadoop. We’ve got traditional workloads still going on and data warehousing, you know, power users building statistical and predictive models, perhaps on data warehouse appliances. And we’re still trying to simplify access to all of this to make it easy for end users.
So, success around this whole setup is more than just the analytical side. You know, we can put the analytical platforms in place, but if we can’t capture and ingest at, you know, high velocity and high volume data, at the scale, there’s not much point. You know, I’ve nothing to analyze. And so, success of big data analytics does require operational systems to scale up. That means, to be able to support new transactions, you know, peaks. You know, any non-transactional data being captured there could be, you know, any new arrival rates very, very high arrival rates on high-velocity data like sensors or any ingest. We have to be able to cater for all of that — to be able to capture this kind of data and bring it in for analysis. We also have to scale the analytics themselves, simplify access to data that I mentioned already. And then, tie that. You know, we have to be able to refine back into those operational systems to give it a closed loop.
So, scaling the operational side of the house to capture data, you know, takes into the world of NoSQL database. I mean, here you see five categories of NoSQL database. This is category will be modeled just being a combination of the other four above. In general, you know, its key values, stored documents and column family databases — the first three there — which are kind of used for the more kind of transactional and non-transactional data.
Some of those databases supporting as properties; some of them not. But nevertheless, you know, we’re seeing the introduction of those to scale those kinds of applications. And so, for example, as we’ve moved away from just employees entering transactions at keyboards to now customers and the masses using novel devices to be able to do that. We’ve seen a tremendous increase in the number of transactions being entered into enterprises. And so, we need to scale transactional applications to do that.
Now, generally speaking, that can be done on NewSQL databases as a relational database like NuoDB and VoltDB shown here. Or some of the NoSQL databases that perhaps support ACID properties that can guarantee transaction processing may be in play. This also applies to non-transactional data such as shopping cart data before a transaction, you know, before people buy stuff, sensor data, you know, as I lose a sensor reading amongst hundreds of millions of sensor readings. It’s no big deal. Clicks, you know, in the clickstream world — if I do use a click, it’s no big deal. So, you know, we don’t need necessarily to have ACID properties there, and that’s often where NoSQL databases come into play, it was there — that capability to do very high, right processing at scale to capture these new kinds of data.
At the same time, we want the analytics to scale. And so, pulling the data from the data stores to the analytical platforms is no longer going to hack it because the data is too big. What we really want is to push the analytics the other way, down into the enterprise data warehouse into Hadoop, into stream processing to be able to push the analytics to the data. However, just because someone says it’s in database analytics or in Hadoop analytics doesn’t necessarily mean the analytics run in parallel. And quite frankly, if you’re going to invest in these new massively parallel scalable technologies like Hadoop, like the data warehouse appliances and whatnot, like the clustered stream processing engines, we need the analytics to run in parallel.
So, that’s only the check out. You know, if we’ve got analytics to help predict things for customers, for operations, for risk, etc., we want them to run in parallel, not just run in the platform. We want both. And that’s because, you know, technology is like these new visual discovery tools such as SAS as well. It’s actually one of our sponsors here.
One thing what people want is at least to exploit those in Hadoop and then in database analytics. And we want those to run in parallel in order to be able to deliver the performance needed on such high data volumes. At the same time, we’re trying to simplify access to all of this. And so, SQL is now back on the agenda. You know, SQL is — SQL on Hadoop is hot right now. I’m tracking it in 19 SQL and Hadoop initiatives right now. Plus, you can see, we can get at this data, you know, in a number of ways so that directly accessing SQL on Hadoop itself, we can go SQL to a search index. In that way such as, you know, some of the search vendors in that space, we can have SQL access to analytical relational databases which have Excel tables to Hadoop.
We can now have SQL access to a data virtualization server which itself can then be connected to a data warehouse on Hadoop. I’m even now starting to see the emergence of SQL access to live streaming data. So, SQL access to all of this is growing rapidly. And part of the challenge is, just because SQL access is being marketed out there. The question is, can SQL deal with complex data? And that’s not necessarily straightforward. There’s all kinds of complications here, including the fact that JSON data could be nested. We can have schema variant records. So, the first record has got one schema. The second record has got a different schema. These things are very different from what happens in a relational world.
So, we need to make questions about what kind of data is it that we’re trying to analyze, and what are the kind of analytical characteristics. Is it, you know, text panel that you want to do? Is it machine learning? Is it graph analysis? Can you do that from SQL? You know, is that invocable from SQL? How many concurrent users have we got doing this? You know, we’ve got hundreds of concurrent users. Is that possible on complex data? You know, all of these things are key questions. So, I kind of made a list of a few here that I think you should consider. You know, what kind of file formats? What kind of data types are we talking about? What kind of analytical functions can we invoke from SQL to get at complex data? And kind of the functions run in parallel. I mean, they’ve got to run in parallel if we’ve got to be able to scale this. And can I join data in Hadoop today outside of it, you know, or that’s not doable? And what will I do with all these different kinds of query workloads?
And as we’ll see, you know, from what I’ve seen, there’s a lot of differences across the SQL and Hadoop distribution. These are all the ones I’m tracking. And by the way, that’s pure SQL on Hadoop. That doesn’t even include data virtualization at this point. And so, a lot out there and lots of room for consolidation, which I think is going to happen over the next year, eighteen months or so. But it also opens up another thing, which is I can have potentially multiple SQL engines on the same data in Hadoop. And that’s something you couldn’t do in relational.
Of course, that means you have to then know, you know, what kind of query workload am I running? Should I run that in batch on a particular SQL on Hadoop initiative? Should I run interactive query workloads through another SQL on Hadoop initiative, etc., so that I know which one to connect to? Ideally, of course, we should not do that. We should just have, you know, asked a question on it. You know, some optimizer figures out the best way to do it. But we’re not fully there yet, in my opinion.
But nevertheless also, data virtualization, I mentioned earlier has a very important role for simplifying access to multiple data stores. And if we do create new insights on Hadoop, it’s certainly plausible for us to join that data-to-data and traditional data warehouses through data virtualization, for example, without necessarily moving the data from Hadoop into traditional data warehouses. Of course, you can do that, too. It’s also plausible if I archive data from traditional data warehouses into Hadoop. I can still get at it and join it back to the stuff that’s in our data warehouse to data virtualization. So, for me, I think data virtualization has got a big future in this overall architecture and simplifying access to all these data stores.
And not to forget that when we create these new insights, whether that’s on relational or NoSQL systems, we still want to drive those insights back into our operations, so that we can maximize the value of what we’ve found, so that we can leverage that for more effective, more timely decisions in that environment to optimize our business.
So, to wrap up then, what I’m seeing, then, is we need, you know, new data sources emerging. We’ve got new platforms on a more complicated architecture, if you like, to handle that. And Hadoop becoming very, very important, enough for data preparation for our liquid sandboxes, for archive query, archive from data warehouse, data management spreading its wings to go beyond data warehousing into managing data across all of these platforms, and new tools to be able to analyze and access data in these environments, to be able to have scalable technologies to do better ingesting of data, and scaling the analytics by pushing them down into the platforms to make them more in parallel. And then, hopefully, also to simplify access to all of it through the emergent SQL coming in over the top. So, it gives you an idea of kind of where we’re headed. So, with that, I’ll pass back to, I guess, Eric now, is it?
Eric: Okay, that’s fantastic. And folks, I have to say, between what you just got from Robin and Mike, it is probably about as comprehensive and concise in overview of the entire landscape from looking at as you’re going to find anywhere. Let me go ahead and queue up George Corugedo first. And there it is. Let me take this for a quick second. Alright, George, I’m about to hand the keys to you, and take it away. The floor is yours.
George: Great! Thank you very much, Eric, and thank you, Rob and Mike. That was great information and lots that we concur on. So, going back to Robin’s discussion, because, you know, it’s not a coincidence that RedPoint is here and SAS is here. Because RedPoint, we really focus on the data side of it on the governance, on the processing of the data and the preparation for use in analytics. So, let me just barge through these two slides. And really talk about and pick up on Robin’s point about MDM and how important it is, and how useful, I think — and we think — Hadoop can be in the world of MDM and data quality.
You know, Robin was talking a little bit about, you know, how this is related to the enterprise data warehouse world and I come — you know, I’ve spent a number of years at Accenture. And what was interesting there is how many times we had to go into companies and try to figure out what to do with the data warehouse that basically had been abandoned. And a lot of that happened because the data warehouse team did not really align their build to the business users or to the consumers of the data. Or, it just took so darn long that by the time they’ve built the thing, the business use or the business rationale for it had evolved.
And one of the things that I think is, I’m so excited about, the idea of using Hadoop for master data management, for data quality and for data preparation, is the fact that you can always go back to the atomic data in a Hadoop data lake or data reservoir, or data repository, or hub, or whatever the buzz form you want to use. But because you always keep that atomic data, then you always have an opportunity to realign with the business users. Because, as an analyst — because I actually started my career as a statistician — you know, nothing is worse than, you know, enterprise data warehouses are wonderful for driving the reports, but if you want to do really predictive analytics, they’re really not that useful, because what you really want is the granular behavioral data that somehow got summarized and aggregated in the data warehouse. So, I think that is really an important feature, and that’s one thing that I think that I might disagree with Robin on is that I personally would leave data in the data lake or the data hub as long as possible, because as long as the data is there and it’s clean, you can look at it from one direction, another direction. You can merge it with other data. You always have that opportunity to come back to it and restructure, and then realign yourself with a business unit and the need that this unit might have.
One of the other kind of interesting things about this is that because it is such a powerful computational platform, a lot of that workload that we’ve been talking about, we see it all coming straight into Hadoop. And while, I think, Mike was talking about all the different technologies that are out there in the world of — in this type of big data ecosystem, we think that the Hadoop really is the workhorse to do that large scale in computationally intensive processing that master data and data quality require. Because if you can do it there, you know, just the sheer economics of moving data off of your expensive databases and into economical databases, this is really driving so much of the uptake right now in large enterprises.
Now, of course, there are some challenges, right? There are challenges around the technologies. A lot of them are very immature. I’d say, you know, I don’t know how many, but a number of the technologies that Mike mentioned are still on zero-point-something releases, right? So, these technologies are very young, very immature, still code-based. And that really creates a challenge for enterprises. And we really focus on solving enterprise-level problems. And so, we think that there has to be a different way, and that’s what we propose is a different way of going about some of the stuff in using some of these very nascent technologies.
And so, and then the other interesting issue here, which has been mentioned previously which is, when you have data that you’re capturing in a Hadoop environment of whatever type, you know, it is usually schema on read rather than schema on write with some exceptions. And that reading, a lot of it is being done by statisticians. And so, the statisticians have to have tools that allow them to properly structure the data for analytic purposes, because at the end of the day, to make data useful, it has to be structured in some form to see some or answer a question or a business, some type of business, create business value.
So, where we come in, is that we have very broad-based and mature EPL, ELT data quality master key and management application. It’s been in the market for many, many years. And it has all of the functionality or much of the functionality that Robin listed out in that circular graph — everything from just pure raw data capture in a whole variety of formats and XML structures and whatnots, to the ability to do all the cleansing, the completion of the data, the correction of the data, the geospatial core bits of the data. That’s something that’s becoming more and more important these days with the Internet of Things. You know, there’s geography associated to much of what we do or much of that data. And so, all of the parsing, the tokenization, the cleansing, the correction, the formatting, the structuring, etc., all of that is done in our platform.
And then, and perhaps, we think of most importantly is the idea of deduplication. You know, at the core, if you look at any definition of master data management, the core of it is deduplication. It’s being able to identify entities across different sources of data, and then create a master record for that entity. And that entity could be a person. The entity could be a part of an airplane, for example. The entity could be a food like we’ve done for one of our health club clients. We’ve created a master food database for them. So, whatever the entities are that we’re working with — and of course, increasingly, there are people and the proxies for their identities which are things like social handles or email accounts, whatever devices that are associated with people, some things like cars and phones, and whatever else you might imagine.
You know, we’re working with a client who’s putting all sorts of sensors into sportswear. So, the data is coming from every direction. And in one way or another, it’s a reflection or representation of core entity. And increasingly, that’s people and the ability to identify the relationships between all these sources of data and how they relate to that core entity, and then being able to track that core entity over time so that you can analyze and understand the changes between that entity and all of those other elements that are in that representations of that entity, a really critical to long-term and longitudinal analysis of people, for example. And that’s really one of the really important benefits that, I think, big data can bring us is much better understanding of people, and over the long term, and understand context and how people are behaving when they’re behaving through what devices, etc.
So, let me move through here quickly. Eric mentioned YARN. You know, I throw this in just for a little bit of a sec, because while YARN — people talk about YARN. There’s still a lot of ignorance, I think, about YARN. And not a lot of people really — there’s still a lot of misunderstanding about YARN. And the fact is that if your application has been architected in the right way, and you have the proper level or parallelization in your application architecture, then you can take advantage of YARN to use Hadoop as your scaling platform. And that’s exactly what we’ve done.
You know, again, just to point out some of the definitions around YARN. To us, really what YARN is has enabled us for ourselves and other organizations to become peers to MapReduce and Spark, and all the other tools that are out there. But the fact is that our applications drive optimized code directly into YARN into Hadoop. And there’s a really interesting comment that Mike has mentioned, because, you know, the question about analytics and our analytics, just because they’re in the cluster, are they really running in parallel? You can ask the same question about a lot of the data quality tools that are out there.
Most of the day, the quality tools that are out there either have to take the data out or they’re pushing code in. And in a lot of cases, it’s a single stream of data that’s getting processed because of the way you have to compare records, sometimes in data-quality type of activities. And the fact is that because we’re utilizing YARN, we’ve been able to really take advantage of the parallelization.
And just to give you a quick overview, because another comment is made about the importance of being able to expand traditional databases, new databases, etc., we implement or we install outside of the cluster. And we push our binaries directly into the resource manager, YARN. And that, and then YARN distributes it across the nodes in the cluster. And what that does is, is that YARN — we allow YARN to manage and do its job, which is to figure out where the data is and take the work to the data, the code to the data and not move the data around. When you hear data quality tools and they’re telling you best practice is to move the data out of Hadoop, run for your life, because that’s just not the way it is. You want to take the work to the data. And that’s what YARN does first. It takes our binaries out to the nodes where the data resides.
And also because we’re outside of the cluster, we can also access all of the traditional and relational databases so we can have jobs that are 100% client server on a traditional database, 100% Hadoop or hybrid jobs that go across Hadoop client server, Oracle, Teradata — whatever it is you want and all in the same job, because that one implementation can access both sides of the world.
And then, going back to the whole idea of the nascency of the tools, you see here, this is just a simple representation. And what we’re trying to do is simplify the world. And the way we do that is by bringing a very broad set of functionality around HDFS to make it… And it’s not because we’re trying to eliminate all the innovative technologies out there. It’s just enterprises need stability, and they don’t like code-based solutions. And so, what we’re trying to do is give enterprises a familiar, repeatable, consistent application environment that gives them the ability to build and process data in a very predictable way.
Quickly, this is the kind of impact we get with our application. You see MapReduce vs. Pig vs. RedPoint — no lines of code in RedPoint. Six hours of development at MapReduce, three hours of development in Pig, and 15 minutes of development in RedPoint. And that’s where we really have a huge impact. The processing time is also faster, but the people time, the people productivity time, is significantly increased.
And my final slide here, I want to go back to this idea, because this is our take on using a data lake or a data hub, or a data refinery as the central point of ingestion. Couldn’t agree more with that idea. And we’re currently in discussions with lots of the chief data officers of major global banks, and this is the architecture of choice. Data ingestion from all sources do the data quality processing and master data management inside the data lake, and then, push data where it needs to go to support applications, to support BI, whatever that might be. And then, if you do have analytics in BI, they can run directly inside the data lake, where all the better, that can start right away. But very much on board with this idea. This topology here is one that is — that we’re finding is gaining a lot of traction out the market. And, that’s it.
Eric: Okay, good. Let’s move right along here. I’ll go ahead and hand it over to Keith. And, Keith, you got about 10, 12 minutes to rock the house here. We took to go a little bit long in these shows. And we advertised 70 minutes for this one. So, just go ahead and click anywhere on that slide and use the down arrow and take it away.
Keith: Sure. No problem, Eric. I appreciate it. I’m going to go ahead and hit just a couple of pieces about SAS, then I’ll move into, right into technology architectures of where SAS intersects with the big data world. There’s a lot to explain in all of this stuff. We could spend hours going through it in great detail, but ten minutes — you should be able to walk away with just a brief understanding of where SAS has taken analytics, data management and business intelligence technologies into this big data world.
First, just a little bit about SAS. If you’re not familiar with this organization, we’ve been, for the last 38 years, have been doing advanced analytics, business intelligence and data management with not just big data, but small data and data wealth for the last 38 years. We have an enormous existing customer footprint, about 75,000 sites all across the world, working with some of the top organizations out there. We are a private organization with about 13,000 employees and $3 billion of revenue. And just really, I guess, the important part is we’ve traditionally had a long-standing history of reinvesting significant amounts of our revenue back into our R&D organization, which has really brought to bear a lot of these amazing technologies and platforms you’re going to see today.
So, I’m going to jump right into these really scary architecture diagrams. We’ll work from left to right in my slides. So, there’s familiar things that you’re going to see inside this platform. On the left-hand side, all those data sources that we’re talking about ingesting into these big data platforms. And then, you’ve got this big data platform.
I haven’t just put the word Hadoop there at the top, because ultimately, the examples I’m going to give today are specifically around all the technologies where we intersect with these big data platforms. Hadoop just happens to be one of the ones where we have some of the most robust deployment options, but we also intersect quite a bit and have developed a lot of these technologies for some time with some of our other enterprise data warehouse partners like Teradata, Oracle, Pivotal and the like. So, I can’t go into great details as to all the different technologies are supported on which platform, but just rest assured that all the ones I describe today are mostly all that Hadoop and a vast amount of them intersects with other technology partners that we have. So, we’ve got that big that platform sitting there.
The next one just to the right, we have our SAS LASR Analytic Server. Now, that essentially, is a massively parallel in memory analytic application server. We’d be clear that it’s not an in-memory database. It’s really designed from the ground up. It’s not the query engine, but designed to service analytic requests at massive scale in a massively parallel way. So, that’s the service key applications you see there on the right hand side.
We’ll get a little bit into more of like, you know, how people deploy these things. But essentially, the application — do you see there — the first one, is our SAS high-performance analytics. That’s going to be — I’m using a lot of our existing technology and platforms like Enterprise Miner or just a SAS, and not just doing multithreading with some of those algorithms that we’ve got built into those tools that we’ve done for years, but also to massively parallel those. So, to move the data from that big data platform into the memory space to that LASR Analytic Server, so that we can execute analytic algorithms — you know, a lot of the new machine learning, neural nets, random forest regressions, those kinds of things — again, the data sitting in memory. So, getting rid of that certain MapReduce paradigm bottleneck where we get filed down to those platforms, that’s not the way you want to do analytic work. So, we want to be able to lift the data one time into the memory space and iterate through it, you know, sometimes thousands of times. So, that’s the concept of using that high-performance Analytic LASR Server.
We also — the other applications below it, the visual analytics, that allows us to persist that data in memory and serve up a larger population on the same data. So, allowing people to do big data exploration. So, prior to doing our model development works, we’re exploring data, getting to understand it, running correlations, doing forecasting or trending decision trees — those kinds of things — but in a very visual, interactive way on data that’s sitting in memory platform. That also services our BI community as far as having a very broad bases of users that can hit that platform to do standard kinds of recording that you’d see — which pretty much any, you know, BI vendor out there.
The next step, we move then into service. And to help our statisticians and our analytics folks to be able to do that kind of ad-hoc modeling with data sitting in memory, removed from visual analytics and exploration into our visual statistics application. This is an opportunity for people to take, to not run statistics in batches that used to kind of iterate through, run the models, see the results. So, that can run the model, see the results. This is to visually drag and drop into interactive statistical modeling. So, this services our statisticians and our data scientists to do a lot of that early exploratory visual statistic work.
And then, we haven’t forgotten our coders — the folks that really do want to have, be able to peel the layers of interface opposite, is to write applications, and to write their own code base in SAS. And that’s our in-memory statistics for Hadoop. And that is the — essentially the code layer that allowed us to interact with that Analytic LASR Server to issue commands directly and customize those applications based on our request. That’s the analytic piece.
How these things get set up… Oops, I’m sorry guys. There we go.
So, there’s really a couple of ways in which we do this. One is to do it with big data — in this case, with Hadoop. And that’s where we have that SAS LASR Analytic Server running in a separate cluster of machines that are optimized for hardcore analytics. This is nestled nice and close up to the big data platform, allowing us to scale it separately from the big data platform. So, we see people doing this when they don’t want to have sort of what I characterize like vampire software eating away at each of the nodes at their Hadoop cluster. And they don’t necessarily scale that big data platform appropriate for doing heavy lifting in-memory analytics. So, you might have 120 nodes of their Hadoop cluster, but they might have 16 nodes of analytic servers that are designed to do that kind of work.
We still are allowed to maintain that parallelism from the big data platform to pull the data into memory. So, it really is a using SAS with the Hadoop platform. A different appointment model then is to say, well, we can use that commodity platform as well and push that — essentially run the Analytic LASR Server on the Hadoop platforms. So, that’s where we’re… you’re operating inside the big data platform. That’s also some of our other appliance vendors as well. So, that’s allowed us to essentially use that commodity platform to do that work.
We see that more often with things like high-performance analytics where it’s a single-serving or single-use kind of analytic run, more kind of batch oriented where you’re — you don’t want to necessarily consume the memory space at Hadoop platform. We’re very flexible at this kind of deployment model, definitely in our working with YARN in a lot of these cases to make sure that we’re playing nice clusters.
Okay, so that’s the analytic world, just to be clear there with the analytic application. But I mentioned that SAS in the very beginning is also a data management platform as well. And there are things that are appropriate to push logic into that platform where appropriate. So, there are a couple of ways in which we do that. One is in the data integration world, doing data transformation work on data may not make sense to pull it back out as we’ve heard before, running data quality routines that’s a big one. We want to definitely push things like data quality routines down into that platform. And then, things like model scoring. So, I’ve got my model developed. I don’t want to rewrite that thing in MapReduce and make it difficult and time consuming for me to redo that work into the native database platform.
So, if you look at, for example, our scoring accelerator for Hadoop, that allows us to essentially take a model and push the SAS mathematical logic down into that Hadoop platform and execute it there, using the parallelism that’s inside that big data platform. We then have our code accelerator for various platforms including Hadoop, and that allows us to essentially run SAS data step code inside the platform in a massively parallel way — so, doing data transformation kinds of work in the platform. And then our SAS data quality accelerator that allows us to have a quality knowledge base sitting there that can do things like gender matching, standardization match code — all the different data quality things that you’ve heard already today.
And then, last piece, there’s Data Loader. We know our business users are gonna have to be able to not have to write code, do data transformation work in these big data platforms. Data Loader is a nice WYSIWYG GUI that allows us to wrap up those other technologies together. It’s like a walk-through wizard to, say, run a Hive query or run a data quality routine and not have to write code in that case.
The last thing I’ll mention is this front piece. We have — as I mentioned before — a massive SAS footprint out there in the world. And this, we can’t just necessarily do all of those platforms that are out there to be in this space immediately. So, we definitely have an existing footprint of users that need to get a data sitting in these big data platforms such as getting data out of Teradata and putting it back into Hadoop, and vice versa. Running the models I already know how to run on my SAS servers, but I need to get a data that’s now being placed in the Hadoop platform. So, there’s this other little icon there called "from," and that allows us to connect using our SAS access engines — access engines to Hadoop to Cloudera in Pola, to Teradata, to Greenplum to… And the list goes on. This allows us to use our existing mature SAS platforms that are already in place to get data from these platforms, do the work that we need to get done, push results back into these areas.
The last thing I’ll make mention is that all of these technologies you see are all governed by the same standard common metadata. So, we talk about getting the transformation work, the data quality rule at work, moving it into memory to be able to do analytics, model development in scoring. We’ve got there the entire analytic lifestyle, lifecycle being governed by common metadata, by governance, by security, by all of the things that we talked about earlier today.
So, just a recap, there’s really those three big things to take away there. One is, we can treat the data platform just like any other data source, pulling from them, pushing to them when it’s appropriate and convenient. We can work with those big data platforms, listing data into a purpose-built advanced analytic in memory platform. So, that’s the LASR server.
And then, last, we can work directly in those big data platforms, leveraging their distributive processing capabilities without moving the data around.
Eric: Well, that is fantastic stuff, folks. Yeah, this is great! So, let’s dive right in to some questions. We typically go about 70 minutes or a little bit longer on these events. So, I see we still have a great audience sitting out there. George, I guess I’ll throw our first question over to you. If you talk about pushing your binary sound into Hadoop, I think that sounds to me like you have really optimized the computational workflow. And that’s the whole key in order to be able to do these kinds of real-time data governance, data quality style achievements, because that’s the value you want to get, right? If you don’t want to go back to the old world of MDM where it’s very cumbersome and it’s very time-consuming, and you really have to force people to act in certain ways, which almost never works. And so, what you’ve done is, you condensed the cycle of what was. Let’s call it days, weeks, sometimes even months down to seconds, right? Is that what’s going on?
George: That’s exactly right, because the scale we get and the performance we get out of a cluster is really staggering in terms of, just, you know, I’m always a little hesitant about benchmarks. But just for the order of magnitude, when we would run a billion, 1.2 billion records and do a complete address standardization — I’m saying mid-range HP machine — it would take, like, you know, eight processor machines, you know, 2 gigs of RAM per core, you know, that would take 20 hours to run. We can do that in about eight minutes now on a, you know, 12-node cluster. And so, the scale of the processing that we can do now is so dramatically different that — and it goes very nicely with the idea that you have all this data at your disposal. So, it’s not as risky to do the processing. If you did it wrong, you can redo it. You’ve got time, you know. It really changed the scale of this where, you know, those kinds of risks really became real business problems for people when they were trying to operate MDM solutions. You have to have 30 people offshore doing data governance and everything. And so, you still have to have some of that, but the speed and the scale at which you can process it now, really gives you a lot more breathing room.
Eric: Yeah, that’s a really, really good point. I love that comment. So, you have the time to redo it again. That’s fantastic.
Eric: Well, it changes the dynamics, right? It changes how you think about what you’re going to try. I mean, I remember this 18 years ago in the industry of doing special effects, because I had a client that was in that space. And you would push the buttons to render it and you’d go home. And you’d come back, maybe on Saturday afternoon, to see how it was going. But if you got it wrong, that was very, very, very painful. And now, it’s not nearly — it’s not even close to being that painful so you have the opportunity to try more stuff. I have to say, I think that’s a really, really good point.
George: That’s exactly right. Yeah, and you blow your extra leg. You know, you get halfway through a job in the old days and it fails, you’ve blown your SOS. That’s it.
Eric: Right. And you’re in big trouble, yeah. That’s right.
George: That’s right. That’s right.
Eric: Keith, let me throw one over to you. I remember doing an interview with your CIL, Keith Collins, I believe, back in, I think, 2011 maybe. And he talked a great deal about the direction SAS was taking specifically with respect to working with customers to embed the analytics derived from SAS into operational systems. And of course, we heard Mike Ferguson talk about the importance of remembering. The whole idea here is you want to be able to tie this stuff into your operations. You don’t want analysis in a vacuum, disconnected from the enterprise. That’s no value whatsoever.
If you want analysis that can directly impact and optimize operations. And if I look back — and I have to say, I thought it’s a good idea back then — it seems like a really, really smart idea in retrospect. And I’m guessing, that’s a real advantage that you guys have. And of course, this great legacy, this huge install base, and the fact that you’ve been focused on embedding these analytics in operational systems, which means now — and granted, it’s going to take some working — I’m sure you’ve been working on it quite hard. But now, you can leverage all of these new innovations and are really in the [inaudible 01:10:11] in terms of being able to operationalize all that stuff with your customers. Is that a fair assessment?
Keith: Yeah, absolutely. The concept is, you get this idea of decision design or decision sciences which is, you know, to some degree that’s exploratory, science-y kind of thing. Unless you can do engineering on the process to really… If you think about developing a car, you’ve got designers that make this beautiful car, but it isn’t until engineers put that plan in place and make an actual viable product before you can actually put things in place, and that’s essentially what SAS has done. It has merged the decisions — decision-designing process with the decision-engineering process together, so that when you talk about the accelerators, the scoring accelerators specifically, you know, if you take a model that you developed and be able to push it out to Teradata, or push it out to Oracle or to Hadoop, with zero downtime for model development, to model deployment. That’s key, because models degrade over time, the accuracy of those models. So, the longer it takes for you to take that and put it into production, that’s model accuracy loss.
And then, the other piece is, you want to be able to monitor and manage that process over time. You want to deprecate models when they get old and inaccurate. You want to look at it, check the accuracy of them over time and rebuild them. And so, we’ve got model management tools that sit on top of that, too, that really tracks the metadata around the modeled process. And people have said that modeling, you know, that kind of concept is like a model factory, or whatever you want to call it. The thing is, it’s putting metadata and management in process and that’s where the three big things we hit — we help people make money, save money and keep them out of jail.
Eric: That last one is pretty big, too. I’m looking to avoid all that. So, let’s talk about ... I’m giving one final question, maybe you each can both kind of jump on this. The heterogeneity of our world will only increase, it seems to me. I think we’re definitely going to see some crystallization around hybrid cloud environments. But nonetheless, you’re going to see lots of the major players sticking around. IBM is not going anywhere. Oracle is not going anywhere. SAP is not going anywhere. And there are so many other vendors that are involved in this game.
Also, on the operational side, where you’ve got literally thousands and thousands of different kinds of applications. And I heard — most of you talk about this, but I think you both would agree with what I’ve been saying. We’ve seen this trend now in terms of just computational power in analytical engines, architecture. Companies have been talking for years now about being able to tap into the other engines out there and service a sort of orchestration point. And I guess, George, I’ll throw it to you first. It seems to me that’s something that’s not going to change. We’re going to have this heterogeneous environment which means there’s stuff like real-time CRM and data quality and data governance. You will need, as a vendor, to interface with all those different tools. And that’s what the customers are going to want. They’re not going to want something that does it okay with these tools and not so okay with those tools. They’re going to want the Switzerland of MDM and CRM, right?
George: [laughs] That’s right. And it’s interesting, because we very much have embraced that. Part of it is the history we had in the space. And obviously, we were already working on all of the other databases, the Teradatas and the pieces of the world. And then, made the — in the implementation process, specifically the way we did, just so that it — you have that span across all of these various databases. One of the things that I find interesting is that, we do have some clients that are just hell-bent on eliminating all relational databases. And that’s interesting. You know, I mean, it’s fine. It’s interesting. But I just don’t see it really happening at a large enterprise scale. I don’t see it happening for a long time. So, I think hybrid is here for a good long time and on the other side of our application where we have our messaging platform in our campaign management platform. We’ve actually specifically designed it. Now, we’ve released a version that does that and that can connect now to the hybrid data environment and query Hadoop, or query any database, any analytic database. So, I think that’s just the wave of the future. And I do agree that virtualization will certainly play a big role in this, but we’re just — we’re going right out to the data on all our applications.
Eric: Okay, great. And, Keith, I’ll throw it over to you. What do you think about the heterogeneous world we’re facing in acting as a footprint of sorts?
Keith: Yeah, it’s really fascinating. I think, what we find more — not just in the data management side of things — but what’s really fascinating right now is the open-source nature of the analytics base. So, we see organizations like, or technologies like Spark coming on board, and people using Python and R and all these other open-source technologies. I think it could be interpreted as sort of a conflict or a threat to some degree. But the reality is, we have some really wonderful compliments with all those open-source technologies. I mean, for one, we’re operating on top of open-source platforms, for God’s sakes.
But also, like being able to integrate, for example, an R model into a SAS paradigm allows you to use the best of both worlds, right? Like, so we know that some of the experimental things in the academic world and some of the model development work is extraordinary and super helpful in the model development process. But also, if you could pair that with a production class kind of tool, it does a lot of the cleansing and quality and checking and making sure the data giving in to the model is, it has been prepped properly so it doesn’t fail on execution. And then, being able to do things like champion challenger models with open-source models. Those are the things that we’re looking at to enable, and as part of this really heterogeneous ecosystem of all these technologies. Yeah, so it’s more — for us, it’s more about embracing those technologies and looking for the compliments.
Eric: Well, this has been fantastic stuff, folks. We went a little bit long here, but we’d like to get to as many questions as possible. We will forward the Q&A file to our presenters today. So, if any question you asked was not answered, we’ll make sure that it gets answered. And folks, this wraps it up for 2014. Yours truly at the DM Radio tomorrow and next week, and then it’s all done and it’s a holiday break.
So much thanks to all of you for your time and attention, for sticking through all these wonderful webcasts. We’ve got a great year lined up for 2015. And we’ll talk to you soon, folks. Thanks again. We’ll take care. Bye-bye.
At Techopedia, we aim to provide insight and inspiration to IT professionals, technology decision-makers and anyone else who is proud to be called a geek. From defining complex tech jargon in our dictionary, to exploring the latest trend in our articles or providing in-depth coverage of a topic in our tutorials, our goal is to help you better understand technology - and, we hope, make better decisions as a result.Full Bio
Never Miss an Article!
Subscribe to our free newsletter now - The Best of Techopedia.