Eric Kavanagh: Okay ladies and gentlemen, it is four o’clock Eastern on a Thursday, and these days that means it’s of course time for Hot Technologies. Yes indeed, my name is Eric Kavanagh. I will be your moderator for today’s web seminar. It’s good stuff, folks, “Big Iron, Meet Big Data” – I just love that headline – “Liberating Mainframe Data with Hadoop and Spark.” We’re going to talk about old meets new. Wow! We’re covering the spectrum of everything we’ve talked about in the last 50 years of enterprise IT. Spark meets mainframe, I love it.
There’s a spot about yours truly and enough about me. The year is hot. We talk about hot topics in this series because we’re really trying to help folks understand certain disciplines, certain spaces. What does it mean to, for example, have an analytic platform? What does it mean to liberate big data from mainframes? What does all this stuff mean? We’re trying to help you understand specific kinds of technologies, where they fit into the mix and how you can make use of them.
We have two analysts today and then of course Tendü Yogurtçu of Syncsort. She’s a visionary in our space, very pleased to have her online today, with our own Dez Blanchfield and Dr. Robin Bloor. I’ll say just a couple of quick words. One is that, folks, you play a big part in this process, so please don’t be shy asking some good questions. We’d like to get to them during the Q&A component of the webcast, which is usually at the end of the show. And all I have to say is we’ve got a lot of good content, so I’m excited to hear what these boys have to say. And with that, I’m going to hand it over to Dez Blanchfield. Dez, the floor is yours, take it away.
Dez Blanchfield: Thank you, Eric, and thank you everyone for attending today. So I get quite excited when I get a chance to talk about one of my favorite things in the world, mainframes. They don’t get much love these days. My view is the mainframe was the original big data platform. Some would argue that they were the only computer at the time and that’s a fair point to make, but for over 60 years now they have really actually been the engine room of what big data has of late been popular as to be. And I’m going to take you on a little journey on why I believe that is the case.
We’ve seen a journey in the technology hardware stacks in the context of mainframes shift from the image that you see on the screen now. This is an old FACOM mainframe, one of my favorites. We’ve moved ourselves through into the big iron phase, the late nineties and the dot-com boom. This is the Sun Microsystems E10000. This thing was an absolute monster at 96 CPUs. Originally 64 but it could be upgraded at 96 CPUs. Each CPU could run 1,024 threads. Each thread could be at application rate at the same time. It was just monstrous and it actually powered the dot-com boom. This is all the big unicorns as we call them, now we’re running, and not just the big enterprises, some of the big websites.
And then we ended up with this common off-the-shelf commodity PC model. We just strapped lots of cheap machines together and we created a cluster and we approached the big iron challenge and what became big data particularly in the form of the Hadoop project that stemmed out the open source search engine, Nutch. And we essentially recreated the mainframe and lots of little CPUs being glued together and being able to act like L-paths and in the form of running separate jobs or parts of jobs and they were quite effective in many ways. Cheaper if you started out smaller, but invariably many of these big clusters have gotten more expensive than a mainframe.
My view on these things is that in the rush from the dot-com boom through to what became Web 2.0 and now chasing unicorns, we’ve forgotten that there’s this platform still powering many of our biggest mission-critical systems out there. When we think about what’s running on the mainframe platforms out there. It is very much the big data, particularly the data workhorse, but certainly big data. Traditional enterprise and government systems like banking and wealth management and insurance in particular, we all use every day.
Airline booking and flight management systems, particularly flight management where real-time is critical. Almost every state and federal government at some time has had a mainframe and invariably many do still have them. Retail and manufacturing. Some of the old software that’s just been around and has never gone away. Just continues to power manufacturing environments and certainly retail at scale. Medical systems. Defense systems, certainly defense systems.
This last couple of weeks I’ve read many articles about the fact that some of the missile control systems are all still running on old mainframes they’re struggling to find parts for. They’re figuring out how to upgrade into new mainframes. Transport and logistics systems. These may not sound like the sexy topics but these are the topics that we deal with on a daily basis across the lines. And some very large telecommunications environments are still run on mainframe platforms.
When you think about the types of data that are in there, they’re all mission critical. They’re really important platforms and platforms we take for granted every day and in many ways make life possible. So who’s still using a mainframe and who are all these people that are holding on to these big platforms and holding all this data? Well, as I said here I believe it’s easy to be fooled by the media’s shift from big iron to racks of common off-the-shelf clusters or cheap PCs or x86 machines, into thinking that the mainframe died and went away. But the data says the mainframe never went away and in fact it’s here to stay.
The research I’ve put together here in the last couple of weeks has shown that 70 percent of enterprise, particularly large enterprise, data still actually resides on a mainframe of some form. Seventy-one percent of Fortune 500s still run core business systems on mainframes somewhere. In fact, here in Australia, we have a number of organizations that have a data center in the middle of a city. It’s an actual underground computer effectively, and the number of mainframes just running there, ticking and happily doing their job. And very few people know that walking around the streets, right under their feet in one particular part of the city there’s this huge data center filled with mainframes. Ninety-two out of 100 of the banks around the world, the top 100 banks that is, still run banking systems on mainframes. Twenty-three of the top 25 retail chains around the world use mainframes to still run their retail management systems in EIP and BI platforms.
Interestingly enough, 10 out of the top 10 insurers still run their platforms on mainframe, and they actually power their cloud services on mainframe. If you’re using a web interface or a mobile app somewhere that there’s middleware an interface is, that actually talk to something really heavy and big at the back end.
I found over 225 state and local government agencies worldwide running on mainframe platforms still. I’m sure there’s a lot of reason for that. Maybe they don’t have the budget to consider new iron but that’s a huge footprint of very large environments running on mainframe with some very critical data. And as I mentioned earlier, most nations still run their key defense systems on mainframe. I’m sure in many ways they’re trying to get off there but there you go.
In 2015 IDC ran a survey and 350 of the CIOs surveyed reported they still owned and managed big iron in the form of mainframes. And it struck me that it’s likely that it’s more than the number of large-scale Hadoop clusters currently running worldwide in production – an interesting little stat there. I’m going to go ahead and validate that, but it was a big number. Three hundred fifty CIOs reported they have one or more mainframes still in production.
Last year, 2015, IBM gave us the mighty Z13, the 13th iteration of their mainframe platform. The media went wild about this thing because they were astounded that IBM was still making mainframes. When they lifted the hood and had a look at what was under the thing, they realized that it was actually on par with almost every modern platform that we’d gotten excited about in the form of big data, Hadoop and certainly the clusters. This thing ran Spark and now Hadoop natively. You could run thousands and thousands of Linux machines on it and it looked and felt like any other cluster. It was quite an astounding machine.
A number of organizations took these things up and in fact I did some data on just how many of these machines are taking up. Now I’ve had the view that the 3270 text terminal has been replaced by web browsers and mobile apps for some time and there’s plenty of data that support that. I think now we’re entering an era where we’ve realized that these mainframes are not going away and there’s a substantial amount of data on them. And so what we’re doing now is simply adding what I call off-the-shelf analytics tools. These are not custom-built apps. These are things that are bespoke one-offs. These are things that you can quite literally just buy in a packaged box per se and plug into your mainframe and do some analytics.
As I said before, the mainframe’s been around for over 60 years, in fact. When we think about how long that is, that’s longer than most living IT professionals’ careers actually span. And in fact probably some of their lives, even. In 2002 IBM sold 2,300 mainframes. In 2013 that grew to 2,700 mainframes. That’s 2,700 sales of mainframes in one year in 2013. I couldn’t get accurate data on 2015 but I imagine it’s rapidly getting close to the 3,000 units sold a year in 2015, 2013. And I look forward to being able to confirm that.
With the release of the Z13, the 13th iteration of a mainframe platform, which I think cost them around about 1.2 or 1.3 billion dollars to develop from scratch, IBM that is, here’s a machine that looks and feels just like any other cluster that we have today, and natively runs Hadoop and Spark. And can certainly be connected to from other analytics and big data tools or invariably be connected to one of your existing or new Hadoop clusters. I have this view that including the mainframe platform in your big data strategy is a must. Obviously, if you have one, you’ve got a lot of data and you want to figure out how to get it off there. And they’re being left to collect dust in many ways, mentally and emotionally as far as the business world goes, but they’re here to stay.
Connectivity and interfaces for all of your analytics tools to mainframe-hosted data should be a key part of your enterprise and particularly government big data plans. And invariably now software is noticing them, taking a good long look at them and realizing what’s inside these things and connecting minds that start to get a bit of insight and a bit of a feel for what’s actually under the hood. And with that I’m going to hand over to my dear colleague, Dr. Robin Bloor and he’ll add to that little journey. Robin, take it away.
Robin Bloor: Well, thank you. Okay, well since Dez has sung the song of the mainframe, I shall go into what I think is happening in terms of the old mainframe world and the new Hadoop world. I guess the big question here is, how do you manage all that data? It’s not my opinion that the mainframe is being challenged in respect of its big data capability – its big data capability is extremely, as Dez has pointed out, it’s extremely capable. In actual fact you can put Hadoop clusters on it. Where it’s being challenged is in terms of its ecosystem and I’ll kind of elaborate on that.
Here’s some mainframe positioning. It has a high entry cost and what has actually happened in the past, since the mid-‘90s when the popularity of the mainframes started to dip, it’s tended to have lost its low end, those people that had bought cheap mainframes and it wasn’t really particularly economic for those people. But higher up actually in the mid-range and high-range of the mainframe it still actually was, and demonstrably actually is, incredibly inexpensive computing.
It was, it has to be said, rescued by Linux because Linux implemented on a mainframe made it possible of course to run all of Linux applications. A lot of Linux applications went there before big data was even a word, or two words I suppose. It’s actually a fairly excellent platform for private cloud. Because of that it can participate in hybrid cloud deployments. One of the problems is that mainframe skills are in short supply. The mainframe skills that exist are actually aging in the sense that people leave the industry for retirement year after year and they’re only just being replaced in terms of the number of people. So that’s an issue. But it still is inexpensive computing.
The area where it’s been challenged of course is this whole Hadoop thing. That’s a picture of Doug Cutting with the original Hadoop elephant. The Hadoop ecosystem is – and it’s going to remain – the dominant big data ecosystem. It offers better scale out than the mainframe can actually achieve and it’s lower cost as a data store by a long way. The Hadoop ecosystem is evolving. The best way to kind of think about this is once a particular hardware platform and the operating environment with it becomes dominant, then the ecosystem just comes alive. And that happened with the IBM mainframe. Well, later happened with the Digital VAX, happened with Sun’s servers, happened with Windows, happened with Linux.
And what’s happened is that Hadoop, which I always think of, or like to think of, as a kind of distributed environment for data, the ecosystem is evolving at an incredible rate. I mean if you just mention the various impressive contributions that are open source, Spark, Flink, Kafka, Presto, and then you add into that some of the databases, the NoSQL and SQL capabilities that are now sitting on Hadoop. Hadoop is the most active ecosystem that actually exists out there, certainly in corporate computing. But if you want to treat it as a database, it just doesn’t bear any comparison at the moment to what I tend to think of as real databases, especially in the data warehouse space. And that explains to a certain extent the success of a number of the big NoSQL databases that don’t run on Hadoop like CouchDB and so on.
As a data lake it has a far richer ecosystem than any other platform and it’s not going to be displaced from that. Its ecosystem isn’t just the open-source ecosystem. There’s now a dramatic number of software members that have products that are fundamentally built for Hadoop or have been imported to Hadoop. And they’ve just created an ecosystem that there isn’t anything that can compete with it in terms of its breadth. And that means really it’s become the platform for big data innovation. But in my opinion it’s still immature and we could have long discussions about what is and isn’t, let’s say, operationally mature with Hadoop but I think most people that are looking at this particular area are well aware that Hadoop is decades behind the mainframe in terms of operational capability.
The evolving data lake. The data lake is a platform by any definition and if you think of there being a data layer in corporate computing now it’s very easy to think of it in terms of the fixed databases plus the data lake making up the data layer. Data lake applications are many and varied. I’ve got a diagram here that just goes through the various data wrangling things that needs to be done if you use Hadoop as a staging area or Hadoop and Spark as a staging area. And you’ve got the whole thing – data lineage, data cleansing, metadata management, metadata discovery – it can be used for ETL itself but often requires ETL to bring the data in. Master data management, business definitions of data, service management of what’s happening in Hadoop, life cycle management of data, and ETL out of the Hadoop, and also you’ve got direct analytics applications that you can run on Hadoop.
And that’s why it’s become very powerful and where it’s been implemented and implemented successfully, normally it has at least a collection of these kinds of applications running on top of it. And most of those applications, particularly the ones that I’ve been briefed about, they’re just not available on the mainframe right now. But you could run them on the mainframe, on a Hadoop cluster that was running in a partition of the mainframe.
The data lake is becoming, in my opinion, the natural staging area for fast database analytics and for BI. It becomes the place where you take in the data, whether it’s corporate data or external data, mess with it until it’s, let’s say, clean enough to use and well-structured to use and then you pass it on. And all of this is still in its infancy.
The idea, in my opinion, of mainframe/Hadoop coexistence, the first thing is that large companies are unlikely to abandon the mainframe. In fact, the indications that I’ve seen recently imply that there’s a rising investment in the mainframe. But they’re not going to ignore the Hadoop ecosystem either. I’m seeing figures of 60 percent of large companies using Hadoop even if a lot of them are actually just prototyping and experimenting.
The conundrum then is, “How do you make these two things coexist?” because they’re going to need to share data. Data that is brought into the data lake they need to transfer to the mainframe. Data that’s on the mainframe may need to go to the data lake or through the data lake in order to be joined to other data. And that’s going to happen. And that means it requires fast data transfer/ETL capability. It’s unlikely that work loads are going to be shared dynamically in, let’s say, a mainframe environment or with something in a Hadoop environment. It’s going to be data that’s shared. And the majority of data is inevitably going to reside on Hadoop simply because it’s the lowest-cost platform for it. And the end-to-end analytical processing will probably reside there too.
In summary, ultimately we need to think in terms of a corporate data layer, which for many companies will include the mainframe. And that data layer needs to be proactively managed. Otherwise the two won’t coexist well. I can pass the ball back to you Eric.
Eric Kavanagh: Again, Tendü I just made you the presenter, so take it away.
Tendü Yogurtçu: Thank you, Eric. Thank you for having me. Hi, everybody. I will be talking about the Syncsort experience with the customers in relation to how we see the data as an asset in the organization is leveled from mainframe to big data on analytics platforms. And I hope that we will also have time at the end of the session to have questions from the audience because that’s really the most valuable part of these webcasts.
Just for people who do not know what Syncsort does, Syncsort is a software company. We have been around actually over 40 years. Started on the mainframe side and our products span from mainframe to Unix to big data platforms, including Hadoop, Spark, Splunk, both on premise and in the cloud. Our focus always has been on data products, data processing and data integration products.
Our strategy with respect to big data and Hadoop has been really to become part of the ecosystem from day one. As proprietors of vendors who have been really focused on data processing with very lightweight engines we thought that there was a big opportunity to participate in Hadoop becoming a data processing platform and be part of this next generation data warehouse architecture for the organization. We have been a contributor to the open-source Apache projects since 2011, starting with MapReduce. Have been in the top ten for Hadoop Version 2, and participated actually in multiple projects also including Spark packages, some of our connectors are published in Spark packages.
We leverage our very lightweight data processing engine which is completely flat-file-based metadata, and sits very well with the distributed file systems like Hadoop Distributed File System. And we leverage our heritage on the mainframe, our expertise with algorithms as we put out our big data products. And we partner very closely with the major vendors, major players here including Hortonworks, Cloudera, MapR, Splunk. Hortonworks recently announced that they will be reselling our product for ETL onboarding with Hadoop. With Dell and Cloudera we have a very close partnership that is also reselling our ETL product as part of their big data appliance. And with Splunk actually, we publish a mainframe telemetry and security data in Splunk dashboards. We have a close partnership.
What is in the mind of every C-level executive? It’s really, “How do I tap into my data assets?” Everybody is talking about big data. Everybody is talking about Hadoop, Spark, the next computer platform that may help me create business agility and open up new transformative applications. New go-to-market opportunities. Every single executive is thinking, “What is my data strategy, what is my data initiative, and how do I make sure that I don’t stay behind my competition, and I still am in this market in the next three years?” We see this as we speak to our customers, as we speak to our global customer base, which is quite large, as you can imagine, since we have been around for a while.
As we speak with all of these organizations we also see this in the technology stack in the disruption that happened with Hadoop. It’s really in order to satisfy this demand about data as an asset. Leveraging all the data assets an organization has. And we have seen the enterprise data warehouse architecture evolve such that Hadoop now is the new centerpiece of the modern data architecture. And most of our customers, whether it’s financial services, whether it’s insurance, the telco of retail, the initiatives are usually either we find that Hadoop as a service or data as a service. Because everybody is trying to make the data assets available for either their external clients or internal clients. And in some of the organizations we see initiatives like almost a data marketplace for their clients.
And one of the first steps achieving that is all from creating an enterprise data hub. Sometimes people will call it a data lake. Creating this enterprise data hub actually is not as easy as it sounds because it really requires accessing and collecting virtually any data in the enterprise. And that data is now from all of the new sources like mobile sensors as well as legacy databases and it is in batch mode and in streaming mode. Data integration has always been a challenge, however, with the number and variety of data sources and the different delivery styles, whether it’s batch or streaming real-time, it is even more challenging now compared to five years ago, ten years ago. We sometimes refer to it as, “It’s not your father’s ETL anymore.”
So we talk about the different data assets. As enterprises are trying to make sense of the new data, the data they collect from the mobile devices, whether the sensors in a car manufacturer or it’s the user data for a mobile gaming company, they often need to reference the most critical data assets in the enterprise, which is customer information, for example. These most critical data assets often live on the mainframe. Correlating mainframe data with these emerging new sources, collected in the cloud, collected through mobile, collected on the manufacturing line of a Japanese car company, or internet of things applications, have to make sense of this new data by referencing their legacy data sets. And those legacy data sets are often on the mainframe.
And if these companies are not able to do that, are not able to tap into the mainframe data then there’s a missed opportunity. Then the data as a service, or leveraging all of the enterprise data is not really tapping into the most critical assets in the organization. There’s also the telemetry and security data part because pretty much all transactional data lives on the mainframe.
Imagine you going to an ATM, I think one of the attendees sent a message to participants here for protecting the banking system, when you are swiping your card that transactional data is pretty much globally on the mainframe. And securing and collecting the security data and telemetry data from mainframes and making those available through either Splunk dashboards or others, Spark, SQL, becomes more critical now than ever, because of the the volume of the data and the variety of the data.
Skill sets is one of the biggest challenges. Because on one hand you have a rapidly changing big data stack, you don’t know which project is going to survive, which project is not going to survive, should I hire Hive or Pig developers? Should I invest in MapReduce or Spark? Or the next thing, Flink, somebody said. Should I be investing in one of these computer platforms? On one hand, keeping up with the rapidly changing ecosystem is a challenge, and on the other hand you have these legacy data sources. The new skill sets do not really match and you might have an issue because those resources might be actually retiring. There’s a big gap in terms of the skill sets of people who understand those legacy data stacks and who understand the emerging technology stack.
The second challenge is the governance. When you are really accessing all the enterprise data across platforms, we have customers who raised concerns that, “I don’t want my data to land. I don’t want my data to be copied in multiple places because I want to avoid the multiple copies as much as possible. I want to have end-to-end access without landing it in the middle there.” Governing this data becomes a challenge. And the other piece is that if you’re accessing data that bottlenecks, if you are collecting most of your data in the cloud and accessing and referencing legacy data, the network bandwidth becomes an issue, a cluster platform. There are many challenges in terms of having this big data initiative and advanced analytics platforms and yet leveraging all of the enterprise data.
What Syncsort offers is, we are referred to as “simply the best” not because we are simply the best but our customers really refer to us as simply the best at accessing and integrating mainframe data. We support all of the data formats from mainframe and make it available for the big data analytics. Whether that’s on Hadoop or Spark or the next computer platform. Because our products really insulate the complexities of the computer platform. You are, as a developer, potentially developing on a laptop, focusing on the data pipeline and what are the data preparations, the steps to make this data created for the analytics, the next phase, and take that same application in MapReduce or take that same application around in Spark.
We helped our customers doing that when YARN became available and they had to move their applications from MapReduce version 1 to YARN. We are helping them do the same with Apache Spark. Our product, new release 9 is running with Spark as well and ships with a dynamic optimization that will insulate these applications for future computer frameworks.
So we have accessing mainframe data, whether it’s VSAM files, whether it’s DB2, [inaudible] or whether it’s telemetry data, like SMF records or Log4j or syslogs, that needs to be visualized through Splunk dashboards. And while doing that, because the organization can leverage their existing data engineer or ETL skill sets, the development time is significantly reduced. In fact with Dell and Cloudera, there was an independent benchmark sponsored, and that benchmark focused on development time it takes if you are doing hand coding or using other tools such as Syncsort, and it was about 60, 70 percent reduction in the development time. Bridging the skill sets gap across groups, across those data file hosts, and also those data file hosts in terms of the people.
Usually the big data team, or the data ingest team, or the team that’s tasked to develop this data as a service architecture, do not necessarily speak with the mainframe team. They want to minimize that interaction almost in many of the organizations. By closing that gap we have advanced. And the most important part is really securing the entire process. Because in the enterprise when you are dealing with this kind of sensitive data there are many requirements.
In highly regulated industries like insurance and banking our customers ask, they said, “You offer this mainframe data access and that’s great. Can you also offer me making this EBCDIC-encoded record format kept in its original format so I can satisfy my audit requirements?” So we make Hadoop and Apache Spark understand mainframe data. You can keep the [inaudible] data in its original record format, do your processing and levels distributor computer platform and if you need to put that back you can show the record is not changed and the record format is not changed, you can comply with the regulatory requirements.
And most of the organizations, as they are creating the data hub or data lake, they are also trying to do this at a single click to be able to map metadata from hundreds of schemas in an Oracle database to Hive tables or ORC or Parquet files becomes necessary. We ship tools and we provide tools to make this a one-step data access, auto-generating jobs or the data movement, and auto-generating jobs to make the data mapping.
We talked about the connectivity part, the compliance, the governance and the data processing. And our products are available both on premise and in the cloud, which makes it really very simple because the companies don’t need to think about what’s going to happen in the next year or two if I decide to go completely in public cloud versus hybrid environment, as some of the clusters might be running on premise or in the cloud. And our products are available both on Amazon Marketplace, on EC2, Elastic MapReduce and also to a Docker container.
Just to kind of wrap up, so we have enough time for Q&A, it’s really about accessing, integrating and complying with the data governance, yet making all of this simpler. And while making this simpler, “design once and deploy anywhere” in a true sense because of our open-source contributions our product runs natively in Hadoop data flow and natively with Spark, insulating the organizations from the rapidly changing ecosystem. And providing a single data pipeline, a single interface, both for batch and streaming.
And this also helps organizations sometimes evaluate these frameworks, because you may want to actually create applications and just run on MapReduce versus Spark and see for yourself, yes, Spark has this promise and provides all of the advance on iterative algorithms work for best machine learning and predictive analytics applications work with Spark, can I also have my streaming and batch workloads done on this computer framework? You can test different computer platforms using our products. And the dynamic optimization whether you are running on a standalone server, on your laptop, in Google Cloud versus Apache Spark, is really a big value proposition for our customers. And it was truly driven by the challenges that they had.
I will just cover one of the case studies. This is Guardian Life Insurance Company. And Guardian’s initiative was really to centralize their data assets and make it available for their clients, reduce the data preparation time and they said that everybody talks about data preparation taking 80 percent of the overall data processing pipeline and they said it was in fact taking about 75 to 80 percent for them and they wanted to reduce that data preparation, transformation times, time-to-market for analytics projects. Create that agility as they add new data sources. And make that centralized data access available for all of their clients.
Their solution, including Syncsort products, is right now they have an Amazon Marketplace lookalike data marketplace supported by a data lake, which is basically Hadoop, and NoSQL database. And they use our products to bring all the data assets to the data lake, including DB2 on mainframe, including VSAM files on mainframe, and the database legacy data sources as well as the new data sources. And as a result of that they have centralized the reusable data assets which are searchable, accessible and available to their clients. And they are really able to add the new data sources and service their clients much faster and more efficient than before. And the analytics initiatives are even progressing more on the predictive side as well. So I will pause and I hope this was useful and if you have any questions for me of any of the related topics please, you are welcome.
Eric Kavanagh: Sure, and Tendü, I’ll just throw one in. I got a comment from an audience member just saying, “I like this ‘design once, deploy anywhere.’” Can you kind of dig into how that’s true? I mean, what have you done to enable that kind of agility and is there any tax? Like when we talk about virtualization, for example, there’s always a bit of a tax on performance. Some people say two percent, five percent 10 percent. What you’ve done in order to enable the design once, deploy anywhere – how do you do it and is there any tax associated with it in terms of performance?
Tendü Yogurtçu: Sure, thank you. No, because unlike some of the other vendors we do not really generate Hive or Pig or some other code that is not native to our engines. This is where our open-source contributions played a huge role, because we have been working with Hadoop vendors, Cloudera, Hortonworks and MapR very closely and due to our open-source contributions, our engine in fact is running natively as part of the flow, as part of the Hadoop flow, as part of the Spark.
What that translates also, we have this dynamic optimization. This was something that came as a result of our customers being challenged with computer frameworks. As they were going into production with some of the applications, they came back, they said, “I am just stabilizing my Hadoop cluster, stabilizing on MapReduce YARN Version 2, MapReduce Version 2, and people are talking that MapReduce is dead, Spark is the next thing, and some people are saying Flink will be the next thing, how am I going to cope with this?”
And those challenges really became so obvious to us, we invested in having this dynamic optimization we refer to as intelligent execution. At run time, when the job, when this data pipeline is submitted, based on the cluster, whether it’s Spark, whether it’s MapReduce or a Linux standalone server, we decide how to run this job, natively in our engine, as part of that Hadoop or Spark data flow. There is no overhead because everything is done through this dynamic optimization we have and everything is also done because our engine is so natively integrated because of our open-source contributions. Does that answer your question?
Eric Kavanagh: Yeah, that’s good. And I want to throw up one more question over there, and then Dez, maybe we’ll pull you and Robin in as well. I just got a hilarious comment from one of our attendees. I’ll read it because it really is quite pithy. He writes, “It seems that in the history of things H-O-T" – get it? Like I-o-T – "is that the more you try to ‘simplify’ something that’s really complex, more often than not the simpler it appears to do things, the more hanging rope is supplied. Think database query, explosion, multi-threading, etc.” Can you kind of comment on this paradox that he’s referencing? Simplicity versus complexity, and basically what’s really going on underneath the covers?
Tendü Yogurtçu: Sure. I think that’s a very valid point. When you are simplifying things and doing these optimizations, in a way under the covers, somebody needs to take that complexity of what needs to happen, right? If you are paralyzing something or if you are deciding how to run a particular job with respect to the computer framework, obviously there’s some part of the job that’s being pushed whether it’s at the user end, menu coding, or it’s at the engine optimization. There is a part of that, by simplifying at the user experience there’s a huge benefit in terms of being able to leverage skill sets that exist in the enterprise.
And you can kind of mitigate that paradox, mitigate that challenge of, “Yeah, but I don’t have control over everything that’s happening under the cover, under the hood in that engine,” by exposing things to the more advanced users if they want to have that kind of control. By also investing in some of the serviceability types of things. Being able to offer more operational metadata, more operational data, as in the example that this attendee gave, for a SQL query as well as with the engine running. I hope that answers.
Eric Kavanagh: Yeah that sounds good. Dez, take it away.
Dez Blanchfield: I’m really keen to get a bit more insight into your footprint in the open-source contributions and the journey that you’ve taken from your traditional, long-running experience in mainframe and the proprietary world and then the shift into contributing to open source and how that took place. And the other thing I’m keen to understand is the view you’re seeing that businesses, not just IT departments, but businesses are now taking with regard to data hubs or data lakes as people are saying now and whether they see this trend of just one single, consolidated data lake or whether we’re seeing distributed data lakes and people are using tools to put them together?
Tendü Yogurtçu: Sure. For the first one, that was a very interesting journey, as a proprietor software company, one of the first ones after IBM. However, again, everything started with our evangelist customers looking at Hadoop. We had data companies like ComScore, they were one of the first ones adopting Hadoop because they were collecting digital data across the globe and weren’t able to keep 90 days of data unless they invested a ten-million-dollar data warehouse box in their environment. They started looking at Hadoop. With that we started also looking at Hadoop.
And when we made a decision and acknowledged that Hadoop is really going to be the data platform of the future, we also came to the understanding that we won’t be able to have a play in this, a successful play in this, unless we were a part of the ecosystem. And we were working very closely with Hadoop vendors, with Cloudera, Hortonworks, MapR, etc. We started really talking with them because partnership becomes very important to validate the value a vendor can bring and also makes sure that we can jointly go to the enterprise and offer something more meaningful. It required a lot of relation building because we were not known to the Apache open-source projects, however we had great support from these Hadoop vendors, I must say.
We started working together and looking at the hub, how we can bring value without even our proprietor software in the space. That was important. It’s not just about putting some APIs that your product can run on, it’s to be able to say that I will invest in this because I believe Hadoop is going to be a platform of the future, so by investing in the sources we wanted to make sure it matures and becomes enterprise ready. We can actually enable some of the use cases that were not available before our contributions. That will benefit the entire ecosystem and we can develop those partnerships very closely.
It took quite a lot of time. We started contributing in 2011, and 2013, January 21st – I remember the date because that date our largest contribution was committed which meant that we can now have our products generally available from that point on – it took quite some time to develop those relations, show the value, partners become design partners with the vendors and with the committers in the open-source community. But it was a lot of fun. It was very rewarding as a company for us to be part of that ecosystem and develop a great partnership.
The second question about the data hub/data lake, I think when we see this data as a service implementation in most of the cases, yes, it might be clusters, physically single or multiple clusters, but it’s more conceptual than becoming that single place for all the data. Because in some organizations we see large cluster deployments on premise, however they also have clusters, for example, in the public cloud because some of the data that’s collected from online sections is really kept in the cloud. It’s being able to have a single data pipeline that you can actually leverage both of these, and use them as a single data hub, single data lake, becomes important. Not necessarily just the physical place, but having that data hub and data lake across clusters, across geographies and maybe on premise and cloud is going to be very critical, I think. Especially moving forward. This year we started seeing more and more cloud deployments. It’s amazing. The first half of this year so far we have seen a lot of cloud deployments.
Eric Kavanagh: Okay, cool. And Robin, do you have any questions? I know we just have a couple of minutes left.
Robin Bloor: Okay, well I can ask her a question. The first thing that occurred to me is that there has been a lot of excitement about Kafka and I was interested in your opinion about Kafka and how you integrate with the way that people are using Kafka?
Tendü Yogurtçu: Sure. Yes, Kafka is becoming quite popular. Among our customers we see that being kind of the data transport layer and viewed that the data is a bus, pretty much. For example, one of our customers actually was using kind of a consuming data that’s pushed into this Kafka among multiple, like thousands of online users and being able to classify that and push through.
Again, Kafka is a data bus to the different consumers of this data. Classify some advanced users versus not-so-advanced users and do something different moving forward in that data pipeline. How we integrate with Kafka is basically, our product DMX-h becomes a reliable consumer, a highly efficient, reliable consumer for Kafka. It can read the data and this is not any different than reading data from any other data source for us. We give users the ability to control the window either in terms of the time requirement that they have or the number of messages that they might be consuming from the Kafka bus. And then we can also do enrichment of that data as it’s going through our product and pushed back into Kafka. We have tested this. We have benchmarked it at the customer site. Also certified by Confluent. We work closely with the Confluent guys and it’s very high performing and easy to use. Again, there the APIs change but you don’t have to worry because the product really treats that as just another data source, a streaming data source. It’s quite fun to work with our product and Kafka, actually.
Robin Bloor: Okay I have another question which is just kind of a general business question but I’ve known Syncsort for a long time and you always had the reputation and delivered extraordinarily fast software for ETL and the mainframe world. Is it the case that most of your business is now being transferred to Hadoop? Is it the case that in one way or another you’ve kind of spread your business out quite dramatically from the mainframe world?
Tendü Yogurtçu: Our mainframe products are still running 50 percent of the mainframes globally. So we have a very strong mainframe product line in addition to what we are doing on the big data and the Hadoop end. And we are still in most of the IT simplification or optimization projects because there’s one end that you want to be able to tap into your mainframe data in the big data Multex platforms and leverage all of enterprise data, however there are also very critical transactional workloads that still continues to run on the mainframe and we offer those customers the ways to really make those applications more efficient, run in the zIIP engine so they don’t consume as much processing cycles and MIPS, make them cost effective.
We continue to invest in the mainframe products and actually play into this space where people go from mainframe big iron to big data and span the product line also across those platforms. So we don’t necessarily shift the entire business to one side, we continue to have very successful business on both sides. And the acquisitions are a big focus for us as well. As this data management and data processing space for the big data platforms evolve we are also committed to make quite a few complimentary acquisitions.
Robin Bloor: Well I guess I can’t ask you what they are because you wouldn’t be allowed to tell me. I’m interested in whether you’ve seen many implementations of Hadoop or Spark actually on the mainframe or whether that’s a very rare thing.
Tendü Yogurtçu: We have not seen any. There’s more question about that. I think Hadoop on mainframe did not make a lot of sense because of the kind of core structure. However Spark on mainframe is quite meaningful and Spark really is very good with the machine learning and predictive analytics and being able to have some of those applications with mainframe data really is, I think, quite meaningful. We haven’t seen anybody doing that yet, however it’s really the use case driving these things. If your use case as a company is more bringing that mainframe data and integrating with the rest of the data sets in the big data platform, that’s one story. It requires accessing the mainframe data from the big data Multex platform because you are unlikely to bring your data sets from open systems and called back to mainframe. However, if you have some mainframe data that you want to just explore and do a little bit of data exploration discovery, apply some advanced AI and advanced analytics, then Spark might be a good way to go and to run on the mainframe as that.
Eric Kavanagh: And here’s one more question from the audience, actually two more. I’ll give you a tag-team question, then we’ll wrap up. One attendee is asking, “Is IBM integrating your open-source contributions on its public cloud ecosystem, in other words, the Bluemix?” And another attendee made a really good point, noting that Syncsort is great for keeping big iron alive for those who already have it, but if companies forego new mainframes in favor of what he calls CE, cloud everything, that it will likely decline, but notes that you guys are really good at moving data by bypassing operating systems up to a gigabyte per second. Can you kind of talk about your core strength, as he mentioned, and whether or not IBM is integrating your stuff into Bluemix?
Tendü Yogurtçu: With IBM, we are already partners with IBM and we had discussions for their data cloud services offering the product. Our open-source contributions are open to everybody who wants to leverage them. Some of the mainframe connectivity is also available in Spark packages, so not just IBM. Anybody can leverage those. In the Bluemix we haven’t done anything specifically on that yet. And do you mind repeating the second question?
Eric Kavanagh: Yeah, the second question was about your core area of functionality over the years, which was really handling bottlenecks of ETL and obviously that’s something that you guys are still going to be doing as mainframes, well, theoretically stay away, although Dez’s point is still kind of rocking and rolling out there. But the attendee just noted that Syncsort is very good at moving data by bypassing operating systems and up to a gigabyte a second. Can you just comment on that?
Tendü Yogurtçu: Yes, that really overall resource efficiency has been our strength and the scalability and performance has been our strength. We are not compromising, simplify has many meanings, we don’t compromise from those. When people started talking about Hadoop in 2014, for example, many of the organizations were not really looking at performance initially. They were saying, “Oh, if something happens I can add another couple of nodes and I’ll be fine, performance is not my requirement.”
While we were talking about having the best performance because we were already running natively, we were not even having some of the initial hiccups that Hive had with multiple MapReduce jobs and overheads with starting them. People were telling us, “Oh, that’s not my worry, don’t worry about that at the moment.”
When we came to 2015 that landscape has changed because some of our customers already exceeded the storage that they had in their production clusters. It became very critical for them to see what Syncsort can offer. If you are taking some data from a database or mainframe and writing into a Parquet format in the clusters, whether you land and stage and do another transformation or just do the inflight transformation and landed target file format, made a difference because you are saving from storage, you are saving from the network bandwidth, you are saving from the workload on the cluster because you are not running extra jobs. Those strengths that we play in terms of being very conscious, we feel the resource efficiency under our skin, it seems.
That’s how we describe it. It is critical for us. We don’t take it for granted. We never took it for granted so we will continue to be strong with that leverage in Apache Spark or the next computer framework. That will continue to be our focus. And in terms of the data movement piece and data access piece, definitely it’s one of our strengths and we are accessing DB2 or VSAM data on the mainframes in the context of Hadoop or Spark.
Eric Kavanagh: Well, that’s a great way to end the webcast, folks. Thank you so much for your time and attention. Thanks to you, Tendü and Syncsort, for coming into the briefing room and stepping into the round, as they say. A lot of great questions from the audience. It’s an ever-moving environment out there, folks. We will archive this Hot Tech as we do with all of the others. You can find us at insideanalysis.com and at techopedia.com. Usually it goes up in about a day. And with that, we’re going to bid you farewell, folks. Thank you so much. We’ll talk to you soon. Take care. Bye, bye.