By believing passionately in something that still does not exist, we create it. The nonexistent is whatever we have not sufficiently desired.
~ Franz Kafka
Necessity remains the mother of invention. As an astute consultant once told me, “If there’s something that needs to happen in an organization, it is happening.” His point was two-fold: 1) some people will always find a way to get things done; and, 2) senior management, or even middle management, may well be unaware of exactly how things are getting done within their own establishment.
Follow the 2nd Annual #KafkaSummit on April 26, 2016 |
If we extend that metaphor to the entire universe of data management, we can see a transformation taking place right now. The raw pressure of big data, combined with the axis of streaming data, creates so much pressure that legacy systems are fraying at the edges, if not collapsing altogether. Nonetheless, there are countless professionals going about their jobs right this moment, who are largely unaware of this reality.
The data-borne, data-driven enterprises have a front-row seat, and are in many ways driving this change. Consider how powerhouses like Yahoo!, Facebook and LinkedIn have turned the enterprise software industry sideways with their prodigious donations to open-source: Hadoop , Cassandra and now Kafka, all of which have been shepherded by the Apache Foundation, itself a central player in this metamorhphosis.
What’s the upshot of all this change? What we’re witnessing today is the categorical reclassification and restructuring of data management itself. This is not to say that legacy systems will now be ripped out and replaced. Any industry veteran will tell you that wholesale dissolution of legacy systems happens about as often as the Chicago Cubs win the World Series. It’s a rare event, to say the very least.
What’s really happening is that a super-structure is being built all around the old-world systems. Consider the analogy of interstate highways, which often rise above the cities and towns they serve, designed to deliver people and cargo into these population centers, and provide egress to anyone and anything within them. They don’t replace existing roads so much as augment them with high-speed alternatives.
That’s exactly what Apache Kafka does: it provides high-speed routes for data movement between and amongst information systems. To follow the highway analogy, there are still many companies using linear message queues, or the old standard of ETL (extract-transform-load); but these pathways have low speed limits, and there are many potholes; moreover, maintenance costs are often exorbitant; signage is poor.
Kafka offers an alternative method for delivering data, one that is decidedly real-time, scalable and durable. This means that Kafka is not only a data movement vehicle, but also a data replicator; and to a certain extent, a distributed database technology. We should be careful about taking the analogy too far, as there are characteristics of ACID-compliant databases that Kafka does not yet sport. Still, the change is real.
This is great news for the information landscape, because data is now free to move about the country – and the world, for that matter. What was once a painful constraint, namely hitting batch windows for ETL processes, is now dissipating much as the fog gives way to clear skies under the glare of a hot sun. When moving data from one system to another becomes borderline seamless, an era of new opportunities dawns.
Human beings will likely represent the most friction on the road to data’s new future. Old habits die hard. Nary a CIO gets too excited about making wholesale changes to enterprise systems. Said one savvy senior executive of the role: “Get ready to be lonely.” Within a year of that comment, he was a consultant. It’s not an easy path, trying to manage the remarkably unwieldy world of enterprise data.
The good news is that Kafka provides an on-ramp to the future. Because it serves as a high-powered, multi-faceted message bus, it creates bridges between legacy systems and their forward-looking counterparts. Thus, organizations that embrace this new opportunity with open minds and sufficient budget will be able to step into the new world, without leaving behind the old. That’s a seriously big deal.
Down to Business
While Apache Kafka is an open-source technology, free for anyone to download and use, the folks who created this software for LinkedIn have spun off a separate entity called Confluent, which focuses on hardening the offering for enterprise use. Much like Cloudera, Hortonworks and MapR have built their businesses around the open-source project of Apache Hadoop, so Confluent seeks to monetize Kafka.
In a recent InsideAnalysis interview, Confluent CEO and co-founder Jay Kreps explained its origin at LinkedIn:
“We were trying to solve a couple different problems there. One was, we had all these different data systems with different kinds of data. We had databases and we had log files and we had metrics about servers and we had users clicking on things. Getting all that data around – as it got big – was really hard. The power of the data was only there if you could get it to the applications, or the processing, or the systems that needed it. That was a big problem.
“The other problem we had was we had adopted Hadoop, and that was something I was involved in. We had this fantastic offline processing platform that we could scale and we could put all our data in. For LinkedIn all of our data happened in real time. There was continuous generation of data. There was always this mismatch as we tried to actually build key parts of the business off of our data; between something that ran once a day, maybe at night, and generated results by the next day, and this kind of continuous data – short interaction times – that you had to catch up with. We wanted to be able to do something that had been around in academia for a while, but wasn’t really a mainstream thing, which is to be able to tap into and process streams of data as they were generated, rather than as they sat.”
Well. That’s exactly what Confluent now seeks to do with enterprise data of all shapes and sizes. The opportunity in play? Greenfield. Frankly, in the entire history of enterprise software, one could argue that the addressable market for this technology absolutely takes the cake. There is not a single large organization, or even data-heavy small business, that cannot benefit hugely from this technology.
This is especially true because of the neurological aspect of this technology; not just the minds involved, but the nature of what Kafka does for information systems. Because Kafka can be used to manage the movement of data throughout an organization, it can be viewed as more than just a traffic cop, but rather the brains of the operation itself. We’re in the early stages of that vision, but rest assured, it’s real.
How Kafka Will Change Data Management
To understand how Kafka will change the nature of data management, just think about the ways in which LinkedIn has changed networking. Finding colleagues got so much easier; staying in touch with people is now a snap. Kafka will do for information systems what LinkedIn does for business people: keep them connected across the widest ranges of this earth.
The spinoff of Confluent is emblematic of something we might call the New Innovation, a movement driven by the decoupling of software development and closed-source mentality, guided by the creators of open-source technology, fueled by large amounts of venture capital, monetized by for-profit companies that seek to revolutionize how organizations and people create, gather, analyze and leverage data.
To quote Franz Kafka, “From a certain point onward, there is no longer any turning back. That is the point that must be reached.”
We have passed the Rubicon. There’s no turning back now.