Rebecca Jozwiak: Ladies and gentlemen, hello and welcome to Hot Technologies of 2016. Today we've got, “The Power of Suggestion: How a Data Catalog Empowers Analysts.” I am your host Rebecca Jozwiak, filling in for our usual host Eric Kavanagh today, while he is traveling the world, so thank you for joining us. This year is hot, it’s not just hot in Texas where I am, but it’s hot all over the place. There’s an explosion of all kinds of new technologies coming out. We've got IoT, streaming data, cloud adoption, Hadoop continues to mature and be adopted. We have automation, machine learning, and all this stuff is of course underlined by data. And enterprises are becoming more and more data driven by the day. And of course, the point of that’s to lead to knowledge, and discovery and, you know, make better decisions. But to really get the most value from data, it’s got to be easy to get to. If you keep it locked away, or buried, or in the brain of a few people within the enterprise, it’s not going to do much good for the enterprise as a whole.
And I was kind of thinking about data cataloging and thinking of course of libraries, where long ago that’s where you went if you needed to find something out, if you needed to research a topic, or look up some information, you went to the library, and of course you went to the card catalog, or the crabby lady who worked there. But it was also fun to kind of wander around, if you just wanted to look, and sure you might just discover something neat, you might find out some interesting facts that you didn't know, but if you really needed to find something out, and you knew what you were looking for, you needed the card catalog, and of course the enterprise equivalent is a data catalog, which can help shine light on all the data for our users to enrich, discover, share, consume and really help people get to data faster and easier.
So today we've got Dez Blanchfield, our own data scientist, and we have Doctor Robin Bloor, our own chief analyst, we've got David Crawford from Alation, who is going to be talking about his company’s data cataloging story, but first we’re going to lead off with Dez. Dez, I am passing the ball to you and the floor is yours.
Dez Blanchfield: Thank you, thanks for having me today. This is a matter I’m extremely interested in, because almost every organization I come across in my day-to-day work, I find exactly the same issue that we spoke very briefly about in the pre-show banter, and that is that most organizations who have been in business for more than a few years have a plethora of data buried around the organization, different formats, and in fact I have clients that have data sets that go back to Lotus Notes, databases that are still running in some cases as their pseudo internets, and they, all is running into this challenge of actually finding where their data is, and how to get access to it, who to provide access to it, when to provide access to them, and and how to just catalog, and how to get it to a place where everyone can: A) be aware of what’s there and what’s in it, and B), how to get access to it and use it. And one of the biggest challenges of course is finding it, the other big challenge is knowing what’s in there and how to access it.
I may well know that I’ve got dozens of databases, but I don't actually know what’s in there or how to find out what’s in there, and so invariably as we’re discovering now in the pre-show data, you tend to walk around the office and ask questions, and yell across the cubical walls and try and figure out, often my experience is, you may even find you’re wandering off to the the front desk, the reception, and asking if anyone knows who you’re going to go talk to. Quite often, it’s not always the IT folk because they are unaware of the data set because someone’s just created it, and it could be something simple as a – quite often we’ll find a project of some sort that’s standing up in IT environment and the project manager’s used a spreadsheet of all things, and it’s gotten some massive amount of valuable information around assets and context and names, and unless you know that project and you know that person, you just can't find that information. It’s just not available, and you’ve got to get hold of that original file.
There’s a phrase that’s been bantered around with regard to data and I don't necessarily agree with it, but I think it’s a cute little throwaway and that is that a certain amount of people think that data is the new oil, and I’m sure we’re going to cover that in some aspect as well, later today. But what I have noticed, certainly being part of that transformation, is that organizations of businesses that have learned to value their data have gained significant advantage over their competitors.
There was an interesting paper by IBM, about five or six years ago, and they surveyed about 4,000 companies here in Australia, and they took all the information, all performance data, all the finance data and put it together in a boiling pot and then sent it to the Australian School of Economics, and they actually started a common trend here, and that was that companies who leveraged technology invariably gained such a competitive advantage over their peers and competitors per se that their competitors almost never catch up, and I think that’s very much the case now with data that we’ve seen what people call a digital transformation where organizations that have clearly figured out how to find data they’ve got, to make that data available, and make it available in some very easy consumable fashion to the organization, without necessarily always knowing why the organization might need it, and gain significant advantage over competitors.
I’ve got a couple of examples on this slide, which you can see. My one line up is, is that the large-scale disruption across almost every industry sector, in my view, is being driven by data, and if the current trends are anything to go by, my view is we’ve only really just gotten started because when the long-standing brands finally wake up to what this means and enter the game, they’re going to enter the game at wholesale. When sort of the major retailers who have mountains of data start applying some historical analysis on the data, if they even know it exists, then some of the online players are going to get a bit of a wakeup call.
But with many of most of these brands, I mean we’ve got Uber who are the largest taxi company in the world. They don't own any taxis, so what is it that makes them magic, what’s their data? Airbnb, the largest accommodation provider, we’ve got WeChat, the largest phone company in the world, but they’ve got no actual infrastructure, and no handsets, no phone lines. Alibaba, the largest retailer on the planet, but they don't own any of the inventory. Facebook, the largest media company in the word. I think at the last count they had 1.4 billion active data users now, which is a mind-boggling number. It’s not anywhere near – I think someone claimed that a quarter of the planet is actually on there every day, and yet here’s a content provider that actually doesn't create the content, all the data they serve is not created by them, it’s created by their subscribers, and we all know this model.
SocietyOne, which you may or may not have heard of, it’s a local brand, I think in a couple of countries it’s a bank that actually does peer-to-peer lending, so in other words, it has no money. All it does is it manages the transactions and the data sits underneath it. Netflix, we’re all very, very familiar with that. There’s an interesting one-liner here. When Netflix was legally able to be used in Australia, when it was officially announced, you did not have to use a VPN to get to it, many people around the world tend to – if you can't get to it in your local area – when Netfix was launched in Australia, it increased the international bandwidth on our internet links by 40 percent, so it almost doubled the internet usage in Australia overnight, by just one application, one cloud-hosted application that does nothing but play with data. It’s just a mind-boggling statistic.
And of course, we’re all familiar with Apple and Google, but these are the largest software businesses on the planet, yet they don't actually write the apps. What’s the consistent thing with all these organizations? Well, it’s data, and they didn't get there because they didn't know where their data was, and they didn't know how to catalog it.
What we’re finding now is that there’s this whole new asset class referred to as data, and companies are waking up to it. But they don't always have the tools and the know-how and the wherefore to map all that data, to catalog all that data and make it available, but we have found that companies with almost no physical assets have gained high market value in record time via this new data asset class. As I’ve said, some of the old players are now waking up to this and certainly bringing it out.
I’m a big fan of taking folk on a little bit of a journey, so in the eighteen hundreds, late eighteen hundreds, and you’ll be more than familiar with this in the U.S. market, it turned out that to run a census each year or so, I think they ran them every ten years at that point, but if you’re going to run a census every year, it could you take up to eight or nine years to actually do the data analysis. It turned out that that data set then got left in boxes in places in paper, and almost no one could find it. They just kept pumping out these reports, but the actual data was very hard to get to, we have a similar situation with another world significant moment, around the 1940s, with the Second World War, and this thing is the Bletchley Park Bombe spelled B-O-M-B-E, and it was a massive number-crunching analytical tool that would go through small data sets and find signals in it, and be used to help crack codes through the Enigma.
This thing again, was essentially a device designed, not much to catalog, but to tag and map data, and make it possible to take patterns and find it inside the data sets, in this case, break codes, find keys and phrases and find them regularly in the data sets, and so we’ve been through this journey of finding things in data, and leading towards cataloging data.
And then these things came along, these massive low-cost racks of machines, just off-the-shelf machines. And we did some very interesting things, and one of the things we did with them is we built very low cost clusters that could start indexing the planet, and very famously these big brands that have come and gone, but probably Google’s the most common home brand that we’ve all heard of – it’s become an actual verb, and you know you’re successful when your brand becomes a verb. But what Google taught us, without realizing it, possibly in the business world, is that they were able to index the entire planet to a certain level, and catalog the data that was around the world, and make it available in a very easy, convenient form in a little tiny one-line formula, a web page with almost nothing on it, and you type in your query, it goes and finds it because they had already crawled the planet, indexed it and made it easily available.
And what we noticed was, “Well hang on, we aren't doing this in organizations – why is that? Why is that we’ve got an organization that can map the entire planet and index it, crawl and index it, and make it available, we can search for it, and then click on the thing to go and find it, how come we haven't done that internally?” So there are lots of these little racks of machines around the world now that do that for intranets and find things, but they’re still really just coming to grips with the idea of going beyond the traditional web page, or a file server.
Instead of now entering this next generation of data catalog in many ways, discovering data access via post-it notes and water cooler conversations is not really an appropriate method for data discovery and cataloging anymore, and in fact, I don't think it ever really was. We can no longer lead that whole challenge to people just passing notes, and posting notes, and chatting about it. We’re well and truly beyond the area now where this next-gen approach to data cataloging has come and gone. We have to get our arms around it. If this was an easy issue, we would’ve already solved it in many ways earlier, but I think that it isn't an easy issue, just indexing and calling the data is only one part of it, knowing what’s in the data and building metadata around what we discover, and then making it available in an easy, consumable form, particularly to self-service and analytics. It’s still a problem being solved, but many parts of the puzzle in five years are well and truly solved and available.
As we know, humans cataloging data is a recipe for failure because human error is one of the greatest nightmares we deal with in data processing, and I regularly talk about this topic where in my view, humans filling in paper forms is probably the greatest nightmare we deal with in big data and analytics, to constantly having to fix things that they do, even down to simple things like the dates and fields, people putting it in the wrong format.
But as I’ve said, we’ve seen internet search engines index the world every day, so now we’re making it to the idea that that can be done on business data sets in the discovery process, and tools and systems are now readily available as you are about to learn today. So the trick, really in my view, is selecting the right tools, the best tools for the job. And more appropriately on top of that, finding the right part of it to help you get started down this path. And I believe we’re going to hear about that today, but before we do that, I’m going to pass over to my college, Robin Bloor and hear his take on the topic. Robin, can I pass over to you?
Robin Bloor: Yes, certainly you can. Let's see if this works, oh yes it does. Okay, I’m coming from a different direction than Dez really, but I’ll end up in the same place. This is about connecting to data, so I just thought I’d walk through the reality of connecting to data, point by point really.
There’s a fact that data is more fragmented than it has ever been. The volume of data is growing phenomenally, but in actual fact, the different sources of data are also growing at an incredible rate, and therefore data is becoming increasingly fragmented all the time. But because of analytics applications in particular – but those are not the only applications – we have got a really good reason to connect to all of this data, so we are stuck in a difficult place, we are stuck in a world of fragmented data, and there’s opportunity in the data as Dez was calling it, the new oil.
About data, well, it used to live on spinning disk, either in file systems or databases. Now it lives in a much more varied environment, it lives in file systems but it also lives in Hadoop instances nowadays, or even Spark instances. It lives in multiple species of database. Not so long ago, we kind of standardized some relational database, well you know that went out the window in the past five years, because there’s a need for document databases, and there’s a need for graph databases, so you know, the game has changed. So it lived on spinning disk, but it now lives on SSD. The latest amount of SSD – definitely the latest SSD unit is coming out from Samsung – twenty gigabytes, which is huge. Now it lives in memory, in the sense that the prime copy of data can be in memory, rather than on disk, we didn't used to build systems like that; we do now. And it lives in the cloud. Which means it can live in any of these things, in cloud, you won't necessarily know where it is in a cloud, you will only have its address.
Just to ram home the point, Hadoop has so far, failed as an extensible data store. We had hoped it would become an extensible scale-out data store, and it would just become one file system for everything, and it would – rainbows would appear in the sky, basically, and unicorns would dance around, and none of that happened. Which means we end up with a problem of data transport, and there isn't a necessity for data transport, at times, but it is also a difficulty. Data really does have gravity nowadays, once you have got into the multi-terabytes of data, picking it up and throwing it around, kind of causes latencies to appear on your network, or to appear in various places. If you want to transport data around, timing is a factor. There’s nearly always, nowadays, some limits on how much time you have got to get one thing, one data from one place to another place. There used to be what we used to think of as batch windows, when the machine was kind of idle, and no matter how much data you had, you could just throw it around and it would all work out. Well that’s gone, we are living in a much more a real-time world. Therefore timing is a factor. As soon as you want to move data around, so if the data has gravity, you probably can't move it.
Data management is a factor in the sense that you have actually got to manage all this data, you don't get that for free, and replication may be necessary in order to actually get the data to do the job it needs to do, because it may not be wherever you have put it. It may not have sufficient resources in order to do the normal processing of the data. So data gets replicated, and data gets replicated more than you would imagine. I think somebody told me a long time ago that the average piece of data is replicated at least two and a half times. ESBs or Kafka present an option for data flow, but nowadays it demands architecture. Nowadays you really need to think in one way or another, about what it actually means to throw the data around. Therefore, to access data where it is, is usually preferable, as long as, of course, you can get the performance you need when you actually go for the data and that depends on context. So it is a difficult situation, anyway. In terms of data queries, we used to be able to think in terms of SQL, we've come up really now, you know, different forms of queries, SQL yes, but adjacent, also graph queries, Spark is only one example of doing graph, because also we need to do text search, more than we ever did, also regex type of searches, which is really complicated searches for patterns, and genuine pattern matching, all of these things are actually bubbling off. And all of them are useful because they get you what you are looking for, or they can get you what you are looking for.
Queries now days span multiple data, so it didn't always do that, and often the performance is appalling if you do that. So, it depends upon the circumstances, but people expect to be able to query data from multiple data sources, so data federation of one sort or another is becoming more and more current. Data virtualization, which is a different way of doing it, depending on performance, is also very common. Data queries is actually a part of a process, not the whole process. It is just worth pointing that out if you are actually looking at analytics performance, the actual analytics can take an awful lot longer than the data gathering, because that depends upon the circumstances, but data queries are an absolute necessity if you want to do any kind of analytics on multiple data sources, and it just, you really actually have to have capabilities that span.
So about catalogs. Catalogs exist for a reason, at least we are saying that, you know, it's, we have directories, and we have schemas in databases, and we have each catalog and we have wherever you go you will find one place and then you will actually find that there’s some kind of catalog, and the unified global catalog is such an obviously good idea. But very few companies have such a thing. I do remember, back in the year two thousand – the year two thousand panic – I do remember that communists couldn't even pin down how many executables they had, never mind how many different data stores they had, and it’s probably the case now, you know, that most companies do not actively know in the global sense, what data they’ve got. But it is obviously becoming increasingly necessary to actually have a global catalog, or at least to have a global picture of what is going on because of the growth of data sources, and the continued growth of applications, and it is particularly necessary for analytics, because you also in one way, and there are other issues here like lineage and problems with the data, and it is necessary for security, many aspects of the data governance, if you really don't know what data you have got, the idea that you are going to govern it is just absurd. So, in that, all data is cataloged in some way is just a fact. The question is whether the catalog is coherent, and actually what you can do with it. So I shall pass back to Rebecca.
Rebecca Jozwiak: Okay, thanks Robin. Up next we have got David Crawford from Alation, David I am going to go ahead and pass the ball to you, and you can take it away.
David Crawford: Thank you so much. I really appreciate you guys having me on this show. I think I am going to get this started, so I think my role here, is to take some of that theory and see how it is actually being applied, and the results that we are able to drive at real customers and so you can see a few on the slide, I want to talk about what results we will be able to see in analytic possibly improvements. So to motivate the discussion, we are going to talk about how they got there. So I am lucky to get to work pretty closely with a lot of really smart people, these customers, and I just want to point out a few who have been able to actually measure, and talk about how having a data catalog has impacted their analyst workflow. And just to briefly stay at the front, I think one of the things that we see change, with data catalogs verses previous mediated solutions and one of the ways that relations really thinks about the solutions that we put together, is to start from the analysts and work backwards. To say, let's make this about enabling analysts’ productivity. As opposed to just compliance, or as opposed to just having an inventory, we are making a tool that makes analysts more productive.
So, when I talk to a data scientist at the financial services company Square, there’s a guy, Nick, who was telling us about how his, he used to take several hours to find the right data set to start a report, now he can do it in a matter of seconds using search at market share, we talked to their CTO who pulled his analysts who were using Square, excuse me, was using Alation, to find out what their, what benefits they saw, and they reported a 50 percent productivity boost, and that the, one of the world’s top retailers, eBay, they’ve got over a thousand people who are doing SQL analysis on a regular basis, and I work pretty closely with Deb Says over there, who is the project manager in their data tools team, and she found that when queriers adopt Alation, adopt a catalog, they are seeing double the speed of writing new queries against the database.
So these are real results, these are people actually applying the catalog in their organization, and I want to take you through what it takes to get set up. How a catalog gets established in a company, and maybe the most important thing to say, is that a lot of it happens automatically, so Dez talked about systems, learning about systems, and that’s exactly what a modern data catalog does. So they install Alation in their data center and then they connect it to various sources of metadata in their data environment. I’ll focus a little bit on the databases and the BI tools – from both these we are going to extract technical metadata, about basically what exists. Right, so what tables? What reports? What are the report definitions? So they extract that technical metadata, and a catalog page is automatically created for every object inside of those systems, and then, they also extract and layer on top of that technical metadata, they layer on top the usage data. That’s primarily done by reading query logs from the database, and this is a really interesting source of information. So, whenever an analyst writes a query, whenever a reporting tool, whether it is home grown, or off the shelf, whether a reporting tool runs a query in order to update dashboard, when an application runs a query to insert data to operate on a data set – all of those things are captured in database query logs. Whether you have a catalog or not, they are captured in the query log with the database. What a data catalog can do, and especially what the Alation's catalog can do, is read those logs, ask the queries inside of them, and create a really interesting usage graph based on those logs, and we bring that into play to inform future users of the data about how past users of the data have used it.
So, we bring all of that knowledge together into a catalog, and just to kind of make this real, these are the integrations that are already deployed at customers, so, we have seen Oracle, Teradata, Redshift, Vertica and a bunch of other relational databases. In the Hadoop world, there’s a range of SQL on Hadoop, sort of relational, meta stores on top of the Hadoop file system, Impala, Tez, Presto and Hive, we have also seen success with cloud Hadoop private providers like Altiscale, and we have also been able to connect to Tableau servers, MicroStrategy servers and index the dashboards there, as well as integrations with data science charting tools like Plotly.
So, we connect to all of these systems, we have connected these systems to customers, we have pulled in the technical metadata, we have pulled in the usage data, and we sort of automatically primed the data catalog, but in that way, we centralize the knowledge, but just centralizing things into a data catalog, doesn't by itself provide those really wonderful productivity boosts that we talked about with the eBay, Square and market share. In order to do that, we actually need to change the way that we think about delivering knowledge to analysts. One of the questions that they are asking to prepare for this, was “How does the catalog actually impact an analyst’s workflow?”
That’s what we spend all day thinking about, and in order to talk about this change in thinking, of a push verses a pull model, I wanted to make a quick analogy to what the world was like before and after reading on a Kindle. So it’s just an experience some of you might have, when you are reading a physical book, you come across a word, you are not sure you know that word’s definition super well, you can maybe guess it from context, not that likely that you are going to get up off the couch, walk to your bookshelf, find your dictionary, dust it off, and flip to the right place in the alphabetical listing of words to make sure that, yes you had that definition just right, and you know the nuances of it. So it doesn't really happen. So you buy a Kindle app and you start to read books there, and you see a word you are not totally sure about and you touch the word. All of the sudden, right in that same screen, is the dictionary definition of the word, with all of its nuances, different example usages, and you swipe a little bit, and you get a Wikipedia article on that topic, you swipe again, you get a translation tool that can translate it into other languages or from other languages, and all of the sudden your knowledge of the language is that much richer, and it just happens an astounding number of times, compared to when you had to go and pull that resource for yourself.
And so what I am going to argue, is that the workflow for an analyst and the way that an analyst will deal with data documentation, is actually very similar to how a reader will interact with the dictionary, whether a physical one, or though the Kindle, and so what we, the way that we really saw this productivity boost, is not spilling the catalog, but connecting it to the workflow of the analyst, and so, they’ve asked me to do a demo here, and I want to make that the focus of this presentation. But I just want to set up the context for the demo. When we think about pushing the data knowledge to the users when they need it, we think the right place to do that, the place where they spend their time and where they're doing the analysis, is a SQL query tool. A place where you write and run SQL queries. And so we built one, and we built it, and the thing that's really different about it from other query tools is its deep integration with the data catalog.
So our query tool is called Alation Compose. It's a web-based query tool and I'll show it to you in a second. A web-based query tool that works across all of those database logos that you saw on the previous slide. What I'm going to try to demo in particular is the way that the catalog information comes to users. And it does it through these kind of three different ways. It does it through interventions, and that's where somebody who's a data governor, or a data steward, or sort of an administrator of some way, or a manager, can say, “I want to sort of interject with a note or a warning in the workflow and make sure that it's delivered to users at the right time.” So that's an intervention and we'll show that.
Smart suggestions is a way where the tool uses all of its aggregated knowledge of the catalog to suggest objects and parts of a query as you're writing it. The most important thing to know there is that it really takes advantage of the query log to do that, to suggest things based on usage and also to find even parts of queries that have been written before. And we'll show that.
And then previews. Previews are, as you're typing in the name of an object, we show you everything that the catalog knows, or at least the most relevant things that the catalog knows about that object. So samples of the data, who had used it before, the logical name and description of that object, all come up to you while you're writing it without having to go ask for it.
So without any more talking, I'll get to the demo, and I'm just going to wait for it to appear. What I'm going to show you here is the query tool. It's a dedicated SQL writing interface. It's a separate interface from the catalog, in a certain sense. Dez and Robin talked about the catalog, and I'm jumping a little bit over the catalog interface straight to how it's brought directly in to service the workflow.
I'm just showing here a place where I can type SQL, and at the bottom you'll see that we sort of have some information appearing about the objects that we're referencing. So I'm just going to start typing a query and I'll stop when I get to one of these interventions. So I'll type “select,” and I want the year. I want the name. And I'm going to look up some salary data. So this is an education data set. It has information about higher education institutions, and I'm looking at the average faculty salary that's in one of these tables.
So I've actually typed the word “salary.” It's not exactly in the name of the column that way. We use both the logical metadata and the physical metadata to do suggestions. And what I want to point out here is this yellow box that's appearing here. It says there's a warning on this column. I didn't go looking for that, I didn't take a class on how to use this data properly. It came to me, and it happens to be a warning about a confidentiality agreement that has to do with this data. So there's some disclosure rules. If I'm going to query this data, I'm going to take data out of this table, I should be careful about how I disclose it. So you have a governance policy here. There's some compliance challenges that makes it so much easier to comply with this policy when I know about it at the time that I'm looking at the data.
So I've got that coming up to me, and then I'm also going to look at tuition. And here we see the previews come into play. On this tuition column, I see – there's a tuition column on the institution table, and I'm seeing a profile of that. Alation goes and pulls sample data from the tables, and in this case, it's showing me something that's pretty interesting. It's showing me the distribution of the values, and it's showing me that the zero value showed up 45 times in the sample, and more than any other value. So I've got some sense that we might be missing some data.
If I'm an advanced analyst, then this might be part of my workflow already. Especially if I'm a particularly meticulous one, where I would do a bunch of profiling queries ahead of time. Whenever I'm approaching a new piece of data, I always think about what our data coverage is. But if I'm new to data analysis, if I'm new to this data set, I might assume that if there's a column, it's filled in all the time. Or I might assume that if it's not filled in, it's not zero, it's null or something like that. But in this case, we have a lot of zeroes, and if I did an average, they would probably be wrong, if I just assumed that those zeroes were actually zero instead of missing data.
But Alation, by bringing this preview into your workflow, kind of asks you to take a look at this information and gives even sort of novice analysts a chance to see that there's something to notice here about that data. So we have that preview.
The next thing that I'm going to do is I'm going to try to find out what tables to get this information from. So here we see the smart suggestions. It's been going all the time, but in particular here, I haven't even typed anything but it's going to suggest to me which tables I might want to be using for this query. And the most important thing to know about this is that it takes advantage of the usage stats. So in an environment like, for instance, eBay, where you have hundreds of thousands of tables in a single database, having a tool that can kind of hit the wheat from the chaff, and using those usage stats, is really important for making these suggestions worth something.
So it's going to suggest this table. When I look at the preview, we actually highlight three of the columns that I have mentioned already in my query. So I know that it's got three, but it doesn't have the name. I need to get the name, so I'm going to do a join. When I do a join, now again I have these previews to help me find, where is the table with the name. So I see that this one has a nicely formatted, kind of properly capitalized name. It seems to have one row with a name for each institution, so I'm going to grab that, and now I need a join condition.
And so, here what Alation is doing is again looking back at the query logs, seeing previous times that these two tables have been joined, and suggesting different ways to join them. Once again, there's some intervention. If I look at one of these, it's got a warning that shows me that this should only be used for aggregate analysis. It'll probably produce the wrong thing if you're trying to do something through the institution by institution. Whereas this one, with the OPE ID is endorsed as the proper way of joining these two tables if you want university-level data. So I do that, and it's a short query, but I've written my query without really necessarily having any insight into what the data is. I've never actually looked at an ER diagram of this data set, but I know quite a lot about this data already because the relevant information is coming to me.
So those are kind of the three ways that a catalog can, through an integrated query tool, directly impact the workflow as you're writing queries. But one of the other benefits of having a query tool integrated with a catalog is that, when I finish my query and I save it, I can put a title like “Institution Tuition and Faculty Salary,” and then I have a button here that allows me to just publish it to the catalog. It becomes very easy for me to feed this back. Even if I don't publish it, it's being captured as part of the query log, but when I publish it, it actually becomes part of the way that the centralized place where all data knowledge is living.
So if I click Search for all queries in Alation, I'm going to be taken – and here you'll see some more of the catalog interface – I'm taken to a dedicated query search that shows me a way to find queries across the entire organization. And you see that my newly published query is at the top. And some might notice here at, as we capture the queries, we also capture the authors, and we sort of establish this relationship between me as an author and these data objects that I now know something about. And I'm being established as an expert on this query and on these data objects. That's really helpful when people need to go learn about data, then they can go find the right person to go learn about. And if I'm actually new to data, whether I'm an advanced analyst – as an advanced analyst, I might look at this and see a bunch of examples that would get me started on a new data set. As someone who might not feel super savvy with SQL, I can find pre-made queries that are reports that I can take advantage of.
Here's one by Phil Mazanett about median SAT scores. Click on this, and I get sort of a catalog page for the query itself. It talks about an article that was written that references this query, so there's some documentation for me to read if I want to learn how to use it. And I can open it up in the query tool by clicking the Compose button, and I can just run it myself here without even editing it. And actually, you get to see a little bit of our lightweight reporting capabilities, where, when you're writing a query, you can drop in a template variable like this and it creates a simple way to create a form to execute a query based on a couple of parameters.
So that's what I have for the demo. I'm going to switch back to the slides. Just to kind of recap, we showed how an administrator, a data governor, can intervene by placing warnings on objects that show up in the query tool, how Alation uses its knowledge of the usage of data objects to do smart suggestions, how it brings in profiling and other tips to improve the workflows of analysts when they're touching particular objects, and how all of that kind of feeds back into the catalog when new queries are written.
Obviously I'm a spokesperson on behalf of the company. I'm going to say nice things about data catalogs. If you want to hear directly from one of our customers, Kristie Allen at Safeway runs a team of analysts and has a really cool story about a time when she needed to really beat the clock in order to deliver a marketing experiment, and how her whole team used Alation to collaborate and turn around really quickly on that project. So you can follow this bit.ly link to check that story out, or if you want to hear a little bit about how Alation could bring a data catalog into your organization, we are happy to set up a personalized demo. Thanks a lot.
Rebecca Jozwiak: Thanks so much, David. I'm sure that Dez and Robin have a few questions before I turn over to the audience Q&A. Dez, do you want to go first?
Dez Blanchfield: Absolutely. I love the idea of this concept of published queries and linking it back to the source of the authoring. I've been a longtime champion of this idea of an in-house app store and I think this is a really great foundation to build on that.
I came to kind of get some insight into some of the organizations that you're seeing doing this, and some of the success stories that they might have had with this whole journey of not only leveraging your tool and platform to discover the data, but also then transform their internal cultural and behavioral traits around. Now having this sort of in-house app store where you sort of just download, the concept where they can not only just find it, but they can actually start developing little communities with the keepers of that knowledge.
David Crawford: Yeah, I think we've been surprised. We believe in the value of sharing queries, both from my past as a product manager in Adtech and from all the customers that we've talked to, but I've still been surprised at how often it's one of the very first things that customers talk about as the value that they get out of Alation.
I was doing some user testing of the query tool at one of our customers called Invoice2go, and they had a product manager who was relatively new, and they said – he actually told me, unprompted during the user test, “I actually wouldn't be writing SQL at all except that it's made easy by Alation.” And of course, as the PM, I kind of go, “What do you mean, how did we do that?” And he said, “Well, really it's just because I can log in and I can see all of these existing queries.” Starting with a blank slate with SQL is an incredibly hard thing to do, but modifying an existing query where you can see the result that's put out and you can say, “Oh, I just need this extra column,” or, “I need to filter it to a particular range of dates,” that's a much easier thing to do.
We've seen kind of these ancillary roles, like product managers, maybe folks in sales ops, who start to pick up, and who always wanted to learn SQL and start to pick it up by using this catalog. We've also seen that a lot of companies have tried to do sort of open source. I've tried to build these kinds of things internally, where they track the queries and make it available, and there's some really kind of tricky design challenges to making them useful. Facebook has had an internal tool that they called HiPal that sort of captured all the queries written on Hive, but what you find out is, that if you don't kind of nudge the users in the right way, you just end up with a very long list of select statements. And as a user who's trying to figure out if a query is useful to me or if it's any good, if I just go look through a long list of select statements, it will take me a lot longer to get something out of value there than starting from scratch. We thought pretty carefully about how to make a query catalog that brings the right stuff to the front and provides it in a useful way.
Dez Blanchfield: I think we all go through this journey from a very young age, through to adulthood, in many ways. A bunch of technologies. I, personally myself, I've gone through that very same genuine thing, like, learning to cut code. I would go through magazines and then books, and I would study to a certain level, and then I needed to go and actually get some more training and education on it.
But inadvertently I found that even when I was going from teaching myself and reading magazines and reading books and chopping other people's programs and the going to courses on it, I still ended up learning as much from doing the courses as I did just talking to other people who had some experiences. And I think that it's an interesting discovery that, now that you bring that to data analytics, we're basically seeing that same parallel, that human beings are invariably quite smart.
The other thing I'm really keen to understand is, at a very high level, many organizations are going to ask, “How long does it take to get to that point?” What's the tipping point time-frame-wise when people get your platform installed and they started to discover the types of tools? How quickly are people just sort of seeing this thing turn into a really immediate “a-ha” moment where they realize they're not even worrying about the ROI anymore because it's there, but now they're actually changing the way they do business? And they've discovered a lost art and they expect they can do something really, really fun with it.
David Crawford: Yeah, I can touch on it a little bit. I think that when we get installed, that one of the nice things, one of the things that people like about a catalog that’s directly connected into the data systems, is that you don't start blank where you have to kind of fill it in page by page. And this is kind of true of previous data solutions where you'd start with an empty tool and you have to start creating a page for everything you want to document.
Since we document so many things automatically by extracting the metadata, essentially within a few days of having the software installed, you can have a picture of your data environment that's at least 80 percent there in the tool. And then I think as soon as people start writing queries with the tool, they're saved automatically back into the catalog, and so they'll start to show up as well.
I don't want to be over-eager in stating it. I think two weeks is a pretty good conservative estimate, to a month. Two weeks to a month, conservative estimate of really turning around and feeling like you're getting value out of it, like you're starting to share some knowledge and being able to go there and find out things about your data.
Dez Blanchfield: It's quite astonishing, really, when you think about it. The fact that some of the large data platforms that you're effectively indexing and cataloging will take sometimes up to year to implement and deploy and stand up properly.
The last question I've got for you before I hand off to Robin Bloor, is connectors. One of the things that immediately jumps out at me is you've obviously got that whole challenge sorted out. So there's a couple questions just really quickly. One, how rapidly do connectors get implemented? Obviously you start with the biggest platform, like the Oracles and the Teradatas and so forth and DB2s. But how regularly are you seeing new connectors come through, and what turnaround time do they take? I imagine you have a standard framework for them. And how deep do you go into those? For example, the Oracles and IBMs of the world, and even Tereadata, and then some of the more popular of late open-source platforms. Are they working directly with you? Are you discovering it yourselves? Do you have to have inside knowledge on those platforms?
What does it look like to sort of develop a connector, and how deep do you get involved to those partnerships to ensure those connectors are discovering everything you possibly can?
David Crawford: Yeah, sure, it's a great question. I think that for the most part, we can develop the connectors. We certainly did when we were a younger startup and had no customers. We can develop the connections certainly without needing any internal access. We never get any special access to the data systems that aren't publicly available, and often without needing any inside information. We take advantage of the metadata services available by the data systems themselves. Often those can be pretty complex and hard to work with. I know SQL Server in particular, the way that they manage the query log, there's several different configurations and it's something that you really have to work at. You have to understand the nuances and the knobs and dials on it to set it up properly, and that's something that we work with customers on since we've done it several times before.
But to a certain extent, it's kind of public APIs that are available or public interfaces that are available that we leverage. We do have partnerships with several of these companies, that's mostly a grounds for certification, so that they feel comfortable saying that we work and also they can provide us resources for testing, sometimes early access maybe to a platform that's coming out to make sure that we work on the new versions.
To turn around a new connection, I would say again, trying to be conservative, let's say six weeks to two months. It depends on how similar it is. So some of the Postgre works kind of look very similar to Redshift. Redshift and Vertica share a lot of their details. So we can take advantage of those things. But yeah, six weeks to two months would be fair.
We also have APIs, so that – we think of Alation as a metadata platform as well, so if anything's not available for us to reach out and automatically grab, there are ways that you can write the connector yourself and push it into our system so that everything still gets centralized in a single search engine.
Dez Blanchfield: Fantastic. I appreciate that. So we're going to hand it over to Robin, because I'm sure he has a plethora of questions as well. Robin?
Rebecca Jozwiak: Robin may be on mute.
Dez Blanchfield: You've got yourself on mute.
Robin Bloor: Yeah, right. Sorry, I muted myself. When you implement this, what's the process? I'm kind of curious because there can be a lot of data in many places. So how does that work?
David Crawford: Yeah, sure. We go in, first it's sort of an IT process of making sure our server's provisioned, making sure that network connections are available, that the ports are open so we can actually access the systems. They all often know which systems they want to start with. Knowing inside of a data system, which – and sometimes we actually will help them. We'll help them go do an initial look at their query log to understand who's using what and how many users they have on a system. So we'll help find out where – they often, if they've got hundreds or thousands of people who might be logging into databases, they actually don't know where they're logging in, so we can go find out from the query logs how many unique user accounts do you have actually logging in and executing queries here in a month or so.
So we can take advantage of that, but often only on the most important ones. We get them set up and then there's a process of saying, "Let's prioritize." There's a range of activities that can happen in parallel. I would focus in onto the training for using the query tool. Once people start using the query tool, first of all, a lot of people love the fact that it's just a single interface to all of their different systems. They also love the fact that it's web-based, doesn't involve any installs if they don't want to. From a security standpoint, they like having sort of a single entry point, from a network standpoint, between sort of a corp IT network and the data center where the production data sources live. And so, they'll set up Alation as a query tool and start to use Compose as a point of access for all of these systems.
So once that happens, what we focus in there’s on training, is understanding what are some of the differences between a web-based or a server-based query tool versus one you'd have on your desktop, and some of the nuances of using that. And at the same time what we'll try to do is identify the most valuable data, again taking advantage of the query log information, and saying, “Hey, you might want to go in and help people understand these. Let's start publishing representative queries on these tables.” That's sometimes the most effective way to very quickly get people spun up. Let's look at your own query history, publish these things so that they show up as the first queries. When people look at a table page, they can see all queries that touched that table, and they can start from there. And then let's start adding titles and descriptions to these objects so that they're easier to find and search, so that you know some of the nuances of how to use it.
We make sure that we get a thorough look at the query log so that we can generate lineage. One of the things we do is we look through the query log at times when data moves from one table to another, and that allows us to put one of the most frequently asked questions about a table of data is, where did this come from? How do I trust it? And so what we can show is not only which other tables it came from, but how it was transformed along the way. Again, this is kind of powered by the query log.
So we make sure that those things are set up and that we're getting lineage into the system, and we're targeting the most highly valuable and the most highly leveraged pieces of metadata that we can get established on the table pages, so that when you search, you find something useful.
Robin Bloor: Okay. The other question – there's a lot of questions from the audience, so I don't want to take up too much of the time here – the other question that kind of comes to mind is, just the pain points. A lot of software's bought because people are, in one way or another, having difficulties with something. So what's the common pain point that leads people to Alation?
David Crawford: Yeah. I think there are a few, but I think one of the ones that we hear pretty often is analyst onboarding. “I'm going to need to hire 10, 20, 30 people in the near term who are going to have to produce new insights from this data, how are they going to get up to speed?” So analyst onboarding is something we certainly tackle. There's also just relieving the senior analysts from spending all of their time answering questions from other people about data. That's a very frequent one as well. And both of those are essentially education problems.
And then I would say another place that we see people adopting Alation is when they want to set up a brand new data environment for someone to work in. They want to advertise and market this internally for people to take advantage of. Then making Alation the front-end to that new analytic environment is very appealing. It's got the documentation, it's got a single point of introduction to the – a single point of access to the systems, and so that's another place where people will come to us.
Robin Bloor: Okay, I’ll pass you on to Rebecca because the audience is trying to get to you.
Rebecca Jozwiak: Yes, we do have a lot of really good audience questions here. And David, this one was posed specifically to you. It’s from somebody who apparently has some experience with people kind of misusing queries, and he kind of says that the more we empower users, the harder it is to govern responsible use of compute resources. So can you defend against the propagation of misguided but common query phrases?
David Crawford: Yeah, I see this question. It's a great question – one we get pretty frequently. I've seen the pain myself at previous companies, where you need to train users. For instance, “This is a log table, it's got logs going back for years. If you're going to write a query on this table, you really have to limit by date.” So, for instance, that's a training I went through at a previous company before I was given access to the database.
We have a couple of ways that we try to address this. I would say that I think query log data is really uniquely valuable to address it. It gives another insight versus what the database does internally with its query planner. And what we do is, one of those interventions – we have the manual interventions that I showed, and that's useful, right? So on a particular join, for instance, you can say, "Let's deprecate this." It'll have a big red flag when it shows up in smart suggest. So that's one way of trying to get to people.
Another thing that we do is, automated at execution-time interventions. That'll actually use the parse tree of the query before we run it to see, does it include a certain filter or a couple of other things that we do there as well. But one of the most valuable ones and the simplest one to explain is, does it include a filter? So like that example I just gave of, this log table, if you're going to query it, have to have a date range, you can specify in the table page there that you mandate that date range filter to be applied. If someone tries to run a query that doesn't include that filter, it actually will stop them with a big warning, and it will say, “You should probably add some SQL that looks like this to your query.” They can continue if they want. We're not going to actually completely ban them from using it – it's a query too, it's got to, at the end of the day, run queries. But we put a pretty big barrier in front of them and we give them a suggestion, a concrete applicable suggestion to modify the query to improve their performance.
We actually also do that automatically in some cases, again by observing the query log. If we see that some really large percentage of queries on this table take advantage of a particular filter or a particular join clause, then we'll actually pop that up. We'll promote that to an intervention. Actually, it happened to me on an internal data set. We have customer data and we have user IDs, but the user ID set, since it's kind of – we have user IDs at every customer. It's not unique, so you have to pair it with a client ID in order to get a unique join key. And I was writing a query and I tried to analyze something and it popped up and said, “Hey, everyone else seems to join these tables with both the client ID and the user ID. Are you sure you don't want to do that?” And it actually stopped me from doing some incorrect analysis. So it works for both the accuracy of the analysis as well as the performance. So that's kind of how we take that problem on.
Rebecca Jozwiak: That would seem to me to be effective. You said you won't necessarily block people from hogging up resources, but sort of teach them that what they're doing might not be the best, right?
David Crawford: We always assume that the users are not malicious – give them the best intents – and we try to be pretty open in that way.
Rebecca Jozwiak: Okay. Here's another question: “What's the difference between a catalog manager, like with your solution, and an MDM tool? Or does it actually rely on a different principal by widening the choice of the query tables, whereas MDM would do it automatically, but with the same underlying principal of collecting metadata."
David Crawford: Yeah, I think that when I look at traditional MDM solutions, the primary difference is a philosophical one. It's all about who the user is. Kind of like I said at the beginning of my presentation, Alation, I think, when we were founded, we were founded with an aim to enable analysts to produce more insights, to produce them faster, to be more accurate in the insights that they produce. I don't think that has ever been the goal of a traditional MDM solution. Those solutions tend to be targeted toward people who need to produce reports of what data has been captured to the SCC or internally for some other kind of auditing purpose. It can sometimes enable analysts, but it's more often, if it is going to enable a practitioner in their work, it's more likely to enable a data architect like a DBA.
When you think about things from the standpoint of an analyst, that's when you start to build a query tool that an MDM tool would never do. That's when you start to think about performance as well as accuracy, as well as understanding what data relates to my business need. All of those things are things that sort of pop in our minds when we design the tool. It goes into our search algorithms, it goes into the layout of the catalog pages and the ability to contribute knowledge from all around the organization. It goes into the fact that we built the query tool and that we built the catalog directly into it, so I think it really comes from that. What user do you have first in mind?
Rebecca Jozwiak: Okay, good. That really helped explain it. [Inaudible] who was dying to get a hold of the archives because he had to leave, but he really wanted his question answered. He said it was mentioned in the beginning that there are multiple languages, but is SQL the only language leveraged within the Compose component?
David Crawford: Yes, that's true. And one of the things that I've noticed, as I kind of witnessed the explosion of the different types of databases, of document databases, of graph databases, of key value stores, is that they are really powerful for application developments. They can serve particular needs there really well, in better ways than relational databases can.
But when you bring it back to data analysis, when you bring it back to – when you want to provide that information to people who are going to do ad hoc reporting or ad hoc digging into the data, that they always come back to a relational, at least, interface for the humans. Part of that’s just because SQL is the lingua franca of data analysis, so that means, for the humans, it's also for the tools that integrate. I think this is the reason that SQL on Hadoop is so popular and there are so many attempts at solving it, is because at the end of the day, that's what people know. There are probably millions of people who know how to write SQL, and I would venture not millions who know how to write a Mongo aggregation pipeline framework query. And that it's a standard language that’s used for integration across a really wide variety of platforms. So all that’s saying, we're very seldom asked to go outside of it because this is the interface that most analysts use, and it is a place where we focused, especially in Compose, that we focused on writing SQL.
I would say data science is the place where they venture outside the most, and so we do get occasional questions about using Pig or SAS. These are things that we definitely don't handle in Compose, and that we would like to capture in the catalog. And I'm seeing also R and Python. We have a couple of ways that we've made interfaces that you can use the queries written in Alation inside of R and Python scripts, so, since often when you're a data scientist and you're working in a scripting language, your source data is in a relational database. You start with a SQL query and then you process it further and create graphs inside of R and Python. And we have made packages that you can import into those scripts that pull the queries or the query results from Alation so you can kind of have a blended workflow there.
Rebecca Jozwiak: Okay, great. I know we've run a little bit past the top of the hour, I'm just going to ask one or two more questions. I know you talked about all the different [inaudible] systems that you can connect to, but as far as externally hosted data and internally hosted data, can that together be searched into your single view, into your one platform?
David Crawford: Sure. There are a few ways to do that. I mean, externally hosted, I would imagine, I'm trying to think about exactly what that might mean. It could mean a database that someone is hosting in AWS for you. It could mean a public data source from data.gov. We connect directly to databases by logging in just like another application with, with a databases account, and that's how we extract the metadata. So if we have an account and we have a network port open, we can get to it. And then when we don't have those things, we have something called a virtual data source, that allows you to essentially push documentation, whether automatically, by writing your own connector, or by filling it in by doing even like a CSV upload, to document the data alongside your internal data. That gets all placed into the search engine. It becomes referenceable inside of articles and other documentation and conversations inside the system. So that's how we handle when we can't directly connect to a system.
Rebecca Jozwiak: Okay, that makes sense. I'll just shoot out one more question to you. One attendee is asking, “How should the content of a data catalog be validated, verified or maintained, as source data is updated, as source data is modified, etc.”
David Crawford: Yeah, it's a question we get a lot, and I think one of the things that we – one of our philosophies, like I said, we don't believe the users are malicious. We assume that they are trying to contribute the best knowledge. They're not going to come in and deliberately mislead people about the data. If that’s a problem at your organization, maybe Alation's not the right tool for you. But if you assume good intentions by the users, then, we think about it as something where, the updates come in, and then usually what we do is we put a steward in charge of each data object or each section of the data. And we can notify those stewards when changes to the metadata are made and they can handle it in that way. They see updates come in, they validate them. If they're not right, they can go back and modify them and inform, and hopefully even reach out to the user who contributed the information and help them learn.
So that's the primary way we think about doing it. This sort of suggestion by the crowd and management by the stewards, so we have some capabilities around that.
Rebecca Jozwiak: Okay, good. And if you could just let the folks know how they can best get started with Alation, and where can they go specifically to get more info. I know you shared that one bit.ly. Is that the best place?
David Crawford: Alation.com/learnmore I think is a great way to go. To go sign up for a demo the Alation.com site has a lot of great resources, customer white papers, and news about our solution. So I think that's a great place to start. You can also email [email protected].
Rebecca Jozwiak: Okay, great. And I know, attendees, sorry if I didn't get to all of the questions today, but if not, they will be forwarded to David or his sales team or somebody at Alation, so they can definitely help answer your questions and help understand what Alation does or what they do best.
And with that, folks, I’ll go ahead and sign us off. You can always find the archives at InsideAnalysis.com. You can also find it at Techopedia.com. They tend to update a little bit quicker, so definitely check that out. And thanks so much to David Crawford, Dez Blanchfield and Robin Boor today. It's been a great webcast. And with that, I'll bid you farewell. Thanks, folks. Bye bye.
David Crawford: Thank you.