The modern data stack

Episode 28

The modern data stack

As CEO of dbt Labs, Tristan Handy finds himself at the centre of the "modern data stack" of many leading businesses. But what defines the modern approach to data, what came before and where is it likely to go next? Tristan speaks to Hg's own data expert, Tim Harrison.

Listen on:

Spotify Apple podcasts Simplecast

Episode Transcript

Tim Harrison

Welcome to Orbit, the Hg podcast series where we speak to leaders and innovators from across software and tech ecosystem to discuss the key trends that change how we all do business. My name is Tim Harrison. I'm part of the data and analytics team at Hg working with our portfolio companies on their end-to-end data journey across commercial and operational optimization and data products. We're delighted to be joined today by Tristan, CEO and founder of dbt Labs, a leading B2B SaaS company in the data transformation space used by over 11,000 companies, including many Hg portfolio companies as well. Tristan recently spoke at Hg's 2023 Digital Forum in London, and it's great to be catching up again a couple of weeks later. Thanks for joining us, Tristan. Welcome.

Tristan Handy

Hey Tim, thanks for having me.

Tim Harrison

Well to start, Tristan, I think it would be great if you could tell us a little bit about yourself and dbt Labs and the journey starting with analytics there as well.

Tristan Handy

Yeah, I have been a data practitioner in some way, shape, or form for right about 20 years now. And I won't bore you with my entire story, but maybe starting back in 2013, I worked at a company called RJMetrics. And RJMetrics was a BI tool that was positioned to be very successful at the time, but was built in a technology architecture that predated what we now have come to know as the modern data stack. So I joined that company in 2013. And it just so happens that right about when I joined, Amazon released a product called Redshift, which was really the first cloud native analytical database. You could swipe a credit card and pay by usage. And this really changed the industry. It was the first time that you could see the kind of performance that you got from Redshift without spending 100 grand on an appliance that you had to store in a server room somewhere.

And very quickly over the course of the next couple of years, lots of small startups seeing this opportunity, they emerged to kind of create a full stack solution originally around Amazon Redshift. And companies like Mode and Looker and 5Tran were all born in this environment, and they all saw the same opportunity. And over the subsequent years, I mean partially this impacted me because RJMetrics quickly, because of this architectural shift, RJMetrics became less competitive as a product, and that's its own story, and maybe we can talk about that at some point if it's interesting.

But I got to see first-hand the emergence of this stack of products, and I was excited about it as a practitioner. I was like, for the first time I can not worry about performance in the way that I used to. Whether you're talking about using Excel or traditional Oracle database or MySQL. I've done analytics in all these different environments, and you always ran into these performance limitations that they're not your highest value add as analytics professional thinking about this kind of stuff. And so I was really excited to use this new technology. And I started an analytics consultancy in 2016 called Fishtown Analytics, really just to hang out a shingle and do work in this new tech stack. And I can get to the dbt story as well, but let me stop there and tell me if I'm boring you so far.

Tim Harrison

No, I think that's really interesting. And I think it ties so nicely into one of the first topics I really want to discuss, which is what is that modern data stack that we all talk about now? And I often get asked that question by our portfolio company CEOs, CTOs, what is that modern data stack? And even though I've been working in data quite a while now, it's actually quite a hard question to answer because it's so entangled with the history. It's got that word modern. But one thing that always is discussed when you talk about that modern data stack is dbt, it's a core component there that fits into what people see as the modern data stack. So where better to ask than hear your perspective on what really is that modern data stack, and how did we get here, and what have you seen of that journey?

Tristan Handy

So a lot of people get wrapped around the axle on the term modern. I live in a house that is of a style called mid-century modern. And this style was mostly popular in the 1940s and 1950s, maybe up through 60s. But there's an era that is associated with mid-century modern, and the Eames chair and everything. We still call it that today even though it's now of a period that has passed. The reason it was called modern at the time, and why we still use that, is because it was in reaction to architectural patterns that had existed beforehand. And so it was a statement of we're doing something new here architecturally.

And the modern data stack is the same way. It was something that was in reaction to that which came before. And really there were two things that came before it. One was the era of the enterprise data appliance. These products where you literally if you needed high performance analytical computing, you called up a vendor and they shipped you a box and you installed it. And these products, they're screaming fast and really represented state of the art for that era.

The problem was they completely and totally lacked any sense of elasticity. And of course they were very expensive, so they were inaccessible to most companies. And for the large enterprises that could afford them, the name of the game was this constant governance of who got to run workloads on this thing. Because there was only so much horsepower to allocate. And so you're always fighting over when you were going to run things, and that people stayed late at night to kick off their job in the middle of the night when no one else was using the resource. So that's obviously not good, and we know from the cloud today that's a kind of an anti-pattern.

The other thing that the modern data stack is in opposition to was the Hadoop ecosystem. So there are a lot of good things about Hadoop, but at the end of the day, the architectural complexity was too high. And it just needed to be owned and operated by a very experienced set of folks internally. And it just never was able to be made turnkey and user friendly. So you have the modern data stack. Well, it's elastic, it's kind of all based around the cloud data warehouse. It speaks SQL, so it's very accessible to large numbers of people, and it doesn't really require any real performance tuning or operational complexity in order to make it all work.

Tim Harrison

And that's really interesting, that piece there around seeing those large enterprise data architectures, but then also the Hadoop movement. One piece you're saying there was really sort of barrier to entry in terms of how easy it is to set up on the modern data stack now. And when I look at some of our portfolio companies, you can really get that platform in days or weeks operating. With Hadoop, that just wasn't really possible when it got down into the nitty gritty details. There was just more barriers to entry, more technical complexity there.

Tristan Handy

You can think about this from the perspective of project failure. So it was not uncommon for "big data" projects to actually result in failure. They never produced business value. And that just doesn't really happen with the modern data stack. It is so turnkey, and so much the value comes kind of out of the box with pre-built integrations and all. Yes, if you have an existing set of SQL stored procedure workloads in a data warehouse appliance, does it take work to migrate them over into a modern data stack native approach? It does take work. There's real projects to be done there. But in my entire now almost seven year history with the modern data stack, I've never seen a project just totally go sideways and fail to produce any business value.

Tim Harrison

Which is a good thing. I guess that's a core part now is on bringing more data value to these businesses. And one, reducing that risk, but also I guess increasing that iteration cycle. Is that where it's coming from?

Tristan Handy

Yes.

Tim Harrison

Is that why that reduction in risk and reduction in failure rate is happening?

Tristan Handy

I think that it is we're operating at a higher level of abstraction. And so that means that the platforms are able to take care of more of the complexity for you. And at this point the main data platforms, whether you're talking about Snowflake or BigQuery or Redshift or Databricks or others, these are all pretty mature products. When I hang out with my co-founders, Drew and Connor, we like to reminisce about how it was actually not that hard to produce fatal errors in Redshift back in 2016. You could do it using a sequence of SQL statements that should have operated normally, but in fact you would get a response that said something like cluster restarting. But these products are now very mature. They've had a long time to bake. They're very widely used. So you could show up to them and expect that you're going to get a consistent turnkey approach.

Tim Harrison

That's really interesting to hear about how much maturity has improved there. But coming back to that piece around what is the modern data stack, I think that piece around modern really resonates. That there's been a paradigm shift, something new has come, that's why we refer to it as modern. With the rest now, when you are explaining to potential customers or others, what do you describe the modern data stack as now? Is it that set of turnkey components? And I guess that modularity is also a big part of it, right? It's no longer one vendor that serves you the whole solution. That there's lots of different components that have quite precise functions that connect together.

Tristan Handy

Yeah. And this is a thing that's being debated inside the ecosystem right now. There's a modularity pendulum. And on one side you have kind of the ecosystem that existed back when I began my career in 2003 where you had these kind of two big ecosystems. You had the Oracle ecosystem and the Microsoft ecosystem. And Oracle and Microsoft at the time where these very vertically integrated enterprise software suites. And you kind of built your skillset in one or the other, and it was very hard to cross between those ecosystems. There's a lot of reasons that I think that's not the best place to be. I think it eliminates practitioner choice and all this stuff. But it does create a certain level of clarity around how do you construct a best of breed application. If you're in the Oracle ecosystem, well, here are the components that you purchase, whatever.

Today we have gone very far on the other end of the pendulum, which is to be very best of breed. And certainly the data platforms are at the center of this, and maybe there's four to six that are really dominant today. But then you have these other major categories. So the first one that many people think about is data ingestion. So you got to get data into the cloud data platform in the first place. Then there's the place that we play, data transformation, dbt has become a pretty big standard in this layer, but there's other ways to do it too. And then there's at least other categories, data observability, data lineage, data cataloging. And certainly analytics and BI on the right most part of the diagram. But if you were motivated to do so, you certainly could put together a "modern" data stack that had 20 different components to it. And that is probably not the right answer either. It's just a lot to manage.

Tim Harrison

And do you feel that we are at that maximum of the pendulum in terms of the unbundling? Do you think there is going to be a tendency maybe, one, either to choosing the smaller number of core components. So okay, you need that for that data stack, you need the data ingestion, you need the transformation, you need the analytics. And I think that's something I've personally seen quite a bit is then once you've started to go to three or four vendors there, and then you start to say, okay, well what else do we need? As that list gets to 15, 20 vendors, it becomes quite hard to manage. And you need to focus on what are the core? Or do you think there's going to be a bit of a bundling in terms of looking at some of the bigger players and bringing some of those tool suites together?

Tristan Handy

I don't know if your listeners tend to listen to the same set of podcasts that I do, but I find myself listening to many different podcasts right now that talk a lot about soft landings. When you're talking about inflation rates and interest rates and all of this stuff. And so the question there is are we able to pull inflation back to a target zone without causing a large recession in the global economy? And so I think that what I really want to see happen in the modern data stack is not for the pendulum to swing hard back in the other direction and see us back at this place where you're either in vendor A ecosystem or in vendor B ecosystem, but instead to have this soft landing in the middle where I don't know what the right answer is. Maybe it's four, maybe it's six sets of modular components that you wire together.
But I think that's on the infrastructure side. So I think that one of the really important things, there's a difference between data infrastructure and data applications. And I really think that the modern data stack is mostly an infrastructure phenomenon. And that infrastructure can feed data to a very wide variety of data applications. And I don't actually think that there is a "correct" number of data applications. I think that there could be as many as you want or need. Now, I think there should be a fairly standard set of pipes that supply all of that data.
But you could imagine, just for internal analytics, you could imagine having a very notebook based product, a spreadsheety product, you could have a kind of classic drag and droppy BI product. And all those three would kind of coexist nicely depending on the user personas and what you're trying to create. And then there's a whole set of how do you deliver analytical experiences to your customers. There's also are there kind of more integrated, not generic like horizontal analytics, but more vertical... Maybe there's CRM analytics that you buy that is plugged into the modern data stack. But is purpose built for CRM. Anyway, I think there's the potential for many of those different kinds of applications, but only a reasonably constrained number of data infrastructure products.

Tim Harrison

So really thinking of it as the core components of the modern data stack and those infrastructure components versus the whole suite of applications that have end use cases or particular end users that are driving value from that underlying modern data stack.

Tristan Handy

100%.

Tim Harrison

And maybe touching on that piece there around how the modern data stack's being used, I think a really interesting piece you brought up at the forum last week was how at the moment the modern data stack primarily is being used for cross businesses analytics. It's driving that primarily batch but not always batch reporting, commercial and operational optimization. And one piece that you've been looking at is seeing how that is shifting now towards more actual sort of operational use cases, and automating workflows and different functions in the business. Where did you see that coming from and where did that originate from a dbt perspective?

Tristan Handy

Those of us who have built technologies in the modern data stack, I don't think that we originally got into this to automate revenue recognition, for example. Most products are created by people who are trying to solve their own problems. And most of us created those products because we were trying to solve data analysis use cases. But over time, as you build up this infrastructure that has services analytics use cases, you realize that these pipes contain effectively all of the important information in your entire company.

And there's already this commercial motivation on the vendor side and on the customer side to make these pipes more and more reliable and more real time. And so over time you start asking, well gosh, we've got this increasingly sophisticated set of infrastructure, why don't we do more stuff with it? And it turns out that the stuff that you can do with it is kind of limit all data-driven things that a business does. So in the past we have used CDPs, customer data platforms, that kind of have their own data stores. They have to collect and process their own data.

But increasingly CDPs are seeing themselves as operating on top of the modern data stack. You've got all of your customer data in one place. Why wouldn't you use that to trigger customer communications? In the past, I mentioned rev rec before, you would do that purely inside of a system purpose-built for that, whether that's a system like a Zuora or whether it's more like a NetSuite. But when you already have all the data together inside of the modern data stack, it becomes possible to actually do it better, more effectively in that context.

Tim Harrison

And when you say better and more effectively, is that because at the moment those tasks are essentially there are teams of people manually going through these pieces of doing these tasks, and then extracting data from one system, uploading it to another, flicking through Google Sheets and Excel trackers. And it's those pieces that you see, those operational pieces, that actually can start to be automated now better with the modern data stack.

Tristan Handy

If you stay purely in the realm of analytics, this holds true as well. So it used to be that because the demands of web analytics were so significant, page view data is very big data, it used to be that you used purpose-built web analytics tools to just do web analytics. And Google Analytics is kind of the 800 pound gorilla there, but there are other products too.
But then what you find yourself doing as you get more and more sophisticated in your use of these tools, is you realize, oh gosh, I want to be able to break down the people that visited this page based on some attribute that my CRM knows about, but my web analytics tool doesn't know about. And so, oh my gosh, I'm going to go build a pipeline to get this data from my CRM over into Google Analytics. And the minute that you build that first pipeline, you end up realizing, holy crap, I need 20 more of those things.

And so very quickly, just talking about web analytics, you find yourself integrating data from all over the place to get to it. And these pipelines become very fragile and often not well maintained. There's not a clear function of the business that's responsible for making sure they continue to be high quality. And so we've moved web analytics over into the modern data stack. And you use the same set of tools to do web analytics that you do with anything else. I think the same thing has started to, and will increasingly play out inside of operational use cases like rev rec, or customer communications too. You just find that, any one given system, if it doesn't have all of the data related to that workflow, you either need to push all the data to it, which then it becomes a copy of your data warehouse, or you have to treat your data warehouse as the source of truth.

Tim Harrison

So it's actually it is that piece of having one single source of truth where all the data exists in the same system, the less you connect it. And I think it's very really interesting you bring up that web analytics because the product telemetry, we actually had Yali Sassoon from Snowplow on our podcast.

Tristan Handy

We use Snowplow internally. I'm a big fan.

Tim Harrison

Excellent. Yeah. And one thing we've really liked as we've seen it used at portfolio companies is that integration with dbt. And not exactly as you said there, that we're now collecting this granular customer behavior product telemetry data, but hey, we can also join that up to firmographic signals in CRM, or in different data sets in that single customer 360 view. And that's where really that power of bringing those data sets together comes from.

Tristan Handy

I think there's an efficiency play here where it's like, when you construct these pipelines well, you eliminate a lot of manual work. And that's great. Efficiency is always a priority. But you used the word power, and I think kind of this unrealized aspect of this transition as well. So when you talk to people who have spent their careers in biz ops or finance, or these parts, or marketing ops, they have this filter that they go through where you say, "Hey, I'd really like to send a email to all of our customers that have the following..."

And the first thing that this thought passes through in these professionals' brains, and rightly so, is can my tool support that? And I want to recognize revenue in the following way, can my tool support that? And you used to ask this question all the time with Google Analytics. And it's one of the reasons I was so excited to start an analytics consultancy in the modern data stack because I knew that when you moved all of these things over into the modern data stack, all of a sudden a customer could ask you a question, and you would always be able to answer, "Sure, of course I could do that." So you don't have to know the limitations of GA, Snowplow will let you do anything because you can just write the code to do it, and et cetera.

Tim Harrison

It's really powerful there. And I guess that is that piece about enabling and empowering those teams to have I guess more autonomy in the questions they're asking. So seeing that transition from modern data being used for analytics and now you're seeing that piece more even operationally and web analytics to even more commercial operational optimization pieces, one other area that I know was a topic at the forum was also about seeing how the role of the modern data stack is forming for customer facing data products itself. So customer facing BI solutions that might sit on top of the sort of unique and proprietary data that software companies are developing. Where do you see that playing? Have you started to see within dbt that your customers are using it for that customer facing data products already, or is that a new area?

Tristan Handy

It's an area honestly that the tech is not quite as mature in, but real customers are putting real applications into production today. So the existing BI tools that you would use to accomplish internal use cases often also have an embedded version that you can use to serve customer use cases. But again, these tools were primarily created to service internal use-cases originally. And so the embedded use cases are probably not quite as good of a fit. Now I'm not saying that you can't create good things there, but they're not quite as good of a fit. So I think that there's interesting, and I'm not going to be able to remember company names off the top of my head, but I've seen some companies that are purpose-built, new startups purpose-built for this.

But then there's also the data processing requirements. And you can, again, build customer facing applications on top of the same data platforms that I've mentioned earlier. But by and large, these data platforms, the reason that they're so magical is that they're optimized for large scale processing but lower concurrency. And so when you turn around and want to allow your customers to query their data, all of a sudden the data set shrinks. And so the large scale processing isn't quite as relevant, and the concurrency spikes because now you've got all your customers hitting it.

And so design considerations for these databases are not quite the same as you might want for that use case. And so there are solutions. Some companies actually use cloud data platforms to kind of do the large scale processing in the pipeline, and then they dump it out to honestly like a Postgres or something. There are also kind of caching solutions that are early stage, but being built to kind of act as this layer for these type of applications, and they will plug right into a cloud data platform and kind of act as a concurrency or caching layer.

Tim Harrison

That's really interesting because I guess we're seeing this as a very important area of those customer facing data products. And one piece that I think I'd certainly notice is there's still maturity that needs to come on what you call the right hand side of the diagram with those sort of actual BI analytics tools, on getting those ready for embedding or that a tool that may be a fantastic BI tool for internal use cases and gets the job done is actually can be quite different once you have a product managers and teams and real customers who have specific requirements. But actually there's also some of those key components in the middle of the modern data stack on the infrastructure side of the database that actually also need to be potentially tuned and tweaked as we increase those customer facing use cases also.

Tristan Handy

But this is going to happen. It's just a matter of more people poking around at it. And as always, early adopters are going to get there first, and they're going to drive some startups through the product market fit process. So whether or not this is on your 2023 roadmap, I'm highly confident that everybody's going to be doing this in the next few years.

Tim Harrison

Well then, conscious of the time, I think given how much I listen to your fantastic Analytics Engineering Podcast, I think you ask a very similar question at the end. I've maybe given it a slight twist on what we've talked about is where do you see the modern data stack going yourself in the next five to 10 years, both from a dbt perspective, but also maybe more broadly in that whole data stack?

Tristan Handy

As an industry, we're exiting from the period of early experimentation into a period of maturity, but we're probably still early in that period of maturity. So here's what I mean by that. Back in 2016, '17, '18, companies that I would work with on a consulting basis would just be so excited by the capabilities that this new technology provided. And it was just like, hell yeah, let's go there.

And now many of those same companies have three, four, five year long investments in this technology. And the technology itself is fundamentally working very well, even at that stage and it's scaling with... But what I think that we are grappling with as an industry is complexity. Once you've spent years operating in the modern data stack, you've rolled out more and more workflows, you've gotten more and more developers involved in the process, and you find yourself needing to keep track of and govern all of these things that you're now supporting in production, whether that's production for internally facing use cases or externally facing.

So I think that the entire ecosystem is going through this process of how do we allow our users to scale complexity over time? And honestly, I think that as we develop better and better answers for that, it is going to enable more and more mission critical uses for this technology. So I'm excited about it.

Tim Harrison

It's very exciting. And one piece I think that's important there is, I was talking with our data team and we were talking about the modern data stack and the role of dbt actually is what we do when we sit around and have lunch together. And one piece we said is why do we think dbt's done so well? Why do we like using it compared to the different tools? We have a whole range of experience in the team. And one thing we said is actually what it's doing is bringing some software engineering best practices to what is frankly a less mature area of data. And I sort of have experience as a software engineer for a short time. And if you go and tell them, "Oh yeah, we needed to put it all in code and use version control and have sort of traceable lineages and build jobs," they'd go, "You weren't doing that already?"

Tristan Handy

Right, right. And so dbt has been a part of the story of this increasing ability to handle complexity because of the software engineering best practices that we brought. But there's stuff that isn't yet possible, or you can kind of get at it, but it's not a good experience yet inside of dbt that we want to push the boundaries on too.

So if you look at, and I'll keep this short, if you tune me up on this, I'll talk about it for an hour, but if you look at the history of software engineering, originally all work was done by heroic individuals. And then that was no longer enough. And we threw a bunch of software engineers at a problem and just said, "Make it work." And what ended up coming out of that were these very, and this is in the late '90s, early 2000s, these monolithic software code bases that were very hard to maintain because of their complexity. It became very hard to make changes to them and know what was going on in there.

And so the entire industry moved towards a service-oriented approach. So the classic two pizza team owns their own code and exposes interfaces to other teams to make use of their functionality. And this is mostly a foreign concept in data still. And at the end of the day, building software is about people. It's about enabling people to do great work. And we think that that's going to be the next evolution of dbt is enabling this more service oriented approach to designing data workflows.

Tim Harrison

That really resonates, and also equally very exciting to see, not only where dbt goes, but the modern data stack. And I think also very nice to see that when answering that question about where we're going in the next five to 10 years, the answer wasn't all about generative AI and ChatGPT.

Tristan Handy

You can get a lot smarter people to talk about that than me.

Tim Harrison

We really appreciate you having you on the podcast, Tristan. It's been an absolute pleasure. Thank you very much.

Tristan Handy

Thanks so much. It's been a lot of fun.

Orbit episodes

Latest