As software engineers, we talk a lot about technical debt. As any project or product progresses, it’s almost inevitable that trade-offs and compromises made along the way become problematic eventually.

We all know how to deal with technical debt, so surely we should be able to manage it just fine when it comes to data? Maybe, maybe not. I think we need to take a step back and see where we’re at, and where we’re going.

In this article, I will argue that data engineering is still very immature compared to software engineering and also why simply adopting tools and frameworks is, at best, a shallow way of establishing best practices in data engineering.

I will use ideas and concepts from software engineering to illustrate what I think is needed for data engineering to evolve. Is there a path out of this mess? How do we start paying down data debt?

The most prominent change in the data space over the last 5 years has been the adoption of software engineering practices, such as:

  • Data as code (and code as data if you will);
  • Version control;
  • Automated testing;
  • Continuous Integration and Deployment;
  • Opinionated frameworks for managing one or more parts of the data lifecycle.

In many ways, the rise of the modern data stack has exacerbated existing problems by putting very powerful (and expensive) tools in the hands of people without the full knowledge of how to harness them. As Lauren Balik states, the problem is the explosion of tables and dashboards to feed an apparent ever-growing need for data at all levels of the organization:

What’s actually going on here is that analytics engineering headcount (mostly using dbt as ~75% of their jobs) is increasing. There is an explosion of tables and objects in the warehouse, and of dashboards, eventually more people get hired to make and manage more tables, tables, and more tables and dashboards, and these people add costs to Snowflake, making more and more tables and objects to feed increased dashboards and analyses which decay in incremental value to the business.

The tangible financial cost is significant as the number of tables and dashboards increases. Of course, not all of this is technical debt, but from what I’ve observed, most data teams using dbt feel like they’re losing control over their environments. They have no clue what these 10 similar-looking models are meant to do, or which of the 5 “Customer Retention” dashboards represent the source of truth.

Clearly, working with data as code isn’t the saving grace we were hoping for. Chad makes a similar salient point here — no tool alone can help you manage your technical debt, and there are no quick fixes:

Source: LinkedIn.

What are the problems causing this mess?

Problems:

  • 🛠 We have power tools: We have data warehouses that swallow inefficient queries without a hiccup — and tools to generate new tables in seconds. We’re using power tools to efficiently produce piles of garbage. You can make an even more expensive and more confusing pile of garbage even faster — automated and with CI/CD and tests. But at the end of the day, it’s still a pile of garbage.
  • 🌋🤑 A mountain of data: Remember those piles of garbage from the previous point? That’s data that no one needs or uses, and that’s driving no value. It’s driving compute cost, driving storage cost and causes a headache of potential compliance and security issues. Of course, we follow all the “best practices” with tests on every column for distributions, freshness and quality — further racking up unneeded queries🤯
Source: LinkedIn.
  • 👎 Worst of both worlds: a “quick request for data” can no longer be fulfilled. Data engineers like to make fun of the business user asking for data “real quick”. While these requests used to be handled ad hoc and found their way into a spreadsheet, now it’s so fast to create and execute a new dbt model (maybe even a dashboard?). Often, this new model is essentially a duplicate of another model, maybe with some filter or aggregation. So now we have the power to expedite these “one-off” requests into version control, and they will stay there forever since no one knows if they are used or not. The “best practices” of version control and data as code are now working against us since the code is piling up and there is no good way of managing its lifecycle. So much for “self-serve” data — we did not get rid of the ad-hoc requests, but we can expedite them into eternity😩
  • Lack of best practices🧐 We have selectively taken the hippest and shiniest stuff from software engineering, like CI/CD, automation, and whatever else the cool kids are doing. We have tools that allow us to produce and consume data at an unprecedented rate, but very rarely do we talk about the lifecycle of the data itself. When ownership is not established, it’s hard to engage in maintaining what is being created. The “hard stuff” like maintenance, refactoring, and architectural reviews are not done. There is no culture of “less is more”. This is in stark contrast to high-performing software teams. In fact, most senior engineers see their best days as deleting and simplifying code, leaving the world slightly better than they found it that morning. This seems a far cry from how most data teams operate these days. Further, we have also just re-invented a lot of existing, long-standing, best practices within data engineering that were there for a reason, or worse just chosen to ignore them. Probably the biggest elephant in the room is data modeling. Fact tables, dimension tables and slowly changing dimensions are becoming relics of the past in the rush for more and more data:
Source: LinkedIn.
Source: LinkedIn.

Is there a path out of this mess?

As many data practitioners are realizing, data (as everything) is about culture and people — not about just the tools. I do still think that tools play a vital role, but the path is a mix of technical and organizational considerations that must be taken seriously. The below list summarizes my views on where we need to go and what we should do:

  • 🌅 Realize that these are early days and that the frameworks we currently have are all about producing data. We are not yet talking cohesively about ownership and managing the lifecycle of data — which also includes deprecation and removal of data. Tools need to take this into account. Some people think that this omission is intentional, and even ill-intended as a strategy to inflate cost across all tools — for instance Fivetran-dbt-Snowflake-Looker. I think it’s probably not all ill intentions, it’s just that more data is not “better data/better decisions”, and people have just not realized they may not need all this data everywhere, all of the time.
  • 💼 Align the data team with business goals and outcomes — i.e. get closer to the business and work to provide tangible value. Essentially, the data team needs to actively work to be seen as a valuable partner that helps other business departments hit their goals: close more business, reduce churn, increase current accounts, provide better support. The list goes on. But at the end of the day, every item on the balance sheet needs to justify its existence and ROI. As data teams have been responsible for a large spend, it’s only natural that they are put under scrutiny. As much as it can be stressful being bombarded with urgent requests from across the business, there are usually clear reasons for the ask. The best data teams are experts at tracking these questions, just like the best product teams continuously adapt their products to the explicit and implicit needs of their users. As Chad says — start simple, but clear goals and outcomes are key for data (I might add, as with any initiative).
  • 🧑‍💻 Improved static analysis: Surely we can do better than jinja templates and {{ ref(“order_items”) }}. SQL itself can be parsed to automatically infer the DAG relationship and lineage between tables (like we do at Alvin). In this regard, I’m also happy to see some initiatives like SQLMesh. It is really exciting — and it gets away from those annoying refs, and allows for writing our data pipeline and dependencies in just plain ol’ SQL.
  • Linting and syntax rules: Enforce linting and standards in SQL (no more SELECT *) to have uniform formatting and styles. SQLFluff is a great tool to remove annoyances like unqualified columns and a lot of other things that make our SQL less readable and maintainable.
  • Manage data like a product by utilizing metadata: The best product teams talk to their users to understand their needs, but they also use product analytics heavily to prioritize and de-prioritize features. This is a very natural part of the lifecycle of a feature and the code that is powering it. By doing analytics on their analytics, data teams can also get a better understanding of their business impact. Understanding which dashboards are being used, and the tables powering them can give insight into where to prioritize performance improvements. Similarly, unused tables and dashboards should be swiftly deleted. Of course, here it’s important to have a proper process around deprecation and deletion, which ties into the point about aligning with the business needs. But the secret weapon here is proactivity. Instead of dropping tables and deleting dashboards and waiting for the onslaught, data teams can notify consumers about upcoming deprecations and changes. Even better, based on patterns, they can refer users to more updated or improved versions of the tables or dashboards they are currently depending on! Talk about being a valuable business partner 🤝
  • Managing cost at the right level: This point, like the previous one also revolves around metadata. There are different levels of granularity, and depending on your goals for the data practices, you need to choose the right one to approach cost. For instance, dbt can tell you how long your models are running, and that can help you make them faster. But is that the right level to ask questions? Of course, it’s nice to shave time (and cost) off of our dbt models, but how much impact will that have really? There are already tons of approaches and also entire companies focused solely on optimizing queries by adding filters, partitions, indexes, and whatnot. But I think this is a potential rabbit hole and I’m not sure it tackles the fundamental problem. I have a strong conviction that for data engineering tasks, many of the problems are far more easily solved. By holistically understanding the cost, usage, and lineage of data assets it’s a lot easier to ask (and answer) the right questions. Questions like, “Do we need this expensive query to run every hour?”. If the answer is no, and we decide to switch to daily, the potential saving is far greater (and easier) than any query optimizations. Once we are done with all of these questions, we can allow ourselves to dive into optimizing queries and tables themselves. It’s a lot easier and faster to remove something than to understand and optimize it.

Is there a path forward from where we’re at now? I’m hopeful!

What are your thoughts? Let me know in the comments or connect with me on Linkedin and let’s discuss!