Data lineage is a hot topic, mainly because it has the potential to answer many business questions and solve data engineering pains.
But it’s also a widely misunderstood term. Let’s clear up some of the most common misconceptions I’ve seen about this technology, across community discussions, articles, and social media posts.
Data lineage is just a diagram
Encountering a lineage diagram for the first time, with all those nodes and edges, it can be difficult to make that leap into how this can actually help you perform your day job. But believe me: when having this kind of tool in their utility belt, data professionals feel they can move mountains (or see where the mountains originated from at least).
Let’s look at an example. See that node on the right where it says Stock Analysis V2? That represents a Looker dashboard. Imagine waking up one day to a message from your CEO is asking why a metric on that dashboard isn’t working. They need it for a meeting later that day (don’t they always). So you turn on your computer and start investigating:
- First, you log into Looker and check if there have been any changes to that dashboard. There haven’t.
- So you go to Slack and say something like “Hey team, has anyone made any changes that could impact the Stock Analysis dashboard?” And then you wait. But people are busy with their own stuff, so after waiting 2 hours and getting no response, you decide to act.
- Acting in this case means manually tracing where the data for that dashboard is coming from. This could take minutes, or it could take hours. All that to find out that this is happening because someone changed the name of a column somewhere upstream.
This example scenario is very simple, and it probably wouldn’t take that long to find out what went wrong. In more complex environments, this kind of task might be a nightmare for engineers who may find themselves wondering why they studied so much deep tech if all they do is fight fires and fix broken stuff all day. I think you’ll agree that this is a waste of everyone’s time.
So a diagram is not just a diagram: It’s a very useful tool to track the root cause of a lot of engineering problems. But it’s also more than that.
Here’s what you can achieve with lineage:
- Impact analysis. It can be hard to predict what’s going to happen when you make infrastructure changes. And, as we saw in the example above, the most innocent SQL query can impact parts of your system that sometimes you had no idea existed. If you have lineage, you know where data is coming from and where it’s going to. And when you know that, you can perform regression testing before actually making changes to your data, avoiding breaking changes and major issues.
- Data assets cleanup. Without constant attention, unused tables and dashboards can pile up fast: single-use tables for ad-hoc analysis, dashboards for punctual campaigns and so on. That can have an impact on cost and be a source of confusion for engineers, especially newcomers. With lineage, you can understand what is not being used and do periodic clean-ups.
- PII tracing and compliance. Much of the work needed to comply with privacy regulations, such as GDPR, CCPA and LGPD falls on engineers’ shoulders. And just like all the other data, sensitive data flows from source to many different tables, dashboards, notebooks, spreadsheets, machine learning models and more. Manually keeping track of where PII ends up with 100% accuracy is a very hard task, but lineage can help with that: it can track sensitive data wherever it flows.
- Data trust. How is this field calculated? Should I use the VALUE or the PRICE field? Where does this data come from? These are common questions that data consumers ask engineers, usually when working with new datasets. Lineage can help trust become more self-serve and reduce the burden on data engineers.
Dbt/Airflow lineage is enough for complex use cases
Tools like dbt and Airflow by design come with some basic lineage, which is cool and very useful in many different scenarios.
But they’re limited in terms of granularity (column vs table-level), and the lineage is limited to the boundaries of that specific tool: if you’re looking at your dbt models’ lineage, you have no visibility of what happens with it once it leaves dbt's domain.
If the question is, "Will I break any dashboards if I rename this field?", we could say that this kind of lineage ain’t gonna cut it and cross-system lineage is what you’re looking for.
Implementing data lineage isn’t worth the time investment
Knock knock. Who’s there? Technical debt. Technical debt who?
technical debt is the implied cost of future reworking required when choosing an easy but limited solution instead of a better approach that could take more time.
When you’re in the trenches, with a never-ending to-do list, anything outside of that workflow can feel like a distraction. Why take on another project when we have more than we can handle right now?
But this can be a false dichotomy: in most cases of technical debts, you'll only feel the consequences of past decisions after months or even years. When we're talking about data, there's potential firefighting and manual work being done right now that could be prevented if you have data lineage in place.
So, like any tooling choice, it should be a cost-benefit analysis. If it can permanently make your team more efficient and effective, then the sooner the better! Also, the modern wave of tooling in the space focuses much more on ease of setup, so rather than months of implementation, you can have automated column-level data lineage up and running in a few weeks with minimal effort.
Data lineage is only useful when your environment is a mess
I wanted to get an idea of when companies usually start implementing some sort of lineage, so I went digging. One of the answers I got, from the great Juan Sequeda (who also hosts an awesome podcast) was:
If you focus on having well-defined data models and data transformations, lineage would be a straight boring line. Invest in your foundation. Lineage in my experience is a bandaid/crutch to help you get out of a mess. Avoid bandaids/crutches by being careful and staying healthy.
Juan is not alone in this belief, but I disagree. No, he’s not wrong: Yes, you should absolutely invest in your foundation and have well-defined data models and transformations (which unfortunately is not the case of most companies for many different reasons, but that’s a conversation for another time). But lineage is more than a band-aid solution.
Some benefits it can bring in a controlled, properly-modeled, non-messy environment:
- Onboarding improvements. It takes time for new hires to properly understand a codebase and it’s not uncommon to make mistakes from lack of knowledge. If you’re using lineage to document pipelines, knowledge transfer doesn’t depend only on people.
- Discoverability and collaboration. With lineage (and the associated metadata), your team can have enough understanding of your data ecosystem without having to constantly ping each other. Who doesn’t want to live in a world with less pinging (engineers will love this one crazy trick!)?
- Pipeline optimization. Beyond cleaning up unused assets, data lineage can provide visibility into what is actually being used a lot, allowing you to understand how users are consuming data and giving you the opportunity to optimize pipelines based on usage and specific use cases.
I need dedicated engineering resources to implement data lineage
A while ago, I spoke with two folks who had built in-house data lineage solutions at their companies from scratch. One of the reasons they did that was because data lineage was expensive back then: you would pay for a tool and have a dedicated team of engineers working on it to have proper lineage in a matter of months (or even years, eek).
For them, building from scratch made sense at the time. But having spent the past three years building out data lineage, and having seen the technical complexity, I’d probably never recommend building it in-house today! For certain specific use cases, it’s doable: If you have well-defined processes and strong SQL practitioners on your team, this will reduce the number of edge cases you need to solve.
But in a vast majority of cases, it’s way easier and cheaper to invest in an automated data lineage tool that can bring you all the benefits quickly and painlessly.
I’ll leave a few more articles and videos here in case you want to dig deeper into the subject:
- If data lineage is the answer, what is the question?
- The Future Of Data Lineage — Beyond A Diagram
- Modern Data Lineage 101
- How To Leverage Column Level Lineage On Airflow
- How Grupo Boticario Keep Their Pipelines Running With Automated Data Lineage
- Understanding the ROI of your data spend
- The ROI of Data Lineage