How to leverage Column Level Lineage on Airflow

The Modern Data Stack

What does modern mean to you? The data ecosystem has grown exponentially in the last years, and the Modern Data Stack has emerged, a collection of tools that help you work with your data and manipulate it in many ways. Ultimately those tools try to make the data engineers’ and data consumers’ lives easier, most of them by solving specific problems.

Where do you store data for analytical use cases? something like BigQuery, Snowflake?
How do you map and transform your data models? DBT?
What tool do you use in your BI layer? Looker? Tableau?
How do you map and orchestrate the tasks that glue together all those tools? Airflow?

Those tools have solved a set of important problems but also introduced the need for better data governance and lightened up the discussion on how to actually trust the data flowing between them. Not saying those needs didn’t exist before, but when you are dealing with a mix of best-of-breed solutions, SAAS providers, open-source tools, and in-house solutions… how do you deal with something such as data lineage?

With the modern data stack, the number of data use cases grew. Initially, simple reporting became more complex. The number of dashboards went up with stakeholders' chase for more data insights. And the most challenging part? Data changes constantly.

In an average business, we have so much complexity around data that when we need to do something as simple as troubleshooting a broken dashboard, we quickly find ourselves drowning in a never-ending analysis.

In this blog post, we are going to talk about how we added the missing contextual data to Apache Airflow at Alvin, so we can help our customers find their way on the modern data stack.

A Data Engineer Tale

Over the past ten years, I’ve spent quite some time working across both Software and Data roles. Before working at a consultant company and now at Alvin, where I joined as a founding engineer, we see that the “modern software stack” is years ahead when it comes to investigating and troubleshooting issues. The modern data stack is not there yet…

Here at Alvin, we have been building the grounds to help customers navigate in their data stack for quite some time, and you just get to see how complex some data environments can get! It is a complicated maze of tables, models, pipelines, and dashboards with interconnected dependencies, which data engineers and data consumers need to navigate.

Talking to Data Engineers that use our solution we constantly hear how painful it is to manage a pipeline and troubleshoot a data quality issue when there’s something wrong with a downstream dashboard.

Data pipelines crash, and it’s not always clear why. It could be an upstream change in the data warehouse, someone messed up a configuration in Airflow, or it is just bad data…

When you are dealing with a complex data environment with many of those tools of the modern data stack, plain and simple… you need more context! It’s not there yet…

For example, here is how bad DAG dependencies can look in Airflow.

So I ask you, have you ever thought about how much time the data engineers in your organization spend troubleshooting those scenarios… only to find out that a DAG didn’t run?

The Rabbit Hole

For some context, the rabbit hole is an expression we use here at Alvin when we get lost in that really complex and never-ending analysis and we just can’t get out of it.

Dashboard with no data for the past 7 days

I know you have been there! Someone from the BI Team or some important stakeholder messages you on slack, asking why the data on that dashboard is missing… and you start your data troubleshooting journey… looking for dependencies between pipelines, trying to figure out the data flow from logs… jumping from one analysis to another into this never-ending rabbit hole, only to find out that a DAG didn’t run successfully in the past 7 days…

And then after fixing the issue and having a successful pipeline run you are ready to go home, only to realize that there’s another DAG upstream that is also failing… and you start the analysis all over again…

This is the curse of complexity that came with the modern data stack… Is there a better way around it?

Alvin + Airflow ❤️

At last, let’s talk about how we extended the OpenLineage Airflow integration and how we dealt with the missing data needed for troubleshooting those issues, which is automated column-level lineage!

First and furthermost we spent a lot of time iterating and improving our own parser in the past years.

On a broader view, we just needed to match the Airflow DAG and Tasks with the SQL Statement executed by those and our parser would take care of the rest.

So how to actually extract it with that kind of context? We just love to get the best out of the open-source world, and the OpenLineage project caught our attention, to help with it:

“Data lineage is the foundation for a new generation of powerful, context-aware data tools and best practices. OpenLineage enables consistent collection of lineage metadata, creating a deeper understanding of how data is produced and used” Source

We ended up extending the OpenLineage integration library and doing the following changes:

PyPI package that executes the proper extraction and mapping code according to the Airflow version, be it 1.10+, 2.1–2.2, or 2.3.
Custom facet with additional metadata attributes, based on OpenLineage specification:

job_id is the actual identifier of the SQL statement executed in the DataWarehouse, so with that we can match it with the column lineage processed by our own parser.

Lastly, we needed more metadata related to the DAGs and Tasks running in the Airflow environment, so we could build the lineage graph with additional context for our users.

The Alvin package on the Airflow platform will create an additional DAG to collect on a regular basis static metadata e.g. DAGs definitions, Airflow operators, source code etc. Alvin docs

Here you may find our PyPI package and the docs on how to use it. But how does it look in Alvin?

Column Level Lineage in Action

Let’s look again at the failing DAG scenario. On the last execution, the Airflow Task 06_refresh_a_curated didn't run successfully:

Ok, how do we start the analysis? We can go into the Alvin DAG lineage view, and look at all the inputs/outputs tables:

Next, we can jump on the failing task lineage view and look at all the downstream assets that will become stale until we fix that task, we can see that there’s an important Looker Dashboard downstream, so we quickly have all the context we need to diagnose that impact!

Alvin Task Lineage UI showing a downstream dashboard

And if you haven’t noticed, by default in the Airflow task view we have column-level lineage expanded :) so we can easily debug and trace those issues down to the column.

Another Task with column-level lineage expanded

Also bear in mind that if you were a data consumer of that BI dashboard, and you noticed you had a data quality issue in it, you could also navigate on the lineage graph in the upstream direction, and find the failing task.

That wraps up how we added column-level lineage to Airflow! We would love to hear more from the data engineers community about what the best experience for debugging those data quality issues would be. So if you are interested in joining Alvin’s Beta and having the Airflow integration in your data stack, you can sign up here and we’ll get in touch.

Closing thoughts

Debugging and troubleshooting issues in the modern data stack is a constant challenge, exploring, understanding, and mapping the context that surrounds our data is time-consuming, and doing that manually is impossible!

That is why the Airflow integration is a key one, and there’s a lot more room to improve it! So we are excited about the next steps in our journey to make the data engineers’ life easier through Alvin!