The first time I heard about Data Lineage was in early 2019, when a Product Manager I worked with mentioned it as one of the most wished-for features by modern data teams.

"Data lineage includes the data origin, what happens to it, and where it moves over time. Data lineage gives visibility while greatly simplifying the ability to trace errors back to the root cause in a data analytics process." (source: Wikipedia)

As far as I know, some providers were already claiming Data Lineage at that time. But most of those tools were designed to fit specific technologies, which meant they had no real impact on data teams’ productivity if their landscape was not restricted to those technologies.

Things have changed quite a bit since then. The need for automation increases as data environments get more complex. Data Catalogs, for example, can now sync metadata in real time and tag PII columns with no manual effort. Plug-and-play Data Lineage seems to be one of the next frontiers in this space, bringing new players to the market and making data engineers rely on lineage graphs that connect assets across a broader range of tools.

This blog post is an introduction to how we're addressing Data Lineage at Alvin, using a public demo from Snowflake Labs as a reference.

Challenges

The use cases for Data Lineage vary significantly (see Dan Mashiter's The Future of Data Lineage — Beyond a Diagram). The same is true of the challenges it brings, especially the technical ones:

  • Extract meaning from SQL. Capturing the relationships expressed through SQL is critical for lineage tools. SQL is the glue between tables belonging to different governance layers in a data warehouse or between BI dashboards and the data sources that feed them. But given the variety of SQL dialects available nowadays (look at BigQuery and Snowflake query syntaxes, for instance), extracting meaningful metadata from SQL statements is easier said than done.
  • Handle pipeline history. Data pipelines change: through new views, performance optimizations, and bug fixes. You’ll probably agree that lineage information should be time-sensitive and Data Lineage tools should register and display these changes in a way that can be used for pipeline observability and troubleshooting.
  • Manage metadata at scale. Take these first two challenges and multiply them by the number of tools available in The Modern Data Stack. Lineage tools first need connectivity with each tool to extract metadata from them. They also need data models flexible enough for processing, storing, and querying such metadata at scale. And that’s no walk in the park.

This (incomplete) list of challenges might help explain why automated lineage has taken so long to become a reality. On the other hand, storage and computing scalability in the cloud have recently enabled many use cases.

Next, let’s see how data warehouse users can leverage automated lineage, taking Alvin and Snowflake as an example. Bear in mind that similar results can be achieved for other DWs such as BigQuery and Redshift.

Examples with Snowflake

Just to recap, these examples were built on top of the Financial Services Asset Management on Snowflake Demo. I’ve changed the queries slightly to demonstrate lineage features at lower costs.

Here’s the pipeline we’ll be talking about:

Image 1. Financial Services Asset Management on Snowflake, pipeline overview (image by author)

Table-level lineage

The first use case consists of showing the relationships among tables and views. STOCK_HISTORY, TRADE, POSITION, and SHARE_NOW illustrate it:

Image 2. Table-level Data Lineage (image by author)

Simple enough at first glance. But looking at the query that creates the POSITION view, you realize it’s not that straightforward for a computer program (built by a company other than Snowflake) to understand.

Code snippet 1. Query used to create the POSITION view (source: Snowflake-Labs)

Although small, the query uses a temporary table (cte), aggregation functions (sum), inner join, math operations, and so forth.

Column-level lineage

Results become even more interesting, and relationships are more evident if we expand the columns.

Image 2. Column-level Data Lineage (image by author)

The arrows on the left prove that the query used to create the POSITIONview is actually fuzzy. Great tools are expected to make their users’ lives easier, so they should allow users to understand exactly how data flows from/to specific columns with minimal effort.

Image 3. Column-level Data Lineage — column highlight (image by author)

Impact Analysis

The metadata used to evaluate Data Lineage can also power Impact Analysis. Impact Analysis tools allow data engineers to understand what would happen if they dropped a table or some of its columns. Would a downstream table become stale or a scheduled query break? The next image shows the impacts of dropping COMPANY_PROFILE.

Image 4. Impact Analysis (image by author)

In summary:

  1. Ten direct impacts on the view itself, which means dropping all of its columns.
  2. Ten stale columns in the WATCHLIST table, which pretty much depends on COMPANY_PROFILE. This means none of the columns in WATCHLIST will be updated after droppingCOMPANY_PROFILE.
  3. One job (Snowflake Task, in this case) would break: the one used to create the WATCHLIST table.
  4. Top users of COMPANY_PROFILE and WATCHLIST might be impacted (the circles on the right-hand side show who needs a heads-up).

A similar analysis can be performed for each column.

Asset Usage Analytics

As I mentioned before, lineage data is time-sensitive. Having this information in a database enables one more use case: calculating the number of queries and users that fetched data from a given table in a specific time frame.

Image 5. Table usage statistics (image by author)

Usage analytics might be used to support the decision of keeping or removing data assets from the warehouse, or changing their storage class for cost optimization.

To sum up

By now, you might be wondering about other use cases that might derive from Data Lineage. So are we: once a good foundation is set, it’s time to build on top of it.

DataOps and Data Observability teams will likely benefit from automated lineage soon. APIs, CLIs, and plugins are all over the place, and I bet it will be no different in this space.

Although I covered only Snowflake for the sake of simplicity, mapping lineage beyond data warehouses — connecting the dots between DWs and BI tools, for example — is also doable, at least with some top providers. The same applies to orchestrators such as Airflow and dbt. I'm planning to bring something in this sense as follow-up articles.

Thanks for reading; excited to hear your thoughts!