My name is Oisin (pronounced O-sheen!) — and I’m a new employee at Alvin here in Tallinn, Estonia, starting my career as a software engineer at my first job in the data space.
I’m super excited to tell you about my first project at Alvin, where I worked on getting our automated, column level, cross-system lineage data into Amundsen, a data discovery tool developed at Lyft. Details at the end on how to join our beta so you can enable Alvin-powered lineage in your own Amundsen environment. Let’s go!
Alvin meets Amundsen 🤝
While Amundsen has some pretty awesome features, you need to bring your own lineage and usage data. As the original Alvin team (who have been developing our core tech for the past 2+ years) will tell you, this information can be really hard to generate automatically. At Alvin, we’re pretty agnostic in terms of where our lineage data is consumed — it’s a fundamental product philosophy that we integrate as seamlessly as possible into existing workflows. (If you want to learn more about Alvin, check out Martin and Dan’s wonderful blog posts about Alvin’s data lineage use cases and the motivations behind Alvin’s creation.)
So, after watching the impressive progress of Amundsen from the sidelines, we felt that it was time to introduce Alvin to Amundsen, and make our lineage data available to their growing community. With the mission set, it was time to embark on my journey. Actually, let’s fast forward through the week of me struggling to get Amundsen running locally — I could dedicate a whole blog post to Docker versioning issues — and start from enabling lineage in the Amundsen frontend.
Enabling Table, Column, and Dashboard Lineage
One caveat with table, column, and dashboard lineage in Amundsen is that it isn’t enabled by default. So, we’ll need to go into the configuration files of the frontend to enable this functionality. In your IDE/vim/text editor of choice, navigate to the amundsen/frontend/amundsen_application/static/js/config/config-default.ts TypeScript file, and head down to the tableLineage key. Here, you’ll see two important flags: inAppListEnabled and inAppPageEnabled. You’ll want to set those flags to true, like I’ve shown below. These flags will enable the lineage diagrams and lineage tables in the UI.
Next, scroll down to the columnLineage flag, which has the same flags. You’ll want to set these ones to true as well. Just like the tableLineage, Amundsen will only show column lineage data if these flags are enabled and the data is provided to Amundsen through a transformer such as ours.
One other thing you may notice within these setting keys is the ability to generate a custom URL with lineage information. This URL can point to an external service, such as Alvin, where one can see more information about the lineage data that Amundsen is presenting.
We also need to enable lineage between our tables and Tableau dashboards. Amundsen doesn’t consider this association to be the same as table or column lineage, but we will still be able to see how dashboards and tables are associated once we ingest our data. For this case, set the single flag in indexDashboards to true :
Now, you should be able to ingest TableLineage , ColumnLineage , and DashboardTable Amundsen models into your Amundsen instance.
Don’t forget to do a clean rebuild of your frontend environment after doing this, otherwise your changes will not be reflected.
Writing the Alvin Transformer
Now that we have that part out of the way, let’s write some code. Amundsen Transformer objects have three main methods, as we see in the NoopTransformer, one of the simplest Transformer classes in Amundsen:
So, to transform a piece of data in Amundsen, we need to make sure we have these three methods. Let’s talk about how I learned about and wrote these methods individually:
This is not Python’s init method (def __init__), but rather one run by Amundsen when using a transformer within the context of a task, where the configuration is passed to the transformer. Here, you can set configuration variables from the config tree, including an API key or other specific variables you will need in the transformation.
Configuration keys should be set when defining the job. For example, in the BigQuery ingestion job that Amundsen provides on their Github repo, we see that they use a ConfigFactory to create a ConfigTree dictionary object that the Job will pass to each Extractor, Transformer, Loader, and Publisher.
So, for example, in our Alvin Transformer, we will want
- An API key, provided by Alvin
- Alvin Platform ID (which is user-defined when you add your credentials to Alvin)
- Alvin Platform Type (such as bigquery or tableau)
- Alvin Instance URL (if running locally, would generally be on 127.0.0.1:8080)
We can specify these variables in our class as static variables, which we can then reference in our job configuration:
You can check out more configuration examples on Amundsen’s website here.
This method takes in a record (an object that corresponds to an Amundsen model, such as TableMetadata or Usage) which can be transformed any way you like, as long as you return a valid Amundsen model back. Read up on all of the Amundsen Models here and check out the code here. There are a ton of possibilities, so I would definitely recommend reading up on what you need for your use case.
One good way to figure out what data you’re working with is to run the transformer with the sample data loader script in the databuilder submodule. This way, you can understand how the transformation function works, and how Amundsen models can be manipulated. Here’s some example code that just prints all of the data that pass through the transformer:
You can then use this Transformer within the context of an ETL task and publishing job, like the example below, which extracts, prints, and loads BigQuery data. With a few adjustments to the BigQuery credentials and dependency installation, you should be able to just run it and see data pop up on Amundsen!
We can also do this process with Tableau. Navigate to the amundsen/databuilder/example/scripts/sample_tableau_data_loader.py script file. After putting in your credentials, this script will explore your Tableau instance metadata and add it to Amundsen. After running the script, you should then be able to see your Tableau dashboards on Neo4j and Amundsen!
One quirk that tripped me up for a few hours was the use of yield within the transform function. Amundsen generally uses a pull/generator pattern in its ETL pipeline. Instead of returning an object, we can yield as many objects as we like from a single transformer.
So, not only are we able to yield the original table data, but also use the record’s entity ID and configuration variables to make the correct API calls to the Alvin API. I could then receive lineage data within Alvin and ingest the data responses as TableLineage , ColumnLineage , and DashboardTable objects. I could then yield these objects back to the task, which would then pass this data along to the loader and publisher. Here’s some of the code to explain what I mean.
During the development process, I also noticed that the function did not provide error messages if the Amundsen model had invalid data, such as a non-existent table or column. If data isn’t showing up, make sure you have the right table or column keys, and that they are formatted correctly. You can find the TableMetadata reference and key format here, and the ColumnMetadata format here.
This is a much simpler function, which returns a string that identifies the Transformer to the job configuration, in the form of:
within the ConfigFactory declaration that was mentioned above. For example, for our Alvin Transformer, we might have a get_scope function like this:
and set up a configuration key that looks something like this, similar to the configuration criteria that talked about in the def init() section:
If you need more information and help, check out amundsen/databuilder/example/scripts for ETL Task and Job samples, including ones that extract Tableau and BigQuery metadata and usage stats. Also, check out the amundsen/databuilder/databuilder folder, which contains loader , transformer , and extractor folders that contain a variety of modules that you can plug into your own custom tasks.
And, of course, you can check out our Github repository here, where you can see the code we have written for the connector.
Extracting and Transforming Amundsen Model Data
One tip for working with Amundsen Transformers would be to become intimately familiar with the Amundsen models that will be passing through your transformer. In general, you’ll be receiving TableMetadata if you’re transforming data directly from a data warehouse extractor, such as the BigQuery extractor, so you’ll want to make sure you’re checking for that before transforming.
In my case, I took the TableMetadata objects and configuration values to create the correct URLs for calling the Alvin API. I would then receive objects that corresponded directly to an Amundsen TableLineage ,ColumnLineage , or DashboardTable model object, and yield those objects once they were created.
Let’s look at the return values of the Alvin API that I created for this project, as well as the TableLineage object from Amundsen:
In this example, I created a corresponding object that matched our API return values; we use FastAPI so this liaison object was necessary to return a valid value that matched our API model. We can see that the constructor of the Amundsen TableLineage object is exactly the same as our AmundsenTableLineage object (apart from the model tag), and thus we can simply match the values when we construct those Amundsen objects in our transformer.
Also, note the comment in the Amundsen TableLineage class definition, claiming that the object will not create nodes (which correspond to the TableMetadata, ColumnMetadata, and DashboardMetadata models in Amundsen). This comment means that you need to make sure both tables will be present in the system before accessing any lineage, otherwise you will not receive any of the lineage data. This paradigm is also present with Dashboards and Columns — ensure that those entities are ingested before attempting to associate entities with each other.
The Final Result (drumroll please…)
Through some trial and error, I finally managed to get Alvin connected to Amundsen! As someone coming from a more academic environment, I needed to learn a lot about enterprise coding, code decoupling on a massive scale, generator patterns, and a bunch of other things that I would never have learned so quickly unless I was thrown into the deep end of the metaphorical data pool. But, after all of that, I was able to get Alvin’s table, column, and BI lineage showing up in Amundsen. We can make sure everything is working correctly by comparing the lineage visualization in Alvin with those in Amundsen. Let’s look at an example. We created a (fake) data pipeline, where the ssn column goes through various steps (source → quarantine → safe), into an analytics table, and finally into a Tableau sheet. Here’s how that looks on the column level in Alvin’s UI:
In Amundsen, there are a few different ways to represent lineage in the UI. Table level lineage within the data warehouse can be represented with a diagram. As you can see, it matches up with the tables in Alvin!
In Amundsen, column lineage is displayed in list form for one level of lineage. We can see below, as in the Alvin UI, that columnemployees_offices_all.ssnhas two columns directly upstream; employees_us.ssn and employees_eu.ssn Excellent!
We’re also able to leverage Alvin’s cross-system lineage data by showing all of the Tableau workbooks that use the employees_offices_all table, in this case BigQueryWorkbook (this contains the SSN by Office sheet from the Alvin UI above).
Finally, sticking with the cross-system lineage, we can now check out the page In Amundsen for the Tableau workbook, where we can see all the the tables that it uses.
Pretty awesome, right? 😎
Alvin ❤️ Amundsen
Hopefully you’ve learned a little bit about Amundsen’s front-end lineage features, and what Alvin can bring to the (literal and metaphorical) table when it comes to generating the data to power them. It’s been a ride getting to this point, but now that I’ve brought Alvin and Amundsen together, I want to see the relationship blossom!
If you want to try Alvin, request access here and a crew member will reach out to set up a demo and walk you through the onboarding :)