The release of Alvin 2.0 came with an important realization: it’s impossible for a UI to serve all of the varied and complex use cases of our customers. Luckily, it also comes with a solution: the metadata warehouse. Whenever one of our users has a question about their data environment that we haven’t yet anticipated, or that is simply too niche to justify supporting in the UI, the metadata warehouse gives them direct access to their metadata via SQL. Everything you ever wanted to know about your data environment, now just a SQL statement away.

I will also use this opportunity to anticipate a potential push back: can’t I just query the logs in my data warehouse? Of course some of the metadata is available in data warehouse logs or via APIs. Internally we call this raw metadata; SQL logs, billing data, schema information etc. The true differentiator with Alvin is when the raw metadata is parsed, structured and correlated to give a deeper level of insight. In a sense, our product is this derived metadata - this is the foundation of the product.

A near universal truth across our user base is they know their way around a SQL statement. It’s therefore only logical that they can leverage that ability in Alvin to tackle even the most complex use cases. We’ve already been surprised and delighted by how some of our customers have been leveraging it. One example is a customer who needed a list of all root (source data) and destination (final level downstream) lineage for a given table. Running a query in the metadata warehouse got the answer, saving days of manual work, mapping and log inspection. Needless to say, these types of interactions also help us improve the product as we understand where we fall short of providing immediate insights or value. But the big difference is that that customer gets real results upfront, not after we have scoped, built and released it!

How does it work?

Looking at general application observability tools like Datadog, they have a “layered architecture”. Logs and process data are continuously ingested. Developers instrument/trace their code and logs and within the platform, logs are processed and correlated. On top of this data, the developers can gain high-level aggregated insights but they can always dive into the log entry or adjust the config. We realized that this way of building an observability product made a lot of sense for use too, at a conceptual level. We actually allow customers to use tags/labels/comments to annotate queries, which is automatically picked up by Alvin and used for e.g. ownership and cost attribution. This is very much akin to the instrumentation process in a software application.

From its raw format, metadata and logs must be correlated - dbt model runs and tests need to be connected to the underlying data warehouse jobs and the same goes for dashboard views. Additionally, metadata that is generated by our parser such as lineage, usage and AST information also goes into the mix. Alvin’s architecture is depicted below:

We ingest data from our integrations and process, enrich and correlate them into a common model. Then we build our features, such as lineage explorer, workloads and anomaly detection on top of it. And the metadata warehouse - the topic of this piece, is at the heart of it!

Example

Enough talking - seeing is believing! To show the power of the metadata warehouse, let’s work through an example. A company is using Looker and BigQuery. There have been a number of complaints by business users about slow dashboards. We suspect this was caused by the recent move from on-demand to slot based pricing in BigQuery, given that compute resources are now constrained. 

Our plan is to:

  1. Look at which dashboards are impacted
  2. Reach out to the users that are currently experiencing slow loading to explain the situation and understand if we need to take action.

While Alvin currently can show dashboard and dashboard element performance and cost, it doesn't have a direct feature to show all the dashboard load times per view in the UI. But as is generally the case, all the data is there. So let's get started.

--get all dashboard view events at the element level
WITH dashboard_views AS (
SELECT
    e.entity_fqn_id,
    e.entity_name,
    e.event_user,
    e.created_time,
    e.lk_query_history_slug,
FROM 
  events AS e
WHERE
  e.event_type = 'LOOKER_HISTORY_VIEW'
  AND e.entity_type = 'DASHBOARD'
AND
  EXTRACT(DATE FROM created_time) = DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY) --
),
--get all BigQuery jobs that are related to looker
dashboard_queries AS (
  SELECT
    e.lk_query_history_slug,
    e.cost,
    e.created_time,
    e.end_time,
FROM 
  events AS e
WHERE
  e.event_type = 'JOB_RUN'
AND
  lk_query_history_slug IS NOT NULL
), 
-- correlate looker dashboard element view events with BigQuery jobs
  joined_views AS (
SELECT
  dv.*,
  dq.cost,
  dq.created_time as query_created_time,
  dq.end_time as query_end_time,
  dq.cost as query_cost
FROM
 dashboard_views dv JOIN dashboard_queries dq ON dv.lk_query_history_slug = dq.lk_query_history_slug
ORDER BY 
  dv.created_time),
-- Aggregate and calculate load time per dashboard view as 
-- time from when the dashboard view started to the last dashboard element query finished
view_performance AS (
SELECT 
  entity_fqn_id,
  entity_name,
  event_user,
  MIN(created_time) AS start_time,
  MAX(query_end_time) AS end_time,
  TIMESTAMP_DIFF(MAX(query_end_time), MIN(created_time), SECOND) as load_time_s,
  SUM(cost) AS cost
FROM 
  joined_views
GROUP BY
  entity_fqn_id,
  entity_name,
  created_time,
  event_user
)
-- Get all dashboard views where the user waited more than 2 minutes
SELECT
  * EXCEPT (entity_fqn_id, cost) -- Don't need entity_fqn_id or cost
FROM
  view_performance
WHERE 
  load_time_s > 120
ORDER BY
  start_time DESC

And - running this query we get back a list of the unfortunate users that had to wait more than 2 minutes for their dashboards to load. 

As a first step we shoot off an email to these users to explain the issue and let them know we are aware and working on it. We also include a feedback form to better understand the priorities. 

As we could see here, this particular query was not trivial, but once you look at the breakdown it’s fairly straightforward. Now, we don’t believe that the barrier to entry of using Alvin should be writing SQL, but the metadata warehouse removes any barriers to the metadata, opening up new and exciting use cases.

Plus, our team is always at hand to help our customers wire up queries when needed, and we hope that our community will begin to share these queries to help others gain value from them. 

What’s next?

As data teams are changing the mindset to data products, it’s essential that they have the insights to manage data like a product. As the example above shows, it’s simply not enough just to focus on technical data quality - the full experience of the data stakeholders must be taken into account, which I recently wrote on LinkedIn. Managing a data product, like any product requires insight about the usage, adoption and users - all of which the metadata warehouse provides out of the box! And of course, expect to see data products being front and center as we continue to build Alvin for data teams and AI use cases.

You can go to https://console.alvin.ai/auth/signin and try this out today. Just sign up and you will immediately be able to access a demo environment. Copy-paste the query and see the metadata warehouse in action for yourself!