We need to talk about data governance

It’s easy to understand why data driven companies are embracing self-serve analytics; when individuals have autonomous access to insights, better decisions can be made faster — theoretically at least. It falls on BI teams to act as the mediators between raw data and the insights generated from it. From our conversations with these teams, this comes with some pretty significant challenges.

Organizations are now packed full of demanding data ‘customers’, who have come to rely on data to do their jobs. Keeping them happy whilst balancing other considerations such as privacy and security can prove tricky, particularly in the face of tightening privacy regulation (GDPR in Europe, CCPA in California) and an increasing threat from cyber attacks.

“Moving fast and breaking things” isn’t all that cool when it comes to leaking sensitive data, or misreporting numbers to the board. The answer? Companies of all sizes, including startups and scaleups, are looking to implement some form of data governance.

Data Governance: maximizing access to the high-quality data stakeholders need to drive business value, whilst minimizing risks.

Currently, tools that try to address these pains trend towards the enterprise; bundled features, a high price point, lengthy setup and plenty of maintenance. These tools are not really fit for purpose in rapidly scaling, time poor companies, and are often over-kill when it comes to solving the real ‘hair on fire’ problems.

Internally, we define data governance as maximizing access to the high-quality data stakeholders need to drive business value, whilst minimizing risks. This encompasses very specific challenges, with different personas and use cases. It is our belief that these need to be met with tools that respect the workflows of today’s data professionals, with use case driven features powered by automation.

Data democracy without data governance — data anarchy?

The shift towards self-serve analytics it’s often referred to as ‘democratizing data’. Some companies have adopted data catalogs to promote this culture, making data assets searchable across teams and departments, and providing additional context such as sensitivity and quality. Enterprise data catalogs have been around for a while, but we’ve also seen data-driven companies such as Lyft (Amudsen), LinkedIn (DataHub) and Netflix (Metacat) build their own open-source solutions, having found nothing on the market to meet their needs.

So what is the relationship between data governance and data democracy? To stretch the analogy, in any democracy, it is the role of the government to put in place the rules and structures needed to ensure it survives, and you could say the same about data governance. With no governance you get ‘data anarchy’; a free for all with little to no control.

Even with a simple setup consisting of a data warehouse (e.g. Snowflake) and a data visualization tool (e.g. Looker), without good governance things can get out of hand quickly. Unused and similar versions of tables and dashboards can clutter the environment, causing a lack of trust, with no single source of truth — as well as driving storage and compute cost. Keeping track of what columns and dashboards contain sensitive data, and who has access to it, becomes a major challenge. When changes cause an important dashboard to break, BI engineers can expect a flood of messages from stressed out colleagues.

A typical Slack back and forth when something goes wrong

Where to start in automating data governance? Know the flows.

For BI teams to work effectively, they need to understand how data flows. For instance, they may want to delete a column that is no longer relevant. However, this has the potential to break an ETL pipeline or a C-level dashboard. Without a tool that can perform impact analysis, BI teams have to choose between painstakingly mapping out data flows manually, making the change and hoping for the best, or simply avoiding changes altogether!

In most cases, the BI team doesn’t need a company-wide enterprise data catalog with all bells and whistles — they need to understand how data is flowing, where it is flowing and who is using it. In enterprise data catalogs, essential features like cross-system and column-level data lineage either don’t exist or are only included in the top-tier pricing. Open-source data catalogs are looking towards it, but it is a technically challenging problem (as we know from experience!).

There is, of course, lots to learn from data catalogs and the open-source community— in fact, any useful metadata product needs a catalog at its core. As data flows can be seen as a graph with edges between nodes, it is evident that you cannot talk about flows without an index of data entities such as tables, columns, and dashboards. The flow of data occurs on a timeline as statements or jobs are issued causing data to move between entities. This becomes truly powerful when data entities have been tagged, for example with location or sensitivity. Suddenly we can start tracking where personal data is trickling into dashboards and ad-hoc queries automatically.

To truly understand data flows, usage data must be treated as a first-class citizen. Queries, jobs, and dashboards must be parsed and structured. Usage data is what makes the product opinionated. This is analogous to how Amazon gives you suggestions based on your search and purchase history.

What is Alvin — and what is our mission?

In our past lives at data focused startups, we experienced first hand some of the big challenges facing big data. Alvin was started with the belief that companies are no longer competing on how much data they collect and store, but on its quality and how they leverage it. We’re currently two founders (Martin and Dan), recently backed by a top nordic VC, and are now growing the team (shameless hiring plug!).

Alvin automates key aspects of data governance — freeing up time and head-space for BI professionals to focus on the true value-creating activities. Our core technology analyzes data flows and interactions between systems and people. This will power a range of modules, each laser focused on a specific challenge, including data lineage, access management, privacy and financial compliance, cost control and logging.

Our first module delivers plug and play data lineage to BI teams, answering questions such as:

What are the upstream and downstream dependencies of a column? (Column-level lineage and transformations)‍

What dashboards are using this table or column? (Cross-system lineage)

What columns are in use, and what are they being used for? (Column usage classification: JOIN/WHERE/GROUP BY)

Will deleting this table break dashboards that are actively used? (Usage-based impact analysis)

How has this flow changed since last week? (Lineage time travel)

Column-level data lineage between BigQuery, Tableau, Looker, Treasure Data and Mode

We have spent the last years fully immersed in the space, building our core tech, talking to BI teams and conducting pilots. We’ve seen how easy it is for highly skilled BI professionals to get stuck doing tedious tasks that could be automated, as much as 40% of their time in some cases. With so many competing priorities in scaling companies, implementing a fully fledged enterprise solution is unlikely to make it to the top of the list, so BI teams continue to operate below their true capacity.

By building Alvin to be zero setup and modular, we aim to break this cycle and take data governance beyond the enterprise. With a BI team focused on innovating and inventing, scaling companies give themselves the best chance of data-driven success.

We need to talk about data governance

Data democracy without data governance — data anarchy?

Where to start in automating data governance? Know the flows.

What is Alvin — and what is our mission?

Start optimizing your data warehouse for free today