Unlocking the Potential of Data Teams with Alvin 2.0

Today, we're thrilled to announce the launch of Alvin 2.0, a major upgrade on how data teams can leverage metadata to perform better. For us, the journey began four years ago, pioneering column-level lineage for the modern data stack and pushing the boundaries of SQL parsing. Having now analyzed billions of SQL statements, it’s now reached a level of accuracy and performance I never imagined.

However, technical achievements are only the beginning. We believe the true power lies in helping data teams unlock the potential of this incredible dataset, and Alvin 2.0 is the first step towards that.

To introduce Alvin 2.0, I wanted to go through five learnings that underpin our product vision:

Lineage is a Foundation, Not a Feature

We've said it from the start: data lineage is fundamental. It's the foundation upon which we built Alvin. While we were the first to focus solely on lineage within the modern data stack (MDS), the importance of lineage has become undeniable. As a result, many companies now offer some degree of lineage as part of observability or catalog solutions.

Whilst the accuracy of these solutions can vary significantly, it’s also fair to say that even we haven’t ‘solved’ lineage. New features and functionalities are constantly added to SQL, requiring a continually evolving parser to maintain accuracy. At Alvin, this encompasses comprehensive coverage of procedure calls, multi-statement queries, and even complex 2000-line fraud analysis queries and an ever-growing test suite.

For in-house solutions or for other SaaS providers in the space, 70-80% accuracy may be acceptable. But this was never going to be enough for us to realize our vision, given this highly accurate dataset was to form the foundation of our product. We’d like to think this focus on complete accuracy has paid off in the breadth and depth of features we are now able to deliver. This meticulousness fuels Alvin 2.0, allowing us to deliver exceptional value to data teams.

FinOps A Priority for Data Teams

For too long, data teams have escaped the scrutiny of FinOps practices. Data spend has been seen as a variable cost in the pursuit of growth, often not separated from general cloud spend. This lack of focus on data costs, coupled with a culture that valued ad-hoc request fulfillment over ROI, has left many data teams struggling to demonstrate their value.

However, the tide is turning. Data teams are increasingly taking pride in delivering value for money, just like any other team within the company. The challenge lies in how to achieve this. The challenge lies in measuring the return on data investments, which can be difficult to quantify, and the cost, which is often spread across pipelines with diverse outputs and functionalities.

Stay on top of trends and dive deep into workloads to see how they are evolving over time

The dataset we are generating allows us to address this challenge head-on. We not only provide data teams with the information they need to analyze and allocate costs across teams, individuals, specific tools, or query workloads but we also recommend actions for optimizing costs at multiple levels. We prioritize the biggest wins first, then delve into granular details, helping teams decide if workloads represent value for money, or how they can be optimized for better performance.

Embracing the Product Mindset

Data quality and observability – these aspects are crucial, but there's more to the equation. Control over data is table stakes for any successful data team. While data contracts are a valuable tool for aligning data teams with other technical teams and fostering ownership and accountability, they are ultimately technical solutions. They will only be successful if accompanied by a mindset change.

Data teams operating at a higher level understand that "quality" in a product context means something entirely different. Just like a software product, the experience delivered to data consumers is paramount. Dashboard viewers don’t think in terms of timeliness, freshness, volume, and null values. What matters is a product that loads quickly and delivers accurate insights.

Workloads allows to easily group, slice and dice cost and performance along important data product dimensions such as cost and runtime

Ironically, many data teams lack access to analytics data on the data products they build. Alvin’s granular usage insights, cost data and performance metrics combined with the full upstream fill this gap. But we felt to really be valuable to the diverse data teams we work with, this needed to go beyond a UI.

And so was born the Metadata Warehouse, where we make all of this data available to query within Alvin, or even pushed back into the data warehouse. Armed with this insight, data teams can begin to understand their business impact and make data-driven decisions to improve.

The architecture behind Alvin is very similar to Datadog. At its core lies an index containing all ingested, parsed, and correlated metadata and logs from across the data stack. This is available for customers to query directly using SQL, not a cumbersome search language. We are happy customers of this data at Alvin, leveraging it internally to build and improve the performance and cost of our own product features.

If some insight need tweaking, jump out of the UI to compose an ad-hoc query and share the results with the team!

Data Warehouse Size and Complexity Stifling Value

Almost all companies are ingesting and processing low-value data into their data warehouses. This not only has obvious cost implications, but also hidden costs that are often overlooked. As pipeline complexity increases, errors become more likely, troubleshooting takes longer, and fixes become more difficult to implement.

The pipeline pruner helps quickly identify wasted compute

Data observability tools are helping data teams detect and fix errors faster than ever, but to reduce the frequency and significance of errors needs a different approach. If something fails, which happens, how do you know if it’s important or not? Does that pipeline we are investing significant resources in maintaining need to run at all?

Similarly, data catalogs, while helpful in navigating complex data environments, can be expensive to maintain and often struggle to get wide adoption. How much of the data assets that we are indexing and governing are actually valuable? It’s an interesting duality - the better curated a data warehouse is, the less necessary a data catalog is.

More data and tables cause a lack of clear definitions, duplicate definitions, and drift in how metrics are defined. More data also increases compliance risks, with sensitive data potentially going unnoticed. Companies need better insight into their data warehouses to be able to measure and limit.

Search by table or column, filter the impact analysis result by entity types

See results by source, destination or each specific link

Alvin provides a single view of your data environment, with full column-level lineage and granular usage and cost data from source to consumption. This allows data teams to make informed decisions about what should run, what data can be deleted, how to simplify and deduplicate pipelines, and how to safely refactor data products.

Most Data Warehouses Are Not AI Ready

AI readiness isn't just a buzzword – it's a necessity for companies looking to leverage AI for competitive advantage. But building high-quality AI products for internal use, let alone customer-facing applications, is incredibly difficult without having a solid foundation in place. Beyond providing high-quality data, AI readiness demands stringent compliance requirements that often mirror those faced by fintech companies. Key elements like data flows, definitions, touchpoints, and origins must be meticulously tracked and understood.

Alvin lays the groundwork for AI readiness. Our granular usage and auditing capabilities provide essential insights, while our comprehensive lineage allows you to build AI/ML models that leverage the same definitions and metrics used across the company. This approach ensures consistency, transparency, and compliance – crucial for building trust in your AI initiatives.

What’s next? (the road ahead)

We're thrilled to unveil Alvin 2.0, the culmination of years of collaboration between our passionate engineering team and dedicated customers. Built on a robust technical foundation, we are able to iterate and deliver new features fast, enabling our customers to leverage the high value dataset that underpins everything. But for us, this is just the beginning. We have a clear roadmap for the future of Alvin, with even greater possibilities on the horizon.

We believe data teams need and deserve robust insights about the data products they are building. Alvin 2.0 is designed to transform data teams into strategic powerhouses, enabling them to build data products that drive measurable business impact.‍

Ready to unlock the true potential of your data? Let's talk!‍