Everyone in Brazil knows Grupo Boticário. As one of the top 15 beauty companies in the world, the brand is huge, with a long history and more than 4,000 physical stores across the globe. 

They own the brands O Boticário, Eudora, Quem Disse, Berenice?; BeautyBox, Multi B, Vult, Beauty on the Web, O.u.i and Dr. JONES, who work together to transform the world through beauty. There are 14,000 direct employees, in addition to another 40,000 people who work in the franchise network, which is now considered the largest beauty franchise in the world, with more than 4,000 points of sale in 1,780 Brazilian cities.

What may be less obvious to the casual observer is that there is a massive technology infrastructure behind this company, existing as it does in the (still very traditional) cosmetics and beauty industry. More specifically, Boticário invests heavily in the data side of the business, to make decisions about the beauty behemoth’s direction and offer better products.

And with these numbers, adds up to (doing the math real quick) a lot of data.

This is the story of Boticário’s great data challenge and how it was solved.

An observability problem

At the moment, there are many different brands in the group. These brandes units are like separate companies within the company, each with its own way of working but focused on the same goal: building the best and biggest beauty ecosystem for the world.

Despite having some separation, however, all these people and brands consume a lot of the same data. Data that is in approximately 12,000 different tables separated into different categories in a data lakehouse.

Working with this amount of data and people is almost incomprehensible to those of us used to scrappier, more startuppy environments:

  • How to know which tables are being used and which are not?
  • Who owns and is responsible for certain data?
  • What is up to date, and what is not?
  • What data is being used the most and needs extra attention from analysts?

A level of data observability is essential for companies that want to have quality data. Decisions made based on poor-quality data can be catastrophic while wasting precious hours for engineers who have to (manually, and daily) execute queries to ensure the data is up-to-date and reliable.

Because if you can’t trust your data, what's the point of collecting it?

Data lineage to the rescue

Thiago, who joined the Data Governance team at Grupo Boticário in 2021, arrived at the company to find their data catalog already well implemented. But something was still missing: given the amount of people and tables, it was very difficult to predict the impact of changes in the environment, and engineers spent a lot of time putting out fires.

Thiago believed that data lineage was the missing link.

After sharing his concerns with the data governance and architecture team, they decided to do a survey of internal customers to understand if this was a pain others were feeling. Lo and behold, 43% of people pointed out that data lineage was something they saw as a necessity.

That's when Thiago remembered Lucas, one of Alvin's engineers.

The two had worked together before, using Databricks and AWS as cloud providers and Airflow for orchestration. The closest thing they’d had to data lineage back then was getting Airflow logs at table level. Imagine the pain.

He wasn’t just thinking of Lucas because he felt like reminiscing over their shared data-related trauma. Thiago remembered that Lucas was now with a startup developing a data lineage product. 

Keen to avoid reinventing the wheel, Thiago and the data team talked to Lucas and it wasn’t long before they started a proof of concept for Alvin within Grupo Boticário.

How Boticario uses Alvin

It is absolutely essential for the company that its data lineage is automated and always up to date. Imagine manually mapping the lineage of over 12,000 tables (again, some quick math will show that’s a yikes x 12,000).

In Thiago's words:

"For such a large and complex company, the main thing is to have visibility into the huge amount of tables and people using them. Lineage plays an extremely important role and Alvin gives us lineage from end to end, with its column level helping out a lot."

When they say 'large and complex", they mean it. One particularly challenging example involved assembling data from eleven different tables, which are then in turn used in another eighteen.

I wonder what it was like to work with this data before they had this visibility. (Actually, I don’t want to know.)

Another functionality saving innocent engineer's lives within Grupo Boticário is impact analysis. This is a report that shows everything in a company's environment if a specific table or column is dropped: what other tables, columns dashboards and users will be affected.

Having our product running within a company the size of Grupo Boticário has been incredible for Alvin: we’ve been getting fantastic feedback about the product, helping us to create new features and integrations, in addition to fixing bugs and making usability improvements.

So we’re logging this one as a massive win-win.