Enterprise Data Catalogs

Most teams know they need a data catalog, especially with the rise of AI: talks about managing context, monitoring, and CI/CD are all part of the conversation. But not many actually know how to get one running across an organisation. Pre-implementation is one stage, implementation is another, but the biggest challenge in my experience is adoption. Especially with AI, people are building so fast that they would rather get results quickly than spend time maintaining a catalog. System owners are so used to their data that they would rather focus on their day-to-day work than document in a catalog, because there are no apparent issues anyway.

In my work thus far, I have led two end-to-end catalog implementations: one leading the delivery of Alation (SaaS), and one leading the deployment of OpenMetadata (open-source, in-house). Different tools, different features, different purposes, but in leading both there are a lot of similar lessons to share.

Why a data catalog matters in the first place

If I were to describe it, a data catalog is the central inventory of data in an organisation. In my presentations, I often use the analogy of a library: without a catalog, books are everywhere, hard to find, hard to trust. With one, discovery becomes fast and reliable. A catalog tells you where data lives, what it means, who owns it, and how it all connects together.

Traditionally, this was already important. Imagine an organisation with many departments, each generating and consuming data with different context and meaning. Cross-department data sharing is limited when there is no standardised place to surface information about data, and without governance workflows, people are not inclined to share or maintain documentation on top of their day-to-day work. For an organisation to scale, an enterprise data catalog is essential for enabling cross-department collaboration and allowing use cases to develop freely when information is readily available.

But with AI in the picture, this becomes even more important. When teams start building AI agents, one key measure is how reliable the agent is. Agents act on whatever they are given, and if data is undescribed or ambiguous, they fill in the gaps themselves, often incorrectly. Having a catalog means closing those gaps before data ever reaches the agent, so it can work on solid ground.

Challenges I faced during my project implementations

In leading the implementation of the SaaS platform, the delivery itself was not the hard part. Getting approval to connect to 13 different systems within the enterprise, each with their own system owner, was the main challenge. I had to attend multiple discussion calls, show the business value to the system owner, get their approval, then hold technical discussions with their tech teams, get their sign-off, and work with their vendor to establish connectivity.

The other hurdle was cloud security. There were many questions: what permissions will the agent have, which databases is it connecting to, what happens with sensitive data. Most data catalogs have a masking feature that helps answer the last one. One thing I always prepared for these meetings was an architecture diagram. Technical people understand faster when there is a diagram to refer to, and it signals that you have thought things through.

On the open-source side, the challenges were different. Deploying OpenMetadata gave more flexibility over technical decisions, but deployment falls entirely on the deployer. We ran into version compatibility issues with AWS managed OpenSearch: certain stable versions of OpenMetadata required a lower OpenSearch version rather than the latest, so future upgrades need a pipeline that reconciles and upgrades all versions together. Ingestion is also maintained manually using Airflow scripts, one per resource type per account, and managing those scripts along with their environment variables adds a maintenance overhead that would not exist with a SaaS deployment.

Features that make it worth the effort

Data lineage is the feature I find easiest to demonstrate value with. You can see exactly how data flows between medallion layers, from bronze to silver to gold, and attach the actual transformation queries at each step. An analyst building a dashboard can see how a column was derived, reuse logic that already works, and produce consistent results. When a schema change is planned upstream, lineage also shows you what is affected downstream before the change is deployed.

Data contracts close the loop on governance. A contract enforces how data should arrive: schema, types, what is allowed to change. If a producer quietly alters a field, the contract catches it before downstream dashboards break silently. This addresses one of the most common pain points with ETL pipelines, where unannounced upstream changes break downstream systems that were never designed to handle them.

Key learnings: which tool to pick

It really comes down to the direction of the team or organisation.

If the goal is to support AI agents or technical pipelines, a technical catalog like AWS Glue or Snowflake Horizon is often sufficient. The feature set is simpler, implementation is faster, and agents or scripts can reference the catalog easily. These tools are already integrated with the existing infrastructure, so there is less to set up.

If the goal is to build an enterprise data platform where data is discoverable and available across teams, then a business-facing catalog is the right call. That could be a SaaS tool like Alation, engaging a vendor, or building an in-house team to run something like OpenMetadata or DataHub. These tools are more user-friendly, support enterprise SSO, store business glossaries and context alongside the data, and expose APIs to populate descriptions programmatically. The trade-off is that they take more time to implement, since they are integrated as a separate tool entirely.

There is a lot more to cover on this topic, but I have tried to keep this concise. If you have questions about the article or my experience with either implementation, reach out directly.