Data catalog services

Writing data processing jobs is a fascinating task. But it can't be worthless if the users can't find and use the generated data. Fortunately, we can count on data catalogs and leverage the power of metadata to overcome this discoverability issue.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Data catalog

Before going to the cloud services part, I'd like to focus on the data catalog definition. Since it has been a hot topic in the data space for several years, instead of giving a one-sentence explanation, I prefer to list the keywords describing a data catalog:

Centralized. Data catalog is a single place where the users can find information about available datasets.
Searchable. The users can access the information directly or after using the data catalog's search feature.
Self-service. Thanks to this data availability, a data catalog is one of the enablers of a self-service data system. It means that despite some tribal knowledge, all data team members can find interesting information, or at worst, they can easily identify the people who can provide them with this information.
Collaboration. A data catalog is then a tool enhancing collaboration between data team members. It applies not only for sharing the read access but also to the writing. Even though each dataset can have a designated steward, everybody can propose the changes.
Data quality. In addition to the description of the data sources, a modern data catalog can include some technical information, like data volume, values distribution in a column, or data freshness.
Data lineage. Not only the quality is important for data users. Often, they also want to know where the dataset comes from. Data catalog helps here too by providing the data lineage information.
Metadata. The metadata is the key to bring all the information together. The operational metadata will provide insight on the dataset generation and access, such as the job producing the data, or the users accessing it recently. The business metadata provides some domain-related view, such as the meaning of the dataset, its possible use cases, or privacy regulations. Finally, the technical metadata describes the dataset itself, so defines the things like column types, the allowed values, or their nullability.

Cloud services

To present the cloud offerings for the data cataloging, I'll repeat the technique from the data wrangling on the cloud and use the comparison table.

	AWS Glue Data Catalog	Azure Purview	GCP Data Catalog
Metadata	Technical and operational. Technical stores the attributes like schema, number of records, average size of a record. Operational has the information about the crawler updating the table, and the last update action.	Besides the operational and technical metadata, the catalog stores the business metadata, such as classifications, glossary terms references, or the owners information.	Technical and business. Technical stores dataset parameters, like schema, type, or location. Business metadata applies to the description and tags.
Ingestion	A serverless Glue component called Crawler can automatically index new metadata in the catalog. It's also possible to create tables manually.	The service analyzes the data sources in the scan step and ingest the generated metadata to the catalog automatically. Adding custom data sources is not supported yet.	The catalog natively integrates Pub/Sub and BigQuery metadata. Additionally, it can discover GCS files and integrate on-premise data sources.
Self-service	Supports search	Supports search and browsing assets by data type.	Supports search.
Data lineage	Not supported.	Built-in for several Azure services. Can also integrate to Apache Spark jobs Spark-Atlas connector.	Not supported.
Data quality	Not supported.	Partially supported with the Insights feature. The feature doesn't apply to the https://docs.microsoft.com/en-us/azure/purview/asset-insights	Not supported.

Although AWS, Azure, and GCP, have a data catalog in their offerings, you can see that the conceptualization is different. AWS and GCP have more of a technical data catalog, whereas Azure's implementation is closer to the collaborative environment enabling easier data discovery.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects