Data catalog services

Writing data processing jobs is a fascinating task. But it can't be worthless if the users can't find and use the generated data. Fortunately, we can count on data catalogs and leverage the power of metadata to overcome this discoverability issue.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

Data catalog

Before going to the cloud services part, I'd like to focus on the data catalog definition. Since it has been a hot topic in the data space for several years, instead of giving a one-sentence explanation, I prefer to list the keywords describing a data catalog:

Cloud services

To present the cloud offerings for the data cataloging, I'll repeat the technique from the data wrangling on the cloud and use the comparison table.

AWS Glue Data Catalog Azure Purview GCP Data Catalog
Metadata Technical and operational. Technical stores the attributes like schema, number of records, average size of a record. Operational has the information about the crawler updating the table, and the last update action. Besides the operational and technical metadata, the catalog stores the business metadata, such as classifications, glossary terms references, or the owners information. Technical and business. Technical stores dataset parameters, like schema, type, or location. Business metadata applies to the description and tags.
Ingestion A serverless Glue component called Crawler can automatically index new metadata in the catalog. It's also possible to create tables manually. The service analyzes the data sources in the scan step and ingest the generated metadata to the catalog automatically. Adding custom data sources is not supported yet. The catalog natively integrates Pub/Sub and BigQuery metadata. Additionally, it can discover GCS files and integrate on-premise data sources.
Self-service Supports search Supports search and browsing assets by data type. Supports search.
Data lineage Not supported. Built-in for several Azure services. Can also integrate to Apache Spark jobs Spark-Atlas connector. Not supported.
Data quality Not supported. Partially supported with the Insights feature. The feature doesn't apply to the Not supported.

Although AWS, Azure, and GCP, have a data catalog in their offerings, you can see that the conceptualization is different. AWS and GCP have more of a technical data catalog, whereas Azure's implementation is closer to the collaborative environment enabling easier data discovery.