Data catalog services

Writing data processing jobs is a fascinating task. But it can't be worthless if the users can't find and use the generated data. Fortunately, we can count on data catalogs and leverage the power of metadata to overcome this discoverability issue.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

Data catalog

Before going to the cloud services part, I'd like to focus on the data catalog definition. Since it has been a hot topic in the data space for several years, instead of giving a one-sentence explanation, I prefer to list the keywords describing a data catalog:

Cloud services

To present the cloud offerings for the data cataloging, I'll repeat the technique from the data wrangling on the cloud and use the comparison table.

AWS Glue Data Catalog Azure Purview GCP Data Catalog
Metadata Technical and operational. Technical stores the attributes like schema, number of records, average size of a record. Operational has the information about the crawler updating the table, and the last update action. Besides the operational and technical metadata, the catalog stores the business metadata, such as classifications, glossary terms references, or the owners information. Technical and business. Technical stores dataset parameters, like schema, type, or location. Business metadata applies to the description and tags.
Ingestion A serverless Glue component called Crawler can automatically index new metadata in the catalog. It's also possible to create tables manually. The service analyzes the data sources in the scan step and ingest the generated metadata to the catalog automatically. Adding custom data sources is not supported yet. The catalog natively integrates Pub/Sub and BigQuery metadata. Additionally, it can discover GCS files and integrate on-premise data sources.
Self-service Supports search Supports search and browsing assets by data type. Supports search.
Data lineage Not supported. Built-in for several Azure services. Can also integrate to Apache Spark jobs Spark-Atlas connector. Not supported.
Data quality Not supported. Partially supported with the Insights feature. The feature doesn't apply to the https://docs.microsoft.com/en-us/azure/purview/asset-insights Not supported.

Although AWS, Azure, and GCP, have a data catalog in their offerings, you can see that the conceptualization is different. AWS and GCP have more of a technical data catalog, whereas Azure's implementation is closer to the collaborative environment enabling easier data discovery.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©