Writing data processing jobs is a fascinating task. But it can't be worthless if the users can't find and use the generated data. Fortunately, we can count on data catalogs and leverage the power of metadata to overcome this discoverability issue.
Before going to the cloud services part, I'd like to focus on the data catalog definition. Since it has been a hot topic in the data space for several years, instead of giving a one-sentence explanation, I prefer to list the keywords describing a data catalog:
- Centralized. Data catalog is a single place where the users can find information about available datasets.
- Searchable. The users can access the information directly or after using the data catalog's search feature.
- Self-service. Thanks to this data availability, a data catalog is one of the enablers of a self-service data system. It means that despite some tribal knowledge, all data team members can find interesting information, or at worst, they can easily identify the people who can provide them with this information.
- Collaboration. A data catalog is then a tool enhancing collaboration between data team members. It applies not only for sharing the read access but also to the writing. Even though each dataset can have a designated steward, everybody can propose the changes.
- Data quality. In addition to the description of the data sources, a modern data catalog can include some technical information, like data volume, values distribution in a column, or data freshness.
- Data lineage. Not only the quality is important for data users. Often, they also want to know where the dataset comes from. Data catalog helps here too by providing the data lineage information.
- Metadata. The metadata is the key to bring all the information together. The operational metadata will provide insight on the dataset generation and access, such as the job producing the data, or the users accessing it recently. The business metadata provides some domain-related view, such as the meaning of the dataset, its possible use cases, or privacy regulations. Finally, the technical metadata describes the dataset itself, so defines the things like column types, the allowed values, or their nullability.
To present the cloud offerings for the data cataloging, I'll repeat the technique from the data wrangling on the cloud and use the comparison table.
|AWS Glue Data Catalog
|GCP Data Catalog
|Technical and operational. Technical stores the attributes like schema, number of records, average size of a record. Operational has the information about the crawler updating the table, and the last update action.
|Besides the operational and technical metadata, the catalog stores the business metadata, such as classifications, glossary terms references, or the owners information.
|Technical and business. Technical stores dataset parameters, like schema, type, or location. Business metadata applies to the description and tags.
|A serverless Glue component called Crawler can automatically index new metadata in the catalog. It's also possible to create tables manually.
|The service analyzes the data sources in the scan step and ingest the generated metadata to the catalog automatically. Adding custom data sources is not supported yet.
|The catalog natively integrates Pub/Sub and BigQuery metadata. Additionally, it can discover GCS files and integrate on-premise data sources.
|Supports search and browsing assets by data type.
|Built-in for several Azure services. Can also integrate to Apache Spark jobs Spark-Atlas connector.
|Partially supported with the Insights feature. The feature doesn't apply to the https://docs.microsoft.com/en-us/azure/purview/asset-insights
Although AWS, Azure, and GCP, have a data catalog in their offerings, you can see that the conceptualization is different. AWS and GCP have more of a technical data catalog, whereas Azure's implementation is closer to the collaborative environment enabling easier data discovery.