Data curation concept

There are a lot of data engineering ideas starting with "data" and sometimes they may be confusing. In this post I will focus on the data curation concept and, among others, show some differences with other "data-like" terms.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

The post starts with the definition of data curation. In this part, I will explain the differences with 2 other "data-like" terms, data cleansing and data enrichment. In the next section, I will illustrate my sayings with some examples.

Definition and differences

Data curation is the process of acquiring and taking care of the data. Sounds familiar? Yes, but it's not another term to describe data cleansing. Data cleansing deals with data attributes whereas data curation is more related to the data organization. Data curation is often compared to a museum curator who, exactly like data curation process, is responsible for the acquisition of new objects (datasets in data curation), transport organization (datasets combination) and documentation (metadata annotation). On the other hand, data cleansing focuses on the dataset attributes and its responsibilities consist of identifying and fixing data issues.

The second term that may be confusing with data curation is data enrichment. Just to recall, data enrichment adds an extra value to already existing data. And this is the principal difference with data curation which doesn't modify the records but rather adds some extra metadata to make them findable. The difference is subtle though. You will see this in the next section with an example of the New York Times data curation pipeline where published articles are tagged by the algorithms and journalists, so they are somehow enriched. The subtle difference is that the tags should be considered in this context more like metadata rather than the article itself which doesn't change.

To sum up, we could tell that data curation is more like data classification pipeline. You will understand it pretty quickly with unstructured data examples like picture captcha where you must to select the boxes with road signs, bridges or cars. Unconsciously you did the work of a data curator because you made a decision about the images that probably will be collected in other goals like to build a training dataset for supervised learning ML algorithms. By your action, you also added some extra information to the dataset because your choices may be considered as an action of tagging.

Data curation example

To illustrate data curation concept, I will use the examples I found in "Big Data Curation" chapter of "New Horizons for a Data-Driven Economy" (link in "Read more" section).

The first use case concerns the "New York Times" newspaper. Data curation has 3 levels:

  1. automatic tags generation - after a new article is written by a journalist, it's automatically annotated with tags generated from its content.
  2. the 1st data curation level - after the automatic tags generation step, all of the generated elements are reviewed by the editorial staff. At this moment the journalists may invalid created tags or add new ones.
  3. the 2nd data curation level - already curated tags are reviewed by Taxonomy managers who are responsible for publishing the article and providing continuous feedback to the editorial staff.
  4. the 3rd data curation level - at the end of the pipeline, the index department adds additional tags and prepares the summary of the published article before indexing it to the index storage.

The second example I would like to share with you concerns eBay which (at least in the example from the book) uses human power to manage products taxonomy and find product identifiers in their descriptions. That human power is called in the article a crowdsourced workers. Since I didn't meet this term before, it deserves some words of explanation. Crowdsourcing is a process outsourcing human-intelligence tasks to a large group of unspecified people via the Internet. It can be explicit or implicit. I already gave you an example of implicit crowdsourcing when I was talking about captcha in the previous section.

More globally - I didn't find the implementation details at eBay - every stage of data pipeline can be crowdsourced. But the skills required for each of them are different:

Understanding data curation is not easy because it can be easily confused with data cleansing and data enrichment steps that can be a part of the data curation process. However, the goal of data curation is different. Exactly like a museum curator, data curation pipeline's goal is to classify the data rather than to clean the attributes.