There are a lot of data engineering ideas starting with "data" and sometimes they may be confusing. In this post I will focus on the data curation concept and, among others, show some differences with other "data-like" terms.
The post starts with the definition of data curation. In this part, I will explain the differences with 2 other "data-like" terms, data cleansing and data enrichment. In the next section, I will illustrate my sayings with some examples.
Definition and differences
Data curation is the process of acquiring and taking care of the data. Sounds familiar? Yes, but it's not another term to describe data cleansing. Data cleansing deals with data attributes whereas data curation is more related to the data organization. Data curation is often compared to a museum curator who, exactly like data curation process, is responsible for the acquisition of new objects (datasets in data curation), transport organization (datasets combination) and documentation (metadata annotation). On the other hand, data cleansing focuses on the dataset attributes and its responsibilities consist of identifying and fixing data issues.
The second term that may be confusing with data curation is data enrichment. Just to recall, data enrichment adds an extra value to already existing data. And this is the principal difference with data curation which doesn't modify the records but rather adds some extra metadata to make them findable. The difference is subtle though. You will see this in the next section with an example of the New York Times data curation pipeline where published articles are tagged by the algorithms and journalists, so they are somehow enriched. The subtle difference is that the tags should be considered in this context more like metadata rather than the article itself which doesn't change.
To sum up, we could tell that data curation is more like data classification pipeline. You will understand it pretty quickly with unstructured data examples like picture captcha where you must to select the boxes with road signs, bridges or cars. Unconsciously you did the work of a data curator because you made a decision about the images that probably will be collected in other goals like to build a training dataset for supervised learning ML algorithms. By your action, you also added some extra information to the dataset because your choices may be considered as an action of tagging.
Data curation example
To illustrate data curation concept, I will use the examples I found in "Big Data Curation" chapter of "New Horizons for a Data-Driven Economy" (link in "Read more" section).
The first use case concerns the "New York Times" newspaper. Data curation has 3 levels:
- automatic tags generation - after a new article is written by a journalist, it's automatically annotated with tags generated from its content.
- the 1st data curation level - after the automatic tags generation step, all of the generated elements are reviewed by the editorial staff. At this moment the journalists may invalid created tags or add new ones.
- the 2nd data curation level - already curated tags are reviewed by Taxonomy managers who are responsible for publishing the article and providing continuous feedback to the editorial staff.
- the 3rd data curation level - at the end of the pipeline, the index department adds additional tags and prepares the summary of the published article before indexing it to the index storage.
The second example I would like to share with you concerns eBay which (at least in the example from the book) uses human power to manage products taxonomy and find product identifiers in their descriptions. That human power is called in the article a crowdsourced workers. Since I didn't meet this term before, it deserves some words of explanation. Crowdsourcing is a process outsourcing human-intelligence tasks to a large group of unspecified people via the Internet. It can be explicit or implicit. I already gave you an example of implicit crowdsourcing when I was talking about captcha in the previous section.
More globally - I didn't find the implementation details at eBay - every stage of data pipeline can be crowdsourced. But the skills required for each of them are different:
- business needs stage - at this very first step the key factor is to understand the business needs, so mostly know to answer to the question of why we are doing what we are doing.
- data collection - the data collection in the context of crowdsourcing is a little bit different than the data collection that we know from classical data pipelines. In the context of crowdsourcing, this term describes more data generation than a technology used to collect the data. Among the examples of this stage, you will find the writing of a new Wikipedia article or of a restaurant review on Tripadvisor.
- data cleansing and curation - domain knowledge is the most important skill at this stage. The goal here is to validate the data and make it usable further in the pipeline without any extra normalization need.
- modeling and visualization - here the data must be modeled and exposed to the rest of the users.
- evaluation - once again the domain knowledge is the key skill here. The goal of this stage is to evaluate the whole crowdsourced data pipeline. If you go to the previous example, you can consider the feedback loop as the evaluation stage.
Understanding data curation is not easy because it can be easily confused with data cleansing and data enrichment steps that can be a part of the data curation process. However, the goal of data curation is different. Exactly like a museum curator, data curation pipeline's goal is to classify the data rather than to clean the attributes.