Introduction to data quality

Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

This post focuses on data quality aspect in data-driven systems. The first part defines the data quality and explains its main axis. The second part shows with what tools the data quality can be provided and at what moment. The last section contains some more important questions that could help to ensure a good data quality in data-driven systems.

Definition

The data quality defines how the data meets the needs of the user requirements. A data of good quality can be used easily. It also respects respecting business requirements. One of its main consequences are more accurate business decisions.

The data quality can be characterized by the following metrics:

Guaranteeing data quality

As you can see, the list is quite long - even though it's not described in details. All of above points can be guaranteed with the use of different methods, at different stages of the processing pipeline, for instance:

Data quality checklist

Data quality is not a young domain and it's why we can find some interesting resources pointing out the most important concepts. Below list contains some of more frequent questions to answer in order to improve the data quality, especially from the engineering point of view:

This posts gave some indicators emphasizing the data quality importance. We could summarize the data quality as a process involving business units and engineering team. The business unit is needed to define the rules helping to understand the data and its important values. The engineering team is there to ensure the reliability, the implementation of business rules, the scaling and the monitoring of the processing pipeline. The last section provided some non exhaustive mlist of questions that should help in the data quality management process.