Introduction to data quality

Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

This post focuses on data quality aspect in data-driven systems. The first part defines the data quality and explains its main axis. The second part shows with what tools the data quality can be provided and at what moment. The last section contains some more important questions that could help to ensure a good data quality in data-driven systems.

Definition

The data quality defines how the data meets the needs of the user requirements. A data of good quality can be used easily. It also respects respecting business requirements. One of its main consequences are more accurate business decisions.

The data quality can be characterized by the following metrics:

Guaranteeing data quality

As you can see, the list is quite long - even though it's not described in details. All of above points can be guaranteed with the use of different methods, at different stages of the processing pipeline, for instance:

Data quality checklist

Data quality is not a young domain and it's why we can find some interesting resources pointing out the most important concepts. Below list contains some of more frequent questions to answer in order to improve the data quality, especially from the engineering point of view:

This posts gave some indicators emphasizing the data quality importance. We could summarize the data quality as a process involving business units and engineering team. The business unit is needed to define the rules helping to understand the data and its important values. The engineering team is there to ensure the reliability, the implementation of business rules, the scaling and the monitoring of the processing pipeline. The last section provided some non exhaustive mlist of questions that should help in the data quality management process.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©