So far I played with Great Expectations and discovered the main classes. Today it's time to see how to automate our data validation pipeline.
In my previous post I presented a very simplified version of a Great Expectations data validation pipeline. Today, before going further and integrating the pipeline with a data orchestration tool, it's a good moment to see what's inside the framework.
When I published my blog post about Deequ and Apache Griffin in March 2020, I thought that there was nothing more to do with data validation frameworks. Hopefully, Alexander Wagner pointed me out another framework, Great Expectations that I will discover in the series of 3 blog posts.
Poor data quality is the reason for big pains of data workers. Data engineers need often to deal with JSON inconsistent schemes, data analysts have to figure out dataset issues to avoid biased reportings whereas data scientists have to spend a big amount of time preparing data for training instead of dedicating this time on model optimization. That's why having a good tool to control data quality is very important.
Last November I spoke at Paris.py meetup about integrating Cerberus with PySpark to enhance JSON validation. During the talk, I covered some points that I would like to share with you in this blog post, mostly about error definition and normalized validation.
In one of recent Meetups I heard that one of the most difficult data engineering tasks is ensuring good data quality. I'm more than agree with that statement and that's the reason why in this post I will share one of solutions to detect data issues with PySpark (my first PySpark code !) and Python library called Cerberus.
Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions.