data quality articles

Articles tagged with data quality. There are 4 article(s) corresponding to the tag data quality. If you don't find what you're looking for, please check related tags: Ad-hoc polymorphism, Akka Distributed Data, Akka examples, Apache Spark 2.4.0 features, Apache Spark data sources, Apache Spark elasticity, Apache Spark internals, Apache Spark scalability, Apache Spark SQL subquery, Apache Spark Structured Streaming execution.

Check out my new course on Data Engineering!

Are you a data scientist who wants to extend his data engineering skills? Or a software engineer who wants to work with Big Data? If not, maybe a BI developer who wants to evolve to engineering position? My course will help you to achieve your goal! Join the class →

Data validation frameworks - Deequ and Apache Griffin overview

Poor data quality is the reason for big pains of data workers. Data engineers need often to deal with JSON inconsistent schemes, data analysts have to figure out dataset issues to avoid biased reportings whereas data scientists have to spend a big amount of time preparing data for training instead of dedicating this time on model optimization. That's why having a good tool to control data quality is very important. Continue Reading →

Validating JSON with Apache Spark and Cerberus

In one of recent Meetups I heard that one of the most difficult data engineering tasks is ensuring good data quality. I'm more than agree with that statement and that's the reason why in this post I will share one of solutions to detect data issues with PySpark (my first PySpark code !) and Python library called Cerberus. Continue Reading →

Introduction to data quality

Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions. Continue Reading →