Lambda architecture is one of the first officially defined Big Data architectures. However, after few time it was replaced by simpler approaches like Kappa. But despite that, you can still find the projects on Lambda and one of them which grabbed my attention is Project Oryx.
So far I played with Great Expectations and discovered the main classes. Today it's time to see how to automate our data validation pipeline.
In my previous post I presented a very simplified version of a Great Expectations data validation pipeline. Today, before going further and integrating the pipeline with a data orchestration tool, it's a good moment to see what's inside the framework.
When I published my blog post about Deequ and Apache Griffin in March 2020, I thought that there was nothing more to do with data validation frameworks. Hopefully, Alexander Wagner pointed me out another framework, Great Expectations that I will discover in the series of 3 blog posts.
One of the most powerful features of an orchestration system is the ability to ... yes, orchestrate different and apparently unrelated pipelines. But how to do so? By directly triggering a task or by using the data?
Preparing an AWS exam is not only a good way to discover AWS services but also more general concepts. It happened to me when I first heard about dark data during a talk presenting AWS Glue.
When I started to work with data on my very first PHP and Java projects, I used only DELETE operator to remove the data. When I switched to (big) data engineering, I found more efficient ways to deal with this operation through TRUNCATE or DROP operations.
Poor data quality is the reason for big pains of data workers. Data engineers need often to deal with JSON inconsistent schemes, data analysts have to figure out dataset issues to avoid biased reportings whereas data scientists have to spend a big amount of time preparing data for training instead of dedicating this time on model optimization. That's why having a good tool to control data quality is very important.
Last November I spoke at Paris.py meetup about integrating Cerberus with PySpark to enhance JSON validation. During the talk, I covered some points that I would like to share with you in this blog post, mostly about error definition and normalized validation.
Even data distribution is one of the guarantees of performant data processing. However, it's not a golden rule and sometimes you can encounter uneven distribution called skews.
Change Data Capture (CDC) is a technique helping to smoothly pass from classical and static data warehouse solution to modern streaming-centric architecture. To do that you can use solutions like Debezium which connects RDBMS or MongoDB to Apache Kafka. In this post, I will try to check whether CDC can also apply to other data stores like Apache Cassandra, Elasticsearch and AWS DynamoDB.
Some time ago I found an interesting article describing 2 faces of synchronizing the data pipelines - orchestration and choreography. The article ended with an interesting proposal to use both of them as a hybrid solution. In this post, I will try to implement that idea.
Streaming processing is great because it guarantees low latency and quite fresh insight. But on the other side, we won't always need such latency and for these situations, a batch processing will often be a better fit because of apparently simpler semantics. In data architectures, batch layer is perceived differently. Kappa, which is a streaming-based model, makes it optional when the streaming broker can guarantee long data retention. But if it's not the case, the data must be copied into some more persistent storage like a distributed file system.
Key-value stores have the advantage of being a kind of distributed and high-available memory cache. But even though they're quite easy to manipulate thanks to the key-based access, they also have some complicated tasks. One of them is the strategy of picking a good key.
Big Data enforces denormalized storage. Joins are costly and it's often much more efficient to store all related information in a single row. Such rows with a lot of columns are called wide rows and they'll be explained in the sections below.
Distributed computing opened a lot of possibilities and horizontal scaling is only one of them. But at the same time it brought some new problems that we need to address during applications conception. And writing a data on different data stores inside one transaction is the one of them.
Some time ago I've started the series of posts about immutability in data-oriented applications. One of approaches helping to deal with it was based on version flags. But fortunately it's not the only solution - especially for the ones who don't like to mix valid and invalid data in a single place.
Keeping different database synchronized is not an easy task. Thankfully some techniques exist to facilitate it and one of them is called the Changed Data Capture pattern.
The immutability is a precious property of systems dealing with a lot of data. It's especially true when something goes wrong and we must recover quickly. Since the data is immutable, the cleaning step is not executed and with some additional computation power, the data can be regenerated efficiently.
Good write throughput and horizontal scalability are maybe the most visible advantages of NoSQL storage systems. However very often people with a solid RDBMS background fall in the trap of index that can't be so easily created. Fortunately, a lot of patterns helping to deal with this problem exist. One of them is the index table pattern.