Waiting for code

on waitingforcode.com

Range partitioning in Apache Spark SQL

The most popular partitioning strategy divides the dataset by the hash computed from one or more values of the record. However other partitioning strategies exist as well and one of them is range partitioning implemented in Apache Spark SQL with repartitionByRange method, described in this post. Continue Reading →

Change Data Capture and NoSQL

Change Data Capture (CDC) is a technique helping to smoothly pass from classical and static data warehouse solution to modern streaming-centric architecture. To do that you can use solutions like Debezium which connects RDBMS or MongoDB to Apache Kafka. In this post, I will try to check whether CDC can also apply to other data stores like Apache Cassandra, Elasticsearch and AWS DynamoDB. Continue Reading →

Bipartite graph recommendation example

When I was analyzing the API of Gelly, I was quite surprised for its support of bipartite graphs. First, because I didn't know that data structure and second because it wasn't supported in other analyzed frameworks. Hence, I added that graph structure to my backlog and sometime later wrote a post to explain it better. Continue Reading →

Regression tests with Apache Spark SQL joins

Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets. Continue Reading →

Batch layer in streaming-based architectures - approaches

Streaming processing is great because it guarantees low latency and quite fresh insight. But on the other side, we won't always need such latency and for these situations, a batch processing will often be a better fit because of apparently simpler semantics. In data architectures, batch layer is perceived differently. Kappa, which is a streaming-based model, makes it optional when the streaming broker can guarantee long data retention. But if it's not the case, the data must be copied into some more persistent storage like a distributed file system. Continue Reading →

Listening EMR events with AWS Lambda

I really appreciate AWS services and one of the main reasons for that is the facility to implement event-driven systems. One of the interesting use cases of these events is related to the EMR service, responsible for running Apache Spark pipelines. In this post I will try to associate an action invoked every time an EMR step completes successfully. Continue Reading →

Big Data patterns implemented - automated processing metadata insertion

Sometimes metadata is disregarded but very often it helps to retrieve the information easier and faster. One of such use cases are the headers of Apache Parquet where the stats about the column's content are stored. The reader can, without parsing all the lines, know whether what is he looking for is in the file or not. The metadata is also a part of one of Big Data patterns called automated processing metadata insertion. Continue Reading →

Introduction to horizontal scalability

Two great features whose I experienced when I have been working with Dataflow were the serverless character and the auto-scalability. That's why when I first saw the Apache Spark on Kubernetes initiative, I was more than happy to write one day the pipelines automatically adapting to the workload. That also encouraged me to discover the horizontal scalability and this post is the first result of my recent research on that topic. Continue Reading →

FAIR jobs scheduling in Apache Spark

During my exploration of Apache Spark configuration options, I found an entry called spark.scheduler.mode. After looking for its possible values, I ended up with a pretty intriguing concept called FAIR scheduling that I will detail in this post. Continue Reading →