General Big Data articles

Data Vault 2.0 and Big Data

In the previous blog post you discovered the first version of Data Vault methodology. But since the very first iteration, the specification evolved and a few years ago a version 2 was proposed. More adapted to the Big Data world, with several deprecation notes, and more examples adapted to the constantly evolving data world.

Continue Reading β†’

Data modeling with Data Vault - part 1

If you hear "agile", "adapted to the changes", you certainly think about Scrum, Kanban and generally the Agile methodology. And you're correct but it's worth knowing that the agile term also applies to the data. More exactly, to the data modeling with the approach called Data Vault.

Continue Reading β†’

Optimistic concurrency control - a little bit of theory and a little bit more examples

It has been a while since I didn't write about general distributed systems topics. That's the reason for this article where I will focus on the topic of an optimistic concurrency control.

Continue Reading β†’

Data curation concept

There are a lot of data engineering ideas starting with "data" and sometimes they may be confusing. In this post I will focus on the data curation concept and, among others, show some differences with other "data-like" terms.

Continue Reading β†’

Message queues and streaming brokers

Before I came to data engineering, I was working a lot with web services and messaging technologies like RabbitMQ and Spring Integration. The day when I started with streaming brokers I was a little bit confused since everything seemed the same but in reality, was slightly different. There were and still are some subtle differences between queues and streaming brokers. In this post, I will focus on them and try to give a better definition for queues and streams.

Continue Reading β†’

Introduction to horizontal scalability

Two great features whose I experienced when I have been working with Dataflow were the serverless character and the auto-scalability. That's why when I first saw the Apache Spark on Kubernetes initiative, I was more than happy to write one day the pipelines automatically adapting to the workload. That also encouraged me to discover the horizontal scalability and this post is the first result of my recent research on that topic.

Continue Reading β†’

Epidemic protocols

A lot of names in IT come from the real world phenomena. One of them are epidemic protocols, aka gossip protocols, covered in this post.

Continue Reading β†’

Data processing locality and cloud-based data processing

Having data close to the computation has a lot of advantages. This idea called data locality is not new since it was popularized with Hadoop MapReduce. Despite of that, it's worth recalling some of its main points and trying to adapt it to modern data pipelines most of the time based on cloud services.

Continue Reading β†’

Massively Parallel Processing

Querying big amounts of data has never been so simple as nowadays. Amazon Redshift and Azure SQL Data Warehouse are one of the solutions. But using them wouldn't be possible without a more global concept known as MPP.

Continue Reading β†’

ACID 2.0

ACID is a well-known acronym for almost all developers growing with the RDBMS as the main storage. However with the popularization of NoSQL and distributed computing, another ACID acronym appeared - ACID 2.0.

Continue Reading β†’

Immutability in Big Data

The interest of immutability in Big Data is often difficult to understand at the first glance. After all it introduces some complexity - especially at the reading path. But when the first problems appear and some of data need to be recomputed in order, the immutability comes to the rescue.

Continue Reading β†’

Data processing frameworks concepts

Modern data processing frameworks offer a wide range of features. At first glance this number can scary. Fortunately they can be discovered sequentially and often are common for the most popular frameworks.

Continue Reading β†’

Secondary index in NoSQL data stores

The data organization in key-value oriented NoSQL databases is very often based on a pair of keys: partition and sorting. However they also offer other feature called secondary index that can be a good alternative to previously described index table pattern.

Continue Reading β†’

Introduction to data quality

Dealing with a lot of data is a time consuming activity but dealing with a lot of data and ensuring its high value is even more complicated. It's one of the reasons why the data quality should never be neglected. After all, it's one of components providing accurate business insights and facilitating strategic decisions.

Continue Reading β†’

Polyglot persistence - definition and examples

The popularization of NoSQL data stores brought a new concept in data management called polyglot persistence. This term is very similar to polyglot programming and it'll be presented below.

Continue Reading β†’

Log-structured file system

Sequential writes made their proofs in distributed data-driven systems. Usually they perform better than random writes, especially in systems with intensive writes. Beside the link to the Big Data, the sequential writes are also related to another type of systems called log-structured file systems that were defined late 1980's.

Continue Reading β†’

Dynamo paper and consistent hashing

One of previous posts presented partitioning strategies. Among described techniques we could find hashing partitioning based on the number of servers. The drawback of this method was the lack of flexibility. With the add of new server we have to remap all data. Fortunately an alternative to this "primitive" hashing exists and it's called consistent hashing.

Continue Reading β†’

Data partitioning strategies

Every data processing pipeline can have a source of contention. One of them can be the data localization. When all entries are read from single place by dozens or hundreds of workers, the data source can respond slower. One of solutions to this problem can be the partitioning.

Continue Reading β†’

Enforcing consistency in stateful serverless processing with idempotence

Since the gain of popularity of cloud operators, serverless processing became one of serious alternatives to the cluster-based data pipelines. It's often cheaper to have event-based applications than different processings in the clusters. However, using serverless (and not only) in distributed and stateful computing can sometimes be difficult. But often one property can help in a lot of problems - idempotence.

Continue Reading β†’

Zeta architecture

Almost every year new concept of data-centric architecture appears. In 2014 Kappa conception was published by Jay Kreps. One year after another concept emerged - the architecture called Zeta.

Continue Reading β†’