Data engineering articles

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data Graphs SQL

If not, below you can find all articles belonging to Data engineering.

Data processing locality and cloud-based data processing

Having data close to the computation has a lot of advantages. This idea called data locality is not new since it was popularized with Hadoop MapReduce. Despite of that, it's worth recalling some of its main points and trying to adapt it to modern data pipelines most of the time based on cloud services.

Continue Reading →

Big Data immutability approaches - aliasing

Some time ago I've started the series of posts about immutability in data-oriented applications. One of approaches helping to deal with it was based on version flags. But fortunately it's not the only solution - especially for the ones who don't like to mix valid and invalid data in a single place.

Continue Reading →

Massively Parallel Processing

Querying big amounts of data has never been so simple as nowadays. Amazon Redshift and Azure SQL Data Warehouse are one of the solutions. But using them wouldn't be possible without a more global concept known as MPP.

Continue Reading →

Change Data Capture pattern

Keeping different database synchronized is not an easy task. Thankfully some techniques exist to facilitate it and one of them is called the Changed Data Capture pattern.

Continue Reading →

Stable Bloom filter

Even though Bloom filter perfectly suits to bounded data it also has some interesting implementations for unbounded sources too. One of them is Stable Bloom filter.

Continue Reading →

Scalable Bloom filter

Bloom filter has a lot of versions addressing its main drawbacks - bounded source and add-only character. One of them is Scalable Bloom filter that fixes the first issue.

Continue Reading →

Bloom filter

After HyperLogLog and Count-min sketch it's time to cover another popular probabilistic algorithm - Bloom filter.

Continue Reading →

Cardinality estimation with linear probabilistic counting

This post follows the series about approximation algorithms. But unlike before, this time we'll focus on simpler solution, the linear probabilistic counting.

Continue Reading →

Consensus problem and the Raft consensus algorithm

During long years the Paxos protocol was one of most serious solutions for consensus problems. However in 2013 Diego Ongaro and John Ousterhout from Stanford University proposed an alternative called Raft.

Continue Reading →

Window functions in SQL

Window functions are one of another SQL features that we'll probably discover during the work with data-oriented application. They can be also used in more classical programs though.

Continue Reading →

Immutability and key-value storage

The immutability is a precious property of systems dealing with a lot of data. It's especially true when something goes wrong and we must recover quickly. Since the data is immutable, the cleaning step is not executed and with some additional computation power, the data can be regenerated efficiently.

Continue Reading →

ACID 2.0

ACID is a well-known acronym for almost all developers growing with the RDBMS as the main storage. However with the popularization of NoSQL and distributed computing, another ACID acronym appeared - ACID 2.0.

Continue Reading →

Conflict-Free Replicated Data Types - flags, graphs and maps

Previous post about the Conflict-free Replicated Data Types presented some of basic structures of this type. This one will describe some of recently uncovered types such as: flags, graphs and arrays.

Continue Reading →

Immutability in Big Data

The interest of immutability in Big Data is often difficult to understand at the first glance. After all it introduces some complexity - especially at the reading path. But when the first problems appear and some of data need to be recomputed in order, the immutability comes to the rescue.

Continue Reading →

Conflict-Free Replicated Data Type

Pessimistic replication requires a synchronous communication between the main node writing the data and the replicas. However in some cases the optimistic replication can be more efficient and still guarantee the same final result. One of solutions from this category are conflict-free replicated data types.

Continue Reading →

Hierarchical queries

SQL hides for the most of its daily users a lot of interesting and powerful functions that though are not used very frequently. One of them are hierarchical queries.

Continue Reading →

Frequency estimation with Count-min sketch

HyperLogLog algorithm described some weeks ago is not the single one approximate solution in the world of Big Data applications. Another one is Count-min sketch.

Continue Reading →

Correlated subqueries

Even though the RDBMS is more and more completed (replaced?) by NoSQL solutions, it still remains an important piece of the data processing. It's even more true with the distributed databases as BigQuery supporting SQL standards so the correlated subqueries. But they're also implemented in other Big Data engines as Spark SQL or more classical ones as PostgreSQL.

Continue Reading →

Data processing frameworks concepts

Modern data processing frameworks offer a wide range of features. At first glance this number can scary. Fortunately they can be discovered sequentially and often are common for the most popular frameworks.

Continue Reading →

Conflict resolution in distributed applications - vector clocks

Dynamo paper, already quoted here in other posts, was published in 2007. It's 10 years ago. Even though the time passed, it still proposes interesting concepts to know for data-driven applications. And one of them are vector clocks used to conflict resolution.

Continue Reading →