Data engineering articles

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

Graph partitioning

As told many times in previous posts, one of the most challenging tasks in distributed graph processing is the partitioning. Connected nature of the graph components makes the partitioning hard. Hopefully, the researchers continue to propose the solutions.

Continue Reading →

Graph storage

Until now we've discovered exclusively the concepts devoted to computing distributed graphs. But the compute part can't go without storage. And since for the latter in the context of graph we can't talk about the storage, it requires its own detailed explanation.

Continue Reading →

Graph algorithms in distributed world - part 1

During last weeks we've discovered a lot about graph data processing in distributed world. However we haven't learned yet about the problems the graphs can solve. And it's as important as the knowledge about the processing techniques. Hopefully, this post will try to catch up this late.

Continue Reading →

Graph-centric graph processing

Previously described vertex-centric model is not the single one used to process graph data. Another one uses subgraphs as the processing unit.

Continue Reading →

Streaming and graph processing

Use cases of streaming surprise me more and more. In my recent research about graph processing in Big Data era I found a paper presenting the graph framework working on vertices and edges directly from a stream. In case you've missed that paper I'll try to present this idea to you.

Continue Reading →

Vertex-centric graph processing

Graph data processing, even though seems to be less popular than streaming or files processing, is an important member of data-oriented systems. And as its "colleagues", it also has some different processing logics. The first described in this blog is called vertex-centric.

Continue Reading →

Graphs and data processing

In this blog I've covered the topics about relational databases, key-value stores, search engines or log systems. There are still some storage systems deserving some learning effort and one of them are graphs considered here in the context of data processing.

Continue Reading →

Transaction compensation - aka Sagas

Distributed computing opened a lot of possibilities and horizontal scaling is only one of them. But at the same time it brought some new problems that we need to address during applications conception. And writing a data on different data stores inside one transaction is the one of them.

Continue Reading →

Epidemic protocols

A lot of names in IT come from the real world phenomena. One of them are epidemic protocols, aka gossip protocols, covered in this post.

Continue Reading →

Data processing locality and cloud-based data processing

Having data close to the computation has a lot of advantages. This idea called data locality is not new since it was popularized with Hadoop MapReduce. Despite of that, it's worth recalling some of its main points and trying to adapt it to modern data pipelines most of the time based on cloud services.

Continue Reading →

Big Data immutability approaches - aliasing

Some time ago I've started the series of posts about immutability in data-oriented applications. One of approaches helping to deal with it was based on version flags. But fortunately it's not the only solution - especially for the ones who don't like to mix valid and invalid data in a single place.

Continue Reading →

Massively Parallel Processing

Querying big amounts of data has never been so simple as nowadays. Amazon Redshift and Azure SQL Data Warehouse are one of the solutions. But using them wouldn't be possible without a more global concept known as MPP.

Continue Reading →

Change Data Capture pattern

Keeping different database synchronized is not an easy task. Thankfully some techniques exist to facilitate it and one of them is called the Changed Data Capture pattern.

Continue Reading →

Stable Bloom filter

Even though Bloom filter perfectly suits to bounded data it also has some interesting implementations for unbounded sources too. One of them is Stable Bloom filter.

Continue Reading →

Scalable Bloom filter

Bloom filter has a lot of versions addressing its main drawbacks - bounded source and add-only character. One of them is Scalable Bloom filter that fixes the first issue.

Continue Reading →

Bloom filter

After HyperLogLog and Count-min sketch it's time to cover another popular probabilistic algorithm - Bloom filter.

Continue Reading →

Cardinality estimation with linear probabilistic counting

This post follows the series about approximation algorithms. But unlike before, this time we'll focus on simpler solution, the linear probabilistic counting.

Continue Reading →

Consensus problem and the Raft consensus algorithm

During long years the Paxos protocol was one of most serious solutions for consensus problems. However in 2013 Diego Ongaro and John Ousterhout from Stanford University proposed an alternative called Raft.

Continue Reading →

Window functions in SQL

Window functions are one of another SQL features that we'll probably discover during the work with data-oriented application. They can be also used in more classical programs though.

Continue Reading →

Immutability and key-value storage

The immutability is a precious property of systems dealing with a lot of data. It's especially true when something goes wrong and we must recover quickly. Since the data is immutable, the cleaning step is not executed and with some additional computation power, the data can be regenerated efficiently.

Continue Reading →