Data engineering articles

Home Data engineering

Looking for something else? Check the categories of Data engineering:

Apache Airflow Big Data algorithms Big Data problems - solutions Data engineering patterns General Big Data General data engineering Graphs SQL

If not, below you can find all articles belonging to Data engineering.

April 10, 2019 • General Big Data

Introduction to horizontal scalability

Two great features whose I experienced when I have been working with Dataflow were the serverless character and the auto-scalability. That's why when I first saw the Apache Spark on Kubernetes initiative, I was more than happy to write one day the pipelines automatically adapting to the workload. That also encouraged me to discover the horizontal scalability and this post is the first result of my recent research on that topic.

Continue Reading →

March 29, 2019 • SQL

SQL GROUPING SETS operator

I have already described grouping sets feature in the context of Apache Spark. But natively they are a part of SQL standard and that's why I would like to extend the previous post here. After all, you don't need Big Data to use them - even though nowadays it's difficult to not to deal with it.

Continue Reading →

February 28, 2019 • SQL

Minus/except operator in SQL

Last time we've discovered the INTERSECT operator. To recall it quickly, it returns all rows that are defined in the combined datasets. Today we'll discover another operator, doing the opposite and called depending on the vendor: MINUS or EXCEPT.

Continue Reading →

February 22, 2019 • Big Data problems - solutions

Key-value distribution patterns

Key-value stores have the advantage of being a kind of distributed and high-available memory cache. But even though they're quite easy to manipulate thanks to the key-based access, they also have some complicated tasks. One of them is the strategy of picking a good key.

Continue Reading →

February 20, 2019 • Graphs

Chaos in streaming graph processing

Some time ago I wrote a post about the graph data processing with streams. That article was based on X-Stream framework proposed by the searchers of EPFL research institute. At this occasion, I also mentioned the existence of newer alternative for X-Stream, adapted for distributed workloads, called Chaos. I voluntary omitted the explanation of Chaos in the previous post. Putting it aside of X-Stream would introduce too many new concepts. But now, after some weeks of graph processing discoveries, I would like to return to the successor of X-Stream and present it more in details.

Continue Reading →

January 31, 2019 • SQL

SQL and intersect operation

Thanks to modern Big Data solutions like BigQuery or Apache Spark SQL, the knowledge of the advanced SQL concepts is important. After covering the operations like window functions or grouping sets, it's time to show another interesting SQL feature, the INTERSECT operator.

Continue Reading →

November 21, 2018 • Graphs

Graph processing frameworks survey

The series about graph processing continues. Today it's the moment to analyze some major graph processing frameworks and choose the framework that I'll present more in details in incoming posts.

Continue Reading →

November 21, 2018 • Big Data problems - solutions

Wide rows in column-oriented stores

Big Data enforces denormalized storage. Joins are costly and it's often much more efficient to store all related information in a single row. Such rows with a lot of columns are called wide rows and they'll be explained in the sections below.

Continue Reading →

November 14, 2018 • Graphs

Graph mining

Because of its connected nature, graph structure has its own branch in data mining. Thanks to this branch we can get insight into relationships and dependencies between vertices.

Continue Reading →

November 7, 2018 • Graphs

Graph partitioning

As told many times in previous posts, one of the most challenging tasks in distributed graph processing is the partitioning. Connected nature of the graph components makes the partitioning hard. Hopefully, the researchers continue to propose the solutions.

Continue Reading →

November 1, 2018 • Graphs

Graph storage

Until now we've discovered exclusively the concepts devoted to computing distributed graphs. But the compute part can't go without storage. And since for the latter in the context of graph we can't talk about the storage, it requires its own detailed explanation.

Continue Reading →

November 1, 2018 • Graphs

Graph algorithms in distributed world - part 1

During last weeks we've discovered a lot about graph data processing in distributed world. However we haven't learned yet about the problems the graphs can solve. And it's as important as the knowledge about the processing techniques. Hopefully, this post will try to catch up this late.

Continue Reading →

October 24, 2018 • Graphs

Graph-centric graph processing

Previously described vertex-centric model is not the single one used to process graph data. Another one uses subgraphs as the processing unit.

Continue Reading →

October 24, 2018 • Graphs

Streaming and graph processing

Use cases of streaming surprise me more and more. In my recent research about graph processing in Big Data era I found a paper presenting the graph framework working on vertices and edges directly from a stream. In case you've missed that paper I'll try to present this idea to you.

Continue Reading →

October 18, 2018 • Graphs

Vertex-centric graph processing

Graph data processing, even though seems to be less popular than streaming or files processing, is an important member of data-oriented systems. And as its "colleagues", it also has some different processing logics. The first described in this blog is called vertex-centric.

Continue Reading →

October 11, 2018 • Graphs

Graphs and data processing

In this blog I've covered the topics about relational databases, key-value stores, search engines or log systems. There are still some storage systems deserving some learning effort and one of them are graphs considered here in the context of data processing.

Continue Reading →

October 3, 2018 • Big Data problems - solutions

Transaction compensation - aka Sagas

Distributed computing opened a lot of possibilities and horizontal scaling is only one of them. But at the same time it brought some new problems that we need to address during applications conception. And writing a data on different data stores inside one transaction is the one of them.

Continue Reading →

September 30, 2018 • General Big Data

Epidemic protocols

A lot of names in IT come from the real world phenomena. One of them are epidemic protocols, aka gossip protocols, covered in this post.

Continue Reading →

September 9, 2018 • General Big Data

Data processing locality and cloud-based data processing

Having data close to the computation has a lot of advantages. This idea called data locality is not new since it was popularized with Hadoop MapReduce. Despite of that, it's worth recalling some of its main points and trying to adapt it to modern data pipelines most of the time based on cloud services.

Continue Reading →

September 2, 2018 • Big Data problems - solutions

Big Data immutability approaches - aliasing

Some time ago I've started the series of posts about immutability in data-oriented applications. One of approaches helping to deal with it was based on version flags. But fortunately it's not the only solution - especially for the ones who don't like to mix valid and invalid data in a single place.

Continue Reading →