Data processing frameworks concepts

Versions: Apache Beam 2.2.0, Apache Spark 2.3.0

Modern data processing frameworks offer a wide range of features. At first glance this number can scary. Fortunately they can be discovered sequentially and often are common for the most popular frameworks.

In this post we'll discover some of basic concepts of batch and streaming data processing. These concepts are rather global and it means that for example they won't represent specific transformations as mapping, filtering - instead of that they'll be represented under a common part called "transformation". This post won't contain code samples neither. But in some places it'll link to another posts in order to clearly show given use case. The post is organized in one big section describing the data processing concepts in an ordered list. It's inspired from the personal experience and Beam's capability matrix, quoted just after the conclusion.

The main ideas of data processing can be resumed in the following list:

Data processing has a lot of concepts and names good to know. They help not only to write efficient data processing pipelines (as shuffle problems, delivery semantics) but also are very helpful in the discovery of new data processing framework. As you can see in the above list, we can retrieve the implementation of almost every point in Apache Beam and Apache Spark. Thanks to that we know the point we'd focus on when we start to learn new framework. However, the list is not exhaustive and if you're more concepts to add, please let me know about them.


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!