Data processing frameworks concepts

Versions: Apache Beam 2.2.0, Apache Spark 2.3.0

Modern data processing frameworks offer a wide range of features. At first glance this number can scary. Fortunately they can be discovered sequentially and often are common for the most popular frameworks.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

In this post we'll discover some of basic concepts of batch and streaming data processing. These concepts are rather global and it means that for example they won't represent specific transformations as mapping, filtering - instead of that they'll be represented under a common part called "transformation". This post won't contain code samples neither. But in some places it'll link to another posts in order to clearly show given use case. The post is organized in one big section describing the data processing concepts in an ordered list. It's inspired from the personal experience and Beam's capability matrix, quoted just after the conclusion.

The main ideas of data processing can be resumed in the following list:

Data processing has a lot of concepts and names good to know. They help not only to write efficient data processing pipelines (as shuffle problems, delivery semantics) but also are very helpful in the discovery of new data processing framework. As you can see in the above list, we can retrieve the implementation of almost every point in Apache Beam and Apache Spark. Thanks to that we know the point we'd focus on when we start to learn new framework. However, the list is not exhaustive and if you're more concepts to add, please let me know about them.


If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!