Data processing frameworks concepts

Modern data processing frameworks offer a wide range of features. At first glance this number can scary. Fortunately they can be discovered sequentially and often are common for the most popular frameworks.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests
Medallion architecture & Lakeflow SDP
Max 10 participants · production-ready templates
See the full curriculum → €7,000 flat fee · cohort of up to 10
Bartosz Konieczny
Bartosz
Konieczny

In this post we'll discover some of basic concepts of batch and streaming data processing. These concepts are rather global and it means that for example they won't represent specific transformations as mapping, filtering - instead of that they'll be represented under a common part called "transformation". This post won't contain code samples neither. But in some places it'll link to another posts in order to clearly show given use case. The post is organized in one big section describing the data processing concepts in an ordered list. It's inspired from the personal experience and Beam's capability matrix, quoted just after the conclusion.

The main ideas of data processing can be resumed in the following list:

Data processing has a lot of concepts and names good to know. They help not only to write efficient data processing pipelines (as shuffle problems, delivery semantics) but also are very helpful in the discovery of new data processing framework. As you can see in the above list, we can retrieve the implementation of almost every point in Apache Beam and Apache Spark. Thanks to that we know the point we'd focus on when we start to learn new framework. However, the list is not exhaustive and if you're more concepts to add, please let me know about them.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com đź“©