Data processing frameworks concepts
In this post we'll discover some of basic concepts of batch and streaming data processing. These concepts are rather global and it means that for example they won't represent specific transformations as mapping, filtering - instead of that they'll be represented under a common part called "transformation". This post won't contain code samples neither. But in some places it'll link to another posts in order to clearly show given use case. The post is organized in one big section describing the data processing concepts in an ordered list. It's inspired from the personal experience and Beam's capability matrix, quoted just after the conclusion.
The main ideas of data processing can be resumed in the following list:
- Additional inputs - often the data processed by the pipeline comes from a single source. However sometimes the processing task can need an additional resource, as for instance a small dictionary of data fitting in memory or other values to combine with the main dataset. In Apache Spark this problem can be solved in different ways but the simplest one seems to be the use of broadcast variables. You can learn more about them in the post zoom at broadcast variables. If you use Apache Beam, the solution will be side inputs described more in Side input in Apache Beam
- Checkpointing - this operation is used to persist the state of the pipeline. So persisted state can be used for different purposes but the most popular is the recovery. It's because the checkpoint can store different metadata related to the pipeline, as: JARs, last processed entries, planned but not processed entries and so on. An example of checkpoint recovery is Spark's metadata checkpoint.
- Delivery semantic - the data processing pipeline implementation may be driven by the message delivery semantic: exactly once, at most once or at least once. If given message is supposed to be delivered only once, we don't have any additional work to do. But if given message can be deliver more than that or not delivered at all, we'll probably need to implement some additional logic to deal with potential duplicates or to prevent data loss. The delivery semantic was shortly described in the post about consumers in Apache Kafka.
- Directed Acyclic Graph - data transformations are often represented as Directed Acyclic Graph, i.e. a directed graph without cycles (= "graph going forward"). This data structure is used to represent transformation dependencies in Apache Spark, Apache Beam or in Airflow. Some post were written about this topic and can be retrieved in the page distributed processing DAG.
- Fault tolerance - in distributed data processing the risk that something goes wrong is elevated. After all not only one node risks to fail but much more. It's why it's important for data processing framework to support fault tolerance, i.e. to have an ability to recover after some punctual problems. In Apache Spark the fault tolerance is implemented as automatic retry of failed tasks. You can find more information about it in the page about Spark fault tolerance.
- Hot keys - in order to guarantee previsible processing time it's important to distribute data evenly. However it's difficult to guarantee in a lot of transformations and the clearest example of that is grouping per key manipulation where some keys can have few data (let's say 80% of processed data) and the remaining ones a lot more of it (remaining 20%). In this situation the processing time will be slow down by the 20% of tasks processing 20% more data than the remaining 80%. Apache Beam comes with the solution to this problem called combine with fanout covered in the post Fanouts in Apache Beam's combine transform.
- Idempotence - sometimes the data processing logic can partially fail from time to time. For instance, in the session computation, the session can be construted correctly but saving it in database can fail (e.g. because of timeout). In such case the failing task will be retried, probably on another node. It's important that this retry generates the same session as the initial one. And it's called idempotence and one of example of its use was given in the post Enforcing consistency in stateful serverless processing with idempotence.
- Join - data processing often needs to combine different distributed datasets into single one dataset. This operations is a little bit different than the additional inputs since the combining considers 2 or more datasets of similar potentially big size. The joining strategies can be different. The most obvious one is the strategy putting all entries for given joining key on the same node (shuffle join). The other one can be based on sending one dataset to each node where only its subset will be joined with appropriated data. The list of join strategies is defined in the page Spark SQL joins.
- Lateness - often data can arrive at late according to the event time. For instance, the log indicating an event of IoT device generated at 10:00 can be injected in the pipeline only 2 hours later. Spark, as well as Beam, has the notion of watermark, i.e. the definition of the time during which the late data is allowed. More can be found in the posts about late data
- Multiple outputs - one input dataset can produce different outputs. A simple use case of that is the situation where our pipeline makes some data validation and, depending on the final result (valid or invalid data), sends it to different data sink (e.g. different streams). Apache Beams deals pretty well with this feature thanks to side outputs feature.
- Partitioning - the data partitioning is an important concept. Since we work in distributed mode, we need to ensure that the workers will process exclusive subset of data (= 2 workers won't process the same subset of dataset). Data source partitioning is one of simple solutions to deal with it. For instance, in the case of files batch processing, the data partitioning can be the number of files processed by given worker. For the case of streaming applications, the partitioning can rely on streaming source partitioning logic (partitions in Kafka or shards in Kinesis). The posts describing this aspect are in the page about parallelization unit.
- Per-key operations - some of data transformations are done per-key. An obvious example of thins category of processing is all problems related to the sessions. For instance, if we want to reconstruct user's activity in a website, we'll probably identify him by the cookie and group his activity on some node for the activity reconstruction. Per-key operations can be represented in 2 different ways: partial and total. The first method applies for the transformations that can be computed partially on the nodes and these partial results can be merged together at the final step (e.g. Tree aggregations in Spark). The second option first brings all data of given key into a single node and only after that applies the final transformation. As you can deduce, the latter approach involves more shuffle and can potentially be slower than the first one.
- Pushdown - sometimes the data processing is expressed at the framework level but its execution is delegated to the data source. This operation is called a pushdown. A great example of it is the execution of filtering at database level and one of its examples is Predicate pushdown in Spark SQL.
- Serialization - it's an intrinsic part of distributed data processing. In order to move data between nodes or even deliver processing logic to the workers, the code must be serializable. It means that it must be able to be translated to the format transferable through the network and convertible to the language-specific code executable by the worker. However, the serialization brings a lot of problems, especially on the beginning. Some problems and solutions are described in the posts in distributed data serialization page.
- Shuffle - as already mentioned, per-key operations can move some data between nodes. This operation is called shuffle and, if overused in the processing pipeline, it can lead to serious performance problems. The process is described more in details in the Shuffling in Spark and Spark shuffle - complementary notes posts.
- Sink - this name is used to designate the place where the processed data is sent.
- Source - a data source can either be bounder or unbounded. For the first one we know all data to process before starting the processing. In the second case the source has the data arriving constantly and we don't know its exact size (boundaries).
- Stateful processing - the distributed processing can require to be stateful, i.e. to generate, maintain and update some state at every transformation. An example of it can be once again a sessionization problem where a session can be represented by an updatable object holding for instance the total session time, the number of actions and so on. The posts explaining such processing are defined in distributed stateful processing tag page.
- Transformation - with this name we often represents the operations on the data, such as maps, filters, aggregations and so on. The data transformations are listed in the posts grouped in distributed data manipulation page.
- Triggers - the triggers define when the processing is started. For instance it can start after accumulating sufficient number of elements or after some time elapsed. They are explained in the posts about Triggers in Apache Spark Structured Streaming and Triggers in Apache Beam.
- Windowing - the data, especially in the case of unbounded source, can be grouped and processed in the windows. Depending on the window type, the most recent window can either contain: exclusively the data for given interval or a part of data of previous intervals. Streaming processing windows page contains a list of posts about window types in Apache Beam and Apache Spark.
Data processing has a lot of concepts and names good to know. They help not only to write efficient data processing pipelines (as shuffle problems, delivery semantics) but also are very helpful in the discovery of new data processing framework. As you can see in the above list, we can retrieve the implementation of almost every point in Apache Beam and Apache Spark. Thanks to that we know the point we'd focus on when we start to learn new framework. However, the list is not exhaustive and if you're more concepts to add, please let me know about them.
If you liked it, you should read: Epidemic protocols Data processing locality and cloud-based data processing Massively Parallel Processing