Data transformations in Apache Beam

Versions: Apache Beam 2.2.0

Transformation are intrinsic part of each data processing framework. Apache Beam is not an exception and it also provides some of build-in transformations that can be freely extended with appropriated structures.

After the first post explaining PCollection in Apache Beam, this one focuses on operations we can do with this data abstraction. The first section describes the API of data transformations in Apache Beam. The next one lists main available transformations and illustrates each of them with one or more code samples.

Transforming PCollections

In order to understand the transformation API we'd first start by recall the previous data processing framework developed at Google - Apache Flume, internal successor of MapReduce. Its transformation API was based on expicitly called operations doing different processing stuff: mapping, filtering, counting and so on. However this approach had some drawbacks: code readibility, modulability, extendability or type safety. It's the reasons why the Apache Beam's programmers decided to adopt more universal approach - expose a single apply(PTransform<? super PCollection<T>, OutputT> t) method from PCollection class.

This method takes the executed transformation (map, filter, count....) as parameter. The transformation is defined as an instance of PTransform<InputT, OutputT> implementation. Simply speaking, PTransform takes some data of certain type and returns other data that may be of different type. Among these transformations we can distinguish:

The transformations are identified by an unique name that can either be assigned explicitly through apply function call or randomly by the system. Some of data processing popular transformations (count, sum, min, max, map, filter...) are already included in Apache Beam. The missing ones can be simply provided as the implementations of PTransform.

Transformations examples

The following list contains some of built-in transformations provided by Apache Beam. Some of them are quite obvious, the others not and require more explaination:

This post is the continuity of previous articles about Apache Beam. This time it presents what we can do with our distributed data. The first section presented the API of data processing. We could learn that Apache Beam defines universal manner of processing methods by exposing a single one method called apply. The second part shown however that the framework still has some built-in common transformations, as counting, min/max, mapping, filtering or grouping.