Articles about Apache Beam on waitingforcode.com - articles for the pleasure of learning and discovery

February 4, 2018 • Apache Beam

Joins in Apache Beam

Dealing with joins in relational databases is quite straightforward thanks to underlying data structures (e.g. trees). However it's less convenient to work with them in data processing world where schemaless and denormalization rule.

Continue Reading →

February 4, 2018 • Apache Beam

Fanouts in Apache Beam's combine transform

Uneven load is one of problems in distributed data processing. How to ensure that the any of nodes becomes a straggler ? Apache Beam proposes a solution for that in the form of fanout mechanism applicable in Combine transform.

Continue Reading →

January 28, 2018 • Apache Beam

Side output in Apache Beam

The possibility to define several additional inputs for ParDo transform is not the single feature of this type in Apache Beam. The framework provides also the possibility to define one or more extra outputs through the structures called side outputs.

Continue Reading →

January 28, 2018 • Apache Beam

Side input in Apache Beam

Very often dealing with a single PCollection in the pipeline is sufficient. However there are some cases, for instance when one dataset complements another, when several different distributed collections must be joined in order to produce meaningful results. Apache Spark deals with it through broadcast variables. Apache Beam also has similar mechanism called side input.

Continue Reading →

January 21, 2018 • Apache Beam

Dealing with state lifecycle in Apache Beam

As we saw in the previous post, Apache Beam brings the possibility to deal with state. However, as we learned there, the state itself allows only to keep something in memory during the window duration. After that, the state is removed. But thanks to another Beam's feature called timers we can deal with the expiring state just before its removal from the state store.

Continue Reading →

January 21, 2018 • Apache Beam

Stateful processing in Apache Beam

Real-time processing is most of the time somehow related to stateful processing. Either we need to solve some sessionization problem, count the number of visitors per minute etc. Not surprisingly Apache Beam comes with the API adapted to put in place the solutions to them.

Continue Reading →

January 14, 2018 • Apache Beam

Triggers in Apache Beam

Another important point of windowing in Apache Beam concerns triggers. Thanks to them we can freely control when the window results are computed.

Continue Reading →

January 14, 2018 • Apache Beam

Late data in Apache Beam

Data, especially in streaming applications, can very often arrive on late to the processing pipeline. Despite of that, Apache Beam is able to handle this case pretty easily thanks to watermark mechanism.

Continue Reading →

January 6, 2018 • Apache Beam

Windows in Apache Beam

As mentioned in one of the first posts about Apache Beam, the concept of window is a key element in its data processing logic. Even for bounded data a default window called global is defined. For the unbounded one the variety of windows is much bigger.

Continue Reading →

January 6, 2018 • Apache Beam

Coders in Apache Beam

Since in distributed computing the data moves either locally (within single worker) or remotely (between several different workers), it must have a format understandable by the machine. And this format is guaranteed by the operation of serialization, also present in Apache Beam.

Continue Reading →

December 31, 2017 • Apache Beam

TransformHierarchy in Apache Beam

Apache Beam has some similarities with Apache Spark. One of them is the definition of processing pipeline as a Directed Acyclic Graph.

Continue Reading →

December 31, 2017 • Apache Beam

Apache Beam pipeline configuration

Despite the fact of serverless nature of Apache Beam's popular runners (e.g. Dataflow), the configuration is still an important point. This post, through some of provided runners, tries to shows why.

Continue Reading →

December 22, 2017 • Apache Beam

Data partitioning in Apache Beam

The power of Big Data processing platforms resides mainly in the ability to parallelize processing on different nodes. Each framework has its own unit of parallelism. In Spark it's called partition. Apache Beam calls it bundle.

Continue Reading →

December 22, 2017 • Apache Beam

ParDo transformation in Apache Beam

Previous post introduced built-in transformations available in Apache Beam. Most of them were presented - except ParDo that will be described now.

Continue Reading →

December 16, 2017 • Apache Beam

Data transformations in Apache Beam

Transformation are intrinsic part of each data processing framework. Apache Beam is not an exception and it also provides some of build-in transformations that can be freely extended with appropriated structures.

Continue Reading →

December 16, 2017 • Apache Beam

PCollection - data representation in Apache Beam

One of the problems with data processing frameworks released in the past few years was the use of different abstractions for batch and streaming tasks. Apache Beam is an exception of this rule because it proposes a uniform data representation called PCollection.

Continue Reading →

Apache Beam articles