Looking for something else? Check the categories of Data processing:
Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark
If not, below you can find all articles belonging to Data processing.
Very often dealing with a single PCollection in the pipeline is sufficient. However there are some cases, for instance when one dataset complements another, when several different distributed collections must be joined in order to produce meaningful results. Apache Spark deals with it through broadcast variables. Apache Beam also has similar mechanism called side input.
As we saw in the previous post, Apache Beam brings the possibility to deal with state. However, as we learned there, the state itself allows only to keep something in memory during the window duration. After that, the state is removed. But thanks to another Beam's feature called timers we can deal with the expiring state just before its removal from the state store.
Real-time processing is most of the time somehow related to stateful processing. Either we need to solve some sessionization problem, count the number of visitors per minute etc. Not surprisingly Apache Beam comes with the API adapted to put in place the solutions to them.
Another important point of windowing in Apache Beam concerns triggers. Thanks to them we can freely control when the window results are computed.
Data, especially in streaming applications, can very often arrive on late to the processing pipeline. Despite of that, Apache Beam is able to handle this case pretty easily thanks to watermark mechanism.
As mentioned in one of the first posts about Apache Beam, the concept of window is a key element in its data processing logic. Even for bounded data a default window called global is defined. For the unbounded one the variety of windows is much bigger.
Since in distributed computing the data moves either locally (within single worker) or remotely (between several different workers), it must have a format understandable by the machine. And this format is guaranteed by the operation of serialization, also present in Apache Beam.
Apache Beam has some similarities with Apache Spark. One of them is the definition of processing pipeline as a Directed Acyclic Graph.
Despite the fact of serverless nature of Apache Beam's popular runners (e.g. Dataflow), the configuration is still an important point. This post, through some of provided runners, tries to shows why.
The power of Big Data processing platforms resides mainly in the ability to parallelize processing on different nodes. Each framework has its own unit of parallelism. In Spark it's called partition. Apache Beam calls it bundle.
Previous post introduced built-in transformations available in Apache Beam. Most of them were presented - except ParDo that will be described now.
Transformation are intrinsic part of each data processing framework. Apache Beam is not an exception and it also provides some of build-in transformations that can be freely extended with appropriated structures.
One of the problems with data processing frameworks released in the past few years was the use of different abstractions for batch and streaming tasks. Apache Beam is an exception of this rule because it proposes a uniform data representation called PCollection.
Prior to Spark 2.2.0 release, the data processing was based on a set of heuristic rules ignoring the typology of the data. But the most recent release brought a tool well known from the RDBMS world that is a Cost-Based Optimizer.
One of problems in distributed computing is the failure detection. How a master node can know that some of its workers went down just a minute ? A popular and quite simple solution uses heartbeats sent at regular interval by the workers. Spark also implements this technique.
If you've ever analyzed Spark UI, you've certainly seen the part of Locality level in the table with tasks. Even if this concept is less exposed than the topics as shuffle, it remains quite important in efficient data processing.
DataFrame can either be loaded and saved. And Spark SQL provides, as for a lot other points, different strategies to deal with data persistence.
It's time to continue the exploration of operator optimizations of logic plans in Spark SQL. After the first part describing optimizations from A to L, this post covers remaining letters.
Pushdown predicate is one of the most popular optimizations in Spark SQL. But it's not the single one and their main list is defined in org.apache.spark.sql.catalyst.optimizer.Optimizer abstract class.
User Defined Functions are not the single way to extend Spark SQL. The second solution is offered by User Defined Aggregate Functions.