Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

September 23, 2018 • Apache Spark SQL

Apache Spark and window functions

One of previous posts in SQL category presented window functions that can be used to compute values per grouped rows. These analytics functions are also available in Apache Spark SQL.

Continue Reading →

September 16, 2018 • Apache Spark Structured Streaming

Stream-to-stream joins internals

In 3 recent posts about Apache Spark Structured Streaming we discovered streaming joins: inner joins, outer joins and state management strategies. Discovering what happens under-the-hood of all of these operations is a good point to sum up the series.

Continue Reading →

September 16, 2018 • Apache Spark

Apache Spark on Kubernetes - init containers

Initialization is a very first step of almost all applications. Unsurprisingly it's also the case of Kubernetes that uses Init Containers to execute some setup operations before launching the pods.

Continue Reading →

September 9, 2018 • Apache Spark Structured Streaming

Stream-to-stream state management

Last weeks we've discovered 2 stream-to-stream join types in Apache Spark Structured Streaming. As told in these posts, state management logic may be sometimes omitted (for inner joins) but generally it's advised to reduce the memory pressure. Apache Spark proposes 3 different state management strategies that will be detailed in the following sections.

Continue Reading →

September 2, 2018 • Apache Spark Structured Streaming

Outer joins in Apache Spark Structured Streaming

Previously we discovered inner stream-to-stream joins in Apache Spark but they aren't the single supported type. Another one are outer joins that let us to combine streams without matching rows.

Continue Reading →

August 26, 2018 • Apache Spark Structured Streaming

Inner joins between streams in Apache Spark Structured Streaming

Apache Kafka Streams supports joins between streams and the community expected the same for Apache Spark. This feature was implemented and released with recent 2.3.0 version and after some months after that, it's a good moment to talk a little about it.

Continue Reading →

August 18, 2018 • Apache Spark

Apache Spark on Kubernetes - useful commands

Beginning with new tool and its CLI is never easy. Having a list of useful debugging commands is always helpful. And the rule is not different for Spark on Kubernetes project.

Continue Reading →

August 18, 2018 • Apache Spark SQL

RDBMS options in Apache Spark SQL

Some recent posts covered important Spark SQL options for RDBMS: partitioning and write modes. However they're not the only ones available for this data storage.

Continue Reading →

August 11, 2018 • Apache Spark

Apache Spark on Kubernetes - global overview

Last years are the symbol of popularization of Kubernetes. Thanks to its replication and scalability properties it's more and more often used in distributed architectures. Apache Spark, through a special group of work, integrates Kubernetes steadily. In current (2.3.1) version this new method to schedule jobs is integrated in the project as experimental feature.

Continue Reading →

August 11, 2018 • Apache Spark SQL

SaveMode.Overwrite trap with RDBMS in Apache Spark SQL

Some months ago I presented save modes in Spark SQL. However, this post was limited to their use in files. I was quite surprised to observe some specific behavior of them for RDBMS sinks. Especially for SaveMode.Overwrite.

Continue Reading →

July 7, 2018 • Apache Spark

External shuffle service in Apache Spark

To scale Spark applications automatically we need to enable dynamic resource allocation. But to make it work we need another feature called external shuffle service that will be covered here.

Continue Reading →

July 1, 2018 • Apache Spark

What Kubernetes can bring to Apache Spark pipelines ?

Commercial version of Apache Spark distributed by Databricks offers a serverless and auto-scalable approach for the applications written in this framework. Among the time some other companies tried to provide similar alternatives, going even to put Apache Spark pipelines into AWS Lambda functions. But with the version 2.3.0 another alternative appears as a solution for scalability and elasticity overhead - Kubernetes.

Continue Reading →

July 1, 2018 • Apache Spark

Docker-composing Apache Spark on YARN image

Some months ago I written the notes about my experience from building Docker image for Spark on YARN cluster. Recently I decided to improve the project and transform it to Docker-compose format.

Continue Reading →

June 24, 2018 • Apache Spark SQL

Correlated scalar subqueries in Apache Spark SQL

Some weeks ago I presented correlated scalar subqueries in the example of PostgreSQL. However they can also be found in the Big Data processing systems, as for instance BigQuery or Apache Spark SQL.

Continue Reading →

June 15, 2018 • Apache Spark SQL

Nested loop join in Apache Spark SQL

In programming a simple is often the synonymous of understandable and maintainable. However it doesn't always mean efficient. One of examples of this thesis is nested loop join that is also present in Apache Spark SQL.

Continue Reading →

April 8, 2018 • Apache Spark Structured Streaming

Query metrics in Apache Spark Structured Streaming

One of important points for long-living queries is the tracking. It's always important to know how the query performs. In Structured Streaming we can follow this execution thanks to special object called ProgressReporter.

Continue Reading →

March 31, 2018 • Apache Spark Structured Streaming

Fault tolerance in Apache Spark Structured Streaming

The Structured Streaming guarantees end-to-end exactly-once delivery (in micro-batch mode) through the semantics applied to state management, data source and data sink. The state was more covered in the post about the state store but 2 other parts still remain to discover.

Continue Reading →

March 25, 2018 • Apache Spark Structured Streaming

Continuous execution in Apache Spark Structured Streaming

During the years Apache Spark's streaming was perceived as working with micro-batches. However, the release 2.3.0 tries to change this and proposes a new execution model called continuous. Even though it's still in experimental status, it's worthy to learn more about it.

Continue Reading →

March 18, 2018 • Apache Spark Structured Streaming

Stateful transformations with mapGroupsWithState

Streaming stateful processing in Apache Spark evolved a lot from the first versions of the framework. At the beginning was updateStateByKey but some time after, judged inefficient, it was replaced by mapWithState. With the arrival of Structured Streaming the last method was replaced in its turn by mapGroupsWithState.

Continue Reading →

March 18, 2018 • Apache Spark Structured Streaming

Stateful aggregations in Apache Spark Structured Streaming

Recently we discovered the concept of state stores used to deal with stateful aggregations in Structured Streaming. But at that moment we didn't spend the time on these aggregations. As promised, they'll be described now.

Continue Reading →

⟵ Previous
8
9
10
11
12
13
14
15
16
17
18
19
Next ⟶

Data processing articles