Looking for something else? Check the categories of Data processing:
Apache Beam Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark
If not, below you can find all articles belonging to Data processing.
Last time I wrote about different hints present in RDBMS and Hive. Today it's the moment to implement one of them.
After 2 previous posts dedicated to custom optimization in Apache Spark SQL, it's a good moment to start to write the code. As Jacek Laskowski suggested on Twitter (link in Read more), I will try to implement one extra optimization hint. But first things first and let's start with hints definition.
Regressions are one of the risks of our profession. Fortunately, we can limit the risk thanks to different testing strategies. One of them are regression tests that we can use to check whether the modified data processing logic didn't introduce the regressions simply by comparing two datasets.
During my exploration of Apache Spark configuration options, I found an entry called spark.scheduler.mode. After looking for its possible values, I ended up with a pretty intriguing concept called FAIR scheduling that I will detail in this post.
In one of my previous posts I presented how to add a custom optimization to Apache Spark SQL. It was not a good moment to deep delve into the topic because of its complexity. That's why I will try to do a better job here by showing the API of native optimizations.
One of data governance goals is to ensure data consistency across different producers. Unfortunately, very often it's only a theory and especially when the data format is schemaless. It's why the data exploration is an important step in the process of data pipeline definition. In this post I wanted to do a small exercise and check how Apache Spark SQL behaves with inconsistent data.
Compression has a lot of benefits in the data context. It reduces the size of stored data, so you will save some space and also have less data to transfer across the network in the case of a data processing pipeline. And if you use Bzip2, you can process the compressed data in parallel. In this post, I will try to explain how does it happen.
Apache Spark 2.4.0 brought a lot of internal changes but also some new features exposed to the end users, as already presented high-order functions. In this post, I will present another new feature, or rather 2 actually, because I will talk about 2 new SQL functions.
Some time ago I was thinking whether Apache Spark provides the support for auto-incremented values, so hard to implement in distributed environments After some research, I almost found what I was looking for - monotonically increasing id. In this post, I will try to explain why "almost".
In November 2018 bithw1 pointed out to me a feature that I haven't used yet in Apache Spark - custom optimization. After some months consacred to learning Apache Spark GraphX, I finally found a moment to explore it. This post begins a new series about Apache Spark customization and it covers the basics, i.e. the 2 available methods to add the custom optimizations.
In the previous post in GraphFrames category I mentioned the motifs finding feature and promised to write a separate post about it. After several weeks dedicated to the Apache Spark 2.4.0 features, I finally managed to find some time to explore the motifs finding in GraphFrames.
Some time ago I was asked by Sunil whether it was possible to load the initial state in Apache Spark Structured Streaming like in DStream-based API. Since the response was not obvious, I decided to investigate and share the findings through this post.
In previous posts about memory in Apache Spark, I've been exploring memory behavior of Apache Spark when the input files are much bigger than the allocated memory. After that it's a good moment to sum up that in the post dedicated to classes involved in memory using tasks.
When I first heard about the foreachBatch feature, I thought that it was the implementation of foreachPartition in the Structured Streaming module. However, after some analysis I saw how I was wrong because this new feature addresses other but also important problems. You will find more .
Data-driven systems continuously change. We moved from static, batch-oriented daily processing jobs to real-time streaming-based pipelines running all the time. Nowadays, the workflows have more and more AI compontents. Apache Spark tries to stay in the movement and in the new release proposes the implementation of the barrier execution mode as a new way to schedule tasks.
The Project Tungsten revolutionized Apache Spark ecosystem. Thanks to the new row-based data structure the jobs became more performant and easier to create. This revolution first affected the batch processing and later the streaming one. As of writing the following article, the graph processing is still not impacted but hopefully GraphFrames project can change this.
Apache Avro became one of the serialization standards, among others because of its use in Apache Kafka's schema registry. Previously to work with Avro files with Apache Spark we needed Databrick's external package. But it's no longer the case starting from 2.4.0 release where Avro became first-class citizen data source.
One of important characteristics of distributed graph processing which makes it different from classical Map/Reduce approach is the iterative nature of many algorithms. Pregel is one of the computation models that supports such kind of processing very well, while Apache Spark GraphX comes with its own Pregel implementation.
The series about the features introduced in Apache Spark 2.4.0 continues. Today's post will cover higher-order functions that you may know from elsewhere.