Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

May 21, 2017 • Apache Spark

Code execution on driver and executors

Keeping in mind which parts of Spark code are executed on driver and which ones on workers is important and can help to avoid some of annoying errors, as the ones related to serialization.

Continue Reading →

May 13, 2017 • Apache Spark

Tree aggregations in Spark

As every library, Spark has methods than are used more often than the others. As often used methods we could certainly define map or filter. In the other side of less popular transformations we could place, among others, tree-like methods that will be presented in this post.

Continue Reading →

May 13, 2017 • Apache Spark

isEmpty() trap in Spark

In general Spark's actions reflects logic implemented in a lot of equivalent methods in programming languages. As an example we can consider isEmpty() that in Spark checks the existence of only 1 element and similarly in Java's List. But it can often lead to troubles, especially when more than 1 action is invoked.

Continue Reading →

May 7, 2017 • Apache Spark

Testing strategies in Spark

After writing a post about testing Spark applications, I decided to take a look at Spark project tests and see which patterns they use to verify framework features.

Continue Reading →

May 7, 2017 • Apache Spark

Testing Spark applications

It's difficult to contest the importance of testing in programming. Tests help to avoid regressions (a lot of regressions) and also to better understand developed code. Spark (and other data processing frameworks by the way) is not an exception of this rule. But, obviously, testing applications working in distributed mode is more tricky than in the case of standalone programs.

Continue Reading →

April 30, 2017 • Apache Spark Streaming

SparkException: org.apache. spark. streaming. dstream. MappedDStream@7a388990 has not been initialized

Metadata checkpoint is useful in quickly restoring failing jobs. However, it won't work if the context creation and processing parts aren't declared correctly.

Continue Reading →

April 30, 2017 • Apache Spark Structured Streaming

Structured streaming

Project Tungsten, explained in one of previous posts, brought a lot of optimizations - especially in terms of memory use. Until now it was essentially used by Spark SQL and Spark MLib projects. However, since 2.0.0, some work was done to integrate DataFrame/Dataset in streaming processing (Spark Streaming).

Continue Reading →

April 15, 2017 • Apache Spark

Jobs, stages and tasks

Every distributed computation is divided in small parts called jobs, stages and tasks. It's useful to know them especially during monitoring because it helps to detect bottlenecks.

Continue Reading →

April 2, 2017 • Apache Spark SQL

User Defined Type

Spark SQL schema is very flexible. It supports global data types, as booleans, integers, strings, but it also supports custom data types called User Defined Type (UDT).

Continue Reading →

April 2, 2017 • Apache Spark SQL

Schemas

Spark SQL - even if the SQL suffix makes automatically think about RDBMS - works well with other data sources, as even plain CSVs or JSON files. This relation would be difficult to achieve without the concept of schema.

Continue Reading →

March 25, 2017 • Apache Spark SQL

Generated code in Spark SQL

One of powerful features of Spark SQL is dynamic generation of code. Several different layers are generated and this post explains some of them.

Continue Reading →

March 25, 2017 • Apache Spark SQL

Spark Project Tungsten

Even if Project Tungsten was started in Spark 1.5 and Spark's current version is 2.1 at the time of writing, it's good to know what precious this Project brought to Spark.

Continue Reading →

February 26, 2017 • Apache Spark SQL

Catalyst Optimizer in Spark SQL

The use of Dataset abstraction is not a single difference between structured and unstructured data processing in Spark. Apart of that, Spark SQL uses a technique helping to get results faster.

Continue Reading →

February 26, 2017 • Apache Spark SQL

Dataset in Spark SQL

Spark 2.0 brought some changes at API level. One of them was the merge of DataFrame with Dataset. Thanks to that the 3rd data abstraction, present yet in 1.6, was finally removed.

Continue Reading →

February 19, 2017 • Apache Spark

Configuration of Spark architecture members

Often a misconfiguration is the reason of all kinds of issues - performance, security or functional. Spark isn't an exception for this rule and it's the reason why this article focuses on configuration properties available for driver and executors.

Continue Reading →

February 19, 2017 • Apache Spark

Spark shuffle - complementary notes

This small post is the complement for previous article describing big lines of shuffle. It focuses more in details on writing part.

Continue Reading →

February 12, 2017 • Apache Spark

Memory management in Spark

Memory management in Spark went through some changes. In the first versions, the allocation had a fix size. Only the 1.6 release changed it to more dynamic behavior. This change will be the main topic of the post.

Continue Reading →

February 4, 2017 • Apache Spark

Shuffling in Spark

As already told in one of previous posts about Spark, shuffle is a process which moves data between nodes. It's orchestrated by a specific manager and it will be the topic of this post.

Continue Reading →

January 29, 2017 • Apache Spark

Cache in Spark

Cache is an appreciable tool when we have a greedy computation generating a lot of data. Spark also uses this feature to better handle the case of RDD which generation is heavy (for example necessities database connection or data retrieval from external web services).

Continue Reading →

January 29, 2017 • Apache Spark

Checkpointing in Spark

Checkpointing is, alongside caching, a method allowing to make a RDD persist. But there are some subtle differences between cache and checkpoint.

Continue Reading →

Data processing articles