Articles about Data processing on waitingforcode.com - articles for the pleasure of learning and discovery

Looking for something else? Check the categories of Data processing:

Apache Beam Apache Flink Apache Spark Apache Spark GraphFrames Apache Spark GraphX Apache Spark SQL Apache Spark Streaming Apache Spark Structured Streaming PySpark

If not, below you can find all articles belonging to Data processing.

January 9, 2019 • Apache Spark Structured Streaming

Apache Spark 2.4.0 features - watermark configuration

The series about Apache Spark 2.4.0 features continues. After last week's discovery of bucket pruning, it's time to switch to Structured Streaming module and see its major evolution.

Continue Reading →

January 3, 2019 • Apache Spark GraphX

Edge partitioning strategies

Previously we've learned about the vertices and edges representations in Apache Spark GraphX. At this moment to not introduce too many new concepts at once, we deliberately omitted the discovery of edges partitioning. Luckily, a new week comes and it lets us discuss that.

Continue Reading →

January 2, 2019 • Apache Spark SQL

Apache Spark 2.4.0 features - bucket pruning

This post begins a new series dedicated to Apache Spark 2.4.0 features. The first covered topic will be bucket pruning.

Continue Reading →

December 27, 2018 • Apache Spark SQL

Defining schemas in Apache Spark SQL with builder design pattern

Schemas are one of the key parts of Apache Spark SQL and its distinction point with old RDD-based API. When we deal with data coming from a structured data source as a relational database or schema-based file formats, we can let the framework to resolve the schema for us. But the things complicate when we're working with semi-structured data as JSON and we must define the schema by hand.

Continue Reading →

December 26, 2018 • Apache Spark GraphX

Edge representation in Apache Spark GraphX

After last week's discovery of VertexRDD we have still one graph-composing item to explain - EdgeRDD. After all, the graph is about the relationships this RDD guarantees the links between vertices.

Continue Reading →

December 20, 2018 • Apache Spark SQL

The who, when, how and what of Apache Spark SQL code generation

The code generated by Apache Spark for all the queries defined with higher-level concepts as SQL queries is the key to understand the processing logic performance. This post, started after a discussion on my Github, tries to explain some of the basics of code generation workflow.

Continue Reading →

December 19, 2018 • Apache Spark GraphX

Vertex representation in Apache Spark GraphX

After last week's global overview of graph representation in GraphX module, it's time to go a little bit deeper and analyze the 2 main components of graphs: vertices and edges. We'll begin here with the former ones.

Continue Reading →

December 13, 2018 • Apache Spark SQL

DataFrame and file bigger than available memory

Some weeks ago I've written a post about files with long lines impact on RDD-based processing. The post revealed the difficulty to process such files because of OOM errors. In this post I wanted to check how does it apply to Datasets.

Continue Reading →

December 12, 2018 • Apache Spark GraphX

GraphX and fault-tolerance

Bad things happen in distributed data processing and if we're prepared for them, it's better. To prevent against such issues Apache Spark is able to recompute failed partition but also to store the computation snapshot as a checkpoint. Both properties apply to GraphX module's fault-tolerance mechanism.

Continue Reading →

December 6, 2018 • Apache Spark

Apache Spark and off-heap memory

With data-intensive applications as the streaming ones, bad memory management can add long pauses for GC. Luckily, we can reduce this impact by writing memory-optimized code and using the storage outside the heap called off-heap.

Continue Reading →

December 5, 2018 • Apache Spark GraphX

Graphs representation in Apache Spark GraphX

Apache Spark uses a common data abstraction for all its higher level data structures. This implementation rule isn't different for GraphX represented by the sets of specialized versions of RDDs.

Continue Reading →

November 30, 2018 • Apache Spark SQL

Multiple SparkSession for one SparkContext

Some months ago bithw1 posted an interesting question on my Github about multiple SparkSessions sharing the same SparkContext. If you have similar interrogations, feel free to ask - maybe it will give a birth to more detailed post adding some more value to the community. This post, at least, tries to do so by answering the question.

Continue Reading →

November 28, 2018 • Apache Spark GraphX

Introduction to Apache Spark GraphX

Every time when we learn a new topic, it's important to start from the basics. We couldn't learn a new language without knowing the order of subject and verbs in a sentence. The same rule applies to Apache Spark's GraphX module that will be covered in this category. But before going into details, we'll focus on its basics.

Continue Reading →

November 14, 2018 • Apache Spark SQL

Apache Spark SQL, Hive and insertInto command

Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys.

Continue Reading →

November 7, 2018 • Apache Spark SQL

Grouping sets in Apache Spark SQL

Apache Spark SQL provides advanced analytics features that we can find in more classical OLAP-based workloads. Below I'll explain one of them.

Continue Reading →

October 18, 2018 • Apache Spark

Neo4j scalability and Apache Spark

Even though Apache Spark provides GraphX module, it's still possible to use the framework with other graph-based engines. One of them is Neo4j. But before using its Spark connector, it's good to know some internal execution details - especially the ones related to scalability.

Continue Reading →

October 11, 2018 • Apache Spark

Apache Spark and data bigger than the memory

Unlike Hadoop Map/Reduce, Apache Spark uses the power of memory to speed-up data processing. But does it mean that we can't process datasets bigger than the memory limits ? Below small survey will try to answer to that question.

Continue Reading →

October 3, 2018 • Apache Spark SQL

Custom projection pushdown in Apache Spark SQL for JSON columns

Most of RDBMS are able to store JSON documents in columns of JSON-like type. One of them is PostgreSQL that can keep JSONs in one of 2 columns (JSON or JSONB) and that natively enables querying of JSON document attributes. As we'll see below, with a little bit of effort we can implement similar behavior in Apache Spark SQL.

Continue Reading →

September 30, 2018 • Apache Spark

Apache Spark and data compression

Compressed data takes less place and thus may be sent faster across the network. However these advantages transform in drawbacks in the case of parallel distributed data processing where the engine doesn't know how to split it for better parallelization. Fortunately, some of compression formats can be splitted.

Continue Reading →

September 23, 2018 • Apache Spark SQL

Dealing with nested data in Apache Spark SQL

Nested data structure is very useful in data denormalization for Big Data needs. It avoids joins that we could use for several related and fully normalized datasets. But processing such data structures is not always simple. Fortunately Apache Spark SQL provides different utility functions helping to work with them.

Continue Reading →

⟵ Previous
7
8
9
10
11
12
13
14
15
16
17
18
19
Next ⟶

Data processing articles