Waiting for code

on waitingforcode.com

GraphX and fault-tolerance

Bad things happen in distributed data processing and if we're prepared for them, it's better. To prevent against such issues Apache Spark is able to recompute failed partition but also to store the computation snapshot as a checkpoint. Both properties apply to GraphX module's fault-tolerance mechanism. Continue Reading →

Apache Spark and off-heap memory

With data-intensive applications as the streaming ones, bad memory management can add long pauses for GC. Luckily, we can reduce this impact by writing memory-optimized code and using the storage outside the heap called off-heap. Continue Reading →

Scala rich data types

Have you ever wondered why in Scala we can directly reverse a String and in Java we must use a StringBuilder especially for it? If yes, this post provides a little bit more explanation by focusing on Scala's data types equivalents to Java's primitives (+ String) called rich wrappers. Continue Reading →

Multiple SparkSession for one SparkContext

Some months ago bithw1 posted an interesting question on my Github about multiple SparkSessions sharing the same SparkContext. If you have similar interrogations, feel free to ask - maybe it will give a birth to more detailed post adding some more value to the community. This post, at least, tries to do so by answering the question. Continue Reading →

Introduction to Apache Spark GraphX

Every time when we learn a new topic, it's important to start from the basics. We couldn't learn a new language without knowing the order of subject and verbs in a sentence. The same rule applies to Apache Spark's GraphX module that will be covered in this category. But before going into details, we'll focus on its basics. Continue Reading →

Scala Futures

With increasing number of computation power, the parallel computing gained the popularity during the last years. Java's concurrent package is one of the proofs for that. But Scala, even though it's able to work with Java's concurrent features, comes also with its own mechanisms. Futures are one of them. Continue Reading →

Self-types in Scala

The series about Scala types system continues. After last week's article about path-dependent types, it's time to discover another type-related feature - self-types. Continue Reading →

Apache Spark SQL, Hive and insertInto command

Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys. Continue Reading →

Graph mining

Because of its connected nature, graph structure has its own branch in data mining. Thanks to this branch we can get insight into relationships and dependencies between vertices. Continue Reading →

Path-dependent types in Scala

Before I went to Scala I had never imagined that we could do such many things nothing but with a types system. Aside of higher-kinded types or type boundaries that we can easily find in other languages, Scala offers more advanced type features as path-dependent types covered below. Continue Reading →