Waiting for code

on waitingforcode.com

Chaos in streaming graph processing

Some time ago I wrote a post about the graph data processing with streams. That article was based on X-Stream framework proposed by the searchers of EPFL research institute. At this occasion, I also mentioned the existence of newer alternative for X-Stream, adapted for distributed workloads, called Chaos. I voluntary omitted the explanation of Chaos in the previous post. Putting it aside of X-Stream would introduce too many new concepts. But now, after some weeks of graph processing discoveries, I would like to return to the successor of X-Stream and present it more in details. Continue Reading →

Annotations in Scala

When I was working with Java and Spring framework, the annotations were my daily friend. When I have started to work with Scala, I haven't seen them a lot. It was quite surprising at the beginning. But with every new written line of code, I have started to see them more and more. After that time it's a good moment to summarize the experience and focus on the Scala annotations. Continue Reading →

Memory and Apache Spark classes

In previous posts about memory in Apache Spark, I've been exploring memory behavior of Apache Spark when the input files are much bigger than the allocated memory. After that it's a good moment to sum up that in the post dedicated to classes involved in memory using tasks. Continue Reading →

Visualizing Apache Spark GraphX data processing with websockets and cytoscape.js

For a long time, I've wanted to make a small real-time data visualization application with the use of websockets and some fancy JavaScript visualization framework. And the moment went when I was preparing the execution schemas to illustrate distributed graph algorithms covered in Graph algorithms in distributed world - part 1 post. I used there static images combined together but it was quite painful. Because of that, I decided to check whether it's possible to do in a more programmatic way. Continue Reading →

Work-stealing in Scala

When I was reading about the Await implementation in Scala, I found a method called blocking. At that time I've read some articles to understand it but I hadn't a chance to play with it. Now it's the case and I will share my findings with you. Continue Reading →

Doing data on AWS - overview

Open Source provides a lot of interesting tools to deal with Big Data: Apache Spark, Apache Kafka, Parquet - to quote only a few of them. However nowadays data platforms without cloud support are more and rarer. It's why this topic merits its own category and posts on this blog. To not go too quickly, the first article speaks about services you can use to work with the data on AWS. Continue Reading →

Type specialization in Scala

When I was analyzing one of Apache Spark GraphX functions for the first time I faced a class annotated with @specialized annotation. Since then I decided to find more information about it and share them with you in this post. Continue Reading →

SQL and intersect operation

Thanks to modern Big Data solutions like BigQuery or Apache Spark SQL, the knowledge of the advanced SQL concepts is important. After covering the operations like window functions or grouping sets, it's time to show another interesting SQL feature, the INTERSECT operator. Continue Reading →

Apache Spark 2.4.0 features - barrier execution mode

Data-driven systems continuously change. We moved from static, batch-oriented daily processing jobs to real-time streaming-based pipelines running all the time. Nowadays, the workflows have more and more AI compontents. Apache Spark tries to stay in the movement and in the new release proposes the implementation of the barrier execution mode as a new way to schedule tasks. Continue Reading →

Creating graphs in GraphFrames

The Project Tungsten revolutionized Apache Spark ecosystem. Thanks to the new row-based data structure the jobs became more performant and easier to create. This revolution first affected the batch processing and later the streaming one. As of writing the following article, the graph processing is still not impacted but hopefully GraphFrames project can change this. Continue Reading →

Apache Spark 2.4.0 features - Avro data source

Apache Avro became one of the serialization standards, among others because of its use in Apache Kafka's schema registry. Previously to work with Avro files with Apache Spark we needed Databrick's external package. But it's no longer the case starting from 2.4.0 release where Avro became first-class citizen data source. Continue Reading →

Stream safely in Scala

Scala Stream offers something we have not a habit to see in other languages - lazy computation of the values alongside the memoization. However it's sometimes misleading and some people think about Streams as about iterators, i.e. a data structure computing and forgetting about the results. Such thinking can often lead to memory problems, especially with infinite streams. Continue Reading →