Waiting for code

on waitingforcode.com

Apache Spark SQL, Hive and insertInto command

Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys. Continue Reading →

Graph mining

Because of its connected nature, graph structure has its own branch in data mining. Thanks to this branch we can get insight into relationships and dependencies between vertices. Continue Reading →

Path-dependent types in Scala

Before I went to Scala I had never imagined that we could do such many things nothing but with a types system. Aside of higher-kinded types or type boundaries that we can easily find in other languages, Scala offers more advanced type features as path-dependent types covered below. Continue Reading →

Graph partitioning

As told many times in previous posts, one of the most challenging tasks in distributed graph processing is the partitioning. Connected nature of the graph components makes the partitioning hard. Hopefully, the researchers continue to propose the solutions. Continue Reading →

Structural types

Duck typing and Scala aren't the words going well together. However with a special kind of types we can achieve similar effect that in dynamically typed languages. Continue Reading →

Graph storage

Until now we've discovered exclusively the concepts devoted to computing distributed graphs. But the compute part can't go without storage. And since for the latter in the context of graph we can't talk about the storage, it requires its own detailed explanation. Continue Reading →

Graph algorithms in distributed world - part 1

During last weeks we've discovered a lot about graph data processing in distributed world. However we haven't learned yet about the problems the graphs can solve. And it's as important as the knowledge about the processing techniques. Hopefully, this post will try to catch up this late. Continue Reading →

Companion objects

Scala doesn't come with static keyword as Java does. However, with object singleton type it allows to define static properties and methods. It goes even further and thanks to object "class" lets us to build an instruction called companion objects. Continue Reading →

Streaming and graph processing

Use cases of streaming surprise me more and more. In my recent research about graph processing in Big Data era I found a paper presenting the graph framework working on vertices and edges directly from a stream. In case you've missed that paper I'll try to present this idea to you. Continue Reading →

Vertex-centric graph processing

Graph data processing, even though seems to be less popular than streaming or files processing, is an important member of data-oriented systems. And as its "colleagues", it also has some different processing logics. The first described in this blog is called vertex-centric. Continue Reading →

Neo4j scalability and Apache Spark

Even though Apache Spark provides GraphX module, it's still possible to use the framework with other graph-based engines. One of them is Neo4j. But before using its Spark connector, it's good to know some internal execution details - especially the ones related to scalability. Continue Reading →

Handling exceptions in Scala

At first glance, the try-catch block seems to be the preferred approach to deal with exceptions for the people coming to Scala from Java. However, in reality, this approach is not a single one available in Scala. Continue Reading →