posts from Github articles

on waitingforcode.com
Articles tagged with posts from Github. There are 5 article(s) corresponding to the tag posts from Github. If you don't find what you're looking for, please check related tags: Apache Spark 2.4.0 features, Apache Spark data sources, completable future, data patterns, graph partitioning, POC.

Introduction to custom optimization in Apache Spark SQL

In November 2018 bithw1 pointed out to me a feature that I haven't used yet in Apache Spark - custom optimization. After some months consacred to learning Apache Spark GraphX, I finally found a moment to explore it. This post begins a new series about Apache Spark customization and it covers the basics, i.e. the 2 available methods to add the custom optimizations. Continue Reading →

Apache Spark and off-heap memory

With data-intensive applications as the streaming ones, bad memory management can add long pauses for GC. Luckily, we can reduce this impact by writing memory-optimized code and using the storage outside the heap called off-heap. Continue Reading →

Multiple SparkSession for one SparkContext

Some months ago bithw1 posted an interesting question on my Github about multiple SparkSessions sharing the same SparkContext. If you have similar interrogations, feel free to ask - maybe it will give a birth to more detailed post adding some more value to the community. This post, at least, tries to do so by answering the question. Continue Reading →

Apache Spark SQL, Hive and insertInto command

Some time ago on my Github bithw1 pointed out an interesting behavior of Hive integration on Apache Spark SQL. To not delve too much into details now, I can tell that the behavior was about not respected DataFrame schema. Our quick exchange ended up with an explanation but it also encouraged me to go much more into details to understand the hows and whys. Continue Reading →