Waiting for code

on waitingforcode.com

JVM JIT optimizations

Last time I was writing a post about the "whys" of code generation. And I found a lot of points related to JVM JIT optimizations. This time I will show what kind of optimization the JVM is capable of at runtime. Continue Reading →

Tips to discover internals of an Open Source framework internals - Apache Spark use case

Apache Spark is a special library for me because it helped me a lot at the beginning of my data engineering adventure to learn Scala and data-oriented concept. This "learn-from-existent-lib" approach helped me also to discover some tips & tricks about reading others code. Even though I used them mostly to discover Apache Spark, I believe that they are applicable to other JVM-based projects and will help you at least a little bit to understand other Open Source frameworks. Continue Reading →

Less popular aggregation functions in Apache Spark SQL

There are 2 popular ways to come to the data engineering field. Either you were a software engineer and you were fascinated by the data domain and its problems (I did). Or simply you evolved from a BI Developer. The big advantage of the latter path is that these people spent a lot of time on writing SQL queries and their knowledge of its functions is much better than for the people from the first category. This post is written by a data-from-software engineer who discovered that aggregation is not only about simple arithmetic values but also about distributions and collections. Continue Reading →

Vectorized operations in Apache Spark SQL

When I was preparing my talk about Apache Spark customization, I wanted to talk about User Defined Types. After some digging, I saw that there are some UDT in the source code and one of them was VectorUDT. And it led me to the topic of this post which is the vectorization. Continue Reading →

Apache Airflow and sequential execution

One of patterns that you may implement in batch ETL is sequential execution. It means that the output of one job execution is a part of the input for the next job execution. Even though Apache Airflow comes with 3 properties to deal with the concurrence, you may need another one to avoid bad surprises. Continue Reading →

CASE - SQL if-else

CASE operator is maybe one of the most unknown by the beginner users of SQL. Often when I see a question how to write an if-else condition in a SQL query, some people advise to write a UDF and use if-else directly inside. As you will see in this post, this solution is a little bit overkill though. Continue Reading →