Waiting for code

on waitingforcode.com

Tips to discover internals of an Open Source framework internals - Apache Spark use case

Apache Spark is a special library for me because it helped me a lot at the beginning of my data engineering adventure to learn Scala and data-oriented concept. This "learn-from-existent-lib" approach helped me also to discover some tips & tricks about reading others code. Even though I used them mostly to discover Apache Spark, I believe that they are applicable to other JVM-based projects and will help you at least a little bit to understand other Open Source frameworks. Continue Reading →

Less popular aggregation functions in Apache Spark SQL

There are 2 popular ways to come to the data engineering field. Either you were a software engineer and you were fascinated by the data domain and its problems (I did). Or simply you evolved from a BI Developer. The big advantage of the latter path is that these people spent a lot of time on writing SQL queries and their knowledge of its functions is much better than for the people from the first category. This post is written by a data-from-software engineer who discovered that aggregation is not only about simple arithmetic values but also about distributions and collections. Continue Reading →

Vectorized operations in Apache Spark SQL

When I was preparing my talk about Apache Spark customization, I wanted to talk about User Defined Types. After some digging, I saw that there are some UDT in the source code and one of them was VectorUDT. And it led me to the topic of this post which is the vectorization. Continue Reading →

Apache Airflow and sequential execution

One of patterns that you may implement in batch ETL is sequential execution. It means that the output of one job execution is a part of the input for the next job execution. Even though Apache Airflow comes with 3 properties to deal with the concurrence, you may need another one to avoid bad surprises. Continue Reading →

CASE - SQL if-else

CASE operator is maybe one of the most unknown by the beginner users of SQL. Often when I see a question how to write an if-else condition in a SQL query, some people advise to write a UDF and use if-else directly inside. As you will see in this post, this solution is a little bit overkill though. Continue Reading →

EXISTS operator in SQL

Years ago when I started to work as a software engineer, I was overusing IN/NOT IN operator. One day, one of my colleagues suggested me to replace it in some queries by EXISTS/NOT EXISTS. And it helped to improve the performances of these queries. If among you are some people like "me years ago", I prepared this short post introducing to EXISTS/NOT EXISTS operator by comparing it to IN/NOT IN one. Continue Reading →

Testing sensors in Apache Airflow

Unit tests are the backbone of any software, data-oriented included. However testing some parts that way may be difficult, especially when they interact with the external world. Apache Airflow sensor is an example coming from that category. Fortunately, thanks to Python's dynamic language properties, testing sensors can be simplified a lot. Continue Reading →