Waiting for code

on waitingforcode.com

Check out my new course on Data Engineering!

Are you a data scientist who wants to extend his data engineering skills? Or a software engineer who wants to work with Big Data? If not, maybe a BI developer who wants to evolve to engineering position? My course will help you to achieve your goal! Join the class →

ALTER DEFAULT PRIVILEGES in PostgreSQL

At first glance, managing users access in PostgreSQL is easy, you simply execute a CREATE USER, give him some grants, assign a role, and often that's all. However, after some time "permission denied" errors can appear as new objects are created and not owned by the user. To mitigate the maintenance burden for that case, PostgreSQL proposes ALTER DEFAULT privileges operator. Continue Reading →

NIO Selector in Apache Kafka

It's rare when in order to write a blog post I need to cover more than 3 other topics. But that's what happens with Apache Kafka idempotent producer that I will publish soon. But before that, I need to understand and explain NIO Selector, its role in Apache Kafka, and finally the in flight requests. Since the first topic was already covered, I will move to the second one. Continue Reading →

Schema case sensitivity for JSON source in Apache Spark SQL

On the one hand, I appreciate JSON for its flexibility but also from the other one, I hate it for exactly the same thing. It's particularly painful when you work on a project without good data governance. The most popular pain is an inconsistent field type - Spark can manage that by getting the most common type. Unfortunately, it's a little bit trickier for less common problems, for instance when a same field has different case sensitivity. Continue Reading →

From Apache Spark connector to Apache Pulsar basic concepts

Some time ago I saw an interesting presentation about Apache Pulsar and it intrigued me. Compute separated from the storage in a streaming system? Sounds great! In this series of posts, I will try to understand how different challenges were solved but I will start by making an exercise of trying to figure out Apache Pulsar's architecture from its Structured Streaming connector. Continue Reading →

Implicit datetime conversion in Apache Spark SQL

If you've ever wondered why when you write "2019-05-10T20:00", Apache Spark considers it as a timestamp field? The fact of defining it as a TimestampType is one of the reasons, but another question here is, how Apache Spark does the conversion from a string into the timestamp type? I will give you some hints in this blog post. Continue Reading →