Waiting for code

on waitingforcode.com

Check out my new course on Data Engineering!

Are you a data scientist who wants to extend his data engineering skills? Or a software engineer who wants to work with Big Data? If not, maybe a BI developer who wants to evolve to engineering position? My course will help you to achieve your goal! Join the class →

DAG evolution - using start_date and end_date?

One of the greatest properties in data engineering is idempotency. No matters how many times you will run your pipeline, you will always end up with the same outcome (= 1 file, 1 new table, ...). However, this property may be easily broken when you need to evolve your pipeline. In this blog post, I will verify one possible way to manage it in Apache Airflow. Continue Reading →

Idempotent file generation in Apache Spark SQL

Some time ago I was thinking how to partition the data and ensure that we can reprocess it easily. Overwrite mode was not an option since the data of one partition could be generated by 2 different batch executions. That's why I started to think about implementing an idempotent file output generator and, therefore, discover file sink internals in practice. Continue Reading →

Apache Pulsar - global architecture and local setup

I'm really happy to start a whole new chapter on the blog and include Apache Pulsar to my monitored topics! Even though I already wrote about this technology in December 2019, I still feel hungry because the topic was more an analysis of Apache Spark Structured Streaming connector than the analysis of the tool per se. That's why I'm starting right now with the presentation of the basic concepts, like global architecture and local setup, both needed to go further. Continue Reading →