Worth reading for data engineers - part 1

Hi and welcome to the new series. This time I won't blog about my discoveries. Instead, I'm going to see other blog posts from the data engineering space and share some key takeaways with you. I don't know how regular it will be yet but hopefully will be able to share some of the notes every month.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

Why Data Quality Is Harder than Code Quality by Ari Bajo

Ari Bajo starts by pointing out an interesting observation: he tends to be more confident about the code quality rather than about the data itself. The former is a component we can control whereas for the latter, we often depend on the external parts, like data providers. Totally agree with the statement, especially for semi-structured data sources data processing (but not only).

Key takeaways from this blog post:

Link to the blog post: https://towardsdatascience.com/why-data-quality-is-harder-than-code-quality-a7ab78c9d9e.

You can reach out to the author on LinkedIn: Ari Bajo.

Real greenfield - Conventional Commits and how to enforce standards usage? by Bartek Kuczynski

The original blog post is in Polish so probably you won't understand it. I'll try summarize the key parts here but feel free to use an automatic translation tool to read it fully!

The blog post talks about greenfield projects and one of the most important parts of software engineering projects, the conventions. The author shares his way on defining a Git hook validating the commit format. According to the rules, each commit in the project should respect the following template:

(Type)[component][!]: (Title)

[Long description]

[Footer]

The template is self-explanatory but more importantly, the author shares how to enforce these hooks in a project. As you probably know, a Git hook is a client-side feature, so a team member can disable or simply forget setting it up. The CI/CD validation can help mitigate this issue but there is also an interesting feature of Maven exec plugin that will run the hook as a part of the build script:

<plugin>
  <artifactid>exec-maven-plugin</artifactid>
  <groupid>org.codehaus.mojo</groupid>
  <executions>
    <execution>
      <id>Git setup</id>
      <phase>generate-sources</phase>
      <goals><goal>exec</goal></goals>
      <configuration>
        <executable>${basedir}/.hooks/setup.sh</executable>
      </configuration>
    </execution>
  </executions>
</plugin>

The author also mentioned something I wasn't aware of, the Git trailers for decorating the commit messages with a predefined attributes, such as co-authors.

Finally, the article also specifies an interesting greenfield definition. To sum it up, not every new project is a greenfield one where you can do literally anything. If you start working on a new project and the organization already has some standards (architecture, naming conventions, tools,...), then you can't really talk about a pure greenfield because your choices are dependent on the existing context.

Link to the blog post: https://koziolekweb.pl/2020/10/28/prawdziwy-greenfield-conventional-commits-i-jak-zmusic-do-uzywania-standardow/.

You can reach out to the author on LinkedIn: Bartek Kuczynski.

Cooling down hot data: From Kafka to Athena by Nicolas Goll-Perrier

The 3rd blog post describes a cooling down hot data journey at leboncoin.fr . The author, Nicolas Goll-Perrier, presents the evolution of a system to synchronize data from an Apache Kafka to S3.

The first implementation was an hourly Apache Avro to Apache Parquet conversion job orchestrated from Apache Airflow. Additionally, the data producers had to declare their topics and schemas in a dedicated and shared repository which looks like a data contracts implementation. However, this batch-oriented approach didn't scale very well with the increasing number of producers and consumers. Among the pain points, Nicolas shares the following:

In the second time, Nicolas and his team challenged the initial idea. They ended up with a streaming-based solution relying on Kafka Connect and the S3 Sink Connector. The implementation consisted of:

The lessons learned from the Kafka Connector sounds interesting too:

But using Kafka Connect is not the last step in the pipeline. There is also a need for indexing synchronized data in Glue metastore. Basically, this task can be implemented with Glue Crawlers but Nicolas lists a few shortcomings for using an exclusively Crawler-based solution:

Then instead of running Crawlers continuously, Nicolas and his team decided to trigger the crawler only when necessary, so when a new table is added or there is no schema associated in the metastore. Other operations, such as partition addition, don't involve crawling which reduces the cost. The whole mechanism relies on the S3 events delivered to an SQS and consumed by an AWS Lambda function.

Link to the blog post: https://medium.com/leboncoin-engineering-blog/cooling-down-hot-data-from-kafka-to-athena-5918a628bd98.

You can reach out to the author on LinkedIn: Nicolas Goll-Perrier.

On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies by Zachary Ennenga

Although the author talks about Hive and HDFS which are not in my learning radar, I really liked the engineering way of approaching the problem! Let's take a quick look. First, Zachary recalls some basics about HDFS and especially why it's problematic to write many small files:

HDFS does not support large amounts of small files well. Each file has a 150 byte cost in NameNode memory, and HDFS has a limited number of overall IOPS. Spikes in file writes can absolutely take down, or otherwise render unusably slow, pieces of your HDFS infrastructure.

Next, the author presents how Apache Spark writes data with the partition-by statement. No, it doesn't write 1 file per output partition! Instead, it writes 1 file per output partition in each data writing task. So if you have 3 tasks and 3 partitions, you may end up with 9 files written. This logic explains how one of AirBnB's pipelines ended up with 1.1 millions files generated for a year of data.

Although later Zachary describes several Spark partitioning approaches, I'll focus here only on 2 that probably less popular than the others, repartition with a random factor and repartition by range:

In the last paragraph, Zachary lists all the partitioning strategies and shares a simple guide with the strategy to implement per use case:

Use coalesce if:
* You're writing fewer files than your sPartition count
* You can bear to perform a cache and count operation before your coalesce
* You're writing exactly 1 hPartition

Use simple repartition if:
* You're writing exactly 1 hPartition
* You can't use coalesce

Use a simple repartition by columns if:
* You're writing multiple hPartitions, but each hPartition needs exactly 1 file
* Your hPartitions are roughly equally sized

Use a repartition by columns with a random factor if:
* Your hPartitions are roughly equally sized
* You feel comfortable maintaining a files-per-hPartition variable for this dataset
* You can estimate the number of output hPartitions at runtime or, you can guarantee your default parallelism will always be much larger (~3x) your output file count for any dataset you're writing. Use a repartition by range (with the hash/rand columns) in every other case.

Link to the blog post: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908.

You can reach out to the author on LinkedIn: Zachary Ennenga.

Incremental datasets by Dataform

To finish this blog post, a great definition from Dataform's documentation of incremental datasets. According to the documentation, they are:

Incremental datasets aren't rebuilt from scratch every time they run. Instead, only new rows are inserted (or merged) into the dataset according to the conditions you provide when configuring the dataset. Dataform takes care of managing state, creating datasets, and generating INSERT (or MERGE ) statements for you.

Link to the blog post: https://docs.dataform.co/guides/datasets/incremental.

I was hoping to write a short summary of the blog posts but turns out it's even longer than the blog posts I'm writing 😮 I probably must improve myself in this knowledge sharing format and if you have some suggestions, they're more than welcome!


If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!