Worth reading for data engineers - part 1 on waitingforcode.com

Hi and welcome to the new series. This time I won't blog about my discoveries. Instead, I'm going to see other blog posts from the data engineering space and share some key takeaways with you. I don't know how regular it will be yet but hopefully will be able to share some of the notes every month.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Why Data Quality Is Harder than Code Quality by Ari Bajo

Ari Bajo starts by pointing out an interesting observation: he tends to be more confident about the code quality rather than about the data itself. The former is a component we can control whereas for the latter, we often depend on the external parts, like data providers. Totally agree with the statement, especially for semi-structured data sources data processing (but not only).

Key takeaways from this blog post:

A data quality issue can be: detected by a business user, a failing data test, or a data monitoring raising an alert.
Ari's data quality definition: "data quality is the process of ensuring data meets expectations". A data expectation is a "rule running on production data". The examples of expectations include: tests against missing values, values within a range, or column values of one dataset matching column values of another dataset.
Ari's definition for data monitoring (aka data observability) is: "process of continuously collecting metrics about your data". An example of a data quality issue could be a decreased volume of data.
Another interesting insight is about data repositories. They're the places storing data assets, such as schema evolutions. Ari says that it's still a "nice to have and hard to do" compared to the software development version controlling. Because of this data repositories challenge, it's difficult to control the schema evolutions and adapt our data engineering jobs before data quality gets broken.
A data quality issue resolution loop consists of the following steps:
1. Finding the source of the error. You can rely on the data lineage layer for that.
2. Understanding the difference between datasets. You can rely on a data reconciliation tool like data-diff.
3. Fixing the issue. In this step you can fix the data producer code or quarantine the invalid data. If the error comes from an external provider, understanding and fixing it can be hard.
4. Backfilling the pipelines. Finally, you also might need to reprocess the pipelines depending on the bad dataset.

Link to the blog post: https://towardsdatascience.com/why-data-quality-is-harder-than-code-quality-a7ab78c9d9e.

You can reach out to the author on LinkedIn: Ari Bajo.

Real greenfield - Conventional Commits and how to enforce standards usage? by Bartek Kuczynski

The original blog post is in Polish so probably you won't understand it. I'll try summarize the key parts here but feel free to use an automatic translation tool to read it fully!

The blog post talks about greenfield projects and one of the most important parts of software engineering projects, the conventions. The author shares his way on defining a Git hook validating the commit format. According to the rules, each commit in the project should respect the following template:

(Type)[component][!]: (Title)

[Long description]

[Footer]

The template is self-explanatory but more importantly, the author shares how to enforce these hooks in a project. As you probably know, a Git hook is a client-side feature, so a team member can disable or simply forget setting it up. The CI/CD validation can help mitigate this issue but there is also an interesting feature of Maven exec plugin that will run the hook as a part of the build script:

<plugin>
  <artifactid>exec-maven-plugin</artifactid>
  <groupid>org.codehaus.mojo</groupid>
  <executions>
    <execution>
      <id>Git setup</id>
      <phase>generate-sources</phase>
      <goals><goal>exec</goal></goals>
      <configuration>
        <executable>${basedir}/.hooks/setup.sh</executable>
      </configuration>
    </execution>
  </executions>
</plugin>

The author also mentioned something I wasn't aware of, the Git trailers for decorating the commit messages with a predefined attributes, such as co-authors.

Finally, the article also specifies an interesting greenfield definition. To sum it up, not every new project is a greenfield one where you can do literally anything. If you start working on a new project and the organization already has some standards (architecture, naming conventions, tools,...), then you can't really talk about a pure greenfield because your choices are dependent on the existing context.

Link to the blog post: https://koziolekweb.pl/2020/10/28/prawdziwy-greenfield-conventional-commits-i-jak-zmusic-do-uzywania-standardow/.

You can reach out to the author on LinkedIn: Bartek Kuczynski.

Cooling down hot data: From Kafka to Athena by Nicolas Goll-Perrier

The 3rd blog post describes a cooling down hot data journey at leboncoin.fr . The author, Nicolas Goll-Perrier, presents the evolution of a system to synchronize data from an Apache Kafka to S3.

The first implementation was an hourly Apache Avro to Apache Parquet conversion job orchestrated from Apache Airflow. Additionally, the data producers had to declare their topics and schemas in a dedicated and shared repository which looks like a data contracts implementation. However, this batch-oriented approach didn't scale very well with the increasing number of producers and consumers. Among the pain points, Nicolas shares the following:

Maintenance of a custom Apache Spark for data conversion from Avro to Parquet.
Involvement in the topic configuration, including specifying the output path, partitioning scheme, or the event time column(s).
Table schema translation in the Glue metastore in all environments.

In the second time, Nicolas and his team challenged the initial idea. They ended up with a streaming-based solution relying on Kafka Connect and the S3 Sink Connector. The implementation consisted of:

Extended Docker image from the official Confluent Inc platform with additional custom plugins, such as S3 Sink plugin and the Single Message Transforms.
EKS Kubernetes cluster with the official Helm chart.
Automated CI/CD workflow on top of the schemas repository. The automation includes schema and Kafka Connect creation directly from the schema files. It doesn't sound like data contracts!?

The lessons learned from the Kafka Connector sounds interesting too:

High memory consumption. The buffering happens in memory and during 20 minutes the consumer can accumulate a lot of data.
Files number. Due to the buffer size (20 minutes) and number of partitions (> 70), each day generates approximately 5040 files. A lot of them are small which may require running some compaction job.
Load balancing. Especially for skewed partitions it can lead to uneven consumption of different nodes on the cluster.
Lack of automatic retry for the failed tasks.
Time for rebalancing the work among the consumers in case of a node failure or restart is unexpected and difficult to tune.

But using Kafka Connect is not the last step in the pipeline. There is also a need for indexing synchronized data in Glue metastore. Basically, this task can be implemented with Glue Crawlers but Nicolas lists a few shortcomings for using an exclusively Crawler-based solution:

Performance. Crawlers list the S3 subtree to detect new partitions or tables.
Cost.
Scaling with an AWS hard limit of 100 simultaneous crawlers. For 160 topics it's far from being ideal.
Closed source doesn't help understand what's going on.

Then instead of running Crawlers continuously, Nicolas and his team decided to trigger the crawler only when necessary, so when a new table is added or there is no schema associated in the metastore. Other operations, such as partition addition, don't involve crawling which reduces the cost. The whole mechanism relies on the S3 events delivered to an SQS and consumed by an AWS Lambda function.

Link to the blog post: https://medium.com/leboncoin-engineering-blog/cooling-down-hot-data-from-kafka-to-athena-5918a628bd98.

You can reach out to the author on LinkedIn: Nicolas Goll-Perrier.

On Spark, Hive, and Small Files: An In-Depth Look at Spark Partitioning Strategies by Zachary Ennenga

Although the author talks about Hive and HDFS which are not in my learning radar, I really liked the engineering way of approaching the problem! Let's take a quick look. First, Zachary recalls some basics about HDFS and especially why it's problematic to write many small files:

HDFS does not support large amounts of small files well. Each file has a 150 byte cost in NameNode memory, and HDFS has a limited number of overall IOPS. Spikes in file writes can absolutely take down, or otherwise render unusably slow, pieces of your HDFS infrastructure.

Next, the author presents how Apache Spark writes data with the partition-by statement. No, it doesn't write 1 file per output partition! Instead, it writes 1 file per output partition in each data writing task. So if you have 3 tasks and 3 partitions, you may end up with 9 files written. This logic explains how one of AirBnB's pipelines ended up with 1.1 millions files generated for a year of data.

Although later Zachary describes several Spark partitioning approaches, I'll focus here only on 2 that probably less popular than the others, repartition with a random factor and repartition by range:

Repartition by columns with a random factor. Simple repartition by columns works for writing small datasets because they end up with one output Hive partition. It's a bit trickier for bigger datasets where the fix consists of adding a random factor column, for example defined as withColumn("rand", rand() % filesPerPartitionKey). It has some issues, though, including hash collisions and filesPerPartitionKey computation. To address them you can use an approach called partition scaling and set a scaling factor as .repartition(5*365*SCALING_FACTOR, $"date", $"rand").
Repartition by range. The operation requires sampling which involves having driver memory increased. It also has a drawback of having all rows related to the partition column in the same Spark partition. You can overcome that issue if you add an extra repartition column with a random value to the dataset.

In the last paragraph, Zachary lists all the partitioning strategies and shares a simple guide with the strategy to implement per use case:

Use coalesce if:
* You're writing fewer files than your sPartition count
* You can bear to perform a cache and count operation before your coalesce
* You're writing exactly 1 hPartition

Use simple repartition if:
* You're writing exactly 1 hPartition
* You can't use coalesce

Use a simple repartition by columns if:
* You're writing multiple hPartitions, but each hPartition needs exactly 1 file
* Your hPartitions are roughly equally sized

Use a repartition by columns with a random factor if:
* Your hPartitions are roughly equally sized
* You feel comfortable maintaining a files-per-hPartition variable for this dataset
* You can estimate the number of output hPartitions at runtime or, you can guarantee your default parallelism will always be much larger (~3x) your output file count for any dataset you're writing. Use a repartition by range (with the hash/rand columns) in every other case.

Link to the blog post: https://medium.com/airbnb-engineering/on-spark-hive-and-small-files-an-in-depth-look-at-spark-partitioning-strategies-a9a364f908.

You can reach out to the author on LinkedIn: Zachary Ennenga.

Incremental datasets by Dataform

To finish this blog post, a great definition from Dataform's documentation of incremental datasets. According to the documentation, they are:

Incremental datasets aren't rebuilt from scratch every time they run. Instead, only new rows are inserted (or merged) into the dataset according to the conditions you provide when configuring the dataset. Dataform takes care of managing state, creating datasets, and generating INSERT (or MERGE ) statements for you.

Link to the blog post: https://docs.dataform.co/guides/datasets/incremental.

I was hoping to write a short summary of the blog posts but turns out it's even longer than the blog posts I'm writing 😮 I probably must improve myself in this knowledge sharing format and if you have some suggestions, they're more than welcome!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #worth reading for data engineers