Welcome to the 3rd part of the series with great streaming and project organization blog posts summaries!
State Rebalancing in Structured Streaming by Alex Balikov and Tristen Wentling
The first blog post of the series presents how to deal with infrastructure scaling for stateful streaming applications in Apache Spark Structured Streaming. The authors, Tristen Wentling and Alex Balikov, share the state rebalancing implementation in Apache Spark version on Databricks.
The key takeaways are:
- The default implementation prefers state locality, i.e. task scheduler prefers assigning a stateful partition always to the same executor because of the state cached locally on disk.
- It's problematic for the workloads scaling because the state remains local and new executors don't bring the expected positive impact.
- The solution seems obvious, relaxing the locality constraint. How? Whenever a new executor joins the cluster, the task scheduler starts by assigning it the stateful partitions which results in loading the corresponding state files from the checkpoint location.
- The approach impacts the latency with the state rebalancing action. However, this cost should be amortized over the pipeline execution by a reduced execution time of the subsequent micro-batches.
- The feature can be enabled with the spark.sql.streaming.statefulOperator.stateRebalancing.enabled property on Databricks Runtime 11 or more.
Link to the blog post: https://www.databricks.com/blog/2022/10/04/state-rebalancing-structured-streaming.html.
Building a Data Lake on PB scale with Apache Spark by David Vrba
Even though my current focus are table file formats and I'm trying to keep myself away from the previous era of data lakes built on top of columnar formats such as Apache Parquet or ORC, I appreciated the David Vrba's article about building a scalable data lake on top of Apache Spark and the aforementioned columnar formats.
As for many big data projects, David and his team had to deal with several challenges. I'll focus here only on a few ones but invite you to see the rest in the blog post directly!
- Schema evolution. The data lake stores semi-structured data from various social media platforms. Keeping the schema consistency is probably one of toughest challenges for that kind of data.
- Changes detection. David and his team developed an internal tool that consists of inferring the schema automatically with Apache Spark and comparing the result with the last schema present in the schema registry component. If there are any differences, the system sends a notification to one of the communication channels.
- Schema merge. If the schema changes in the tables exposed to the end users, there is a human verification if the added fields can be shared. If yes, there is another internal tool that...simply switches the location for the tables, without moving the data! Let me quote the code from the blog post to illustrate the idea:
( spark.createDataFrame(, new_schema) .write .mode('overwrite') .option('path', table_location_temp) .saveAsTable(table_name_temp) ) spark.sql('ALTER TABLE table_name_temp SET LOCATION table_location') spark.sql('MSCK REPAIR TABLE table_name_temp') spark.sql('DROP TABLE table_name') spark.sql('ALTER TABLE table_name_temp RENAME TO table_name')
- Storage optimizations. The data lake uses 2 storage optimizations, partitioning and bucketing. The former optimizes data analytics queries that most of the time target only the most recent values of the dataset. The latter is more adapted to the datasets combination, such as joins, because it mitigates the shuffle problem.
- Data optimizations. The user-facing tables rely on the data denormalization and may contain the pre-calculated values such as aggregates to provide a better user experience.
- Data quality. That's almost a must-have nowadays and the Great Expectations framework is often a default choice. It was not different for David's data lake. The data quality checks are running on the data before it lands to the final table. If there are some broken rules, the export simply stops there.
- Software engineering. IMO it's still a missing part of the data engineering blog posts, i.e. how to write and maintain the code base. David shares some solutions implemented for the data processing jobs on top of PySpark. All the shared logic is defined as modules and classes, and is packaged into a wheel for the maximal reusability. Besides there is also some automation processes:
All steps in the lake are fairly automated with just a few manual steps related to ingesting a new datasource. In that situation, we just add some configuration parameters into a config file and add the schema of the new datasource to the schema registry. The git pipeline creates the jobs in Databricks and they are launched using Airflow according to predefined scheduling. The other manual steps are related to monitoring and approving new fields in incoming data and possible debugging of changed datatypes that cannot be safely cast to versioned ones.
Link to the blog post: https://towardsdatascience.com/building-a-data-lake-on-pb-scale-with-apache-spark-1622d7073d46.
You can reach out to the author on LinkedIn: David Vrba.
Apache Kafka: 8 things to check before going live by Aris Koliopoulos
In his 3-years old blog post Aris shares several interesting notes about Apache Kafka. I particularly enjoyed the last 3 ones covering some low-level details:
- Memory maps. Aris proves that the default number of memory maps (65 536) may not be enough for big topics. Each topic:
- Each log segment has an index and time index attached, it requires 2 memory maps.
- There are multiple log segments per partition. By default, a segment is closed after 1GB or 7 days. If the retention is set to 1 day, there will be 365 segments, so 365 * 2 (index, time index) memory maps.
- If a topic has 1000 partitions, it will need 1000 * 365 * 2 memory maps = 730 000
- File descriptors. Kafka relies on them for log segments and open connections. The starting point of 100 000 allowed file descriptors for the broker process may not be enough. Aris experienced production workloads with 400 000 file descriptors.
- Log compaction. The log compaction process can die without killing the broker. If you do care about it, you should monitor the errors in the log-cleaner.log file.
Link to the blog post: https://ariskk.com/kafka-8-things.
You can reach out to the author on LinkedIn: Aris Kyriakos Koliopoulos.
Designing a Production-Ready Kappa Architecture for Timely Data Stream Processing by Amey Chaugule
Reprocessing (aka backfilling) is not an easy task in streaming systems. It's even more challenging if you think about stateful applications. The blog post I'm sharing here was written by Amey Chaugule almost 3 years ago but despite the age, the solution is very smart and deserves its place in the "Worth a read" series!
To introduce the problem, let me quote Amey from his blog post:
The data which the streaming pipeline produced serves use cases that span dramatically different needs in terms of correctness and latency. Some teams use our sessionizing system on analytics that require second-level latency and prioritize fast calculations. At the other end of the spectrum, teams also leverage this pipeline for use cases that value correctness and completeness of data over a much longer time horizon for month-over-month business analyses as opposed to short-term coverage. We discovered that a stateful streaming pipeline without a robust backfilling strategy is ill-suited for covering such disparate use cases.
First, I thought about backfilling as about fixing invalid data, for example generated by a buggy code. Amey's definition extends that purpose and turns out that "A backfill pipeline is thus not only useful to counter delays, but also to fill minor inconsistencies and holes in data caused by the streaming pipeline".
Therefore, how to backfill those pipelines? Before introducing the implemented solution, Amey reminds 2 well-known approaches:
- Replaying data from the data at-rest storage (Hive in the article) to the streaming broker (Apache Kafka in the article) and launching the real-time streaming job. It has some drawbacks:
- Creation of a dedicated infrastructure which in the context of multi-tenant stack was problematic.
- Data ingestion that should replay the events in the exact same order to the topic.
- Using the Apache Spark unified API (DataFrame) to run the streaming job in a batch manner. Here too, Amey notes several drawbacks:
- Ignored streaming semantics (windowing, watermarks) in the batch mode.
- Resources because replaying the data from a data-at-rest storage would require much more compute power than for the iterative streaming processing.
- Downstream consumers overloading. Processing a big volume of data would require much more compute resources for the job itself but also for the downstream consumers that are writing the backfilled data to Elasticsearch or Hive.
The solution? Combining both approaches! The implementation considers Hive as an unbounded data source. It addresses 2 main issues. First, it avoids overloading the infrastructure with one huge backfill batch task. And second, it avoids the overhead of creating a temporary Kafka topic to ingest the reprocessed data.
Link to the blog post: https://www.uber.com/en-FR/blog/kappa-architecture-data-stream-processing/.
You can reach out to the author on LinkedIn: Amey Chaugule.
See you next month for the 4th part of the series!