Few weeks ago I got a comment asking me about the recommended data engineering books. I mentioned few of them in Becoming a data engineer - a feedback of my journey blog post but without explaining why. I will try to complete that in this blog post then.
Since I wanted to avoid a "long tail" article, I organized the books into 3 categories. The "Data concepts" groups all the books presenting general data engineering concepts like fault-tolerance, replication, scalability, ... The second category, called "Data tools", presents the books about specific data tools like frameworks or databases. Finally, the last category; called "Software engineering"; lists all good ... software engineering readings that can help in your daily data engineering work.
The first book from this category is "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" by Martin Kleppmann. It's a reference book that explains most of the key data concepts. Do you want to discover the data structures implemented in databases? Or maybe you want to see what the important aspects of every distributed system are? That's 2 of many questions that you will find an answer for in the book. I liked the agnostic aspect of the presented concepts, which aims for the same I'm trying to achieve right now with multi-cloud learning; i.e. find the patterns and explain them to others [to show that you know them ;)].
Another book I recommend very much from this general data category is "The Enterprise Big Data Lake" by Alex Gorelik. As a person who came to the data world from software engineering, I didn't follow the evolution of the data systems in the organizations. I did know about data warehouses, data marts, but since I joined the data domain during the rise of data lakes, I couldn't experience the data warehouse era. Even though the book mostly explains this data lake aspect, it helped me understand this data warehouse to data lake transition. But do not get me wrong, even as someone familiar with the data lake concept before reading, I learned on this field too!
The 2 next books are about streaming processing. The first of them is my reference for in a specific domain, a little bit like the "Designing Data-Intensive Applications...". I'm talking here about "Streaming Systems" by Reuven Lax, Slava Chernyak and Tyler Akidau. Through generic examples and a lot of meaningful pictures, it helped me to extend what I already knew about streaming from the micro-batch perspective (Structured Streaming).
The second book that helped me to assimilate streaming concepts is "Streaming Data" by Andrew G. Psaltis. It covers more general concepts than "Streaming Systems" but knowing them helps a lot to go further easily. Since it was the first purely streaming book I read, it helped me to learn the basics (event time, processing time, windows, etc. ) and to understand how to integrate some more complicated aspects like approximate algorithms into streaming architectures. And "Streaming Systems" extended all that perfectly!
To complete the pattern-oriented books from the previous section, I have 3 more tool-specific titles to share. First, the "Learning Spark: Lightning-Fast Data Analytics" by Jules S. Damji, Brooke Wenig, Tathagata Das, Denny Lee. I didn't expect that as an Apache Spark enthusiast, but all the 12 chapters taught me something new, as well about the execution engine as about the use cases (ML, MLOps, data lake, lakehouse, ...). As a result, a few extra items in my learning backlog!
The second book from the family of data processing frameworks is "Stream Processing with Apache Flink" by Fabian Hueske and Vasiliki Kalavri. Weird? After all, I should have known everything about streaming processing after reading all the mentioned books. But no :) It was very insightful seeing another streaming approach than micro-batch implemented. Moreover, several concepts, like the ones related to fault-tolerance, work differently than in Structured Streaming, and it was fascinating to discover them. So, if like me, you use Apache Spark daily and are looking for something to complete your data processing vision, think about "Stream Processing with Apache Flink" !
And finally the book I bought a bit by chance, the "Google BigQuery: The Definitive Guide: Data Warehousing, Analytics, and Machine Learning at Scale" by Valliappa Lakshmanan and Jordan Tigani. At the time, I wanted to understand how it's even possible to implement a serverless data warehouse and propose it as a managed service. Thanks to this book, I wasn't deceived. I could not only understand that but also learn some new SQL operations and optimizations for distributed computing.
To terminate, because there is an "engineering" in"data engineering", 2 books to improve software engineering skills. The first one is of course, the "Clean Code" by Robert Cecil Martin. In the past, I learned a lot of clean code concepts by doing (by the way, thank you Frank 🙏) and didn't know they were materialized somewhere. The "Clean Code" helped me raise awareness of the best coding practices by confirming what I already knew and giving extra input elements.
And to accompany the "Clean Code", think about "The Software Craftsmanship" by Sandro Mancuso. I read it with a very optimistic spirit, knowing that I will probably confirm many points of my data engineer attitude. Finally, it was partially the case since I also identified a few bad habits to eliminate. Maybe because of this personal concern, it was one of the fastest books I've ever read.
Those are the books I enjoyed the most in the last years of my data engineering journey. I certainly missed some of the excellent ones and will be happy to add them to my reading backlog. Just let me know in the comments ✍