Recently a reader asked me in a PM about the things to know and to learn before starting to work as a data engineer. Since I think that my point of view may be interesting for more than 1 person (if not, I'm really sorry), I decided to write a few words about it.
The post is the result of my observations and since I'm not considering myself as an "I know everything" person, please put it in balance with other things you find. In the post, I will first share my story, the whys, and hows of the data engineering adventure. Later on, I will describe what I should have done better and what are the takeaways.
My story: from a software engineer to a data engineer
I started my adventure with software a long time ago, in 2003. At this time I was a French football lover and created by the first website about it (Thank you Wayback Machine 🙏). From that point, I knew that web applications will be a thing I will work on for long years. And it was the case, I enjoyed a transition from PHP to Java. By the way, I also learned software engineering best practices, Agile methodology and ops basics.
Back in 2016, after a very interesting project with Neo4j and Elasticsearch, I asked myself, "do you think you will be able to do it during the next 10 years?". I was talking about Java and web services stuff. The answer was "no", even though I liked what I was doing. And it was the moment of choice. Machine Learning was becoming more and more popular, "data scientist was the sexiest job of the 21th century", and I was seriously thinking about learning data science.
Hopefully for me (and maybe for you if you enjoy my posts 😉), I saw that we could do ML with Apache Spark. After playing a little with the examples from the documentation, I didn't understand a lot and wanted to discover how it works. I started to read a lot because, at that time, I wasn't comfortable with the domain, so analyzing the code like today, was not the easiest way to learn. At the same time, I also wanted to learn Scala, so started to analyze available code snippets and to write my own.
So I started to learn more and more about Spark, RDDs, DStreams, the difference between transformations and actions... Meanwhile, I discovered Apache Kafka, and to learn Kafka I also began to explore Apache ZooKeeper. After a few months, all this showed me that I enjoyed much better the data engineering and distributed computing problems than the data science ones... And since even today, in 2020, I'm still curious about data engineering concepts and techniques, I can say that I made a good choice 4 years ago.
But my learning process was not perfect. At that time it was technology-driven since I preferred to discover frameworks and data stores instead of pure data engineering concepts. I also underestimated the importance of SQL and raising cloud computing.
Let's start with what I mean by "technology-driven". At that time I liked to play with Spark and Kafka, send and process messages but I would describe it more like a library exploration rather than domain learning. I knew that Spark could work on batch and streaming, that for streaming I could use Kafka, but I ignored that, for example, we could use partitioning to process data easier or that we could use an orchestration system to assemble components of a data system. All this to say that I missed the use-case driven approach with architecture examples, trade-offs that make choosing one solution or another, alternative solutions, and so on.
The second point is about cloud computing. I was always thinking that Open Source is the best and all proprietary tools are bad, and that's why I ignored the cloud at the beginning of my journey. I always liked, and still like, to see how things are implemented, not only read the documentation. But afterward, I saw that I missed the point and started to improve my cloud computing skills. After all, I also experienced that either the cloud services are based on Open Source solutions (eg. EMR to run Spark, agree, with few extra optimizations), or they share the same principles with them (eg. Kinesis shard == Kafka partition). So in other words, we could somehow live in both worlds at the same time, or almost.
But cloud computing was not the single point I ignored in the first months of my learning adventure. I also underestimated the importance of SQL and Python. First, SQL, is the query language per se and you can do a lot just by knowing it, going from simple SQL queries on your on-premise data warehouse, to building ELT pipelines on top of cloud data warehouses like Redshift or BigQuery. Of course, knowing a programming language helps a lot, especially for more complex projects where the logic cannot be expressed with a simple SQL query. Nonetheless, I would like to know more about ANTI/SEMI JOIN, GROUPING SETS or WINDOWs at the beginning of my journey.
And why Python? After all, I didn't want to become a data scientist. Well, from my perspective, Python is related more to all *asCode things. If you want to orchestrate your pipelines with Apache Airflow, you will define them as Python code. You will also find some libraries to deal with Infrastructure as Code in Python like Brume. Apart from that, Python is also a good bridge between data scientists and data engineers, so if you care about having a single language to express everything, you can use Python.
To sum-up this quite long article, you can find few recommendations to start or go further in your data engineering career:
- learn a data processing framework - no comments, without it going further will be complicated
- see different real use cases of data systems - you can find some inspiration in recordings of different data-driven conferences or local meetups. It will help you to discover what you know and what you don't.
- SQL and Python are your friends - even though you enjoy JVM-language, knowing them will help a lot to move forward or to talk with your new friends (data scientists and data analysts).
- learn well how to work on one cloud provider - try to get a specialized certification. After, switching to another provider should be easier.
- be curious - but it's a more global recommendation. Not only do things, but try to understand why they work as they work. You will see, with every new technology, you will more work on "Indeed, I already saw that" mode than on "Oh really, it's possible 😮"
- respect software engineering best practices - not mentioned before but I had a chance to learn them from very good engineers and they helped me later to easier switch to the data world, especially at the beginning when I had to test different solutions very quickly. Thanks to correct code decomposition and unit test coverage, I was able to change backends quite fast.
- read - not my blog but if you do, I appreciate 😉 Joking aside, you can find really good data books, like "Streaming Systems" (Reuven Lax, Slava Chernyak, and Tyler Akidau), "Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems" (Martin Kleppmann), "The Software Craftsmanship" (Sandro Mancuso) or "Streaming Data" (Andrew G. Psaltis). I read and appreciated them all, so feel free to take a look.
- ops - even though you will probably not do the same good work as the ops part of your team, being able to create a cloud resource from the code, prepare your data engineering project for the production delivery with a CI/CD pipeline definition, will be helpful and guarantee a bigger level of independence. For some organizations, being able to do so is often the must-have skill, so not underestimate it.
I arrived at data engineering a little bit by chance. I simply wanted to see another face of engineering after several years spent on web apps and web services. There are some points I should have done better and I hope that my feedback (I rarely write "personal" posts) will help you in your learning process. And if you prefer an online training, you can take a look at my Become a Data Engineer course where I cover most of the points from this article.