DataOps - DevOps in the data world or a bit more than that?

"DataOps", this term is present in my backlog since a while already and I postponed it multiple times. But I finally found some time to learn more about it and share my thoughts with you.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

The post is organized as follows. The first section presents the "why" of the concept. It's quite an important question to answer since we already have data engineers, data scientists, or data analysts in the data world. Why do we need another role? The second section defines the concept. In the end, you will see the 7 steps to implement the DataOps framework inside a data team. You can find all the elements presented here in the DataOps cookbook by DataKitchen. I will not reinvent the wheel here but rather try to summarize the book's 133 pages and interpret some of the points. If you are an experienced DataOps practitioner and see something missing, please leave a comment. It will help to improve my and other readers' understanding.

DataOps, the "why"

In the first part of the book, you will find multiple data teams' pain points. The first is the lack of understanding that the single constant in our domain is the change. I associate this to the trap of "hope and heroism" also covered in the book. Since the teams are not adapted to the change, they often try to work day and night to overcome the issues they could solve with a proper data management strategy or software craftsmanship principles (TDD, Continuous Integration, ...). Because of that, the delivery is often "heroic" (work day and night, even during the weekend) and hopeful that it will provide a meaningful insight (hard to predict before without good data quality monitoring).

The second pain point is the lack of automation. It's somehow related to the first point. If the process of testing a new pipeline, validating the code, and promoting it into the production is not automated, then - fingers crossed - let's hope it works and doesn't break any other parts of the system, probably managed by different teams. This lack of automation can be explained by the absence of automation tools like a deployment pipeline and the lack of visibility and deployment practices.

This lack of visibility is related to other points, the data errors and poor data quality. The former point is inevitable unless your data suppliers work closely with you and guarantee that the given required field will always be there in a good format and that they will always deliver the data in time. It's then more important to be prepared for that and be aware of it, rather than trying to prevent it at all cost. Regarding poor data quality, in the book, you will find a great sentence summarizing the problem - "Bad Data Ruins Good Reports". If you can't say whether the data you exposed to the end-users is of good quality, even the best data analysts won't be able to value it in the business users' eyes.

The last problematic list groups things related to the architecture. First, the datasets are not adapted to the final use. For example, why use JSON files for an ad-hoc analysis from a serverless tool like AWS Athena, which is billed by the volume of the queried data? Parquet would be much more cost and performance efficient. The second problem is the siloed architecture where every team keeps a part of the main dataset. It makes the data access much more complicated, especially without any automated (manual work, you see it again!) data management process involving data validation, lineage, discovery, and cataloging.

It's a summary vision of the problems encountered by data teams. In the book, you will find them in the "Eight Challenges of Data Analytics" chapter.

DataOps definition

Finding "the" definition for DataOps in the book is not easy. There are many aspects addressed by the DataOps concepts, and let's first go through them to find "our" definition. The first aspect covers the tooling and methodology. In the book, you can read that:

DataOps is a combination of tools and methods, which streamline the development of new analytics while ensuring impeccable data quality. DataOps helps shorten the cycle time for producing analytic value and innovation, while avoiding the trap of "hope, heroism and caution."

The second part you'll find in the DataOps definitions is its relationship with DevOps:

It communicates that data analytics can achieve what software development attained with DevOps. That is to say, DataOps can yield an order of magnitude improvement in quality and cycle time when data teams utilize new tools and methodologies.

And to summarize:

The special sauce behind DataOps is automated orchestration, continuous deployment and testing/monitoring of the data pipeline. DataOps reduces manual effort, enforces data quality and streamlines the orchestration of the data pipeline.

From these 3 quotes - and of course the points found in the book - I would define DataOps as a set of automated data management and deployment practices helping to provide meaningful data insight and to promote seamless data-driven innovation. I think it covers the essentials but will be happy to get your vision of the concept!

Implementing DataOps

Let's focus now on the 7 steps that you will have to follow in the DataOps approach implementation:

  1. Add tests - you got it; the tests are the key component of DataOps. They enable the fast and safe evolution of the data pipelines. In this category, you will find 2 test types. The usual unit/integration tests for the business logic and the less known data quality tests to assess the quality of the generated reports. The tests should be then present not only in your project application but also in the pipeline. This idea of pipeline tests was greatly summarized by Cédric Hourcade and Germain Tanguy at Open R&Day in 2018 (in French, but if you check the slide, you will get the idea; if not feel free to ask)
  2. Use Git - the book talks about a "Version Control System" but what's better nowadays than Git? If you know any better alternative, I will be happy to learn! Anyway, the versioned code brings many advantages, especially in the deployment and code quality management (not to mention the collaborative work, which is quite obvious).
  3. Use Git correctly - create branches and merge only if your teammates are OK with the changes. This principle may seem obvious for a person coming from a software engineering background, but it doesn't have to be the case every time. I think here about the teams used to work with drag & drop-based tools, hard to version and deploy automatically.
  4. Leverage cloud promises to have an optimal working environment - in the past, it was quite complicated to isolate dev environments. Ordering a new server to let the new team member work in isolation on his feature? Not always so obvious to do as it seems. Hopefully, thanks to the cloud flexibility and Infrastructure as Code, it's not anymore a big deal to deploy a dedicated, feature-oriented environment before performing the final tests in the environment having multiple features to be deployed.
  5. Containers and reuse - containers foster the creation of small pieces of software doing a single job and doing it well. If you take a look at Kubernetes, you will see that nowadays, it's very easy to containerize and run not only data stores but also data processing code. And thanks to this technology, the code is less susceptible to run differently on local, dev, or production environments. It's also worth adding that this containerization helps to go fast, even for the setup for a local environment, because the container is created only once and after it's reused by the engineers, being eventually modified during its lifecycle.
  6. Be flexible - keep in mind that the pipeline and executed code will always change. Make your code and ETL/ELT pipelines flexible but also take care of the consistency. An example that comes to my mind is the branching in Apache Airflow that you can use to control the pipeline's business rules.
  7. Lose your hope - with DataOps, data teams shouldn't have "hope" nor "heroism" in their vocabulary. Developing and delivering a new feature for the end-user should be an obvious and stressless activity thanks to the data management, tests ensuring fewer regressions in the code and data quality, and automated deployment of the evolutions.

If you are still hungry after reading the above paragraphs, feel free to read the cookbook. It'll help you understand the ideas better and provide other examples than those presented here, which are my interpretation of the concept. But it's not terminated yet. Next time I will try to look at the DataOps concept with a critical eye.

TAGS: #DataOps

If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!