KISS principle is valid not only for software engineering but also for data pipelines. The pattern called Complex Logic Decomposition illustrates this pretty well.
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
This post begins with a short description of the pattern. Next, it shows an example of the implementation with Apache Airflow.
The goal of Complex Logic Decomposition pattern is to simplify data processing logic. Hence, instead of writing all the logic inside one application and deploy all of it with a distributed data processing framework, you will rather opt for a divide-and-conquer approach. The logic will be then split into smaller and much easier to understand applications.
A good use case of this pattern is the situation where you have to process the data in different paradigms, like with iterative and not iterative algorithms. Instead of combining the logic of them inside one program, you can locate them in 2 different steps. It will allow you to run these steps concurrently and do whatever you want with generated results in a kind of merge step.
Fortunately, implementation of Complex Logic Decomposition pattern is quite easy with modern orchestration tools. With Apache Airflow you can define it like this:
with dag: graph_data_update = PythonOperator( task_id='update_graph_data', python_callable=update_graph_data) pregel_computation = PythonOperator( task_id='pregel_computation', python_callable=execute_pregel) sql_aggregator = PythonOperator( task_id='sql_aggregator', python_callable=aggregate_with_sql) graph_sql_results_reducer = PythonOperator( task_id='graph_sql_results_reducer', python_callable=merge_results) graph_data_update >> pregel_computation >> graph_sql_results_reducer sql_aggregator >> graph_sql_results_reducer
The pattern increases readability but it can also have a bad impact on the performance. Let's take the example of an Apache Spark SQL pipeline. If your 2 "branches" use the same input dataset, keeping it inside Apache Spark you can benefit from faster data access since it can be kept in memory between the jobs. Otherwise, you will need to read them from their primary store and very often this input loading step is one of the longest ones. So like always, it's up to you to make a decision about trade-offs performance vs isolation.
To wrap-up, the pattern works well in the cases where you have separate datasets that must be merged or when the data must be computed with different paradigms (e.g. iterative and not iterative). In such a case, decomposing the workflow should help not only to understand it better but very likely to execute it in parallel and optimize it. On the other side, when the input dataset is shared by the logic, decomposing it may have a negative impact on the overall performance.