I am a freelance Data Engineer, author of Data Engineering Design Patterns and a Databricks MVP. I help companies build scalable, maintainable, and cost-effective data platforms.
I specialize in the "You build it, you run it" philosophy. I don't just deliver code; I deliver automated, tested, and observable systems.
Core Expertise
I focus on solving complex architectural puzzles, not just connecting tools:
| Category | Problem | Tech stack | Result |
|---|---|---|---|
| Data cleansing | A stateful streaming job preparing data for the Silver layer, including the modifications like standardization, reformatting, deduplication. | Apache Spark Structured Streaming, AWS, Jenkins, Scala | |
| Data dispatching | A real-time serverless job classifying data as valid or invalid and dispatching it to dedicated streams. | AWS, Scala | Resolved cluster maintenance friction by migrating the dispatching logic to serverless functions. |
| Data dispatching | A real-time serverless job classifying data as valid or invalid and dispatching it to dedicated streams. | Apache Beam, GCP, Java, Cloud Build | Simplified the overall architecture by clearly classifying data as being "ready for processing" and "requiring investigation". Centralized the dispatching logic in a central, scalable, and fully serverless component instead of having many isolated jobs doing the same work. |
| Data migration | Batch migration from Hadoo Hive to GCP BigQuery. | Apache Spark, GCP, Scala | Enabled decomissioning of old and expensive Hadoop stack. |
| Data migration | Migrating PowerBI Dataflows to Databricks with PySpark. | Azure, PowerBI, Databricks, Python | Resolved knowledge gap issue in the team responsible for this data transformation part. With the as-code solution, fully versioned on Github, and a clear review process, the team members gained a better understanding on the overall work done with the data. |
| Data preparation | Data cleansing, normalization, and enrichment for a predictive ML use case. | Apache Airflow, Apache Spark SQL, Azure, Azure DevOps, Python | |
| Data privacy | GDPR right-to-be-forgotten system in the data warehouse based on GCP BigQuery. | Apache Airflow, GCP, SQL, Python | Ensured regulatory data erasure through scalable, automated, and auditable SQL-based deletion logic. |
| Data validation | Data quality real-time and stateful pipeline to validate the order of the transformed events. | Apache Spark Structured Streaming, AWS, Scala | Optimized the feedback provided by the customer support team. |
| Data visualization | Delivering sensors data for PowerBI in near-real time. | Azure, Azure DevOps, Python | Helped the IoT monitoring team to get a better insight on key indicators for the health of the IoT devices. |
| ELT | Various business-related batch and SQL-based only pipelines, including sessions generation, data cleansing, data preparation for data marts used by data analysts. | Apache Airflow, GCP, Github Actions, Python, SQL | Improved the resilience of the pipelines by implementing idempotency patterns, reducing maintenance pressure on the team. Previously team members had to perform manual actions on the tables, such as DELETEs or TRUNCATEs before starting any reprocessing.. |
| Reverse ETL | Real-time streaming pipeline integrating relevant events into an external CRM tool. | Apache Beam, GCP, Java, Cloud Build | Enabled a real data insight for the marketing team, based on the data present in data mart. |
| Ordered streaming data delivery. | Streaming pipeline delivering multi-tenant data with the ordering and maximum latency constraints. | Apache Spark Structured Streaming, AWS, Gitlab CI, Scala | |
| Scaling automation | Automating a project deployment for new regions. Reduced deployment time from 2 weeks to 1 day. | AWS, Databricks, Pyton | Saved many hours, reducing the deployment effort from manual configuration to an automated Jinja-powered templates, deployed from a configuration file. |
| Sessionization | Hourly batch jobs processing for sessions generation and a follow-up part for the data ingestion to the data warehouse layer. I also gave a talk on that topic. | Apache Airflow, AWS, Apache Spark SQL, Scala, Python, Jenkins | Implemented the backbone for the future data use cases within the organization by rewriting an unsuccesful PoC in SQL on Teradata to a more scalable cloud environment leveraging Apache Spark for distributed processing. |
| Sessionization | Real-time users sessions generation with the help of session windows. | Apache Beam, GCP, GCP Cloud Build, Java, Terraform | Improved the precision of the geolocated recommendations by providing an additional context to the backend service with session windows generated in near real-time (< 2 min of latency). |
| Sessionization. | Serverless sessionization pipeline relying on the Change Data Capture for providing users activity insight in near real-time. | AWS, Scala, Terraform | Established a single source of truth for audience analytics by integrating with an external audience measurment system, providing the organization with certified viewership validation for all broadcast channels. |
To ensure the best results, I typically work under the following model:
If you have a data engineering challenge, especially involving Databricks, Spark, or Streaming, I'd love to hear about it.
To help me give you a quick "Yes/No" on fit, please include:
π Email me at contact@waitingforcode.com.