I'm a freelance data engineer and the author of the Data Engineering Design Patterns book.
My passion for knowledge sharing has led me to speak at various conferences such as Data+AI Summit, Spark Summit, NDC, and Big Data Summit, where I discuss topics ranging from data engineering to cloud-based solutions.
Recognized as a Databricks MVP for my contributions to technology sharing, I have built a reputation for my expertise in designing and implementing high-performance data systems on public clouds using Databricks and open-source technologies.
As a freelancer, I specialize in crafting scalable and efficient data solutions that empower organizations to harness the full potential of their data.
In addition to my speaking engagements, I actively contribute to the tech community by sharing my knowledge through blogs, open-source projects, and hands-on workshops.
Whether you're looking to optimize your data infrastructure or leverage the latest advancements in data engineering, Iām here to help you achieve your goals.
It depends - you know already, I'm a consultant ;-) More seriously, it depends on the problem you have! I can help with any data engineeringing issues but I will not be the best person to implement a Machine Learning algorithm or a backend service.
My areas of expertise are:
Besides I'm also a follower of the you build it, you run it, hence CI/CD pipelines (Gitlab CI, Jenkins) and IaC (Terraform) are my daily work routines.
I didn't include all technologies I've been working with so far in the list above since I'm not considering myself as an expert in them. But to complete the picture, I've also worked with other data technologies, including Apache Airflow, Apache Beam (Dataflow runner), ETL- and ELT- based batch processing, or yet serverless functions for event-driven workflows.
You know my hard skills already. But in addition to them, I'm fully committed to the teams and people I work with. As a long-life learner, I'm always looking to bring the innovation to the code as well as people. If you are here, you know I like to share my discoveries on the waitingforcode.com blog, in the hope of spreading the knowledge throughout the data community. I also do thisit privately with my teammates by leading internal workshop, preparing POCs, and improving the code quality and team skills through code reviews.
I don't like stay idle and am continuously looking for new data challenges. Below is the list of the projects I was working on:
Category | Problem | Tech stack |
---|---|---|
Data cleansing | A stateful streaming job preparing data for the Silver layer, including the modifications like standardization, reformatting, deduplication. | Apache Spark Structured Streaming, AWS, Jenkins, Scala |
Data dispatching | A real-time serverless job classifying data as valid or invalid and dispatching it to dedicated streams. | AWS, Scala |
Data dispatching | A real-time serverless job classifying data as valid or invalid and dispatching it to dedicated streams. | Apache Beam, GCP, Java, Cloud Build |
Data migration | Batch migration from Hadoo Hive to GCP BigQuery. | Apache Spark, GCP, Scala |
Data migration | Migrating PowerBI Dataflows to Databricks with PySpark. | Azure, PowerBI, Databricks, Python |
Data preparation | Data cleansing, normalization, and enrichment for a predictive ML use case. | Apache Airflow, Apache Spark SQL, Azure, Azure DevOps, Python |
Data privacy | GDPR right-to-be-forgotten system in the data warehouse based on GCP BigQuery. | Apache Airflow, GCP, SQL, Python |
Data validation | Data quality real-time and stateful pipeline to validate the order of the transformed events. | Apache Spark Structured Streaming, AWS, Scala |
Data visualization | Delivering sensors data for PowerBI in near-real time. | Azure, Azure DevOps, Python |
ELT | Various business-related batch and SQL-based only pipelines, including sessions generation, data cleansing, data preparation for data marts used by data analysts. | Apache Airflow, GCP, Github Actions, Python, SQL |
Reverse ETL | Real-time streaming pipeline integrating relevant events into an external CRM tool. | Apache Beam, GCP, Java, Cloud Build |
Ordered streaming data delivery. | Streaming pipeline delivering multi-tenant data with the ordering and maximum latency constraints. | Apache Spark Structured Streaming, AWS, Gitlab CI, Scala |
Scaling automation | Automating a project deployment for new regions. Reduced deployment time from 2 weeks to 1 day. | AWS, Databricks, Pyton |
Sessionization | Hourly batch jobs processing for sessions generation and a follow-up part for the data ingestion to the data warehouse layer. I also gave a talk on that topic. | Apache Airflow, AWS, Apache Spark SQL, Scala, Python, Jenkins |
Sessionization | Real-time users sessions generation with the help of session windows. | Apache Beam, GCP, GCP Cloud Build, Java, Terraform |
Sessionization. | Serverless sessionization pipeline relying on the Change Data Capture for providing users activity insight in near real-time. | AWS, Scala, Terraform |
Can't find your problem in the list? Well, I haven't had a chance to work on it yet, but I'm excited to help as long as it stays in the data engineering landscape!
Send me an email with your project at contact@waitingforcode.com. No worries if it's not very detailed but please include the problem nature and technological stack.