Data engineering on AWS articles

Don't sleep when you code...about sleep issue in KPL

Lessons learned why it's always worth checking the code implementation to avoid surprises later. Even for vendor-supported solutions.

Continue Reading β†’

Kinesis sequence number is not an Apache Kafka offset

I have used to say "Kinesis Data Streams is like Apache Kafka, an append-only streaming broker with partitions and offsets". Although often it's true, it's not that simple unfortunately.

Continue Reading β†’

Amazon Kinesis is not Apache Kafka

Open Source tools helped me switch to the cloud world a lot. The managed cloud services often share the same fundamentals as their Open alternatives. However, there is always something different. Today I'll focus on these differences for Amazon Kinesis service and Apache Kafka ecosystem.

Continue Reading β†’

Serverless MapReduce?

Is it possible to implement the MapReduce paradigm on top of cloud serverless functions? Technically yes and there are some reference architectures I'm gonna discuss in this blog post. Is it a good idea? It depends on the context and hopefully you'll be able to figure out the answer after reading my thoughts.

Continue Reading β†’

Serverless streaming on AWS - an overview

If you already worked on AWS and tried to implement streaming applications, you certainly noticed one thing. There is no single way to do it! And if you didn't notice that, I hope that this blog post will convince you, and by the way, help you to get a better understanding of the available solutions.

Continue Reading β†’

DataFrames for analytics - Glue DynamicFrame

When I came to the data world, I had no idea what the data governance was. One of the tools which helped me to understand that was AWS Glue. I had a chance to work on it again during my AWS Big Data specialty exam preparation and that's at this moment I asked myself - "DynamicFrames?! What's the difference with DataFrames?" In this post I'll try to shed light on it.

Continue Reading β†’

AWS data services security - encryption, authentication and exposition

When I was preparing my AWS Big Data specialty certification, I was not comfortable with 2 categories, the visualization and security. Because of that I decided to work on them, starting with the latter one which can have a more direct impact. The work that I initiate with this post about data security practices on AWS data services.

Continue Reading β†’

My journey to AWS Certified Big Data specialty

January 10, 2020 I successfully passed my AWS Certified Big Data specialty with the overall score of 82%. Despite the fact that it will be replaced soon (April 2020) by AWS Certified Data Analytics - Specialty, I'd like to share with you my learning process and interesting resources.

Continue Reading β†’

My journey to AWS Cloud Practitioner

One of goals in my 3-Levels List was to get 3 certificates: AWS Cloud Practitioner, AWS Big Data and GCP Data Engineer. I've already passed the first one and that's the reason I'm writing this blog post.

Continue Reading β†’

Loading data into Redshift with COPY command

One of approaches to load big volumes of data efficiently is to use bulk operations. The idea is to take all the records and put them into data store at once. For this purpose, AWS Redshift exposes an operation called COPY.

Continue Reading β†’

AWS Kinesis Firehose, event time and batch layer

Last time I wrote about sending Apache Kafka data to batch layer. This time I would like to do the same but with AWS technologies, namely Kinesis, Firehose and S3.

Continue Reading β†’

Listening EMR events with AWS Lambda

I really appreciate AWS services and one of the main reasons for that is the facility to implement event-driven systems. One of the interesting use cases of these events is related to the EMR service, responsible for running Apache Spark pipelines. In this post I will try to associate an action invoked every time an EMR step completes successfully.

Continue Reading β†’

AWS Lambda - does it fit in with data processing ?

Despite the recent critics (cf. "Serverless Computing: One Step Forward, Two Steps Back" link in the Read also section), serverless movement gains the popularity. Databricks proposes a serverless platform for running Apache Spark workflows, Google Cloud Platform comes with a similar service reserved to Dataflow pipelines and Amazon Web Services, ... In this post, I will summarize the good and bad sides of my recent experiences with AWS Lambda applied to the data processing.

Continue Reading β†’

Doing data on AWS - overview

Open Source provides a lot of interesting tools to deal with Big Data: Apache Spark, Apache Kafka, Parquet - to quote only a few of them. However nowadays data platforms without cloud support are more and rarer. It's why this topic merits its own category and posts on this blog. To not go too quickly, the first article speaks about services you can use to work with the data on AWS.

Continue Reading β†’