Looking for something else? Check the categories of Cloud:
Data engineering on AWS Data engineering on Azure Data engineering on GCP Data engineering on the cloud
If not, below you can find all articles belonging to Cloud.
When I've first met the Complex Event Processing (CEP) term, I was scared. Event streaming processing itself was complex enough, so why this extra complex-specific stuff? It happens that the complexity is real but in this post I will rather focus on a different aspect. What are the services supporting the CEP on the cloud?
AWS was the first cloud provider I've been working on. That's why when I did my first Azure and GCP project, I was always asking myself, "Hey, how would you implement that on AWS?". Answering that question was easy most of the time, but sometimes I got stuck. One of my most significant issues was the identity and permissions management component. I will try to share some related answers in this blog post.
Data is not perfect, and in each project, you'll probably need to do some cleaning to prepare it for business use cases. To make this task easier, cloud providers have dedicated data wrangling services, and they'll be the topic of this blog post.
Is it possible to implement the MapReduce paradigm on top of cloud serverless functions? Technically yes and there are some reference architectures I'm gonna discuss in this blog post. Is it a good idea? It depends on the context and hopefully you'll be able to figure out the answer after reading my thoughts.
It's time for the 5th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between September and December.
When I tell you "schema management" and "streaming", you'll certainly think about the schema registry of Apache Kafka. That's true but also streaming cloud services do manage the schemas and in this blog post we'll see how.
That's one of the biggest problems I've faced in my whole career. The development environment! I'm not talking here about creating cloud resources in different subscription but about the environment sharing similar characteristics to the production. In the blog post I'll share with you different strategies to put in place in the context of the cloud and streaming applications.
Processing static datasets is easier than dynamic ones that may change in time. Hopefully, cloud services offer various more and less manual features to scale the data processing logic. We'll see some of them in this blog post.
How to manage secrets is probably one of the first problems you may encounter while deploying some resources from a CI/CD pipeline. The simple answer is: not manage them at all! Let the cloud services do this.
One of the big announcements of the previous Data+AI Summit was Delta Sharing, a protocol to exchange the life data with internal and external users. The question I asked myself at that moment was "Does it exist on the cloud?". Let's see.
When it comes to executing one isolated job, there are many choices and using a data orchestrator is not always necessary. However, it doesn't apply to the opposite scenario where a data orchestrator not only orchestrates the workload but also provides a monitoring layer. And the question arises, what to do on the cloud?
I've first heard about the time travel feature with Delta Lake. But after digging a bit, I've found that it's not a pure Delta Lake concept! In this blog post I will show you what cloud services implement it too.
When I was writing my previous blog post about losing data on the cloud, I wanted to call it "data loss prevention". It happens that this term is currently reserved for a different problem. The problem that I will cover just below.
Data is a valuable asset and nobody wants to lose it. Unfortunately, it's possible - even with the cloud services. Hopefully, thanks to their features, we can reduce this risk!
You all certainly heard about EMR, Dabricks, Dataflow, DynamoDB, BigQuery or Cosmos DB. Those are well known data services of AWS, Azure and GCP, but besides them, cloud providers offer some - often lesser known - services to consider in data projects. Let's see some of them in this blog post!
When I first heard about Durable Functions, my reaction was: "So cool! We can now build fully serverless streaming stateful pipelines!". Indeed, we can but it's not their single feature!
It's time for the 4th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between May and August.
Guess what topic I was afraid of at the beginning of my cloud journey as a data engineer? Networking! VPC, VPN, firewalls, ... I thought I would be able to live without the network lessons from school, but how wrong I was! IMO, as a data engineer, you should know a bit about networking since it's often related to the security part of the architectures you'll design. And in this article, I'll share with you some networking points I would like to know before starting to work on the cloud.
Almost 2 years ago (already!), I wrote a blog post about data pipeline patterns in Apache Airflow (link in the "Read also" section). Since then I have worked with other data orchestrators. That's why I would like to repeat the same exercise but for Azure Data Factory.
The easiest way to learn is by doing but what if it involves leaving your credit card number beforehand? I've never been comfortable with that but there is no other choice to get some hands-on experience on the cloud. Hopefully, it doesn't mean you can't control your expenses. In this article, we'll see how.