What's new on the cloud for data engineers - part 4 (05-08.2021)

It's time for the 4th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between May and August.

Let me know what you think about the new format. This time, I tried to organize the changes into categories specific to each service type. I also tried to avoid listing version upgrades except for some big changes like Apache Airflow 2 support. Hopefully, this new organization will make the article, which is usually quite long, more readable!

AWS

Athena

Data reading:

Security updates:

Aurora

A lot of added PostgreSQL extensions:

Availability:

CloudFront

Even though CloudFront is not a pure data service, it got an interesting lightweight data processing feature, mostly reserved to the HTTP communication part:

Data Exchange

Two service changes:

Database Migration Service

A small change from the security:

DocumentDB

The document database also got some new features:

DynamoDB

An interesting feature if you're using NoSQL Workbench:

ElastiCache for Redis

ElasticCache should scale better with a new feature:

Elasticsearch service

Some changes in the storage part:

EMR

Let's start the EMR part with cluster changes:

A dedicated thread for new features:

And for the integration with other AWS services:

EventBridge

Two interesting changes on EventBridge:

FinSpace

FinSpace is one of the new services in this update list. It's a fully managed data management and analytics service dedicated to the financial services industry. In addition to data management capabilities, the service also includes a library of functions commonly used in this domain.

Glue

As previously Glue is probably the most updated service. Let's start with Data Brew component and the new supported transformations:

In addition to the transformations, there are some changes in the data types field:

Besides, Data Brew also got some updates in the supported data sources and data sinks:

Among 2 other Data Brew announcements, you'll find:

Another Glue visual component updated in the last few months is Glue Studio:

And since we're talking about jobs:

Also, streaming component got a change:

Kendra

Two search-related changes in Kendra:

Keyspaces

I didn't present this service in the previous releases, so let me do it right now. In a nutshell, Keyspaces is an Apache Cassandra-compatible, highly and easily scalable database service. In the most recent features:

Besides, Keyspaces also got some security improvements:

Kinesis Data Analytics

This streaming analytics service got a new component in the last few months:

In addition to this new feature, the service also got some other evolutions:

KMS

The service is not a pure data service but it's often the encryption component for many of the data services. In June, KMS got an interesting feature from this encryption standpoint:

Lake Formation

The data lake service had an interesting security improvement:

Lambda

Two streaming/messaging related changes in AWS Lambda:

Managed Workflows for Airflow

I promised to not include the new releases in my update blog posts but this news deserves to be shared!

MemoryDB

August brought not only new features for the already existing services but also a completely new service called MemoryDB for Redis. It's a Redis-compatible, durable, and in-memory database service, advertised as a data store for modern applications with microservices architectures.

Its API is compatible with Redis, so it should be relatively easy to give it a try for existing applications. The service also relies on multiple Availability Zones for data durability and Multi-AZ transaction log to fast failover and recovery.

MSK

Two security features for this managed Apache Kafka service:

MQ

This AWS messaging broker implements a new feature for RabbitMQ:

Neptune

When it comes to the AWS' graph database, it got the following updates:

Redshift

Let's start Redshift updates with the querying part:

And what about security updates?

Among other features, you'll find:

RDS

To present the RDS updates, I will cover one database type in each list. To start, Oracle with 2 updates:

One new feature is shared by PostgreSQL and Oracle:

And to terminate, SQL Server:

S3

Let's start with 2 security updates for S3:

Besides, a few things also changed for Amazon S3 on Outputs:

Snow family

Four interesting improvements for this data migration service:

SNS

SNS has an interesting filtering feature:

SQS

One of the preview features became Generally Available on May:

QuickSight

If you're a data engineer, you won't necessarily use QuickSight on a daily basis. However, it has an interesting feature for operationalizing your pipelines:

Azure

Cosmos DB

Let's start the Azure part with CosmosDB security improvements:

In addition to the client-side encryption, you will also find the following client features:

Regarding more service-oriented features:

Data Explorer

One Data Explorer network feature became Generally Available:

Data Lake

Some important announcements regarding Data Lake Gen1:

Among other features, you will find some interesting data capabilities:

Database

Let's start with 2 interesting Azure SQL Database features:

Besides, a lot of things also happened for PostgreSQL - Hyperscale (Citus), with plenty of General Availability announcements:

Presenting Citus it's a good occasion to introduce Azure Database for PostgreSQL features:

To terminate, let's see what's new in Azure Database for MySQL:

Databricks

If you're looking for decreasing your costs, the following update can interest you very much:

Functions

A few changes in the Durable Functions framework:

Purview

Two important integrations for the data cataloguing solution:

Security features:

Synapse

One interesting integration with an ACID-compatible file format:

Service Bus

Service Bus got a quota limit increase:

Storage

A bit more things happened recently on Azure Storage. To start, a completely new component is available:

Besides, the blobs have the following new features:

When it comes to the security announcements:

Finally, among other features, you will find:

GCP

Cloud Composer

Exactly as for AWS Managed Workflows for Apache Airflow, let's start with a big news:

The service should also have a better errors management thanks to:

Besides, there are also 3 new features:

BigQuery

It's time to see what's new on BigQuery. Let's start with general service features:

Regarding querying features:

In addition to the new functions, BigQuery also got new supported types:

Besides the aforementioned features, BigQuery also got some new Data Control Language (DCL) and Data Definition Language (DDL) statements:

And finally, 2 new for BigQuery Geographic Information Systems module:

BigQuery Transfer Service

It was relatively calm in the previous update but since then, the BigQuery Transfer Service got some new features:

Cloud SQL

Let's see first common changes for all Cloud SQL databases:

When it comes to the particular databases. Let's start with PostgreSQL:

And what about MySQL?

To terminate, Cloud SQL for SQL Server news just below:

Cloud Storage

Some changes from the upload category first:

Among other features, you will find:

Data Fusion

Two replication changes for this code-free ETL solution:

Among other features, you will find:

Dataflow

Three new Generally Available features are:

Besides, the service also has some pipeline-related updates:

Dataproc

The major part of Dataproc evolutions concerns the clusters:

And when it comes to the scaling:

  • Dataproc Enhanced Flexibility Mode is finally Generally Available - this mode manages shuffle data to minimize job progress delays that might be caused by the removal of nodes from a running cluster. With this mode, the shuffle data can be written to the primary workers (recommended for Spark jobs), or Hadoop Compatible File System (HCFS; only primary workers participate in this mode)
  • Datastore

    Two evolutions for Datastore:

    Datastream

    Not only AWS had new services in the last few months. Datastream is a new GCP service. It's a serverless Change Data Capture and replication service that enables streaming low-latency data from Oracle and MySQL databases. It also integrates with other GCP services like BigQuery, Cloud Spanner, Dataflow and Data Fusion.

    Functions

    A new interesting feature is available for Cloud Functions:

    IAM

    Even though IAM is not a pure data product, some of its new features may be interesting for a data engineer:

    Pub/Sub

    Three GA announcements:

    Spanner

    Spanner is another heavily updated GCP service. To start, the types-related changes:

    When it comes to the querying and data storage part:

    And below you can find the news from the operationalization category:

    Storage Transfer Service

    This data transfer service has some multi-cloud integration features

    When it comes to the on-premise projects features:

    Please let me know what you think about this new presentation format ? If you have any improvement ideas for this type of blog posts, I will be happy to learn. Otherwise, a lot of good things happened in the last few months on the cloud. To sum-up, AWS and GCP came with new services, MemoryDB and Datastream. AWS continued to innovate Glue, GCP put a significant effort on BigQuery and Spanner, whereas Azure introduced some important improvements to the Storage and Data Lake Gen2 services. Of course, it's a very simplified vision, but since every article has to terminate somehow, I think it gives a good picture of the most recent evolutions on the cloud for data engineers!


    If you liked it, you should read:

    đź“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!