What's new on the cloud for data engineers - part 4 (05-08.2021) on waitingforcode.com - articles about Data engineering on the cloud

It's time for the 4th part of the "What's new on the cloud for data engineers" series. This time I will cover the changes between May and August.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

Let me know what you think about the new format. This time, I tried to organize the changes into categories specific to each service type. I also tried to avoid listing version upgrades except for some big changes like Apache Airflow 2 support. Hopefully, this new organization will make the article, which is usually quite long, more readable!

AWS

Athena

Data reading:

added support for Apache Hudi 0.8.0 - thanks to this release and Athena's snapshot query feature, you can use Athena to get a near real-time view of the Hudi tables! Moreover, the release includes a more optimized way to create Hudi datasets from already existing Parquet files
Microsoft-certified Power BI connector - starting from July you can connect Power BI to Athena via the Microsoft-certified official connector. The connector relies on the existing ODBC driver and supports major Power BI features like a scheduled dataset refresh or imported data source.
Parquet, Avro, ORC and JSON formats supported in the UNLOAD statement
with the cross-account feature you can register a Glue Data Catalog from a different account and run cross-account queries from Athena

Security updates:

JDBC and ODBC driver supports Microsoft Azure Active Directory and Ping Identity's PingFederate authentication methods
support of parameterized queries - if a query is used with a small changes like a different filter value, you can use a prepared statement to create a parameterized query. To do so, define the parameterized query with the PREPARE ${statement name} FROM ${statement} and call it by passing different parameters with the EXECUTE ${statement name} USING (value1, value2, ...)

Aurora

A lot of added PostgreSQL extensions:

pg_bigm to optimize full text search
pg_partman to facilitate partitions management thanks to an automatic partition creation and runtime maintenance.
pg_proctab to access OS statistics like I/O statistics on SQL queries.
oracle_fdw to connect from your PostgreSQL database to Oracle databases

Availability:

the JDBC driver got a better monitoring for the database cluster status - it should be more reactive and minimize failover time

CloudFront

Even though CloudFront is not a pure data service, it got an interesting lightweight data processing feature, mostly reserved to the HTTP communication part:

serverless CloudFront functions - you can use it to run JavaScript functions across more than 255 CloudFront physical locations. An example quoted in the release notes is the analysis of the Accept-Language header to rewrite the request to a language specific versions of your site.

Data Exchange

Two service changes:

easier data product publication - the revision publication was optimized from 2 to a single step!
data subscribers can pay in installments with custom payment schedules - previously, they had to pay the whole price upfront. Now, the data providers can create custom payment plans for the subscriptions.

Database Migration Service

A small change from the security:

support of TLS authentication and encryption in Apache Kafka endpoints (both self-managed and Amazon MSK)

DocumentDB

The document database also got some new features:

better compatibility with MongoDB by the added support for the following operations: renameCollection, $natural, $indexOfArray, $reverseArray, $zip
improved index support - ability to use an index in distinct() operation for non-multikey indexes
Global Clusters support for a region-wide outages protection, better disaster recovery and low-latency global reads with the reading operation redirected to the closest DocumentDB cluster to the client

DynamoDB

An interesting feature if you're using NoSQL Workbench:

possibility to save frequently used data-plane operations with Save Operations menu

ElastiCache for Redis

ElasticCache should scale better with a new feature:

support for automatic scaling actions defined as auto scaling policies - it's a classical approach when you define some thresholds like a memory usage. Whenever one of them is reached, the service will scale the cluster to handle the extra load. The opposite action will also happen if the added resources remain idle.

Elasticsearch service

Some changes in the storage part:

lower cost storage for infrequently used data - the solution is based on UltraWarm storage tier where a part is reserved to hot storage (frequently used data) and another part to cold storage (less frequently used data). The difference with the classical archival solutions is the ability to query both layers in the same interactive analysis experience.

EMR

Let's start the EMR part with cluster changes:

On-Demand Capacity Reservations - by default, EMR uses the lowest-priced instances first to meet the capacity requirements. However, if you do care about the instance type, platform or its Availability Zone, you can define the Capacity Reservations to take precedence over this price-based strategy.
increased the limit for EC2 instance types for master, core and task groups when using EMR Instance Fleets - now you can set 2x more different node types than previously (15 → 30)

A dedicated thread for new features:

SQL-based ETL with Apache Spark on Amazon EKS - the solution is more a recipe relying on already existing AWS services. It provides a Jupyter notebook running on top of Amazon EKS. The solution also uses Apache Spark SQL for declarative data processing and Argo Workflows library for no-code jobs orchestration.
native availability for Apache Ranger providing fine-grained data access control to authorization policies through Hive Metastore and S3 via EMRFS

And for the integration with other AWS services:

EMR on EKS supports Apache Spark Pod Templates feature - I introduced this feature in the What's new in Apache Spark 3.0 - Kubernetes article but in a nutshell, pod templates are a great way to use Kubernetes configurations not managed in Apache Spark, but available in Kubernetes

EventBridge

Two interesting changes on EventBridge:

integration with Step Functions to send events to EventBridge directly from Step Functions workflows, without the need to write any extra custom code
fan-in and fan-out patterns support; so the possibility to aggregate all events from different event buses or the dispatch events from a single bus to other ones

FinSpace

FinSpace is one of the new services in this update list. It's a fully managed data management and analytics service dedicated to the financial services industry. In addition to data management capabilities, the service also includes a library of functions commonly used in this domain.

Glue

As previously Glue is probably the most updated service. Let's start with Data Brew component and the new supported transformations:

NESTED and UNNEST transformations to work with nested JSON fields
IF, AND, OR and CASE conditions to create custom values or reference other columns within the expressions
numerical format transformations, including the features like setting decimal precision, customizing thousands separators, and abbreviating large values

In addition to the transformations, there are some changes in the data types field:

14 new advanced data types - Data Brew can automatically identify and normalize them. Starting from June you can use the following additional types: Social Security Number (SSN), email address, phone number, gender, credit card, URL, IP address, date and time, currency, zip code, country, region, state, and city

Besides, Data Brew also got some updates in the supported data sources and data sinks:

native integration with Amazon AppFlow, which in its turn opens the connection to data from SaaS services like for example Salesforce, Zendesk, Slack, ServiceNow; the feature is currently in Preview
backslash delimiter (\) support in CSV datasets
cleaned and normalized data can now be written to JDBC-supported databases and data warehouses without requiring to move the data into intermediary data stores
support for Tableau Hyper as data output format; you should now be able to upload the cleaned datasets directly into Tableau
also, cleaned data can be directly written into AWS Lake Formation-based tables on S3 managed by Glue Data Catalog; Data Brew will automatically adopt the existing access permissions and security defined in the table and its associated S3 location
to terminate, the cleaned datasets can be directly written to the tables managed by Glue Data Catalog so that you can rely on the catalog to manage the metadata

Among 2 other Data Brew announcements, you'll find:

AWS Health Insurance Portability and Accountability Act (HIPAA) eligibility meaning that healthcare and life science customers can include Data Brew in their workloads
control over the data quality statistics generated when running a profile job, such as outliers, duplicates or correlations

Another Glue visual component updated in the last few months is Glue Studio:

support for defining streaming ETL job settings in the editor such as window size, schema detection for each record vs. schema from Data Catalog
added code editor for customizing the job scripts; you no longer need to download and modify the scripts. You can customize them directly from the visual editor
data preview for each step of the visual job definition process; Studio automatically samples the transformed data and applies the transformation on this sample so that you can test and debug it easier, without the need to deploy the job

And since we're talking about jobs:

custom connectors are now bidirectional, i.e, you can use them as data sources and data sinks
integration with Amazon EventBridge which gives the possibility to trigger Glue workflows from an event generated by EventBridge, like S3 object generation

Also, streaming component got a change:

support for SSL client certificate authentication with Apache Kafka
support for JSON Schema data format in addition to the already supported Apache Avro; if you use Schema Registry to validate the schemas, you can now manage your records with JSON Schema vocabulary

Kendra

Two search-related changes in Kendra:

Dynamic Relevance Tuning adapted to the person executing the query - an example from the release note explains the purpose pretty well. An engineer of an IT department may prefer to see the results boosted by internal documentation Wiki-like systems, whereas an HR employee might prefer to prefer the results coming from SharePoint documents
Query Suggestion - the feature provides an auto-completion to the search terms typed by the users. Kendra administrators can configure these suggestions to shorten the time needed by their users to find the searched terms.

Keyspaces

I didn't present this service in the previous releases, so let me do it right now. In a nutshell, Keyspaces is an Apache Cassandra-compatible, highly and easily scalable database service. In the most recent features:

an easier support for quotas management from Service Quotas console
new CloudWatch metrics that should help detect unbalanced workload traffic across the partitions or spot a need to increase the number of client connections for a better throughput
is HIPAA eligible for running healthcare workloads; in addition, you can run the workloads requiring SOC (System and Organization Controls) report levels of 1, 2, or 3

Besides, Keyspaces also got some security improvements:

customer-managed master keys (CMKs) support for data at rest encryption - so far, the service only used the AWS-managed keys but starting from June, you can also create and manage your own keys stored in AWS KMS
optimized throughput for the private connections made through AWS PrivateLink

Kinesis Data Analytics

This streaming analytics service got a new component in the last few months:

Kinesis Data Analytics Studio to interactively query data streams in real time and in SQL, Python or Scala - the component is based on 2 Open Source technologies, Apache Zeppelin for the notebook and Apache Flink for the processing engine.

In addition to this new feature, the service also got some other evolutions:

RollbackApplication API is available in preview to restore the application to the previous stable state - it can be useful when the application becomes unresponsive and you need to maintain its real time character. Additionally, 2 other actions were added at the same time. ListApplicationVersions to get all applications versions with the associated configurations, and DescribeApplicationVersion to get a most complete view of the configuration for a given version.
another new action to configure the maintenance window, so the period when AWS updates the underlying infrastructure of your Kinesis Data Analytics application

KMS

The service is not a pure data service but it's often the encryption component for many of the data services. In June, KMS got an interesting feature from this encryption standpoint:

multi-region keys, so the possibility to replicate keys from one AWS region into another; now, if you move the encrypted data between regions, you won't need to decrypt and re-encrypt it with different keys in each region

Lake Formation

The data lake service had an interesting security improvement:

Tag-Based Access Control (TBAC) - control list based on tags where you can tag data lake resources such as databases, tables, or columns, and use these tags to create logical access control policies. These policies are also called tag grants because their association to the users passes through GRANT TAGS tag1 TO Principal 1 operation.

Lambda

Two streaming/messaging related changes in AWS Lambda:

support for SASL/PLAIN authentication for the functions triggered by a self-managed Apache Kafka deployment - at the definition of the event source, you can choose the SASL/PLAIN authentication method (simple username/password pair), and for example reference the credentials in Secrets Manager
support for Amazon MQ for RabbitMQ as an event source

Managed Workflows for Airflow

I promised to not include the new releases in my update blog posts but this news deserves to be shared!

the managed Apache Airflow service supports Apache Airflow 2, so the most recent version of the data orchestrator with improved scalability, security and visual user experience

MemoryDB

August brought not only new features for the already existing services but also a completely new service called MemoryDB for Redis. It's a Redis-compatible, durable, and in-memory database service, advertised as a data store for modern applications with microservices architectures.

Its API is compatible with Redis, so it should be relatively easy to give it a try for existing applications. The service also relies on multiple Availability Zones for data durability and Multi-AZ transaction log to fast failover and recovery.

MSK

Two security features for this managed Apache Kafka service:

support of IAM access control - thanks to the IAM service, you won't need to manage cluster credentials. You can now rely on the IAM roles or policies to control access to the service.
support of Secret Manager-backed login/password credentials for AWS Lambda triggered by MSK

MQ

This AWS messaging broker implements a new feature for RabbitMQ:

consistent hash exchange type to keep all messages having the same key to the same queue, and so maintaining the order of dependent messages

Neptune

When it comes to the AWS' graph database, it got the following updates:

openCypher support - you can now use this query language to work with Amazon Neptune graphs
SPARQL 1.1 Graph Store HTTP Protocol support (GCP) - it provides convenient endpoints to manipulate an entire named graph in one HTTP request instead of multiple SPARQL 1.1 queries
General Availability of Neptune ML with support for edge predictions and model selection automation
resource tags applied to database snapshots - they can be used with IAM policies to manage access to Neptune resources and the scope of allowed actions. The tags will be automatically copied to manual or automated database snapshots.

Redshift

Let's start Redshift updates with the querying part:

the data warehouse service supports recursive CTEs - it's a convenient way to query hierarchical data structures like grapes or trees; I blogged about hierarchical queries 3 years ago (I've to stop writing that, I'm feeling old every time :P)
semi-structured and JSON data supported with SUPER data type
spatial query performance enhancements, 3D/4D geometries and new spatial functions; all this to improve your work with spatial data; some of the queries can benefit up to 100x improvement in their execution time, without performing any extra action on your side
multiple statements within one transaction support in Redshift Data API

And what about security updates?

a simplified authentication profile support in JDBC/ODBC connectors - to create an authentication profile you can specify a JSON file with the specific configuration options per profile of each user. The profile name from the authentication profile is supported as an AuthProfile option in the JDBC connection string.

Among other features, you'll find:

cross-account data sharing Generally Available - thanks to this feature, you can share your live data across different Redshift clusters without the necessity to create any data synchronization pipeline. The feature passed from Preview to GA state in only 4 months!
another GA feature is Amazon Redshift ML; you should be able to create, train and deploy ML models using SQL commands
you can now set the default collation for all CHAR and VARCHAR columns in the database as case-sensitive or case-insensitive using the COLLATE clause; the clause can also work a fine-grained level of a table and COLLATE() function can override the collation of a string or expression
automatic management for the column compression - initially, you had to create a data snippet and ask Redshift to resolve the most optimized compression strategy for each column. However, with each added record, the initially chosen strategy may not be appropriate anymore. If you enable the automatic compression management by calling ALTER TABLE ${table name} ENCODE AUTO, the data warehouse will continuously monitor column compression and update it on an ongoing basis

RDS

To present the RDS updates, I will cover one database type in each list. To start, Oracle with 2 updates:

an RDS instance can create a single pluggable database instance, enabling it to operate as a multitenant container database; one container database can hold multiple pluggable databases to facilitate agility, shared resource management and enhanced security
support for Oracle Time Zone File auto upgrade; the users can now enable the upgrade with TIMEZONE_FILE_AUTOUPGRADE option. Previously, they had to create a new instance and manually migrate the data

One new feature is shared by PostgreSQL and Oracle:

support for Encrypted Cross-Region Automated Backups feature; it enables automatic replication of system snapshots and transactions logs from a primary to a secondary AWS Region. The files are encrypted with AWS KMS customer master key in the destination region.

And to terminate, SQL Server:

Cross-Region Automated Backup, pretty similar to the Encrypted version mentioned above. The difference is the lack of encryption in the replicated snapshot and transaction log files.
two new parameters for the full-text search; max full-text crawl range - 4 by default, you can increase it if the instance has a lot of CPUs to get a better search performance. transform noise words - to remove noise words from the query
DescribePendingMaintenanceActions API exposes the information about the scheduled upgrade of a minor SQL Server version; you can also upgrade it immediately from the console or by calling the ApplyPendingMaintenanceAction API

S3

Let's start with 2 security updates for S3:

use S3 Bucket Keys while creating encrypted copies of the existing objects - relying on the S3 Bucket Keys reduces the number of requests sent to AWS KMS for the encryption keys. It can optimize the cost while copying millions of existing objects.
use S3 Access Points instead of bucket names anywhere you access S3 data, such as EMR, Storage Gateway, or Athena - each access point has a dedicated access policy, simplifying the access management for shared datasets. Previously, a single bucket policy had to manage these multiple access scenarios.

Besides, a few things also changed for Amazon S3 on Outputs:

the service offers 2 larger storage tiers of 240 TB and 380 TB
access your S3 objects directly from the on-premise network through the Outposts Local Gateway
support for sharing S3 capacity across multiple accounts within an organization using AWS Resource Access Manager

Snow family

Four interesting improvements for this data migration service:

console-based request for a Snow job with long-term pricing - the new method should be much easier than the previous one requiring to reach out the AWS sales team
instances on a Snowball device can have a direct access to an external network - this feature increases the flexibility over the network configuration and enables some new use cases like connections requiring multiple physical network interfaces
AWS Snowcone users can remotely monitor and operate their devices, for example from AWS CLI - they can then easily manage even geographically distributed Snowcone devices, including viewing the state of each device (online/unlocked) and monitoring their metrics (storage, compute capacity). The users can also manage their devices, for example by unlocking or rebooting them
support of Network File System fast data transfer operations in AWS Snowball Edge Storage Optimized devices - the users can transfer up to 80TBs of data through both file and object interfaces. Prior to this improvement, the data transfer rate for the file interface was at most 40 MB/s, up to 10 times less than for the objects. Now, both performances are similar without any extra adaptation.

SNS

SNS has an interesting filtering feature:

added 3 new filtering operations: exists for messages without the specified attribute value, anything-but for messages matching everything except the filter, and cidr for the IP matches.

SQS

One of the preview features became Generally Available on May:

SQS high throughput mode is Generally Available, after 6 months spent as a preview feature - with it, you can process up to 3 000 messages per second in each API action. It's 10x the previously available throughput!

QuickSight

If you're a data engineer, you won't necessarily use QuickSight on a daily basis. However, it has an interesting feature for operationalizing your pipelines:

threshold-based alerts - works very similar to the threshold-based alerts on CloudWatch. You have to define a threshold and whenever the dashboard values reach it, the service will send an alert. You've certainly used this for metrics related to your pipelines like memory or CPU usage, or task failure. With QuickSight, you can propose the same solution but to the business users, so rely the alerts on the data you're processing

Azure

Cosmos DB

Let's start the Azure part with CosmosDB security improvements:

role-based access (RBAC) is Generally Available for Core API - you can now define allowed actions inside a role definition and assign it to the Azure Active Directory identities
client-side encryption with Always Encrypted in public preview - if you can't reveal some sensitive information to Cosmos DB service, you can keep them encrypted in the client application with the help of the Key Vault-managed encryption keys.

In addition to the client-side encryption, you will also find the following client features:

partial updates in preview - if you want to update only specific fields in a single document, you can do it using partial updates. Thanks to this feature, you won't need to perform a read-replace operation of the full document.
improved debugging by viewing the query in full text within diagnostic logs
ASP.NET can use Cosmos DB as remote session state or caching provider; to do so, use AddCosmosCache or AddSession methods.

Regarding more service-oriented features:

General Availability for continuous backup with point in time restore - allows granular restore to any point in time in the past 30 days.
serverless mode is Generally Available - real pay-as-you-go pricing model where Azure charges you only for the Cosmos DB resources being used. It's available on all Cosmos DB APIs.

Data Explorer

One Data Explorer network feature became Generally Available:

subnet delegation - no more need to configure Network Security Groups rules manually in your VNet. This feature was previously in preview but became Generally Available in June.

Data Lake

Some important announcements regarding Data Lake Gen1:

removed possibility to create a Gen1 account - the Gen1 will be retired on February 29, 2024 and it's not possible anymore to create it.
migration guide - to facilitate the switch to Gen2, Azure prepared a 6-steps migration guide that can be directly executed from the portal.

Among other features, you will find some interesting data capabilities:

append blobs support - they're ideal for any append-only use cases like logging. Before that, only block blobs were supported in Data Lake. Starting from May, you can also use append blobs.
soft-delete - soft-delete is a protection against accidental deletes. It retains the deleted files for the configured period in time. So far available only for Blob Storage, it's now available in a limited public preview for Data Lake Storage.
static websites hosting - you can now use Data Lake Storage to host static website data and view its content by using the public URL of the website.

Database

Let's start with 2 interesting Azure SQL Database features:

Change Data Capture available in Preview for Azure SQL Database; the feature was previously available in Azure SQL Managed Instance
Query Store hints are in Preview - the goal of this feature is to provide a method for shaping query plans without changing application code. So instead of writing the hint as a part of the query - something which is not always possible to do - you can annotate the query with a hint if it was previously captured in the Query Store.
backup storage redundancy configuration: geo-redundancy, zone-redundancy or local-redundancy
Always Encrypted with secure enclaves - a secure enclave is a protected region in the database memory where Always Encrypted data can safely access cryptographic keys and sensitive data in plaintext. It's not accessible from the outside, even by a debugger, so the sensitive data never leaves the database. It can be a convenient way to apply other than equality operations (e.g. pattern matching) on the Always Encrypted data because these operations must be executed on the client-side in a normal workflow.

Besides, a lot of things also happened for PostgreSQL - Hyperscale (Citus), with plenty of General Availability announcements:

server group restart is GA - thanks to this feature, all nodes in the Hyperscale (Citus) server group are restarted
scheduler maintenance is also GA - the feature enables system-managed or custom schedule for the maintenance window, with the preferred day and 30-minute time period
columnar compression is GA - to use columnar storage you have to specify USING columnar while creating the table. The engine will later compress the stored columns.
asynchronous replication of data from one server group to other server group in the same region is GA; you can use this feature to optimize the read-heavy workloads
GA of PgBouncer, a connection pooling tool to use when your application requires more connections than the limit allowed by Hyperscale (Citus) coordinator node
a single-node tier called Basic is available; it can be a starting point before adding more worker nodes to the server group
more performant PostgreSQL 12 and 13 are Generally Available

Presenting Citus it's a good occasion to introduce Azure Database for PostgreSQL features:

planned maintenance notifications are GA - the service should notify you about a planned maintenance event 72 hours in advance.
Flexible Server offers reserved instance pricing in Preview - by reserving your instances, you can get a better price than if you paid for them in the pay-as-you-go mode. The feature is in Preview.
forced failover for Flexible Server is in Preview - the feature provides a manually started failover which can be good to simulate and test your database in the context of an unplanned outage scenario. You can also use it to force the failover if the primary server becomes unavailable for any reason.
extra metrics with CPU Credits Consumed and Remaining are available for the B-Series (Burstable) computer tier; a B-Series VM is a Virtual Machine that can save unused CPU cycles for later usage. It could be then a good way to handle load peaks if the database didn't consume the allocated CPU before.
Azure Defender for Single Server is Generally Available; Azure Defender can detect anomalous database access and query patterns, including suspicious database activities, that it will later send you in an email

To terminate, let's see what's new in Azure Database for MySQL:

as for PostgreSQL, the planned maintenance notifications are Generally Available
still as for PostgreSQL, Burstable credit metrics are also in Preview

and to terminate shared part with PostgreSQL, MySQL also got a support of Azure Defender

in Flexible Server you can now choose the standby server zone location - for example, you may choose the standby server to be placed in the same zone to reduce the replication lag or put it in a different one to improve the redundancy. The feature is in public preview

Databricks

If you're looking for decreasing your costs, the following update can interest you very much:

Spot VMs are Generally Available - Spot VMs can bring up to 90% discount compared to the public VM prices and you can use them not only in the clusters configuration but also in the cluster pools for a more reactive and less expensive auto scaling

Functions

A few changes in the Durable Functions framework:

PowerShell support - PowerShell is a new language supported by Durable Functions workflows.
new storage models in public preview - the framework uses Azure Storage as the default storage model. The first of them is Netherite, an Event Hub and Page Blobs powered storage provider with a higher throughput than other providers. The second one is SQL Server and was added to take advantage of its data management operations (backup, failover, restore, ...).

Purview

Two important integrations for the data cataloguing solution:

Hive Metastore Database added as a source - Purview can now fully scan the Metastore and fetch the lineage between the datasets. The connector works on Metastores running on Apache Hadoop, Cloudera, Databricks and Hortonworks.
Erwin, Google BigQuery and Looker supported - the 3 are new Purview data sources supporting a full scan and fetch lineage.

Security features:

private endpoints - users on a private network can securely access Purview catalog over a Private Link. Also, the service can scan the data sources located in private networks or VNets to ensure an end-to-end network isolation.

Synapse

One interesting integration with an ACID-compatible file format:

Delta Lake format support in public preview - you can now query Delta Lake files from Serverless SQL Pool using T-SQL.

Service Bus

Service Bus got a quota limit increase:

messages can be up to 100MB - the 1MB limit is not there anymore and instead you can work with larger messages, up to 100MB!

Storage

A bit more things happened recently on Azure Storage. To start, a completely new component is available:

Storage Blob inventory is Generally Available - thanks to this feature you can get an overview of your containers, blobs, snapshots and blob versions. Every day, the inventory component will generate a report corresponding to the defined rules. In the report you will find things like expiration time, access tier, permissions or creation time for every catalogued item.

Besides, the blobs have the following new features:

last access time tracking is now generally available - the information integrates with lifecycle management policies or can be used aside to know the last access date of the object
another GA features are index tags - the tags are useful in the filtering scenarios where, for example, you may want to find all objects having a given tag.
still for GA features, container soft delete is now available - thanks to it, you can recover from any accidental container deletion. The retention period can be configured between 1 and 365 days, impacting the final storage price because the retained data is billed at the same rate as active data.
unlike the previous features, this one is available only in Preview - you can apply an immutable policy on all past and current versions of any blob. The feature extends the already available Write Once Read Many (WORM) storage pattern.

When it comes to the security announcements:

Attribute-Based Access Control (ABAC) in Preview - the access levels are based on attributes associated with the security principals, resources, requests, and the environment. The feature provides more fine-grained access control, for example, by giving read access only to the blobs with a specific tag.
the option disabling Azure Storage access using Shared Key authorization is now Generally Available; thanks to it, you can enforce the access to only Azure Active Directory credentials
key rotation and expiration policies for Azure Storage - with the feature, you can set the Storage Account key expiration period and be able to monitor the keys that are about to expire

Finally, among other features, you will find:

General Availability of the operational backup for Azure Blobs - even though the solution integrates with Backup Center, it's a local backup because the data stays within the source Storage Account. It enables the point-in-time restore recovery in case of different data loss scenarios such as blob corruptions or blob deletions.
Azure Event Grid supports 2 new events related to the blob rehydration - if you store some offline data in Azure Archive Storage, you can take advantage of its low storage costs. However, to make this data available, you have to rehydrate it, so move to a hot or cool tier. Once this operation completes, Azure Event Grid receives 2 new events: Microsoft.Storage.BlobCreated when the archived blob was copied to an online tier, or Microsoft.Storage.BlobTierChanged when the archived blob's tier changed without copying it
Azure Blob Storage natively supports Network File System 3.0 protocol; the feature is now Generally Available
Azure Storage can now be mounted as a local share for web apps deployed on App Service for Linux; ; the feature is now Generally Available

GCP

Cloud Composer

Exactly as for AWS Managed Workflows for Apache Airflow, let's start with a big news:

Cloud Composer supports Airflow 2!

The service should also have a better errors management thanks to:

links to corresponding cluster build logs for errors messages about PyPI package conflicts
added troubleshooting information to the web server deployment failure error messages
redacted database passwords in error messages from Composer Agent logs
correctly reported error messages about dependency conflicts happened while installing Python packages
the logs for long-running tasks should be periodically updated in the Airflow UI; before that change, the service displayed the logs only for the completed tasks

Besides, there are also 3 new features:

increased timeout for environment upgrade operations
user labels assigned to Cloud Composer environments now appear in billing reports
Apache Airflow smtp_password configuration can be stored in Secret Manager

BigQuery

It's time to see what's new on BigQuery. Let's start with general service features:

General Availability of Cloud Spanner federated queries
General Availability of row-level security, so the ability to control access to specific rows in a table based on qualifying user conditions
table snapshots are in Preview; a snapshot is a copy of the table at a particular time. They're read-only, can be queried as regular tables and, more importantly, can also be used as a restore mechanism
table functions are in Preview; a table function is a User Defined Function (UDF) returning a table
support for materialized views without aggregations and materialized views with inner join is in Preview
support for multi-statement transactions is in Preview; you can now atomically execute multiple DML operations, for example delete and insert some data within the same atomic transaction

Regarding querying features:

CONTAINS_SUBSTR function to check the existence of a string in another string, is now Generally Available
PIVOT and UNPIVOT functions are Generally Available to rotate rows into columns and the columns back torows

In addition to the new functions, BigQuery also got new supported types:

INTERVAL type to represent a duration or an amount of time; the feature is in preview
parameterized types: STRING(L), BYTES(L), NUMERIC(P) / NUMERIC(P, S), BIGNUMERIC(P) / BIGNUMERIC(P, S); parameterized types are a feature in preview. You can use it to enforce type constraints. For example, if you declare a variable as STRING(10) and you try to assign a string longer than 10 characters, you will get an OUT_OF_RANGE error

Besides the aforementioned features, BigQuery also got some new Data Control Language (DCL) and Data Definition Language (DDL) statements:

CREATE CAPACITY and DROP CAPACITY to manage capacity slots purchase
CREATE RESERVATION and DROP RESERVATION to manage reservations, so the assignment of BigQuery slots capacity to a kind of reservation pools
CREATE ASSIGNMENT and DROP ASSIGNMENT to manage GCP projects assignment to the reservations
ALTER TABLE RENAME TO statement to rename a table is Generally Available
ALTER COLUMN SET DATA TYPE statement changing the data type of a column to a less restrictive is Generally Available
ALTER COLUMN SET OPTIONS statement to set column options, such as the description, is Generally Available
CREATE TABLE LIKE statement to create a new table with the same columns but eventually different partitioning and clustering than a source table
CREATE TABLE COPY statement to create a new table with the same columns, partitioning and clustering as the source table
the view_column_name_list from CREATE VIEW ${view name} (${view_column_name_list}) is now Generally Available; you can use it to create the view with the list of column names
ALTER COLUMN DROP NOT NULL statement to remove the NOT NULL constraints from a column is Generally Available

And finally, 2 new for BigQuery Geographic Information Systems module:

support for loading geography data from newline-delimited GeoJSON files
ST_STARTPOINT, ST_POINTTN and ST_ENDPOINT supported to return a point of a linestring geography as a point geography

BigQuery Transfer Service

It was relatively calm in the previous update but since then, the BigQuery Transfer Service got some new features:

Google Merchant Center (GMC) support - GMC is a service to manage in-store and online product inventories and their appearance on Google. You can now integrate it with BigQuery Transfer Service
General Availability for Audit logging, Cloud loggin and Cloud Monitoring

Cloud SQL

Let's see first common changes for all Cloud SQL databases:

increased the storage limit to up to 64 TB
a faster maintenance with the connectivity dropping for less than 30 seconds to 2 minutes on average, depending on the database type
in preview, you should get the recommendations to help reduce the risk of downtime that might be caused by the instances running out of disk space
support for IAM Conditions to define and enforce attribute-based access control for Cloud SQL instances; it supports the attributes like date/time (e.g, to configure a temporary access) or resource attributes like their names

When it comes to the particular databases. Let's start with PostgreSQL:

Query Insight is now available for read replicas; thanks to this you can monitor performance at an application level and trace the source of problematic queries
Java and Python connectors support IAM authentication
support for the pg_similarity extension providing text similarity detection in the queries
3 new flags are supported: tcp_keepalives_count, tcp_keepalives_idle and tcp_keepalives_interval
in preview you can set up logical replication and decoding to enable logical replication workflows and Change Data Capture workflows; technically speaking, thanks to the logical replication, the replicated PostgreSQL databases don't need to be of the same version

And what about MySQL?

support for new flags: expire_logs_days (MySQL 5.6 and 5.7) and binlog_expire_logs_seconds (MySQL 8.0) to define the binary log expiration period; and innodb_flush_log_at_trx_commit to define the interval of flushing transaction logs
General Availability of the IAM database authentication
support for stored procedures like mysql.addSecondaryIdxOnReplica to add a secondary index on the database and mysql.dropSecondaryIdxOnReplica to drop it.

To terminate, Cloud SQL for SQL Server news just below:

General Availability of the integration with Managed Service for Microsoft Active Directory; thanks to this feature you can log in to SQL Server instances using Windows Authentication
preview feature for the replication, including cross-region replicas

Cloud Storage

Some changes from the upload category first:

support for compose objects API using the objects encrypted with Cloud KMS keys
faster uploading and downloading with gcloud alpha storage commands
XML API multipart upload with a POST request - you can upload an object with multiple HTTP calls. The first request is a POST one and it returns an upload id. The subsequent ones are PUTs using this upload id to transfer data chunks associated with the operation id. To terminate the upload, a final POST request is sent.

Among other features, you will find:

Public access prevention in preview - you can use it when you know that the stored data should be never exposed publicly to the internet.
bandwidth quota of 200 Gbps/project/region for egress to other Google services

Data Fusion

Two replication changes for this code-free ETL solution:

datetime data type supported in BigQuery tables
Oracle by Datastream plugin supported to continuously replicate data from operational stores in Oracle into BigQuery

Among other features, you will find:

better bootstrap for order to cash process with SAP accelerator - it provides sample pipelines to build end-to-end order to cash process and analytics performing the tasks like: SAP data source connection, data transformations in Data Fusion, data storage in BigQuery, and finally analytics set up in Looker.

Dataflow

Three new Generally Available features are:

custom Docker containers - pass the Dataflow runtime containers from --sdk_container_image pipeline option.
GPU support for Dataflow jobs
Dataflow snapshots - a useful backup/recovery feature. It lets you save the state of your streaming pipeline and recover it later. An important point to know is that any restored snapshot creates a job using Streaming Engine feature.

Besides, the service also has some pipeline-related updates:

Dataflow Shuffle is the default mode for batch jobs - meaning that the shuffle operations are moved out of the worker into the Dataflow service backend, by default.
Dataflow SQL supports User Defined Functions written in Java - the feature covers scalar and aggregate UDFs.
you can turn on the hotKeyLoggingEnabled debugging option to log the hot key in a human-readable format

Dataproc

The major part of Dataproc evolutions concerns the clusters:

Compute Engine Confidential VMs support - based on the AMD Rome processor family, this kind of VMs is secure and performant at the same time. It offers high performance while keeping all memory encrypted with a dedicated per-VM instance key generated by hardware. In addition to the encryption at rest and in motion, It's the 3rd encryption pilar applying to the data in use
clusters having the same project ID, region and name will have the same URLs unless you enable Dataproc Personal Cluster Authentication
to customize the Conda environment you can use Conda-related cluster properties: dataproc:conda.env.config.uri (activates/creates a new Conda environment on the cluster, points to a Conda environment YAML file on GCS) and dataproc:conda.packages (adds Conda packages on the cluster)
a single JobthreadPool is shared by all jobs - the number of the threads in the pool is configured with agent.process.threads.job.min and agent.process.threads.job.max
an extra ERROR_DUE_TO_UPDATE cluster state; it indicates an irrecoverable error while scaling the cluster. Such clusters can't be scaled but can accept jobs

And when it comes to the scaling:

Dataproc Enhanced Flexibility Mode is finally Generally Available - this mode manages shuffle data to minimize job progress delays that might be caused by the removal of nodes from a running cluster. With this mode, the shuffle data can be written to the primary workers (recommended for Spark jobs), or Hadoop Compatible File System (HCFS; only primary workers participate in this mode)

Datastore

Two evolutions for Datastore:

access recent import and export operations from the GCP Console
custom IAM roles support

Datastream

Not only AWS had new services in the last few months. Datastream is a new GCP service. It's a serverless Change Data Capture and replication service that enables streaming low-latency data from Oracle and MySQL databases. It also integrates with other GCP services like BigQuery, Cloud Spanner, Dataflow and Data Fusion.

Functions

A new interesting feature is available for Cloud Functions:

private worker pools that you can use to limit Cloud Function's connectivity to the perimeter delimited by the VPC Service Controls. By default, a function has unlimited internet access during the build process.

IAM

Even though IAM is not a pure data product, some of its new features may be interesting for a data engineer:

General Availability for attaching a service account to resources in other GCP projects
lateral movement insights generation - we talk about a lateral movement when a service account from a project A can impersonate a service account in a project B. It can then increase insider risk because the attacker can move laterally through projects.The insight generation feature is currently in preview..
Activity Analyzer shows the last uses of service accounts and keys to call GCP APIs. For example, it can help to identify the SA or keys that are no longer used. The feature is currently in preview.

Pub/Sub

Three GA announcements:

Python and Go client libraries for Pub/Sub Lite are GA
message schemas are also GA

Spanner

Spanner is another heavily updated GCP service. To start, the types-related changes:

you can use the NUMERIC data type column as a primary key, a foreing key or a secondary index

When it comes to the querying and data storage part:

GCP released the Query Optimizer version 3 having new join algorithms, distributed merge union, performance improvements for some queries using the GROUP BY, LIMIT, and JOIN operations
Time To Live feature is available in public preview - each table can have its own row deletion policy. The feature consists of defining a table column of a TIMESTAMP type and a retention period to apply. A background thread passes through the rows at a daily basis and removes the expired ones automatically.
views are supported in all Cloud Spanner databases

And below you can find the news from the operationalization category:

backup operations don't affect the performance of the database anymore - Cloud Spanner uses dedicated backup jobs not relying on the instance server resources
zero downtime instance configuration switch - for example, you can change an instance from a regional to multi-regional configuration, and so without compromising the strong consistency guarantee. The feature is currently in preview.
fine-grained instance sizing - previously, the most granular unit for provisioning compute capacity has been a node. Now, you can also increment the capacity in batches of 100 Processing Units
availability of Key Visualizer to analyze usage patterns in Spanner databases
GCP console improvements such as finding common queries and pre-populated DML query template instead of Insert a row and Edit a row data forms
support for Cloud External Key Manager to manage customer-managed encryption keys from external places (= other than Cloud KMS for example)

Storage Transfer Service

This data transfer service has some multi-cloud integration features

support for Azure Data Lake Storage Gen 2 source - in preview, you can move your data from ADLS Gen 2 to GCS
AWS Security Token Service integration - thanks to it, you don't need to pass any S3 credentials to copy the data from S3 and can use AWS STS to request temporary credentials for AWS IAM.

When it comes to the on-premise projects features:

General Availability for delete-objects-from-source feature
public preview for RESTful API to automate on-premise to GCS transfer workflows

Please let me know what you think about this new presentation format ? If you have any improvement ideas for this type of blog posts, I will be happy to learn. Otherwise, a lot of good things happened in the last few months on the cloud. To sum-up, AWS and GCP came with new services, MemoryDB and Datastream. AWS continued to innovate Glue, GCP put a significant effort on BigQuery and Spanner, whereas Azure introduced some important improvements to the Storage and Data Lake Gen2 services. Of course, it's a very simplified vision, but since every article has to terminate somehow, I think it gives a good picture of the most recent evolutions on the cloud for data engineers!

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #what's new on the cloud for data engineers