What's new on the cloud for data engineers - part 7 (05-08.2022) on waitingforcode.com - articles about Data engineering on the cloud

Four months in cloud history is a huge period of time. Even when 2 of the 4 months are the usual "holiday" months. As you can guess from the title, it's time to see what changed recently on the cloud from a data engineering perspective!

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

engineers - part 7 (05-08.2022) Four months in cloud history is a huge period of time. Even when 2 of the 4 months are the usual "holiday" months. As you can guess from the title, it's time to see what changed recently on the cloud from a data engineering perspective!

AWS

Athena

Four news for this AWS ad-hoc querying service:

Support for views stored in self-managed Apache Hive metastore. Starting from May Athena can handle the HiveQL view definition language correctly and therefore, access Hive views stored in the self-managed meta stores .
Athena connector for Amazon Lookout for Metrics. You can use Athena as the data source for this anomaly detection service without the need of setting up an ETL pipeline specifically preparing the data for the service.
Parameterized queries support in the console. Starting from July 11, it's possible to run parameterized queries directly from the Amazon Athena console. Parameterized queries are a convenient way to use the same query multiple times with different parameters each time, without the need to prepare the query before each execution.
Visual query analysis and tuning tools. These new components should help you understand how your query will run by providing an interactive view of the query plan. Additionally, the query execution provides new query-level metrics with the information about the time spent in queuing, planning, and execution stages as well as the rows and size of data processed and output.

Aurora

Large Objects (LO) module supported in Amazon Aurora PostgreSQL-Compatible Edition. The extension provides the support for Large Objects.
Zero Downtime Patching is available in Amazon Aurora PostgreSQL-Compatible Edition. The feature gives a possibility to upgrade the cluster to a new PostgreSQL version without any downtime.
R6i instances support. Amazon Aurora PostgreSQL-compatible edition can use the R6i instances optimized for memory-intensive workloads.

Backup

AWS Backup Audit Manager supports Amazon S3 and AWS Storage Gateway. You can use this new feature to continuously evaluate the backup activity of these services and generate audit reports.
Cross-regions and cross-accounts copy for Amazon S3 backups. This change adds support for copying S3 backups across AWS Regions and accounts. The copy can be used to guarantee a better data protection and strengthen the disaster recovery.

Data Exchange

Open Data datasets are available on AWS Data Exchange. It includes more than 100 petabytes of high-value and cloud-optimized public data sets.
10x bigger asset size. AWS increased the asset limit from 10GB to 100GB for the 3rd party data providers. The change should open a wider range of possibilities including the genomics data, high volume financial data, and satellite imagery, which are often bigger than the previous limit.

Data Sync

AWS DataSync support for Google Cloud Storage (GCS) and Azure FIles.
Support for copying data to/from FSx for NetApp ONTAP.

Database Migration Service

IBM Db2 z/OS is available as a new source.
Babelfish for Aurora PostgreSQL is available as a new target.
VPC source and target endpoints are available as new sources and targets. Database Migration Service can now freely connect to other AWS services with VPC endpoint.

DocumentDB

Dynamic resizing for storage space. The storage space of Amazon DocumentDB (with MongoDB compatibility) automatically adapts to the real storage needs. It increases when there is no space left for new records, and decreases, if you remove an existing data.
Database cloning support for quickly create a clone of the existing database sharing the same storage volume.
Query auditing support for database events. Additionally to the existing DDL events, DocumentDB will now log extra DML events for the actions like insert(), insertMany(), update(), updateMany(), delete(), deleteMany(), bulkWrite(), find(), count(), distinct(), replaceOne(), aggregates.
Decimal128 data type support. It's a BSON data type with a 128 bits of decimal representation supporting 34 decimal digits of precision.

DynamoDB

Service Quotas integration. With the feature you can continuously monitor the consumed quota and set a CloudWatch alarm to proactively request their increase.
Bulk import from Amazon S3!. That's huge news for any of you who want to bootstrap a new DynamoDB table. Now, it's possible without writing any extra processing logic, simply by importing CSV, DynamoDB JSON or Amazon Ion, files from S3. It's worth noticing the import doesn't consume table's write capacity, so there is no upfront capacity planning required.

EC2

Although it's not a pure data service, EC2 got a few interesting auto-scaling updates:

Auto Scaling backfills Predictive Scaling. Starting from May it's possible to create a predictive scaling policy based on the past 14 days to validate the accuracy of the scaling forecast. The Predictive Scaling itself is an interesting feature to proactively auto-scale the capacity based on predictions from past actions.

ElastiCache

Data in transit encryption support. The feature increases the security for the exchanged data between the client and the cluster by using Transport Layer Security (TLS) version 1.2.
A native JSON support for Amazon ElastiCache for Redis and Amazon MemoryDB for Redis. Thanks to this new feature you won't need to write a custom serialization/deserialization layer to manipulate JSON data on these services. Instead, you can query them directly in the database.

EMR

Amazon EMR Serverless is GA!
Result Fragment Caching with EMR Runtime. According to AWS' measures, this feature can improve the query performance of Apache Spark workloads by up to 15x! How? By caching the result of not changed repeatedly processed data that is stored on a dedicated S3 bucket.

EventBridge

Support for GitHub, Stripe, and Twilio. The consumers can subscribe to get the events produced by these services via webhooks.

Glue

Streaming:

Auto-decompression support. AWS Glue Streaming ETL can automatically decompress BZIP, GZIP, SNAPPY, XZ, ZSTD, and DEFLATE compressed Avro, JSON, or CSV data from Amazon Kinesis, Amazon Managed Streaming for Apache Kafka (Amazon MSK), and self managed Apache Kafka.
Auto Scaling is GA. Streaming jobs can automatically scale up and down depending on the input data volume.
Smaller instance types for streaming. With the feature you can use G.025X, a new quarter DPU worker type for streaming jobs.

Jobs:

SASL authentication support for Apache Kafka.
The API for authoring and managing AWS Glue Studio visual jobs is GA. You can use the GetJob and CreateJob actions to copy the job definition between Glue environments without losing its visual representation.
Flex execution option to reduce job cost. This option can help you reduce the job costs by up to 34%. It's not a good candidate for all types of jobs, though. It's well adapted to pre-production, test, or non-urgent workloads because the Flex execution mode doesn't guarantee a fast job start.

Notebooks:

AWS Glue Interactive Sessions and Job Notebooks are available in Preview. Both are great candidates for interactive data exploration on AWS Glue.

Lambda

Processing:

Consumer Group IDs for Amazon Managed Streaming for Apache Kafka (MSK) or Self-Managed Kafka as an event source. Previously whenever a Lambda started to consume an Apache Kafka topic, the service assigned an auto-generated and unique consumer group id. Starting from August it's possible to define the consumer group id. The feature improves the fault-tolerance because the consumers will start where they left.

Security:

lambda:SourceFunctionArn supported in the IAM condition key.
Attribute-Based Access Control (ABAC) support. The feature enables using attributes, such as tags, to provide a fine-level grained access to Lambda API actions in the IAM policies.

Ops/Others:

Tiered pricing for monthly Lambda function duration i.e. GB-Seconds of usage.The new tiered pricing can help save up to 20% costs on Lambda function duration based on their monthly usage.

Macie

New review and validation method. It's a one-click, temporary retrieval of up to 10 examples of sensitive data found in S3. Before that, you had to get the location of the object with sensitive data and read the file by yourself.

MSK

Amazon MSK Serverless is GA!
MSK Serverless full integration with AWS CloudFormation and HashiCorp Terraform.

Neptune

Fine-grained access control with IAM. Amazon Neptune now supports fine grained access control with IAM. This new feature enables creating dedicated scope of permissions, such as read-only, read+write, depending on the role of the user.
Global Database availability. This type spans multiple AWS Regions to provide disaster recovery in case of a regional outage, and a better throughput for the regional consumers.

RDS

Oracle:

Support for promotion of managed in-region read replica. You can now promote a managed in-region read replica created with the replica function.
Scale Compute operation support. The feature helps scale your Amazon RDS Custom for Oracle up and down.

SQL Server:

TDE enabled database migrations using Native Backup/Restore for Microsoft SQL Server. Previously, you had to disable TDE on your on-premises TDE enabled SQL Server database in order to migrate to Amazon RDS.

MySQL:

SSL/TLS connections to the database instances support. You can enforce SSL/TLS client connections by enabling the require_secure_transport parameter.

PostgreSQL:

Cascaded read replicas for up to 30X more read capacity. Workloads using PostgreSQL 14 can use up to 155 cascaded read replicas that can improve the read capacity by 30x!

Global:

RDS usage metrics on CloudWatch. You can now monitor the usage metrics against account-wide service limits.
IPv6 support for new instances created in your VPC.
Flexible Performance Insights. To analyze the performance you can select any time range in the Performance Insights. Previously, you could only use relative values, such as last hour or last day.
Encrypted SNS topics. RDS supports publishing events to encrypted SNS topics.
New limit for concurrent copy per destination region. The previous limit of 5 concurrent operations was increased to 20 snapshots.
PostgreSQL and MySQL support M6i and R6i instances with new instance sizes up to 128 vCPUs and 1,024 GiB RAM.

Redshift

Others:

Snapshot isolation level for concurrent transactions is available.
Linear learner algorithm with Redshift ML. It's ideal for the scenarios like predicting sales of a product or determining marketing effectiveness.
Automated Materialized Views are GA.
Serverless is GA. You can now seamlessly run and scale analytics workloads without provisioning and managing the infrastructure.

Performance:

Cluster resize performance and flexibility of cluster restore improved. The cluster unavailability during the resize went from several hours to minutes!
Open Source ODBC driver with binary protocol support and enhanced performance is available.

Security:

Row-Level Security (RLS) support. You can restrict access to a subset of rows within a table based on the users’ job role or permissions and level of data sensitivity with SQL commands.
Federated SSO support for Query Editor V2. As an administrator you can integrate your Identity Provider(IdP) with Amazon AWS console to access the Query Editor v2 as a federated user.

S3

Outpost:

AWS PrivateLink support for buckets and access points management. The feature simplifies the internal network architecture by avoiding the need of using public IPs or proxy servers to manage S3 on Outpost via a private endpoint within your virtual private network.
Presigned URLs available for time-limited data sharing.

Other features:

Access Points changes. The number of maximum Access Points increased to 10 000. Additionally, the feature got support for new AWS services, such as Redshift, CloudFront, and SageMaker Feature Store.

OpenSearch

Tag-based authorization for data read and write operations.
Quota information available through Service Quotas.
Application-centric view. Logs, traces and visualizations are now available in an application-centric view. This view simplify finding correlation between the events for a specific application.

Step Functions

Improved console. The changes facilitate the navigation through the details of the workflow executions to identify issues and dive deeper into the context of a failure.
Additional AWS Services and API Actions available. Starting from July, it's possible to connect to Amazon Pinpoint, AWS Billing Conductor, and Amazon GameSparks. Additionally, there are 195 new API Actions available for the already existing connectors.

QuickSight

Drag controller for rows and columns for table and pivot table.The feature simplifies altering the column by simply dragging from cell, row header or column header.
Show/hide fields on pivot table. You can show or hide any column, row or value fields from the field well context menu on pivot table visuals.
The API for account creation is ready. The feature is available for QuickSight Enterprise and Enterprise + Q editions to simplify the deployment automation.
Look and feel customization of the maps in geospatial visualization. The change includes location details in Streets maps, light or dark canvas, and Imagery base map that adds more visuals to the map.
Redesigned dashboard experience. The new experience improves the discoverability, predictability, and the overall polish of the dashboards.
API-based domain listing for embedded analytics. The feature exposes a new endpoint that you can use to configure the list of domains where the dashboards can be embedded.

Azure

Backup

Support backup of Write Accelerator enabled disks is GA. This category of disks is popular in M-Series Virtual Machines (VMs) to improve the I/O latency of writes against Azure Premium Storage.
Support for trusted launch Azure Virtual Machines is GA.

Cache for Redis

The Enterprise and Enterprise Flash tiers of Azure Cache for Redis now support the popular RedisJSON module. This module adds native functionality to store, query, and search JSON formatted data, which allows you to store data more easily in a document-style format in Redis. This simplifies common Redis use cases like storing product catalog or user profile data. https://azure.microsoft.com/en-ca/updates/public-preview-redisjson-available-in-azure-cache-for-redis-enterprise/

Cosmos DB

Synapse Link:

Azure Synapse Link support for your existing Azure Cosmos DB containers. It's now possible to use the Synapse Link for the existing containers. The feature triggers an initial sync from the transactional store to analytical store without impacting the former one's workload.
Power BI connector GA. You can use DirectQuery to visualize the Cosmos DB data in real-time on Power BI.

MongoDB:

16MB limit per document in MongoDB API. It's 8x more than previously. The feature is currently in Preview.
Linux emulator emulator with Azure Cosmos DB API for MongoDB in Preview. It's great way to test the development locally before deploying it to the cloud.
Data plane RBAC. This new access control allows you to authorize your data requests with a fine-grained, role-based permission model.
Azure Data Studio MongoDB extension. With the tool you can manage and query your MongoDB resources using mongo shell.
Unique partial indexes GA. The feature allows you to define the index fields from the documents and also enforce the uniqueness of their values.

Misc:

New features for elasticity. Several things related to the elasticity changed: increased serverless container to 1TB, added hierarchical partition keys, burst capacity from unused throughput, partition merge for optimized layout, throughput distribution across partitions.
Continuous backup enhancements. The continuous backup has now a free version with 7 days retention.
Azure Cosmos DB Core (SQL) API Query Engine improvements GA. The list includes the DateTimeBin function for an optimized GROUP BY operation on dates, and improved index usage on the aggregations.
Cosmos DB SQL sandbox migration anytime within 30-days period.
Audit log for continuous mode with Azure Cosmos DB is GA. It exposes the details of restore action on source account and destination account without needing to switch on special diagnostic logs.

Data Explorer

Connector for Power Automate, Logic Apps, and Power Apps is GA. It can be used in various automation scenarios (alerting, notifications, workflows).
Native ingestion from Amazon S3 support is GA. You don't need to set up ETL processes in that case.

Databricks

Serverless SQL available in Preview. The feature provides a serverless compute environment for users who don't want to manage the cluster setup. It's especially useful in ad-hoc data analysis, BI, and SQL workloads.

Event Grid

User authorization for partner topics. With this change you must grant your consent to the partner to create partner topics in the given resource group.
Azure Event Grid partner events are GA. The availability of this feature lets external systems like SaaS providers and platforms, publish events on Azure.

Event Hubs

Support for Apache Parquet output in Preview.
Resource governance with application groups. You can use application groups to apply throttling and data access policies per each group and associate them with a uniquely identifiable condition such as the security context.

Functions

Some news for the serverless offering:

Increased scale-out limits for the Linux Elastic Premium. The new limit is 60 for the East US region, and 40 for the remaining ones.
Kafka extension available in the Premium plan. The library lets you connect to the Apache Kafka broker and consume the data in real-time.
New extension defaults. They're now set to use the latest extensions.
Dynamic concurrency is GA. The feature is currently supported in Service Bus, Azure Blob, and Azure Queue triggers.
Retry policy for Event Hubs and timer triggers is GA. The policy executes a function until successful execution or exceeding the maximum number of retries.

Durable functions:

Support for managed identity for Azure Storage. Instead of using the connection string embedded within the app, you can use managed identity of the Function app.
Support for Java and NodeJS.

Key Vault

Although it's not a pure data service, it got an important update:

Automated key rotation is GA.

Purview

DevOps policy in Microsoft Purview Data Policy. With this extension to the Data Policy, An administrator can grant access to DBA or devops users on system metadata in one or more Azure SQL DB instances and in SQL Server 2022, without having to manage separate grant statements for each system view.
Microsoft Purview Data Estate Insights. The new metrics give a bird's eye view on the data estate health. A Chief Data Officer can see the adoption rate of the data catalog (Health), assets distribution (Inventory and Ownership) and any poorly annotated entities (Curation and Governance).
Microsoft Purview access policies for Azure SQL Database. The feature gives the possibility to create access policies through Microsoft Purview and apply them to SQL Database at scale, without needing to connect to each database individually.
Rich text editor for assets description is GA.
Managed attributes in preview. Managed attributes are user-defined attributes providing an extra business or organization level context to an asset.

SQL Database

Hyperscale:

PgBouncer 1.17 is GE. This popular Postgres connection pooling tool is now a native part of the service.
99.99% availability in SLA for Azure SQL Database Hyperscale tier.
Point-in-time restore is possible for up to 35 days. It's way more than 7 days guaranteed previously.
Named replicas are GA. They're a great way to implement near-real time analytical solutions.
Cross-region failover using active geo-replication and auto-failover groups. Both help perform quick disaster recovery of databases in case of a regional disaster or a large scale outage.

PostgreSQL:

New extensions are GA. The list includes PLV8 and PgRouting.
Same-zone high availability for Flexible Server. You can now place your standby replica to the same zone as the primary server.
Migration tool. The new tool facilitates migrating from a Single Server to Flexible Server instance with several automated steps.

SQL Managed Instance:

Resumable database restore. The feature enables restore of backups in case of an impactful system update during the maintenance window.
Windows Authentication for Azure AD principals is GA.

MySQL:

Higher burstable compute is for Flexible Servier. Burstable instances can use the accumulated not used capacity at the demand peaks.
Memory Optimized service tier is now called Business Critical.
A new 80 vCore Business Critical compute option is available. It offers up to 80 vCores and 504 GiB of memory for the Business Critical tier.
Data encryption for Flexible Servier with customer-managed keys in Preview.
Server logs for Flexible Server are GA. The feature saves the server activity logs to a file that you can download and use for issues troubleshooting later.

Misc:

Storage limits increase for selected compute sizes. The update concerns single databases and elastic pools configured with 8 and 10 vcores and increases the capacity from 1.5 TB to 2 TB.
Change Data Capture for record changes in Azure SQL Database.
Ledger in Azure SQL Database is GA. It enables a cryptographic proof that a database has not been tampered with.
New features including binding updates, JSON enhancements, and a new local development experience. It includes updated Python and JavaScript bindings, JSON_PATH_EXISTS, JSON_OBJECT, and JSON_ARRAY constructors and local development emulator.

Storage Account

Azure Files can be mounted as a local share in Windows Code in App Service.
Blob storage object replication. Object replication supports premium blob blocks.
Azure Data Lake Storage Gen1 to Gen2 migration. The new feature uses Azure Portal to seamlessly migrate the Gen1 to Gen2 storage.
More Azure Storage Accounts. It's now possible to create up to 5000 Azure Storage accounts per subscription per region. It's 20x more than previously (250).

Stream Analytics

Managed identities authentication for Azure Cosmos DB and Azure Service Bus. The new feature lets you connect to these 2 services with System-Assigned Managed Identity or your own User-Assigned Managed Identity.
Autoscaling for the jobs. This native autoscaling capability adapts the number of streaming units to the incoming data. It's currently in Preview.
Bigger size for jobs and cluster. The maximum size of jobs and clusters increased from 192 SUs to 396 SUs.
No-code editor in Preview. You can now define your jobs in a drag&drop manner with the use of the available templates and the UI.
Support for not existing table in SQL Database output. The service will create the table from the schema defined in the Stream Analytics query.

Synapse

Synapse Link for SQL in Preview. The Link provides a real-time replication from SQL Database to Azure Synapse SQL pool without any extra ETL/ELT pipeline.
Elastic pool storage for Azure Synapse Analytics Spark. The feature enables monitoring of the temporary storage on the worker node and increasing its capacity if needed to reduce the risk of no space left on disk errors.

Virtual Network

Although it's not a pure data service, it has 2 new features good to know as a data engineer:

Network security groups (NSGs) support for private endpoints is GA. The feature enables advanced security controls on traffic destined to a private endpoint. Using it requires enabling the PrivateEndpointNetworkPolicies property in the subnet with the private endpoint.
User-defined routes (UDRs) for private endpoints are GA. Defining custom routes doesn't require creating a /32 address prefix anymore. Instead, you can use a wider address prefix in the user defined route tables for traffic destined to a private endpoint. The feature works only for the PrivateEndpointNetworkPolicies enabled at the subnet level.

VM

Although it's not a pure data offering, it's worth noticing one interesting news from a data workloads perspective.

Storage-optimized Azure Virtual Machines are GA. They provide faster processors, increased networking, and higher remote disk throughput, making them great candidates for data analytics workloads.

GCP

BigQuery

Administration:

Admin Resource Charts for on-demand users are GA. They simplify troubleshooting issues and key metrics monitoring across the entire organization. The previous version only exposesd them to the reserversion users.

IO:

Informatica Data Loader supports loading data into BigQuery.
Storage Read API quotas changed. Data plane requests per user and per minute increased from 5 000 to 25 000. There is also a new limit of 2 000 concurrent ReadRows calls per project in the US and EU multi-regions, and of 400 concurrent calls in other regions.
Storage Write API concurrent connections limit for non-multi-regions increased from 100 to 1000.

SQL:

The %J format element is GA for DATE, TIME, DATETIME, and TIMESTAMP functions. It represents 1-based day of year.
PARSE_DATE, PARSE_TIME, PARSE_DATETIME, and PARSE_TIMESTAMP support new date and time format elements: %a, %A, %g, %G, %j, %u, %U, %V, %w, and %W.
Time-travel window configuration at the dataset level is in Preview. If configured, it'll apply to all tables in the dataset.
New INFORMATION_SCHEMA for storage are in Preview. The TABLE_STORAGE shows snapshot of total current storage usage, TABLE_STORAGE_TIMELINE_BY_PROJECT does that for the project, and TABLE_STORAGE_TIMELINE_BY_ORGANIZATION for the organization.
The @@dataset_project_id variable is GA. It defines the default project in case it's missing in the query.
Deterministic encryption functions are GA. They include DETERMINISTIC_ENCRYPT, DETERMINISTIC_DECRYPT_BYTES, and DETERMINISTIC_DECRYPT_STRING.
APPENDS change history TVF is in Preview. It provides a history of table appends over a window of time.
Inverse trigonometric SQL functions are GA. To compute an angle you can use COT for cotangent, COTH for hyperbolic cotangent, CSC for cosecant, CSCH for hyperbolic cosecant, SEC for secant, SECH for hyperbolic secant.
ALTER TABLE RENAME COLUMN is In Preview. You can use it to rename columns of a table.

Security:

Column-level data masking in Preview. With the feature you can obscure the content of the columns some users shouldn't have access to read from.
Resource Manager tags on datasets are in Preview. You can rely on this feature to conditionally apply IAM policies to resources.
Cloud console supports VPC service control perimeters to control access from BigQuery Omni to external clouds. The feature is GA.
Workload identity federation is supported in Preview for BigQuery resources.

Omni:

Reservation and Access Control DCL are supported.
Azure workload identity federation available in Preview.

Collation:

Case-insensitive collation is in Preview. If used, the case is ignored in comparison and sorting string operations.
COLLATE function is available in Preview. It takes a string column and returns a string value with the collation specified in the parameter.
DEFAULT_COLLATE function is available in Preview. It'll be applied to all columns supporting collation. The function is supported in CREATE/ALTER SCHEMA and CREATE/ALTER TABLE statements.

Other features:

Direct integration between Pub/Sub and BigQuery. You can now write your Pub/Sub messages to BigQuery without any intermediary layer, such as Dataflow or Cloud Function.
The query/statement_scanned_bytes and query/statement_scanned_bytes_billed metrics are no longer delayed for 6 hours. They're reported every 3 minutes.
Batch and interactive translation services are GA. The feature lets you translate most of major SQL dialects to the version supported by BigQuery.
Explore data in Data Studio link in the BigQuery query results is GA.
Query queues are in Preview. When enabled, BigQuery automatically determines the query concurrency instead of setting a fixed limit. Moreover, queries beyond the concurrency target are queued until processing resources become available.
BI Engine acceleration can be limited to a set of tables with the BI Engine Preferred tables feature. It's in Preview.
New limit of 250 GB for maximum reservation size per project and per location for BigQuery BI Engine projects. Previously the limit was set to 100GB.
Default values are supported on columns in BigQuery tables. The feature is in Preview.
Default configurations at a project or organization level is GA.
BigLake is GA.
A new option for materialized views to control costs and performance is in Preview. This new option is called max_staleness and defines the data freshness of the results.
BigTable external data source is now GA.
Query execution priority management for Cloud Spanner federated queries is GA. You can assign one of the 3 priorities (high, medium, or low) to the external queries reading Cloud Spanner tables.
Job type can be selected when assigning a folder, organization, or project to a reservation, The feature passes GA and among the supported types you can choose QUERY, PIPELINE, or ML_EXTERNAL.
View field can be set in the tables.get() API method to control the returned information about the table. It goes from BASIC, STORAGE_STATS, to FULL.

Cloud Composer

Cloud Composer 2:

Increased memory limit for Redis queue. The queue also scales with the environment's size.
Warning messages when storage usage is close to the limit.
Web server restarting in Preview.
User Stats Chart view is enabled for Admins.
Fixed refresh and display for incremental task logs in Airflow UI.
Some network changes:
- Private Service Connect support is GA. You can create a Private IP environment instead of VPC peering.
- Privately used public IP addresses are GA. You can use them if you don't have enough private IP addresses in your pool. Some of the public addresses are excluded from the list, though.
- The service no longer checks for network range conflicts that are not relevant for Private Service Connect.
- Starting from July, new environments created in the console use Private Service Connect configuration by default.
- Size update for Private Service Connect-based environments is supported.
- IP Masquerade agent is GA. You can use it to save network ranges in the networking configuration by translating Pod IP addresses to the node IP addresses.

Security:

Authorized networks support is GA. The feature allows you to specify CIDR ranges that can access your environment's cluster control plane using HTTPS.
Starting from July, the service enforces the "Act as" organization policy in all projects. The Service Account that create, update, and delete Cloud Composer environments must have the iam.serviceAccounts.actAs permission granted.
Improved DAG UI reliability in Private IP environments.
For Cloud Composer 2 you can assign permissions for an environment's service account on the service account level. Previously, it was only possible to do this at the project level.
Per-folder Roles Registration support. It's an automated way of configuring roles and their DAG-level permissions. The feature automatically creates a custom Airflow role for each subfolder directly inside the /dags folder and grants this role DAG-level access to all DAGs stored there.

Others:

DAG UI is GA. It's a section of Google Cloud console interface dedicated to viewing and monitoring DAGs, DAG runs, and individual tasks.
New Airflow metrics for pools, smart sensor, and SLA email notifications are available.
Several deprecated operators will be removed in one of the future versions of operators for Airflow 2. The list includes: BigQueryExecuteQueryOperator, BigQueryPatchDatasetOperator, DataflowCreateJavaJobOperator, DataflowCreatePythonJobOperator, DataprocScaleClusterOperator, DataprocSubmitPigJobOperator, DataprocSubmitSparkSqlJobOperator, DataprocSubmitSparkJobOperator, DataprocSubmitHadoopJobOperator, DataprocSubmitPySparkJobOperator, MLEngineManageModelOperator, MLEngineManageVersionOperator and GCSObjectsWtihPrefixExistenceSensor.
A new database metric that shows the total limit of database connections, and a metric for the number of active database connections is available.

Cloud Functions

Cloud Functions 2nd gen is GA. This new generation has an advanced feature set, including the infrastructure, performance, scalability, and trigger improvements.

Cloud SQL

SQL Server:

External replica. Cloud SQL for SQL Server can publish to an external or internal subscriber to Cloud SQL.
Server Audit is GA. The feature helps tracking and logging server- and database-level events.
Max server memory flag on instance is supported.

MySQL:

The in-place major version upgrades feature is in Preview.
Setting the timezone names as values for the time_zone parameters is now supported.

PostgreSQL:

Access to the pg_shadow view. This extension show properties of all roles that are marked as rolcanlogin in pg_authid.
Four new extensions (pg_bigm, refint, decoderbufs, pg_wait_sample) are GA.
External replica. Cloud SQL for SQL Server can publish to an external or internal subscriber to Cloud SQL.
New Monitoring dashboard and System insights dashboard. The former helps monitor overall health and performance while the latter helps detect performance problems.

PostgreSQL and MySQL:

External replicas with High Availability.
High Availability support for read replicas.

Global:

Faster machine type changes. Now, the connectivity dropping is less than 60 seconds.
Password policies at the instance level. You can define things like password length or complexity.
Instance deletion protection is GA. You can use this feature to prevent accidental removal of the instance.

Cloud Storage

Security:

The restrict authentication types organization policy constraint is GA. It configures the authentication types allowed in the GCS requests.
Customer-managed encryption key (CMEK) organization policy constraints are GA. Two new constraints, the constraints/gcp.restrictNonCmekServices and constraints/gcp.restrictCmekCryptoKeyProjects, give a fine-grained control over the resources requiring using the CMEK.
Bucket tags are in Preview. They help implement a fine-grained access control.
Definition of the default Cloud KMS key for the buckets created with XML API.

Misc:

Behavior change for JSON and XML copy requests. They return a permanent error on timeouts for objects larger than 2.5 GiB and a retryable error otherwise.
New conditions and actions for the Object Lifecycle Management. This GCS component supports MatchesPrefix and MatchesSuffix conditions, and an AbortIncompleteMultipartUpload action to remove abandoned multipart uploads.
Turbo replication is GA. This new replication mode guarantees replicating objects into separate region within 15 minutes.
Dual-region storage is GA. This new configuration creates a bucket replicated in 2 regions on the same continent.

Data Catalog

Entry list section with the entries included in the dataset. This information was not available previously in the UI.
Dataplex integration. Data Catalog is now a part of Dataplex service. The goal is to provide an automated and complete data management and governance solution.

Data Fusion

Two new releases (6.7.0 and 6.7.1) and some major changes in them:

Default Master Machine Type is now set to n2.
SAP Ariba Batch Source plugin in Preview. You can connect your data pipeline to an SAP Ariba Source and a BigQuery Sink.
Connection Management is GA.
Transformation Pushdown for JOINs is GA.
Dataplex source and sink plugins are in Preview as system plugins. Thanks to that, you don't need to install them by yourself anymore.
DNS Resolution in Preview. It's now possible to use domains and hostnames for sources instead of IP addresses.
Plus many other changes available in the Product's changelog page.

Data Loss Protection

The LOCATION_COORDINATES infoType detector is now available in all regions.
Improved detection quality with a new detection model for PERSON_NAME infoType.

De-identification of sensitive data from GCS is GA.

Built-in infoTypes contain InfoType categories.

Dataflow

Dataflow Prime is GA. This serverless data processing layer was previously available in Preview.
Regional Managed Instance Groups used in Dataflow. It replaces previously used zonal Managed Instance Groups.

Dataproc

Behavior change for canceling a job that is already in CANCEL_PENDING, CANCEL_STARTED, or CANCELLED state. The request will return the job instead of initiating the cancellation.
Behavior change for submitting a job or workflow and selecting the cluster matching specified labels. Dataproc will choose among clusters in one of the following states: RUNNING, UPDATING, CREATING, or ERROR_DUE_TO_UPDATE.
Custom OSS Metrics are GA. It collects and integrates Dataproc cluster OSS component metrics into Cloud Monitoring.
Ranger Cloud Storage plugin is GA. The plugin evaluates requests from the GCS connector against Ranger policies. For any allowed request, it returns an access token for the cluster VM service account.
Dataproc Persistent History Server is GA. The server exposes a web interface to view job history, even for deleted Dataproc clusters.
Dataproc custom constraints are in Preview. They can be used to allow or deny specific operations on Dataproc clusters, such as restricting the number of workers for a created or updated cluster, or preventing the application master from running on a preemptible nodes.

Datastream

Backfilling for Oracle database tables with more than 100 million rows. This is a new scenario supported by the service.
Integration with tags. Datastream supports the tags on its resources for fine-grained access control. It can include configurations, connection profiles, and streams.

Firestore

Firebase App Check support for Firestore is GA. You can use this feature to ensure that only given mobile or web app has access to the Firestore data.
Custom IAM roles are available for datastore.databases.getMetadata permission.
Time-to-live policies are in Private Preview. The feature automatically removes stale data from the database.
VPC Service Controls support is GA. This networking configuration helps reduce data exfiltration risks by defining a security perimeter in the network.

IAM

Documentation change. IAM policies are now called allow policies. It doesn't affect the APIs.
Workforce identity federation is in Preview. It lets you use an external identity provider to access supported GCP services.

Pub/Sub

gRPC compression available for the Java client. This new feature helps save networking costs by reducing the size of the publish request.

Spanner

Change Data Capture implementation with change streams. You can now get all changes made on Cloud Spanner tables in real-time.

Performance:

Two new Query Optimizers, version 4 and version 5. The former is still the default optimizer in production.
Commit timestamps can improve the query performance while retrieving data written after a particular time.
Query insights is GA. It helps visually detect and identify query performance issues.

Other features:

Transactions, reads, queries, and lock contentions metrics in Cloud Monitoring are GA.
The PostgreSQL interface is GA. Cloud Spanner becomes accessible from the PostgreSQL ecosystem, including the SQL dialect and psql CLI.
Granular instance sizing is GA. It supports creating production instances of fewer than 1000 processing units.

Querying:

Query statistics package can be updated manually with the ANALYZE command. It completes the existing automatic management by providing faster feedback for frequently changed data or indexes.
The DISABLE_INLINE hint for function calls is available. If used, it asks the query engine to compute the referenced function only once, even if it's present in other parts of a query.

Storage Transfer Service

Detailed logging for objects copied between AWS S3, Azure Blob Storage, ADLS Gen 2, and Cloud Storage is GA. The feature enables additional data integrity checks on the copied objects.
Expanded overwrite options are GA. They help solve the conflicts when the copied files already exist in the destination.
Metadata preservation options are GA. This configuration helps preserve metadata of the copied objects, such as POSIX attributes and symlinks for POSIX filesystems; or ACL, CMEK, temporary holds, and object creation time for GCS buckets..
Unified console experience for cloud and file system transfers. Both types are now available from a single interface.
Self-hosted transfer agents can be used to transfer AWS S3 data.

If I had to pick my Top 5 changes? Once again, serverless releases (Redshift, MSK, Dataflow, EMR) are my favorites. And in addition to them the flex option for Glue, so the mix of time- and cost-saving features :) What are yours?

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects

TAGS: #what's new on the cloud for data engineers