What's new on the cloud for data engineers - part 6 (01-04.2022) on waitingforcode.com - articles about Data engineering on the cloud

It's time for the first cloud news blog post this year. The update summary lists all changes of data or data-related services between January 1 and April 25.

Looking for a better data engineering position and skills?

Become a better Data Engineer

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

I'm covering here the data services with some exceptions most often related to the security services. For the updates, I'm omitting the version upgrades which are quite frequent changes especially for the managed RDBMS services. This time, I'm also trying to highlight the most important features. The view is subjective, though.

AWS

Athena

ACID transactions built on Apache Iceberg are now Generally Available. The feature adds insert, update, delete, and time travel operations to Athena's SQL data manipulation language.
Support for Amazon Ion binary format, used so far by internal Amazon teams or AWS services. Athena can interact with the data stored in this format thanks to the Amazon Ion Hive Serializer/Deserializer.
Extended list of data connectors. Athena can query the data located outside AWS, including the databases like Snowflake, Microsoft SQL Server, Oracle, Azure Data Lake Storage (ADLS) Gen2, Azure Synapse, and Google BigQuery.

Aurora

New version of Aurora Serverless is Generally Availalble. The fine-grained scaling mode helps reduce costs up to 90%.

Batch

VPC connectivity. You can manage your AWS Batch resources in your VPC with AWS PrivateLink connection to meet the security and compliance requirements.

Data Exchange

General Availability for Redshift. The feature enables seamless querying of third-party data available in the Data Exchange catalog from a Redshift cluster. The process doesn't involve any data movement because it remains in the source Redshift cluster.

DocumentDB

A trial offer. The service got a new 1-month free trial offer with: 750 hours of a t3.medium instance, 30M IOs, 5GB of storage, and 5GB of backup storage for 30 days.
$mergeObjects and $reduce support. The service improved the compatibility with MongoDB and added a support for $mergeObjects and $reduce operators.
Performance Insights in preview. This monitoring feature helps tune the database performance by providing an easy-to-understand dashboard with the database load.

DynamoDB

New parameter in PartiQL APIs. The API supports a new optional parameter, the ReturnConsumedCapacity. If defined, it'll return the request execution context containing the total read and write capacity, and some statistics for the table and indexes involved in the operation.
Limit operator in PartiQL. In addition to the ReturnConsumedCapacity attribute, PartiQL also got a new operator, the Limit. You can use it to limit the number of items processed in PartiQL operations.
Default service quoats increased. With this change, the number of DynamoDB tables per account and region increases from 256 to 2 500. Consequently, the number of table management concurrent operations grew from 50 to 500.

EC2

Although it's not a pure data service, EC2 got a few interesting auto-scaling updates:

Warm Pools with new features. Amazon EC2 Auto Scaling Warm Pools got 2 new features, Warm Pool instances hibernation and returning the running instances to a Warm Pool on scale-in action.
Default instance warm-up time. The warm-up time controls the delay before the created instance is considered ready to deliver metrics to CloudWatch In the Auto Scaling instance lifecycle, you can now define the default instance warm-up time for all scaling activities, health check replacements, and other replacement events.
Auto Scaling instance lifecycle states in the Amazon EC2 Instance Metadata Service (IMDS). Amazon EC2 Auto Scaling publishes these new states. You can use them to easily determine the lifecycle state of the scaled instances without having to setup Amazon CloudWatch Events. Instead, you can use Amazon EC2 Auto Scaling Lifecycle hooks and, for example, start installing an extra software after the instance passes to the ready state.

Warm Pools

It's a pool of pre-initialized EC2 instances sitting alongside an Auto Scaling group. It helps decreasing the latency for the applications having exceptionally long boot times, for example, because of writing massive amounts of data to disk.

Besides, it also supports new instance types:

Compute-optimized C6a instances are Generally Available. They're powored by 3rd generation AMD EPYC processors and can perform up to 15% better than C5a, for 10% less than comparable x86-based EC2 instances.

EMR

Quite a lot updates for Kubernetes:

Custom Image Validation Tool open-sourced. The tool validates the custom Docker image with Apache Spark application against a set of predefined rules containing, among others: UserName, WorkingDir, or EntryPoint. It can also perform a validation of the Apache Spark setup by running a basic job, such as SparkPi, in local mode.
Custom images for interactive endpoints. With this feature the ones of you who have custom Docker images can run them from an interactive workload, such as an EMR Studio.
Custom images for AWS Graviton-based instances. In January, AWS also added support for custom Docker images deployed on the Graviton-based instances.
Additional errors messages in the DescribeJobRun API. The state details of the DescribeJobRun API provide detailed error message explaning the job failure. You don't need to download and view logs anymore.

ACID file formats:

Apache Iceberg 0.12 support. Exceptionally, I'm sharing a version of support news. New versions of EMR can work with Apache Iceberg 0.12. As you can see, not only Athena gets some updates for this ACID file format.

Lake Formation:

Write support. You can now use the INSERT INTO, INSERT OVERWRITE, and ALTER TABLE against Glue Data Catalog-managed tables. Previously, Apache Spark could only send SHOW DATABASES and DESCRIBE TABLE commands.

Studio:

Real-time in EMR Studio. The new release of the service provides a rela-time view for the notebook activity. The users invited to collaborate using the same notebook can see each other activity.

Scaling:

Shuffle-aware scaling. EMR Managed Scaling automatically resizes EMR cluster for best performance and resource utilization. Recently, AWS added shuffle-awareness to the feature, meaning that the nodes storing shuffle files won't be removed. It prevents job re-attempts and re-computations.

EventBridge

Enhanced Rule Filtering and Event Transformation capabilities in the Management Console. These 2 features, EventBridge Rules help filter events that should be delivered to the sink, whereas Event Transformation provides a capability to convert the input event. Now, you can play with these features in the new Sandbox section of the AWS console before implementing them in the code base.
Global endpoints support. With that feature, EventBridge improved its fault tolerance. In case of a regional outage, the Global endpoint will automatically fail over the events ingestion to the secondary region. The primary and secondary regions are defined during the Global endpoints setup step. It's worth noticing they apply only to the EventBridge events. If there is no available data sink in the secondary region, the delivery won't happen.

FSx

Although it's not a pure data service, one of its recent updates brings something good for databases:

Custom ZFS record size. If your application or database writes/reads data in consistent chunks, you can set a custom ZFS record size to optimize file system throughput, latency, and IOPS. Previously the setting used a value tuned for the vast majority of workloads.

Glue

Data Brew:

E.164 standard phone number format. You can format phone numbers in the dataset to the E.164 international format standardizing the numbers as [+] [country code] [subscriber number including area code] .
Sorting support. The service also supports custom sort of one or multiple columns in the DataBrew dataset. The operation doesn't require writing any line of code.
Cross-account access. The service recently added support for accessing the Glue Data Catalog-backed tables on S3 located in different AWS account. The operation requires an appropriate resource policy in AWS Glue.

FindMatches:

AWS Glue FindMatches in AWS Glue 2.0. The version 2.0 of the service supports fuzzy matching and deduplication of similar records in the analyzed datasets. The feature relies on a ML-based transform called FindMatches.

Jobs:

Autoscaling for streaming and batch jobs. The feature automatically fine-tunes the resources available for the job execution. For the batch workloads, it analyzes each stage of the job and adds or removes workers accordingly to the workload. For the streaming applications, the feature adapts the workers to the activity of the stream data source.
Personal Identifiable Information (PII) detection. The service brings in Preview the feature of PII detection and remediation. It detects PII data with the help of pattern matching and machine learning and automatically logs any detected PII data at the column and cell level.
Job Run Insights. This new feature facilitates debugging Glue jobs by providing extra information on errors and performance bottlenecks. The feature requires starting the job with the --enable-job-insights flag enabled.
General Availability for Interactive Sessions. If you need to perform an ad-hoc data analysis without worrying about the underlying infrastructure to set up, you can use the Interactive Sessions component.

Schema registry:

Protocol Buffers. The AWS Glue Schema Registry added Protocol Buffers to the list of the supported file formats. Two previously present formats were Apache Avro and JSON Schema.

Kendra

There are 3 new features for this search engine:

Query language support. You can use the expressions familiar from other languages, such as SQL, as predicates in the Kenda queries. The feature includes various scenarios, like negations, exact matches, or range predicates.
Misspelled queries suggestion. With the new Amazon Kendra’s Spell Checker feature, you can fix any misspelling errors in the queries to reduce the number of 0-results searches.
FSx Connector release. The service released an FSx Connector to enable intelligent search on top of Amazon FSx for Windows File Server. The feature supports indexing documents of different formats (HTML, PDF, MS Word, MS PowerPoint, and plain text), and querying them from Kendra.

Keyspaces

Two changes for the clients:

AWS SDK support. You can now managed Keyspaces resources from AWS SDK. Previously, the service supported only CloudFormation.
Apache Spark compatibility. If you have a data processing job written in Apache Spark, you can use the open-source Spark Cassandra Connector to query Keyspaces.

Kinesis

Two new sinks are available in Firehose:

Honeycomb, which is the data observability service.
Coralogix, which is the log analytics platform.

Lambda

Processing:

Max Batching Window. It's a new way to delay the function execution by up to 300 seconds. If applied, the service will not invoke the function but accumulate the records to process in the invocation for up to Max Batching Window time. Naturally, it works with the streaming data sources, such as Amazon MSK, Apache Kafka, Amazon MQ for Apache Active MQ, and RabbitMQ.
Improved auto-scaling for Kafka. After this improvement, the service starts with one consumer. Every minute, it checks the measure of the backlog (the OffsetLag metric) and scales up or down every 3 minutes to reduce the lag. The previous scaling algorithm started 50% of maximum number of consumers and later scaled them every 15 minutes.
10GB of ephemeral storage. If you've ever needed more than 512 MB to store temporary data generated inside your function, you're safe now. The service supports up to 10,240 MB of this temporary storage.

Security:

aws:PrincipalOrgID support. You can now use the aws:PrincipalOrgID condition key in Lambda function resource-based policies.

Ops/Others:

Test events sharing. The developers can now create a test event in the console and share it with their colleagues from the same AWS account. Previously the created events were only available to their authors.
Lambda Function URLs. This new feature simplifies invoking the functions through an HTTPS endpoint. It consists of generating the function URLs endpoints automatically as https://${url id}.lambda-url.${AWS region}.on.aws.

Macie

New managed data identifiers . The new identifiers help discover and identify the HTTP Basic Authentication Headers, HTTP Cookies, and JSON Web Tokens in S3 objects.

Managed data identifier

The service supports different managed data identifiers. Each of them designed to detect a specific type of sensitive data, such as credit card numbers, or passport numbers.

MSK

Performance:

Storage throughput provision. The option enables scaling the I/O without having to provision additional brokers. The throughput of the storage volumes is limited to 1000 MiB/s.

Security:

Custom configuration providers. If you need to store and use secrets from MSK Connect, you can use the new feature of custom configuration providers. It enables connection with 3rd party secret stores, such as Amazon Secrets Manager.

Ops:

CloudFormation support for cluster configuration and SASL/SCRAM secrets.

Redshift

Spectrum:

Custom data validation rules. Spectrum already provides data validation rules to handle invalid records during the query. Previously, these rules had been internal to the system, and the users couldn't overwrite them. Now, customization is possible, and you can decide what to do with invalid data. The invalid data means here things like an unexpected character or numeric overflow in the input.

Data sharing:

General Availability for data sharing. Data sharing across Redshift clusters in different AWS Region is a Generally Available feature. It's worth noticing the feature is available on all Amazon Redshift RA3 node types.

Data types and language:

PIVOT and UNPIVOT operators. The new operations help transpose rows into columns and vice versa.

I/O:

Unload for SQL query results in JSON. JSON is the new format supported in the UNLOAD command. Previously, you could use CSV, delimited text, and Apache Parquet only.
Streaming Ingestion support for Kinesis Data Streams (KDS). That's great news if you worry about online data ingestion. Previously, you had to use intermediate S3 storage before ingesting the data to Redshift. Now, it's possible to do it directly, from Kinesis to Redshift. The feature is currently in preview.

Security:

Role-Based Access Control (RBAC). Redshift now supports RBAC, so can grant privileges at a role level without having to apply the same grant operation to the individual users or groups. For example, you could create a read-only role for some schema and grant it to different users instead of granting these read-only permissions individually to each user.
Microsoft Azure Active Directory (AD) integration. Redshift now natively integrates with AD, enabling simplified authentication and authorization with tools like Microsoft Power BI. The feature also lets you use AD to authenticate access to your Redshift database.

Others:

Audit Logging improvements. Two major changes on this. AWS minimized the delivery latency of the logs and also added CloudWatch service as the new log destination.

S3

Security:

AWS Backup General Availability. S3 support in AWS Backup is now Generally Available. You can use this service to create and manage immutable backups of S3 data. Additionally, it also provides a restore capability, including restoring individual objects from the backup vault.
Integrity check acceleration. S3 can validate the integrity of the downloaded or uploaded objects with the checksum provided by the client, or with the checksum calculated dynamically during the process. You can now choose one of 4 checksum algorithms (SHA-1, SHA-256, CRC32, or CRC32C) and accelerate the integrity check by up to 90%.
Account-level block public extension. The public access block configuration defined at the S3 level is now supported in Amazon Lightsail Object Storage.

MemoryDB

AWS PrivateLink support. You can use AWS PrivateLink to access the Amazon MemoryDB for Redis instance from your VPC. This access method guarantees the network traffic to stay inside AWS network.

Neptune

Increased cluster storage. The graph database service supports up to 128 TiB of storage instead of 64 TiB previously.
Custom ML models and SPARQL query language. You can now bring your own Graph Neural Networks (GNNs) Machine Learning models to Neptune ML. Additionally, the service supports SPARQL on W3C’s Resource Description Framework (RDF) data model for the inference queries on property graphs.

RDS

Oracle:

Query execution plans tracking in Amazon RDS Performance Insight. The new tracked item helps identify queries performance degradation due to the execution plan change.
ALLOW_WEAK_CRYPTO* parameters support. If you want to block older ciphers and algorithms from being used by the SQL*Net encryption and checksum parameters, you can use two new parameters, the SQLNET.ALLOW_WEAK_CRYPTO_CLIENTS and SQLNET.ALLOW_WEAK_CRYPTO.

SQL Server:

Support of SQL Server Analysis Services (SSAS) in Multidimensional mode.
Always On Availability Groups (AGs) for the Multi-AZ configuration in all AWS Regions on Standard Edition.
Server Agent job replication. Any action (creation, modification or deletion) of the Agent job on the primary instance will be automatically synchronized to the secondary instance if the server uses Multi-AZ configuration.

MySQL:

General Availability of the Amazon Web Services (AWS) Java (JDBC) Driver for MySQL. The new driver works for Amazon RDS or Amazon Aurora MySQL-compatible edition database clusters. It has an improved failover implementation relying on the cached cluster topology. It helps reduce the failover from minutes to seconds.

PostgreSQL:

mysql_fdw support. The extension can connect and retrieve data stored in separate Amazon Aurora MySQL-compatible, MySQL, and MariaDB databases directly from your PostgreSQL instance.

Global:

Fully managed performance monitoring. Amazon Aurora and Amazon RDS support a new fully managed performance monitoring dashboard called RDS Performance Insights. It's an easy way to detect and solve any database performance issues.
Internet Protocol version 6 (IPv6) can be used to access the RDS Service APIs.
Only for PostgreSQL and MySQL: new Multi-AZ deployment option with 2 readable standby database instances across 3 Availability Zones. This deployment mode optimizes the write transactions with the automated failover support and more read capacity.

SNS

Attribute-based access control (ABAC) support. ABAC is an authorization strategy that can manage permissions based on user attributes, such as department, job role, and team name. This feature is now available for SNS API actions including Publish and PublishBatch.

Step Functions

Mocking support in local runtime for debugging. AWS Step Functions Local provides a local environment for debugging and testing state machine workflows. Now it's possible to use it without calling the real services. Instead, the solution can mock these calls.
Extended integration. AWS Step Functions added support for over 20 new AWS SDK integrations (+1000 API actions), including the ML services like AWS Panorama.

Storage Gateway

Volume Shadow Copy Service (VSS) integration. If you use Amazon FSx File Gateway, you can browse and restore versions of files stored on Amazon FSx for Windows File Server file systems with the VSS. The feature provides a self-service recovery for the accidentally deleted or altered files.

QuickSight

New supported time period functions. QuickSight added a support for comparative and cumulative time period functions to facilitate datetime-aware comparisons. Comparative functions compare measures at different time periods, e.g. years or quarters. When it comes to the cumulative functions, they calculate metrics within a given period-to-date window, so for example yearly, up to a specific date in the timeline.
Rich text formatting on visual titles and subtitles.
Auto-refresh for direct query. The direct query mode automatically refreshes the opened dataset. Recently QuickSight added an auto-refresh for direct query controls. It'll happen every 24 hours.
Native groups management with an interactive user interface. The QuickSight administrators can manage user groups directly from the QuickSight admin console with this feature. It's available in the Enterprise Edition.

Azure

Backup

Soft deletes. The soft-deleted backups incur no storage costs. Additionally, the backup policy is not enforced on the data retained in the soft delete state.
Multi-user authorization integration. Disable security features operation is defined as critical. You can use a Resource Guard to protect it.
Protected server deletion protection. Protected server cannot be unregistered if the security features are enabled for the vault and there are associated backups items in active or soft delete state.
Long term retention for Azure PostgreSQL backup Generally Available.
Lease feature integration on the snapshots. The feature acquires a lease (lock) on the snapshot to protect against the accidental deletion.
Multiple backups per day for Azure Files General Availability. You can take multiple daily backups via backup policy.
Restore health metrics for Azure Blobs in Preview. The feature enables monitoring restore job health, writing custom alerting rules based on these metrics, or routing them to various notification channels.

Batch

Spot Virtual Machines. They are available as single-instance VMs or VM Scale Sets in user-subscription Batch accounts. You can use the Spot instances to reduce the overall cost of your computation.
Simplified compute node communication. It reduces the complexity and scope of inbound and outbound networking connections required in baseline operations. The old communication protocol requires 1 inbound and 2 outbound rules in the Network Security Groups. The simplified mode has only 1 requirement for the outbound rule and 0 for the inbound.

Cache for Redis

Active geo-replication is Generally Available. The feature available in the enterprise tiers reflects all writes to one region automatically in all other linked regions with consistency.

Cosmos DB

Synapse Link:

Azure Synapse Link for Azure Cosmos DB partitioning. You can partition your Cosmos DB and use Apache Spark partitioning pruning to reduce the data scanned during the processing.

Security:

Always Encrypted Generally Available. Always Encrypted brings client-side encryption capabilities to Azure Cosmos DB. It allows clients to encrypt sensitive data without revealing the encryption keys to the service.

Misc:

Partition key advisor notebobok. Azure shared a notebook that can recommend choosing the most optimal partition key for the workload. The notebook analyzes new or existing workloads including basic details, query information, and candidate partition keys.
Lower autoscale RU. Previously, the lowest scale range was 400 - 4000 RU/s. Now, this number decreased to 100 - 1000 RU/s.
Unique index feature in Preview. Azure Cosmos DB API for MongoDB can create unique indexes on the non-empty collections.

Data Explorer

Kibana dashboards and visualization on top of Azure Data Explorer are Generally Available. You can connect your Kibana stack to Azure Data Explorer by using an Open Source K2Bridge connector.
Ingestion properties Generally Available. You can enable Ingestion properties to customize the target table of the ingested database. Previously, the target was always the database associated with the data connection.
Azure Data Explorer supported as an application in Azure Active Directory (AAD) Conditional Access. Thanks to this integration you can enforce various condition-based access policies, such as user's location or device platform.
Azure private endpoints Generally Available. The clients from a VNet can now securely connect to Azure Data Explorer without leaving the Microsoft backbone network.

Data Box

Archive tier storage. You can order Data Box and copy the data to the Archive tier in Azure Storage directly.

NOT

Data Factory

Databricks

Delta Live Tables. The framework is Generally Available on Azure Databricks. Delta Live Tables replaces Apache Spark-based job definition by a set of processing steps that the service will orchestrate and manage for you.

Functions

Some news for the serverless offering:

Cosmos DB API. Historically, Azure Functions table bindings supported Storage Account. Recently Azure added a support for Cosmos DB API, available in Azure Table extension.

Monitor

Although it's not a pure data service, you might find some changes indirectly related to the data services:

You can export multiple tables to different Event Hubs in the same Event Hub namespace.
One-minute frequency log alerts are Generally Available. It means the service will check the alert conditions every minute.
Public preview for processing logs during the ingestion with KQL.
"Basic logs" are a new flavor of logs to reduce the cost of storing high-volume verbose logs you use for debugging, troubleshooting and auditing, but not for analytics and alerts. The feature is in a Public Preview.
Logs archival for up to 7 years at significant price reduction. The feature is in a Public Preview.
Still in Preview, you can use a Search job to query petabytes of the logs data.
Custom logs API facilitates sending logs data from any REST API client. It's also in Preview.
Export rules are also available for unsupported tables . The exports will automatically work as soon as the tables gets supported.
Private virtual network configurations via private links are Generally Available. The new Azure Monitor agent and data collection rules can now upload data via private links only, without accessing the public internet directly.
Grafana integrations with Azure Monitor. You can easily visualize the Azure monitoring data in your Grafana dashboards.

Unsupported tables

Azure Monitor represents collected logs as tables. The supported tables are the one currently available for exploration and export rules. An example of that type is DatabricksWorkspace which is an audit log for Databricks Workspace.

The unsupported tables, albeit coming from real Azure resources, are not ready yet for querying. For example, a Perf table stores performance counters from Windows and Linux agents. However, only Windows data can be used in the export rules.

Key Vault

Although it's not a pure data service, it has an important quotas update:

Increased service limits. The subscription wide limit and per vault limit doubled. For example, the secrets, managed storage account keys, and vault transactions doubled from 2000 to 4000 per region.

Purview

Localization is Generally Available. You can now choose one of 18 languages to navigate through Purview pages.
Assets certification. Data stewards can endorse resources from the Purview data catalog to indicate their readiness to be used in the organization.
Workflows. It's a new process to manage the changes. Purview has 2 types of workflows, the Data catalog for CUD (create, update, delete) operations for glossary terms and the Data governance for data policy, access governance, and loss prevention. For example, instead of deleting a term in the business glossary directly, the workflow might generate an approval request that has to be accepted to realize the action. The feature is currently in preview.
Glossary terms in the search results. Purview data catalog search now includes the glossary terms in the search results. The feature is Generally Available.

SQL Database

Hyperscale:

Preview for auto-failover groups. They simplify failover management by including ability to failover the entire group of databases and to maintain the same read/write and read-only endpoints after failover.
Private access support. It's now possible to connect the Hyperscale (Citus) nodes to the Azure Virtual Network securely and privately with Private Link.
Zone redundancy in preview.
Reverse migration support. If you migrated your SQL Database into the Hyperscale tier, you can now make the opposite operation.

PostgreSQL:

General Availability of the timescaleDB extension to provide time-series functionality on top of PostgreSQL.

SQL Managed Instance:

T-SQL queries support on Azure Data Lake Gen 2 and Azure Blob files.
Near real-time database replication between SQL Server and SQL Managed Instance implementation with the link feature.
Notifications are now sent up to 24 hours before planned events with the Azure SQL Managed instance.
Messaging and queuing support with Exchange Service broker. The broker is an internal queueing system created with CREATE QUEUE and CREATE SERVICE statements.
Import-Export feature can now use Private Link. It avoids opening too broad privileges for all Azure services.
Azure Active Directory (Azure AD) enables Windows Authentication access to Azure SQL Managed Instance. It facilitates migrating existing services to the cloud.

SQL Server on VM:

Storage configuration from the SQL Server for Azure Virtual Machine blade in the Azure portal available for SQL Server deployed using Azure Marketplace.
Deployment enhancements. You can now change multiple things, such as tempdb, change default collation, or min/max server memory.
Automated backups improvements. Their Storage Account retention increased from 30 to 90 days. It's also possible to select a specific container per instance.

Security:

RSA key stored in Azure Key Vault Managed HSM can be now used for customer-managed Transparent Data Encryption (TDE). The feature is Generally Available.
Additionally, User-Assigned Managed Identity support for TDE BYOK is in Preview.
New backup storage redundancy options, the Zone and Local, are available with Azure SQL Hyperscale Database.
You can monitor Database Restore progress at a more granular level, instead of the previous 0-50-100% thresholds.
Azure Active Directory (Azure AD) server principals (logins) are currently in Public Preview for Azure SQL Database.
Backup history management view is available in Azure SQL Database.

Misc:

Serverless Azure SQL Database elastic pools support zone redundant configuration.

Storage Account

Monitoring:

General Availability for Azure Monitor Diagnostic settings for Azure Storage. You can use detailed information about successful and failed requests to the storage service to better diagnose the storage service issues.

Misc:

Retire date for classic storage accounts. The classic storage accounts must be migrated to Azure Resource Manager until August 31, 2024.
Access time tracking for objects in Azure Data Lake Storage Gen2. The feature stores the last access time and facilitates the lifecycle management policies based on this attribute.

Table Storage:

Azure Active Directory (Azure AD) support Generally Available. You can use Azure Active Directory (Azure AD) to authorize requests for Azure Table Storage. Therefore, you can rely on the standard role-based access control to manage the permissions.

Stream Analytics

Azure Machine Learning (ML) integration. The User-Defined Functions of Azure ML are Generally Available in Stream Analytics. Put another way, you can invoke your Machine Learning models directly from the streaming service.
User-assigned managed identity. Stream Analytics can use it as an authentication mode for input and output connections that support Azure AD authentication. There is no need to store the connection credentials.

Synapse

New intelligent cache for Apache Spark in Azure Synapse. Unlike the vanilla Spark cache, the new Synapse cache mechanism not only automatically stores each read block to avoid reloading it from ADSL Gen 2. It also detects underlying file changes and refreshes the cache to provide the most recent data.

GCP

BigQuery

Two major announcements for BigQuery:

Analytics Hub available in Preview. This new service in BigQuery creates and shares analytics assets within and across organizations.
BigLake available in Preview. It's a new storage engine unifying cross-cloud data lakes and warehouses. It also supports fine-grained access controls to the tables with the column and row-based security.

SQL:

JSON data type is in Preview to store JSON data.
WITH RECURSIVE is in Preview. The operator allows a query in a WITH clause to refer to itself or the subsequent queries from the WITH clause.
INFORMATION_SCHEMA.JOBS_* and INFORMATION_SCHEMA.RESERVATION* views are available in Preview for BigQuery Omni.
The INFORMATION_SCHEMA.STREAMING_TIMELINE_* is Generally Available. The views contain aggregated streaming statistics for the project, folder, or organization.
The QUALIFY clause is Generally Available. It can be used to filter the results of the analytic functions, such as windows.
Search indexes can be created for BigQuery and called with the SEARCH function. The feature helps finding elements in unstructured text and semi-structured data formats.

Other features:

Materialized views without aggregation and materialized views with inner joins are Generally Available.
The BigQuery migration assessment is in Preview. You can use it to assert the complexity of migrating the current data warehouse to BigQuery.
Table clone is in Preview. A table clone is a lightweight and cost-optimal writable copy of the main table. You pay only for the storage of the data that is different from the source table.
Remote functions available in Preview. You can now call Cloud Functions from BigQuery, as if they were User-Defined Functions.
Cross-cloud transfer supported for BigQuery Omni is in Preview.
In July 2022, the projects.list API will return results in unsorted order.

Cloud Composer

Cloud Composer 2:

The environments with a user-managed service account correctly use the SA to get the Cloud Composer images and export workload metrics.
Maintenance operation is less disturbing for tasks that take less than 25 minutes. Cloud Composer waits until they finish before starting the maintenance operation.
Customer Managed Encryption Keys (CMEK) are supported now in Cloud Composer 2.

Bug fixes for:

0 workers for an environment after a failed upgrade.
Task logs not being exported to Cloud Logging.
Log levels in Cloud Composer 2 environments. It includes fixing a problem of logging some info messages as errors during environment operations.
Unhealthy web server not started in Cloud Composer 2.
"Environment health" and "Worker Pod eviction" metrics occasionally not reporting new time-series points.
Failures when creating environments with Private Service Connect in a Shared VPC configuration.
Deployment or insufficient quota errors generated when creating an environment lead now to failing the operation immediately.

Others:

Logs in Cloud Console are Generally Available.
Snapshots are in Preview. The feature saves and loads the environment state.
Support for CMEK encryption using keys from Cloud External Key Manager.
Logs from SQL proxy are now correctly passed to the customer project in environments with enabled Private Service Connect support.
Environment labels propagated to the environment's bucket.

Cloud Functions

Secret Manager connection Generally Available. Thanks to this feature, you don't need to worry about storing the secrets. Cloud Functions can easily connect to Secret Manager.
The 2nd gen Cloud Function in Preview. The next-generation FaaS offering brings a more powerful infrastructure, control over performance and scalability, and over 90 event triggers available.
Terraform support for Cloud Functions 2nd gen.
Google-managed Artifact Registry supported by Cloud Functions (1st gen) to store function images in addition to the customer-managed Artifact Registry.
Serverless VPC Access connectors in Shared VPC are Generally Available. You can use this method to connect the function to the resources deployed in a shared VPC network.

Cloud SQL

SQL Server:

Cross-region replication is Generally Available.
In-place upgrades are in Preview.

PostgreSQL:

New flags are available to configure the parallelism: max_parallel_maintenance_workers, max_parallel_workers, max_parallel_workers_per_gather, and max_pred_locks_per_transaction.
Other 2 new flags (wal_receiver_timeout and wal_sender_timeout), this time to receiver and sender timeouts for Write Ahead Logs.
Query Insight supports query sampling rate configuration.
In-place major version upgrades are in Preview.

Global:

It's possible to select an allocated IP range for clones and replicates created from a primary instance using a private IP address.
Tags are supported on Cloud SQL instances. They can help to define a fine-grained access control.
The Key Access Justifications (KAJ) is Generally Available. The feature helps control the reason for each Cloud EKM request. You can also automatically approve or deny these requests.
Customer-managed encryption key (CMEK) organization policy constraints are in Preview. The 2 new constraints define the resources requiring the usage of CMEK and also the projects authorized to use Cloud KMS keys to validate requests.

Cloud Storage

Security:

storage.multipartUploads permission included in the Storage Object Admin IAM role.
New organizational constraints. You can now use the restrict authentication types constraints to control the types allowed to access GCS resources. Additionally, you can also define the constraints for the Customer-managed encryption key (CMEK) and for example require all objects in GCS to be encrypted.

Others

Pricing changes planned on October 1, 2022. Among the changes you'll find an increased "Always Free" usage from 1GB to 100GB for egress ingestion, and increased or decreased storage costs, depending on the used tier.
Dual-region storage supports 2 regions within the same continent. The feature consists of replicating the writes made on the bucket in one region to the same bucket located in a different geographical area.

Data Catalog

Adding rich-text overview and data stewards to data entries is in Preview.
Public tags in Preview. They have less strict access permissions than the private tags. The user with a view permission for a data entry can also access all the tags associated with it.
Dataplex integration. Data Catalog can now catalog and search data entries from Dataplex lakes, zones, tables, and filesets.
An additional schema and column tags section available in the Data Catalog table details page.
Integration with Analytics Hub in Public Preview.

Data Fusion

Cluster reuse is Generally Available. The feature avoids recreating the Dataproc cluster for each run. There is no more required to use the system.profile.properties.clusterReuseEnabled property for enabling this feature.
Predefined autoscaling in Preview. In addition the custom auto-scaling policy, you can use the one predefined by Data Fusion.
Flow control in Preview. The feature defines a threshold for the number of outstanding start requests in the service.
Fetch Size supported for RDBMS and Teradata batch data sources.
Max idle time property for Dataproc has a default value of 30 minutes. The configuration will remove the cluster if it has been idle for longer than this value.
Pipeline Studio limit for data preview is now 5000 records.
Pagination and ordering added to the Lifecycle Microservices List applications API endpoint.
Limit in the published lineage messages to avoid OOM errors due to large lineages.

Data Loss Protection

New detectors and connections:

New infoType detector. The service got a new infoType detector for South Africa id number.
BigQuery data profiler Generally Available. The profiler is a managed offering scanning all the data in the organization and providing a general awareness of the type of the stored data.

Dataflow

Full support for custom IAM roles. You can create a custom role and assign it to a user-managed service account. The feature facilitates fine-grained security accesses.
Cloud Profiler for Dataflow is Generally Available. The tool monitors pipeline performance.
24 new Google-provided templates for Pub/Sub, GCS, and BigQuery as sources and destinations.
Support for Go in Apache Beam SDK. The feature is in Preview.
Dataflow Runner V2 is Generally Available for all languages of the SDK. The new runner improves scalability, generality, extensibility, and efficiency by moving from a language-specific to a service-oriented architecture. The new system include a more efficient and portable worker architecture packaged together with the Shuffle Service and Streaming Engine.

Dataplex

Dataplex is Generally Available. The service helps to centrally manage, monitor and govern the data across different storage (data lake, data warehouse).
Dataplex source and sink are available in Cloud Data Fusion. The feature is in Alpha.
Dataplex Explore is available in Preview. The component provides a fully-managed and serverless data exploration environment with Apache Spark SQL and Jupyter notebooks.

Dataproc

Dataproc Serverless is Generally Available. With this new runtime environment you can execute your Dataproc jobs without having to manage the cluster.
Dataproc on GKE is Generally Available. You can run your jobs directly on the Kubernetes cluster instead of YARN.
Autoscaler limit. If the service scaled down more than one thousand nodes at once, the next scaling action will remove at most one thousand nodes at a time.
Cloud Monitoring has a new operation metric, the cluster_type.
Enabled the Resource Manager UI and HA capable UIs in HA cluster mode.
Some bug fixes. The first solved the problem of missing JARs at runtime added in the --jar flag of gcloud dataproc jobs submit spark-sql. The second bug fix solved the issue causing Dataproc delay marking a job canceled.

Firestore

VPC Service Controls in Preview. VPC Service Controls is a GCP solution to mitigate data exfiltration risk.

IAM

Two interesting changes for this security service:

You can now set an expiry time for all newly created service account keys in your project, folder, or organization. This feature is in Preview. To use this feature, request access to the Preview release.
You can now use deny policies to prevent principals from using certain permissions, regardless of the roles they're granted. This feature is in Preview.
IAM Conditions now provides resource attributes for Cloud SQL backup sets. You can use these resource attributes to grant access to a subset of your Cloud SQL resources.
IAM Conditions now provides resource attributes for Apigee X. You can use these resource attributes to grant access to a subset of your Apigee X resources.

Pub/Sub

Exactly-once delivery. This new Preview feature enables exactly-once delivery for the messages sharing the same message_id attribute.

Spanner

Performance:

CPU utilization metrics provide grouping by task priorities which are low, medium, and high.
The database improved the way of processing groups of similar statements in DML batches. The batch writes under certain conditions should perform better after that change.
Cloud Spanner statistics related to transactions, reads, queries, and lock contentions in Cloud Monitoring can be aggregated.
The retention period for transactions, reads, queries, and lock contentions metrics at one-minute intervals increased from 6 hours to 6 weeks.

Other features:

Data type of the COLUMN_DEFAULT column changed from BYTES to STRING.
Query statistics apply to the DML statements (insert, update, delete).
Committed use discounts are available if you commit to use Cloud Spanner compute capacity for at least one year.
Google Cloud Console supports view management.
A non-key table column can now have a default value when the insert or update statement doesn't set it explicitly. Use the DEFAULT keyword in the schema definition to set this value.
It's now possible to export a subset of tables to GCS in Apache Avro format.

Storage Transfer Service

Preview support for moving data between two filesystems and keeping them in sync on a regular schedule. It's fully managed solution to migrate from a self-managed filesystem to Filestore.
POSIX attributes and symlinks preservation in the migration between POSIX filesystems is in Preview.
Agent pools are Generally Available. You can use them to create an isolated group of agents as sources or sinks in the transfer job. They open the possibility to perform data transfers concurrently, without having to create multiple projects.
For some scenarios, transfers using Storage Transfer Service will not result in GCS charges. Check the pricing page.
You can now opt for preserving metadata while transferring data between GCS buckets.
Cloud Client Libraries are a recommended way for accessing Cloud APIs programmatically. They're now supported by Storage Transfer Service.
Better control for overwriting the existing files. You can set one of the 3 options in the new overwriteWhen to never overwrite the file (NEVER), overwrite only files with different ETags and checksum (DIFFERENT), or always write the new file (ALWAYS).
A new roles/storagetransfer.transferAgent predefined role is available to simplify permission assignment to transfer agents.
Managing data transfers with the gcloud CLI is Generally Available.
Resource Location Restriction enforced by the service. The policy defines the regions in which transfer jobs and other location-based GCP resources can be created.

The data innovation on the cloud is in progress. My top news are Apache Iceberg in AWS, Kubernetes support on EMR, new data services (Dataplex, BigLake, Analytics Hub) on GCP, optimized runtime environments (Dataflow Runner V2, Dataproc Serverless), and integrated ML capabilities (Stream Analytics). What's are yours?

TAGS: #what's new on the cloud for data engineers