It's time for another part of "What's new on the cloud for data engineers". Let's see what happened in the last 3 months.
This 10th part covers all that happened between 10.03.2023 and 27.05.2023. As previously, I highlighted the most interesting news.
- Dynamic filtering and new predicate push down optimizations available for the connectors of the data sources other than S3. The new features should improve the performance and reduce the cost.
- Support for views in external data sources, such as RDBMS, cloud object stores, or streaming sources.
- Amazon VPC IPv6 endpoints support for inbound connections.They complete the previously available connectivity for the public IPv6 endpoints.
- A new configuration to ensure the minimum encryption level for the query results within a workgroup is available.
- I/O-Optimized Amazon Aurora. It's a new Aurora configuration available and ready for applications with high I/Os.
- Graviton3 based R7g instance family supported for MySQL and PostgreSQL.
- Better availability for read replicas in Amazon Aurora PostgreSQL. They remain available through writer node restarts.
- Support for EC2-based SAP HANA databases for backup and restore.
- Support for restoring resources with tags copied from protected resources. You can use tag-based policies for access, cost allocation, compliance, and automation workflows for the restored resources;
- Ephemeral storage for the jobs running on AWS Fargate becomes configurable. It's now possible to extend it from the default 20GiB to 200GiB if needed.
- Support for user-defined pod labels for jobs running on Amazon Elastic Kubernetes Service (Amazon EKS) clusters.
- A new customizable dashboard on the console is available to simplify the resources monitoring.
- The service supports copying data from Azure Blob Storage to AWS Storage services such as S3. The feature is currently in preview.
Database Migration Service
- It's now possible to generate an AWS Glue Data Catalog for the files migrated to an Amazon S3 bucket.
- Support for S3 data validation. The feature compares all rows in the input with all rows copied to S3 and reports any mismatches.
- An Open Source ODBC driver for the service is available. You can use it to connect the database to popular BI Tools, including PowerBI.
- The limit of the concurrent table restores per AWS account increased from 4 to 50.
- Cluster Mode configuration support on existing clusters. Previously the feature required recreating existing clusters.
- Data tiering on Graviton2-based R6gd nodes supports auto-scaling.
- Support for customer-level metrics when running interactive Spark workloads via managed endpoints. The feature enables monitoring for the kernel lifecycle operations such as request counts, request latency, request errors and kernel launch failures via CloudWatch.
- Support for defining where the Jupyter Enterprise Gateway pod must be deployed for interactive workflows using the managed endpoints. The feature includes the ability to specify an on-demand instance via a managed or self-managed node group.
- Managed and self-managed node groups support for managed endpoints.
- Self-hosted notebooks support for managed endpoints.
- Vertical autoscaling. The feature automatically tunes the memory and CPU resources of EMR Spark Applications to adapt them to the real workload.
- Fine-grained access controls with AWS Lake Formation and Apache Hive on Amazon EMR. You can now apply AWS Lake Formation-based table and column level permissions on Amazon S3 data lake for write operations with Apache Hive jobs.
- Job-level billed resources to improve the cost management. The new exposed metrics are vCPU-hours, memoryGB-hours, and storageGB-hours consumed by a completed EMR serverless job.
- Support for Amazon EC2 C7g (Graviton3) instances.
- Enhanced error details to simplify the troubleshooting.
- Support for automatic partition index creation.
- Custom JDBC drivers support for data discovery.
- 10 new visual transforms, including Concatenate, Split string, Array to columns, Add current timestamp, Pivot rows to columns, Unpivot columns to rows, Lookup, Explode, Derived column, and Autobalance processing.
- Added new Redshift capabilities, such as browsing Redshift tables directly, supporting native Redshift SQL, executing common operations (drop, truncate, upsert, create or merge) while writing to Amazon Redshift.
- Continuous logs for Job Monitoring to track the job while it's running.
- Simplified permissions setup on the console, including default role for new AWS Glue jobs and notebooks.
- Monitoring of account-limited Glue resources in CloudWatch. Among the items you'll find the number of Glue workflows, triggers, jobs, concurrent job runs, Blueprints, and number of Interactive Sessions.
- AWS Glue G.4X and G.8X instance types are Generally Available. They provide the most Data Processing Units (DPU).
- New connectors are available: Microsoft SharePoint Cloud, Confluence Server Connector, SharePoint OnPrem, Confluence Cloud, Microsoft OneDrive, Gmail, Adobe Experience Manager On-Premise, Adobe Experience Manager Cloud, Alfresco Enterprise, Alfresco PaaS.
- Featured Results feature where the administrators can define search queries and associate a set of featured documents for each search query.
- Content-based query suggestions.
- Client-side timestamps support. One of the use cases is the conflict resolution in case of many concurrent writes.
- The IN operator in SELECT queries support.
- Support for auto-generated document ID for Amazon OpenSearch Service destination. You don't need to define a document id in the ingestion process. Instead, you can rely on the fully automated generation by OpenSearch.
- Extended data cataloging, data sharing and fine-grained access control support for customers using a self-managed Apache Hive Metastore (HMS) as their data catalog.
- Response payload streaming. The functions can stream the responses to clients even for the ones exceeding 6MB.
- AWS X-Ray tracing support for SnaStart-enabled functions
- Multi-VPC private connectivity and cross-account access support.
- CloudFormation support for Neptune Serverless.
- Simplified cluster creation in the console with 3 pre-defined modes (Production, Dev/Test, and Demo).
- IAM Authentication support.
- R6i instances are available for the service.
- Graph summary API that exposes metadata about the property graphs (PG) and Resource Description Framework (RDF) graphs.
- Slow Query Logs support to identify slow queries requiring performance tuning.
- New observability features such as log patterns, metrics analytics and support for Jaeger traces.
- Amazon OpenSearch Ingestion availability. It's a fully managed data ingestion tier that allows you to ingest and process petabyte-scale data before indexing it.
- RDCS Custom for SQL Server supports Multi-AZ deployments.
- Support for up to 15 read replicas and inbound replication, for Multi-AZ deployment option with 2 readable standby database instances.
- Optimized Reads with up to 2X faster queries. The feature relies on placing temporary tables generated by PostgreSQL on the local NVMe-based SSD block-level storage, thereby reducing your traffic to Elastic Block Storage (EBS) over the network.
- pgvector support for simplified ML model integration.
- Rust is supported to create User-Defined Functions.
- Support for up to 15 read replicas for Multi-AZ deployment option with 2 readable standby database instances.
- RDS events include tags to enable filtering and routing scenarios.
- AWS Graviton3-based M7g and R7g database instances are Generally Avaialble.
- Performance Insights consolidates information from Amazon CloudWatch and Amazon RDS Performance Insights to provide a comprehensive view of your database’s health.
- String query performance improved by 5x to 63x. The feature comes with enabling the Automatic Table Optimization (ATO) flag.
- MERGE SQL command is Generally Available.
- Dynamic Data Masking is Generally Available.
- Centralized access control for data sharing with AWS Lake Formation.
- Auto-commit statements support in stored procedures.
- Simplified private connectivity from on-premise networks with Virtual Private Cloud (VPC) interface endpoints for Amazon S3.
- Two security best practices applied by default to all new buckets, namely the S3 Block Public Access enabled and S3 access control lists (ACLs) disabled.
- Cross-account support for Multi-Region Access Points.
- A new OperationFailedReplication metric available on CloudWatch to simplify diagnosing S3 Replication issues.
- Support for setting content-type request headers for HTTP/S notifications.
- Faster automatic deletion of unconfirmed subscriptions.
- Support for unloading data to S3 via the UNLOAD statement.
- "Hide collapsed columns" control for Pivot table.
- Row-Level Security tags with OR condition support.
- Ingestion schedule APIs and Incremental Refresh Configuration APIs for data ingestion.
- State Persistence and Bookmarks support for embedded dashboards.
- VPC Connections support via public APIs with Multi-AZ.
- Dataset Parameters can help optimize slicing and dicing operations.
- New scatterplot options, such as None aggregate.
- Common Sub-expression Elimination feature that can improve the SPICE performance.
- QuickSight dashboards available for seller reporting and insights in AWS Marketplace.
- Immutable vaults for Azure Backup are Generally Available.
- Azure VMs with Ultra disks support in Public Preview.
- Azure Backup Reports includes new workloads (Azure Database for PostgreSQL Servers, Azure Blobs and Azure Disks).
- Private preview support for confidential VMs using Customer-Managed Keys.
- Soft-delete of Recovery Points is in Public Preview.
- Retirements for:
- Select Batch Pool autoscale service-defined variables, planned on 31 March 2024.
- Azure Batch classic compute node communication model planned on 31 March 2026.
- Batch Service in select regions planned on 31 March 2026.
Cache for Redis
- JSON support for Active Geo-Replication on the Enterprise and Enterprise Flash tiers in Preview.
- Customer-managed encryption key for Enterprise tier in Preview.
- 99th percentile latency metric in Public Preview.
- Azure Container Apps support for jobs to run serverless containers that perform a task and exit when complete. The feature is in Public Preview.
- General availability for the document expiration based on any field of the document.
- Vector search is in Public Preview. The feature simplifies vector similarity searches.
- General availability of the REST APIs that you can use to manage your Cosmos DB for PostgreSQL instances.
- Cluster compute start and stop to save time and costs is Generally Available.
- Data encryption with Customer-Managed Keys is in Public Preview.
- Materialized views are in Public Preview.
- Public Preview for Change Data Capture on top of the analytical store. As the feature is based on the analytical store, it doesn't consume provisioned RUs and doesn't affect the transactional workloads, providing lower latency with lower TCO.
- Improved restore process from backup that doesn't require moving data between accounts.
- Burst capacity is Generally Available. The feature uses the database or container's idle throughput capacity to handle temporary, future spikes of traffic.
- General Availability of the transformations on Log Analytics. They can reduce the volume of logged data and hence, the costs.
- More storage in a single container for the Azure Cosmos DB serverless. It accepts up to 1 TB.
- All versions and deletes change feed mode in Public Preview. It returns all the changes occurred to items, even if they happen between change feed reads. Additionally, it also covers physical deletes (the alternative latest mode only supports soft deletes).
- Hierarchical partition keys are Generally Available. The feature exposes up to 3 keys to subpartition the data and enable more optimal data distribution.
- OpenTelemetry and Application Insights integration for .NET and Java SDKs.
- Support for ingesting data from .NET Applications via the Serilog sink.
- New geospatial functions (geo_point_buffer, geo_line_buffer, and geo_polygon_buffer) are Generally Available.
- Several changes for Oracle, including the Service Pack of 4 extensions covering end-to-end shallow and deep assessments, right-sizing of the Azure at a target, code conversion, remediation planning, and near real-time data migration to Azure.
- Offline and online migration extension for Oracle is in Public Preview.
- Serverless SQL is Generally Available. BI users and SQL workloads can now leverage this instant elastic compute feature.
- AMD Confidential VM option for driver and worker nodes. This type of node provides hardware-based encryption generated by the underlying chipset and inaccessible to Azure operators.
- MQTT protocol support in Public Preview.
- HTTP pull-based message delivery in Public Preview. It completes the already existing push delivery.
- Dedicated self-serve scalable clusters for mission critical workloads are Generally Available.
- Support of Apache MirroMaker2 for data replication between on-premise clusters and Azure Event Hubs.
- JSON Schema support in Schema Registry for Kafka applications is in Public Preview.
- Kafka Connect integration is Generally Available.
- Support for Managed Identities in Event Hubs Capture.
- Apache Kafka compaction support is Generally Available.
It's a new end-to-end, unified analytics service on Azure. It integrates other Azure technologies, including Azure Data Factory, Azure Synapse Analytics, and Power BI, into a single unified product. It has 7 different workloads that you can use for various use cases, such as real-time analytics, data orchestration, or data science.
- Durable Functions support for managed identity of Azure Storage is Generally Available.
- Target-based scaling is Generally Available. The feature provides a faster and more intuitive scaling model for the Service Bus Queues and Topics, Storage Queues, Event Hubs, and Cosmos DB extensions.
- General Availability for SQL Bindings.
- Public preview of Azure Functions on Container Apps Environment for simplified integration of the serverless functions to the cloud-native microservices workloads.
- Alert rules duplication is Generally Available.
- Azure Metrics Dataplane API is in Public Preview. The new API exposes among others a batch API endpoint to retrieve metric data for multiple data points.
- Configuration autocompletion and reduced list of signals for the alerts configuration is now Generally Available.
- Managed services for Prometheus are now Generally Available.
- Microsoft Purview DevOps policies for Azure SQL Database are Generally Available.
- Elastic pools, a new shared resource model, are in Public preview.
- Increased high availability for the instance with zone redundancy.
- pg_hint_plan, semver_extensions are new extensions supported in the Flexible Server.
- 6 new metrics on active connections, idle connections, total pooled connections, number of connection pools are available for a better PgBouncer monitoring.
- 5 new burstable SKUs are Generally Available. They provides a low-cost solution for flexible CPU usage to accommodate workloads with fluctuating usage patterns.
- Query performance insight is in Public Preview.
- Read replicas are Generally Available.
- The Single Server will be retired by 28 March 2025.
- Public preview of database-is-alive metric to monitor the database availability status.
SQL Managed Instance:
- Link feature for hybrid connectivity with SQL Server 2016 and 2019.
- Create External Table As Select command support. It exports data from local database tables into Parquet and CSV files located in Azure Storage and references them from external tables.
- Connector for PowerApps, Logic Apps and Power Platform is in Public Preview.
SQL Server on VM:
- SQL Server connector’s V1 actions and triggers will be retired on 31 March 2024.
- Auditing configuration using managed identity.
- Database-level transparent data encryption with Customer-managed keys.
- Cross-tenant transparent data encryption with Customer-managed keys.
- Azure Private Link for Azure SQL Managed Instance is in Preview.
- Azure SQL Database offline migrations in Azure SQL Migration extension are Generally Available.
- DBCC SHRINKDATABASE operation to shrink the size of the data and log files in the specified database.
- End for Gen 4 hardware support.
- General Availability of the Encryption scopes for REST, HDFS, NFSv3 and SFTP protocols in an Azure Blob / Data Lake Gen2 storage account. With the feature you can apply the encryption either at the container level (as the default scope for blobs in that container) or at the blob level.
- Cross-region service endpoints providing secure and direct connectivity to Azure services over an optimized route over the Azure backbone network are Generally Available.
- Public preview of Azure Cold Storage, a new tier pricing positioned between cool and archive, with 90-day early deletion policy.
- The new service is Generally Available. It covers the data migration scenarios from on-premise file shares to Storage Account.
- Managed private endpoint support for connecting a Stream Analytics to Data Explorer.
- Integration with Application Insight for telemetry data processing.
- Dynamic Blob container name is in Public Preview. With the feature you can customize container names and simplify data partitioning based on data characteristics.
- Azure Stream Analytics integration with Event Hub Schema Registry is in Public Preview.
- Exactly-once delivery to ADLS Gen2 is Generally Available.
- Exactly-once delivery to Event Hub is Generally Available.
- Apache Kafka adapters for reading and writing are in Private Preview.
- Good FinOps news. There will be a new pricing model with up to 80% cost reduction.
- Multi-Column Distribution for Dedicated SQL pools is Generally Available. It allows you to distribute data on multiple columns to balance the data distribution in your tables and reduce data movement during query execution.
- Time-travel for Azure Synapse Link for Cosmos DB is in Public Preview. With the feature you can access past versions of your Cosmos DB databases.
- Autoscaling slots are Generally Available. They are also part of 3 new BigQuery editions: Standard, Enterprise and Enterprise Plus.
- Interactive and batch queue timeouts in default configurations for query queues. The feature is in preview.
- INFORMATION_SCHEMA.MATERIALIZED_VIEW view and enhanced job statistics for better materialized views usage monitoring are in preview.
- Partitioning and clustering recommender feature is in preview.
- DML statements are no longer included in the number of table or partitioned table modifications per day.
- Faster propagation of the preferred tables for BI Engine reservations. It now takes up to 10 seconds compared to the 5 minutes previously. It's Generally Available.
- The CREATE TABLE AS SELECT and INSERT INTO SELECT statements supports filtering data from files stored on Amazon S3 and Azure Blob Storage before they're transferred into BigQuery tables. The feature is in preview.
- Cache support for query results from table snapshots.
- Maximum result size limit of 20 GiB logical bytes for queries executed against Azure or Amazon S3 data is GA.
- BigLake and non-BigLake external tables support GCS custom dual-regions.
- Table cloning is Generally Available.
- EXTERNAL_QUERY SQL pushdown applying to SELECT * FROM T queries is Generally Available.
- BI Engine Top Tables Cached Bytes, BI Engine Query Fallback Count, and Query Execution Count can be viewed as dashboard metrics for BigQuery.
- Object tables are Generally Available.
- New ML functions, such as: ML.DECODE_IMAGE, ML.CONVERT_COLOR_SPACE, ML.CONVERT_IMAGE_TYPE, ML.RESIZE_IMAGE, ML.DISTANCE, ML.LP_NORM.
- Non-incremental materialized views support extra SQL queries in preview. The list contains among others, the OUTER JOIN, UNION, and HAVING clauses, as well as analytic functions.
- Rounding mode setting support. The mode can be defined as ROUND_HALF_EVEN or ROUND_HALF_AWAY_FROM_ZERO for parameterized NUMERIC or BIGNUMERIC columns.
- JSON data type mapping is Generally Available for Cloud Spanner federated queries.
- CREATE VIEW or ALTER COLUMN support adding descriptions to the columns of a view. The feature is in preview.
- Dynamic data masking update to allow masking on RECORD columns that have been set to REPEATED mode.
- Preview for differential privacy. It includes 4 privacy aggregate functions (AVG, COUNT, SUM, and PERCENTILE_CONT) that you can use to anonymize the data.
- The VPC Service Controls perimeter protects also the BigQuery Reservation API.
- A new Lineage tab in the table properties to track the data movements and transformations through BigQuery is Generally Available.
- Support for translation configurations in the BigQuery Interactive SQL Translator.
- Unicode column naming support.
- Change Data Capture support relying on the BigQuery Storage Write API in preview.
- Support for tf_version and xgboost_version training options.
- Model artifacts import saved in ONNX, XGBoost, and TensorFlow Lite formats for inference.
- Inference support for models stored remotely on Vertex AI Prediction.
BigQuery Transfer Service
- Support for the new Google Ads API is Generally Available.
Cloud Composer 2:
- Celery logs splitted into stdout/stderr supported with the [logging]celery_stdout_stderr_separation option.
- API for Highly resilient environments is available.
- Per-folder Roles Registration correctly reassigns permissions for re-added DAG file.
- Access Approval is Generally Available.
- Support for access with external identities through workforce identity federation.
- BigQuery deferrable mode tasks shouldn't fail anymore with the data lineage enabled.
- Fixed a problem when a worker scheduled for deletion started a new task before it was deleted.
- New metrics (5) available in Cloud Monitoring.
- Support for customer-managed encryption keys for 2nd gen functions is in Preview.
- Exponential backoff retry with a minimum backoff of 10 seconds and a maximum backoff of 600 seconds for the newly created 1st gen functions for Pub/Sub subscription.
- Preview support for accepting requests from the Shared VPC network that a function is connected to, including "Internal" or "Internal and Cloud Load Balancing." ingress configuration.
- Uppercase letters and underscores are supported in the 2nd gen function names.
- New deployments can be restricted by the generation (1st or 2nd).
- Support for 2nd gen Firestore triggers through Eventarc is in Preview.
- SqlPackage and bcp utilities support for importing and exporting data.
- 38 new metrics are available.
- Linked Servers support to integrate data from multiple sources and distribute queries.
- Active Directory Diagnosis tool to support Active Directory integration and simplify the troubleshooting.
- Point-in-time recovery and read replica features can be used on the same primary instance.
- Simultaneous multithreading (SMT) can be disabled to reduce the licensing fees.
- Over 150 new database flags are supported, including innodb_flush_log_at_trx_commit and sync_binlog that impact the SLA.
- New extensions, view, utilities, and flags are Generally Available, including postgresql_anonymizer, pgtt, and pg_dumpall.
PostgreSQL and MySQL:
- Cascading Replicas feature is Generally Available when migrating from external servers.
- Proxy Operator is Generally Available.
- You can now leverage Fast migration for Cloud SQL to improve the performance of migrating data from an external data source.
- Improved monitoring dashboard including metrics, bucket-based filtering, and customization, is Generally Available.
- The Storage insights inventory providing a metadata overview for all objects in a bucket is Generally Available.
- Cloud Storage FUSE is in Preview. The feature lets you mount and access GCS buckets as local file systems.
- Pricing changes are effective from April 1.
- Autoclass doesn't manage objects smaller than 128KiB.
- Custom object metadata support for the final request of the resumable upload with the X-Goog-Meta header.
- Custom audit logging is available in Preview.
Data Loss Protection
New detectors and connections:
- MARTIAL_STATUS is available in all regions
- STREET_ADDRESS becomes the default detection model.
- COUNTRY_DEMOGRAPHIC identifying countries used for place of birth, residency, or citizenship is available in all regions.
- The discovery service can generate Data sensitivity and Data risk finding types in Security Command Center.
- Support for profiling up to 25 tables at no additional charge. It applies to tables smaller or equal to 1TB.
- Support for assigning a sensitivity level to built-in or custom infoTypes.
- Vertical autoscaling available for batch jobs.
- Cost monitoring with the estimated job cost is available in preview.
- Automatic Model Refresh support for Dataflow ML. It enables updating the ML models without stopping the jobs.
- Auto data quality (AutoDQ) and data profiling supported for any BigQuery table, including those that aren't part of the Dataplex lake.
- AutoDQ and data profiling support BigQuery views, BigLake tables, and BigQuery external tables.
- AutoDQ and data profiling support sampling the data. It can reduce the service cost.
- Cross-project service account support.
- Details for the Autoscaler recommendations are available in Cloud Logging logs.
- Customer-Managed Encryption Key organization policy is Generally Available.
- Key Access Justifications adds a reason for each key usage request. It's now Generally Available.
- BigQuery destination is Generally Available.
- PostgreSQL source is Generally Available.
- Support backfilling of PostgreSQL tables of any size.
- Multi-tenant Oracle architecture support.
- General Availability support for count() queries.
- Eventarc events and Firestore events for Cloud Functions (2nd gen) available in preview.
- The limit of 500 writes passed to a Commit operation doesn't exist anymore.
- OR queries are available in Preview.
- Workforce identity federation support for the browser-based sign-in is Generally Available.
- Workforce identity federation and workload identity federation both accept SAML assertions.
- General availability for schema updates.
- Support for NUMERIC and BIGNUMERIC data types in BigQuery subscriptions.
- Generated column as the primary key support.
- Database deletion protection is in Preview.
- Automatic increase of the degree of parallelism on a query when the instance size allows it.
- Logging the processing duration for read and write requests in Cloud Audit Logs.
- Sampled query plans are available in Preview.
- Google Cloud tags support for grouping and organizing Spanner instances and condition IAM policies.
- Increased number of indeed par table from 32 to 128.
- THEN RETURN and RETURNING clauses support is Generally Available.
- New functions in the GoogleSQL dialect: ARRAY_INCLUDES_ALL, ARRAY_INCLUDES_ANY, ARRAY_MIN, ARRAY_MAX.
- New query capabilities for PostgreSQL dialect databases: parameterized LIMIT and OFFSET operations, statements hints such as optimizer_version and optimizer_statistics_packag, set operations with ORDER BY, LIMIT, or OFFSET in subqueries.
Storage Transfer Service
- Published IP ranges from which the service makes requests to AWS or Azure storage resources. You can use them to restrict the access by IP for AWS and Azure data sources.
- Optional preservation of UID, GID and mode metadata for folders for the transfers between file systems.
- General Availability for the transfers from S3-compatible storage to GCS. The feature supports Multipart upload and List Object V2 which simplifies the integration between GCS and applications written for the S3 API.
- Manifest support is Generally Available. The Manifest file specifies a list of objects, object versions, and files to transfer.
Data Fabric is the most impactful and changes from the list. Unified set of other services to simplify cloud data stack sounds great! Besides, there are other smaller but also interesting changes, such as Vertical auto scaling for EMR on Kubernetes, Kafka Connect GA in Event Hubs, or GA lineage and CDC support in BigQuery!