Have you missed any cloud data engineering-related news in the last 3 months? No worries, I got you covered with the new part of the "What's new on the cloud for data engineers..." series.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
This 9th part covers all that happened between 17.12.2022 and 10.03.2023. You'll see that despite the Christmas period, the data services got a lot of exciting updates! As previously, I highlighted the most interesting news.
AWS
Athena
- GCS support. It's now possible to use Athena to query data stored on Google Cloud Storage!
- Delta Lake format enhancements. This update simplifies the Delta Lake files reading. Previously, the setup required generating additional metadata files. Now, Athena can directly read Delta Lake table files.
Aurora
- Microsoft Active Directory (AD) authentication for the MySQL-Compatible Edition. The authentication is supported via the AWS Directory Service for Microsoft Active Directory or on-premise Active Directory by establishing a trusted domain relationship.
- Custom maintenance windows support for Aurora Serverless v1. If your cluster requires a version upgrade or any other maintenance operation, you can now define a time window when this operation can be performed by AWS.
Batch
- Improved monitoring for the terminated or canceled jobs. When you terminate or cancel the job, you can get the corresponding flag (isTerminated, isCancelled) in the job payload throughout the queue.
CloudWatch
- Cross-account Metric Streams support. The feature enables metrics consolidation to a single data observability endpoint that might be located in a different AWS account.
Database Migration Service
- Target recommendations support in the Fleet Advisor component. It provides recommendations for the on-premise-to-AWS migration scenarios by gathering the performance metrics and usage patterns.
DocumentDB
- Some major improvements related to the MongoDB 5.0 upgrade:
- Client-side field-level encryption.
- 128 TiB storage limit, a double of the previous limit.
- $dateAdd and $dateSubstract MongoDB API operators support for the date fields.
DynamoDB
- Table deletion protection. The feature protects against accidental table deletion. It can be enabled/disabled by authorized administrators with adapted IAM policies.
ElastiCache
- Encryption in transit can now be enabled for existing clusters.
- Enhanced I/O multiplexing. This feature can provide significant improvement for the throughput-bound workloads with multiple client connections.
- The 99.99% SLA for ElastiCache for Redis. The new availability concerns the Multi-Availability Zone configuration. The previous SLA was 99.9%.
EMR
Serverless:
- Custom images support. The Serverless supports custom images that can be stored in Amazon Elastic Container Repository (ECR).
- Account-level vCPU-based service quota introduction. The feature improves cost management. It sets the maximum number of aggregate vCPUs your applications are able to scale up to within a Region.
- HIPAA, HITRUST, SOC, and PCI DSS workloads support.
- Large worker sizes. The limits increased from 4 vCPU / 30 GB to 16 vCPUs and 120 GB, making possible processing compute- or memory-intensive workloads.
- Application log encryption support with Customer Managed Keys.
Others:
- Improved launch time in private subnets. New clusters should start up to 30% faster now.
Glue
Crawlers:
- Lake Formation integration. The feature improves permissions management. Crawlers can now use Lake Formation permissions to access the managed tables.
Streaming:
- Streaming ETL support with AWS Glue 4.0. This new release improves the data observability for duplicated records, MSK authentication (IAM support), and stream-based aggregations with an optimized state store.
Studio:
- 5 new transforms are available. You'll find among them: Flatten, Format timestamp, To timestamp, Add identifier, and Add UUID.
Others:
- Continuous logs in AWS Glue Job Monitoring. The feature exposes consolidated driver and executor logs in CloudWatch.
- Improved permissions setup with a default role for jobs and notebooks.
Kendra
- Intelligent Ranking for self-managed OpenSearch. The new extension leverages Kendra’s ML-powered semantic ranking technology to quickly improve the quality search results.
- S3 connector with VPC support. The connector can now search content on S3 in secure environments, such as VPC.
- Google Drive Connector release. It's now possible to index and search Google Drive documents.
Kinesis
Data Streams:
- AWS CloudFormation support for DynamoDB streams Global Tables.
- On-Demand write throughput limit increased to 1GB/s.
Firehose:
- Elastic is a new target supported in Firehose.
Lambda
Processing:
- Maximum Concurrency for Amazon SQS event source. You have now a better control of the concurrent SQS data processing.
- New CloudWatch metrics for asynchronous invocations, namely AsyncEventsReceived, AsyncEventAge and AsyncEventsDropped.
- Amazon DocumentDB change stream is a new supported event source.
Ops/Others:
- Runtime management control. The new feature helps control over when Lambda updates functions to a new runtime version.
- Lambda functions code scanning in Amazon Inspector.
Managed Workflows for Apache Airflow
- Amazon MWAA is now PCI DSS compliant.
Neptune
- Low-code users can now explore Amazon Neptune graph-explorer to visualize either labeled property graphs (LPG) or Resource Description Framework (RDF).
Redshift
Serverless:
- Lower base capacity configuration of 8 Redshift Processing Units (RPU). It decreases the starting point by x4.
Others:
- You can now create up to 200k tables in a single cluster.
- General availability of ROLLUB, CUBE, and GROUPING SETS in the GROUP BY clause.
S3
Storage Lens:
- Tiered pricing for cost-effective monitoring at scale. The new mode can lead to up to 40% savings for the users with hundreds of billions of objects.
Security:
- All new objects are now automatically encrypted with managed server-side encryption (SSE-S3).
Other features:
- Maximum file shares per gateway increased from 10 to 50 for Amazon S3 File Gateway.
Storage Gateway
- A simplified, one-click, file share creation for Amazon S3 File Gateway from the AWS console.
MemoryDB
- Reserved Nodes are available for Amazon MemoryDB for Redis clusters. Using them can lead to significant (up to 55%) savings compared to the on-demand nodes.
- The Availability SLA for Multi-AZ configurations increased from 99.9% to 99.99%.
OpenSearch
- Security Assertion Markup Language (SAML) authentication can now be enabled during domain creation. SAML enables users to integrate directly with identity providers (IDPs) such as Okta, Ping Identity, OneLogin, Auth0, Active Directory Federation Services (ADFS) and Azure Active Directory.
- Service software updates can now be scheduled during off-peak hours to reduce their risk of service disruption during the updates.
RDS
Oracle:
- New cipher suites for OEM Agent and SSL option.
SQL Server:
- Custom Engine Version supported to improve resiliency of customizations. The feature consists of creating a base golden image from an Amazon Machine Image (AMI) with the required Windows operating system (OS) and database customizations.
PostgreSQL:
- The seg extension is now supported.
MariaDB:
- RDS Optimized Writes feature (x2 higher writing throughput) is supported on MariaDB.
MySQL + PostgreSQL:
- Read replicas supported for Amazon RDS on AWS Outposts.
Global:
- Support for renaming and restoring database snapshots from Multi-AZ with two readable standbys to simplify deployments.
- Integration with AWS Secrets Manager for master user password.
- Support of the new SSL/TLS certificates and certificate controls.
- Support for increased storage size for read replicas and database restoration from snapshots.
Snow Family
- Support for software updates on AWS Snowcone. It's not required anymore to send the device to install the updates. Each user can do it on his own.
- AWS Snowcone and AWS Snowball Edge devices support Instance Metadata Service Version 2 (IMDSv2).
Step Functions
- The service integrated the SDK of 35 new services, including EMR Serverless.
Timestream
- A new batch load feature to ingest CSV files from S3 is available.
QuickSight
- Pivot table improvements with loading only the visible fields and increasing the fields number.
- A new chart type called Radar is available. You can use it to visualize multivariate data.
- Role-based access control to data sources connecting to S3 and Athena is available.
Azure
Backup
- Vaulted backups for Azure Files. A vault is a logical entity that stores the backups and recovery points created over time.
- Vaulted backups for Azure Blob.
Cache for Redis
- Improved geo-replication is GA. It's now possible to initiate a failover much simpler than previously, with a single action.
Cosmos DB
PostgreSQL
- 4TiB, 8TiB, and 16TiB storage per node is GA.
- Burstable compute for single node configurations is GA. The feature adds new 1-vCore with 2 GiB RAM and 2-vCore with 4 GiB RAM burstable compute options.
Misc:
- Azure Cosmos DB connector V2 for Power BI is in preview. The connector introduces a new Direct Query mode support that enables real-time reporting on the always fresh dataset.
Data Explorer
- Apache Log4J2 sink is GA. Thanks to the sink you can stream your logs to Data Explorer.
- Managed ingestion for Cosmos DB is in Preview.
- Dashboards feature is GA. The feature lets you create different dashboards in a single place.
Database Migration
- Simplified migration for Login and Transparent Data Encryption-enabled databases. The new process provides extra assistance for migrating database objects (login, permissions, server roles, ...), and encrypted databases to Azure SQL.
Databricks
- Model Serving is GA. The feature exposes ML models from a REST API running on a serverless and fully managed compute resources.
Event Grid
- Tribal Group added as a new partner publisher.
Functions
- Microsoft Netherite and MSSQL are 2 new backend storages for Azure Durable Functions extension.
- Linux Elastic Premium plan with increased maximum scale-out limits. Depending on the region, the number increased by x2 or x3.
Purview
- Access policies for SQL Server 2022 in preview.
- Data sharing lineage and search for Azure Storage.
SQL Database
Hyperscale:
- Serverless Hyperscale is in preview. The billing is based on the used resources.
PostgreSQL:
- Customer managed keys (CMK) encryption is GA for Flexible Server.
- Azure Active Directory authentication for Flexible Server is GA.
- 12 new metrics are available to monitor the auto vacuum process, including information on dead rows, vacuum cost limit, frequency of auto vacuum, number of tables vacuumed.
- Azure Data Studio has now an Azure PostgreSQL migration extension.
SQL Managed Instance:
- The APPROX_PERCENTILE function is GA.
MySQL:
- Logic Apps and Power Automate integrations are in preview.
- Power BI integration is GA for Flexible Server.
Security:
- Azure Active Directory authentication for SQL Server 2022 is GA.
Misc:
- Automatic key rotation is GA.
- TempDB maximum size becomes configurable.
- Optimized locking. The feature helps to reduce lock memory as very few locks are held for large transactions. As a consequence, more users can access the table at the same time.
Storage Account
- A new limit of 5GB for blobs uploaded with the Put Blob action. It's almost 20x more than previously.
- Simplified Storage Account Conversion from non-zonal redundancy (LRS/GRS) to zonal redundancy (ZRS/GZRS).
- General Availability for mounting Azure Storage File as a network share in Windows code (non-container) in App Service.
Stream Analytics
- No-code editor changes, including Delta Lake capture (preview) and Power BI connector (GA).
GCP
BigQuery
Administration/OPS:
- The bq commands support for service account impersonation feature is GA.
- A dataset and tables can be created as case-insensitive. The feature is GA.
IO:
- Autoscaling slot reservation is in preview. With this change you don't need to purchase slot commitments before creating auto scaling reservations. Instead, you must only define the maximum number of slots.
SQL:
- HAVING MAX and HAVING MIN are available for ANY_VALUE function in preview.
- ALTER TABLE RENAME COLUMN and ALTER TABLE DROP COLUMN are GA.
- Primary and foreign key table constraints are available in preview.
Security:
- It's now impossible to save query results to Google Drive from projects inside a VPC Service Controls protected perimeter.
- Resource Manager tags can be attached to datasets. The feature enables conditional application of the IAM policies to the resources.
- The Authorized stored procedures feature is in preview. It shares stored procedures without giving access to the underlying tables.
Omni:
- Azure workload identity federation is GA.
Console:
- Two changes in the Explorer pane. The first selects the resources corresponding to the focused tab while the second shows all resources in the searched resource's level.
Other features:
- A new Lineage tab in the table properties to track data transformations through BigQuery.
- Temporary functions are now maintained until the session ends.
- Session statements including the TEMP keyword can also contain REPLACE and IF NOT EXISTS keywords.
- The query_info column in INFORMATION_SCHEMA.JOBS, JOBS_BY_FOLDER and JOBS_BY_ORGANIZATION views exposes the information related to query processing.
BigQuery Transfer Service
- Support for transferring data from Azure Blob Storage to BigQuery is in preview.
BigLake
- The materialized views creation over BigLake metadata cache-enabled tables referencing structured data from GCS is in preview.
Cloud Composer
Bug fixes:
- After an environment's cluster update, the number of active workers shouldn't be reported as 0.
- Intermittent issues with database connections shouldn't happen anymore while upgrading a Private IP environment with VPC peerings to Cloud Composer 2.0.31 and later versions.
Cloud Functions
- Preview availability for the user-specified concurrency and vCPU configuration for 2nd gen functions.
- It's possible now to update a Serverless VPC Access connector, including the instance type and minimum/maximum number of instances.
Cloud SQL
SQL Server:
- Point-in-time recovery to recover an instance to a specific point in time is GA.
- Striped import and striped export are available to reduce the time needed for BAK file operations and for other purposes.
MySQL:
- The lower_case_table_names flag is supported.
PostgreSQL:
- The Write-Ahead-Logs for the instances with the point-in-time recovery enabled are now stored in GCS.
PostgreSQL and MySQL:
- Support for viewing an audit log for an automated backup of an instance. The feature lets you verify the completeness and outcome of the backup.
Global:
- The V2 release for the Auth proxy with improved performance, stability, and telemetry.
- Underprovisioned instance recommender preview version is available for Cloud SQL. It provides recommendations for resizing the instances to better suit the current workload.
Cloud Storage
- The number of tag bindings that can be associated with a storage bucket is now 50.
- Autoclass feature doesn't manage GCS objects smaller than 128KiB. It involves transitioning affected objects to the Standard storage at no cost.
Data Fusion
- Preview availability for the SAP SuccessFactors Batch Source.
Data Loss Protection
New detectors and connections:
- The current default PERSON_NAME infoType detection model is also used when the infoType.version is set to legacy. The old detection model is no longer available.
- The PORTUGAL_NIB_NUMBER is available in all regions.
- The US_MEDICARE_BENEFICIARY_ID_NUMBER and are available in all regions.
- The SSL_CERTIFICATE infoType detector is available in all regions.
- The VAT_NUMBER infoType detector can identify Belgium VAT numbers.
Others:
- Estimation helps understand the size and shape of the BigQuery data, including table count, data size, and profiling cost.
- The estimated null proportion and estimated uniqueness are 2 metrics available for data profiles generated at the column level.
Dataflow
- Workers running in a different region from the Dataflow regional endpoint are not supported starting with Beam SDK version 2.44.0.
- Support for ES6 syntax for JavaScript User-Defined Functions.
Dataplex
- Preview availability for a business glossary to manage business-related terminologies and definitions.
- Preview availability for the Attribute Store that helps associating attributes with tables and columns.
Dataproc
- Dataproc Hudi Optional Component is GA.
- Dataproc driver node groups feature is GA. It gives a possibility to horizontally scale the driver resources for a concurrent jobs execution.
- Hive Metastore OSS metrics are supported if you define hivemetastore in the --metric-sources property during the cluster creation.
- Support for Dataproc Metastore integration with Trino.
Datastream
- New parameters, validate_only and force, are available in the projects.locations.connectionProfiles resource in the Datastream API.
- Added a configuration for the number of maximum concurrent backfill tasks.
Firestore
- The __name__ field is displayed for each composite index definition in the Google Cloud and Firehose consoles.
IAM
- Workforce identity federation is GA. With the feature you can use an external identity provider for authentication and authorization for supported GCP products.
Spanner
- A Kafka connector to publish change stream records to an Apache Kafka topic.
- Status and progress of copy backup long-running operations are displayed in the Google Cloud console.
- The table size statistics feature is GA.
- ALTER INDEX statement supports adding columns into an index or dropping non-key columns from it.
- Autocomplete and syntax validation for DDL statements in the Google Cloud console.
Storage Transfer Service
- Preview support for tracking progress of a Transfer Job in Cloud Monitoring.
- UID, GID, and other metadata can be preserved for transfers involving file systems.
This time I noticed less targeted-changes. In the previous updates, I had a feeling that the cloud providers were working on a particular topic each time, such as table file formats or streaming. Here, the changes are different but definitively, the new autoscaling on BigQuery, DocumentDB stream, or cross-cloud connectivity, are interesting features. What's yours?