What's new on the cloud for data engineers - part 3 (02-04.2021)

It's time for the 3rd part of "What's new on the cloud for data engineers" series. This time I will cover the changes between February and April.

AWS

Athena

Good news if you are in quest of performance! The EXPLAIN statement in AWS Athena is now Generally Available! Thanks to it, you can analyze the query execution plan and eventually get some interesting information for its tuning.

Batch

The first change is about Service-Linked Roles which are now available for AWS Batch. In other words, AWS published a predefined role that has all required permissions to run batch jobs in the service.

In addition to this IAM change, recent releases also included some more job-specific evolutions. To start, AWS Batch supports the Amazon Elastic File System mounts. What does it mean for you? First, elastic storage capacity changing with every written item. Second, you will have shareable storage with other jobs.

To terminate, the job scheduling also changed a lot. AWS Batch is now up to 5x faster when scaling a managed EC2 Compute Environment. The scaling algorithm takes the decision much faster, increasing the throughput for scale-up and reducing the costs for scale-down operations.

Data Exchange

Before writing this "What's new…", I didn't hear about Data Exchange and by the way, this services discovery is another thing I like while preparing this update summary! Thanks to Data Exchange, you, as a subscriber, can easily copy the datasets made available by the other people (publishers) to your S3 buckets and start to explore them. No more need for data discovery or billing technology. Recently, AWS added support for setting up export jobs upon subscribing to products. Thanks to that, the subscribers can define the export jobs in a single screen, just completing the subscription process.

Database Migration Service

The next service is less mystical for me because it's there to migrate datasets between various databases. Database Migration Service recently got support for IBM's database called Db2. You can now use it as a source to move the data to any other supported database.

DynamoDB

A great feature for the ones of you who also have the ops responsibilities. NoSQL Workbench supports CloudFormation-managed DynamoDB data models. To recall, NoSQL Workbench is a GUI tool helping model, visualize and query DynamoDB.

Apart from that, a big change related to the Global Secondary Indexes (GSI) managed by AWS Amplify CLI. Amplify is also a new service for me, but this time, it's not a data one. It's a toolset that can facilitate building mobile and front-end apps on top of AWS. Thanks to the recent changes, you can now invoke amplify push and update 1 or more GSI at once!

ElasticSearch service

Another NoSQL database managed by AWS, Elasticsearch, also had some new features. The first of them is the support for index rollups. The reason? When you work with time-series data, the interest in having very granular data reduces with time. The index rollups let you define new indexes for this data and store only relevant fields in a coarser time bucket.

Elasticsearch service also became more event-driven since the last "What's new on…" blog post. In the list of published events, you will find the ones related to the update (start, availability, completion). In addition to these new events, you now have a possibility to debug better distributed applications thanks to a feature called Trace Analytics. Compatible with Open Source OpenTelemetry Collector, it supports various AWS sources like CloudWatch or X-Ray.

Another feature is called "Reporting" and it's addressed to Kibana users who can now generate and download reports from the Dashboard, Visualize and Discover panels.

EventBridge

Two changes related to the EventBridge service. To recall, EventBridge is a serverless event bus service very often used to route various events to different endpoints. Starting from March, you can push the events to any HTTP endpoint. All you have to do is to set up the authorization context and define the target URL.

Also, you can include an X-Ray trace context in your requests to view the end-to-end flow of events through your applications, like AWS Lambda or EC2. Among the supported EventBridge targets, you will find Amazon SQS, SNS, API Gateway, AWS Lambda, and AWS Step Functions.

EMR

For the EMR's side, the big announcement is the General Availability of EMR Studio, a fully managed Jupyter Notebooks for a collaborative work between data engineers and data scientists.

Apart from that, the instance fleets are now supported in AWS GovCloud regions. Instance fleets define the instance types that can be used in the auto-scaling to meet the cluster compute power expectations.

And by the way, April's 5.33 release of the service supports 10 new instance types! All of them are built on the Nitro System, which is the new EC2 virtualization platform providing better performance and price, and enhanced security.

Glue

But EMR is not the most updated AWS service. It's Glue! If you check the list of updates, you can easily write a separate blog post. However, I will try to summarize them below.

To start, Glue jobs can update Glue Data Catalog at runtime. It facilitates the previous way of work which required executing Glue Crawler on top of the processed data. The second change, also facilitating work a lot, is the dynamic schema inference for the data stored on S3. It integrates pretty well with the Glue Studio interface for an interactive authoring of the jobs.

By the way, you can now write your processing logic with SQL in Glue Studio! No more need to know Apache Spark, even though it can be helpful to write more complex logic in Glue Jobs that also got some new features!

To start, you can use Glue Workflows and orchestrate complex data integration workflows much easier with the help of the preview feature called custom blueprints. It looks similar to Dataflow templates because you define your job and make some parts configurable. For example, you could use the blueprints for a repetitive job converting CSV files to Parquet files, where the single variable is the input path. In addition to the blueprints, you can also extend your Glue Streaming jobs and read data produced in Kinesis Data Streams living in a different AWS account! Finally, if you use ML part of Glue in your jobs, you should notice 2 new features. In the first one, FindMatches ML transform exposes how much each column in the dataset contributed to determining if records were matches. The second one is a completely new ML transform called Fill Missing Values and you can use it to fill missing values in the column(s) you specify. The transform learns patterns from the complete rows to predict and set the missing ones.

To terminate this section about Glue, let's focus on Glue DataBrew that got some new features. There are so much changes compared to the previously presented components that the best presentation mode will be a list:

Kinesis Data Analytics

Small changes for Kinesis Data Analytics. The March's release announced Python 3.7 support for Apache Flink 1.11.

Kinesis Data Streams

Starting from April, you can configure your DynamoDB to Kinesis Data Streams synchronization with CloudFormation templates! Thanks to that, all DynamoDB operations will be available in Kinesis Data Streams and therefore, you will be able to process them with Kinesis-based streaming services like Firehose or Data Analytics.

Lambda

For AWS Lambda, TrustedAdvisor supports 4 extra checks:

And if you use the AWS console, you will certainly notice some visual changes making the console more user-friendly (less scrolling, dedicated config tab)!

Lookout for Metrics

Lookout is another new service on the list. In March, it became Generally Available. If you are an AWS user, you can use it to detect anomalies or unexpected changes in your metrics. The service uses ML, and thanks to the implemented logic, it can help proactively monitor the health of your resources and detect the issues earlier.

Managed Streaming for Apache Kafka (MSK)

An important change for the ones of you who use MSK in the context of IoT devices. Starting from March, you can connect to MSK with the help of a username and password. For enhanced security, you can store them in AWS Secrets Manager.

RDS

You will also find some interesting features in RDS. For PostgreSQL you can now use the version 13 of the database. In addition, it also supports pg_bigm extension that can be helpful to faster full-text search languages that require multi-byte character sets such as Japanese, Chinese, and Korean.

SQL Server has a better availability guarantee thanks to the Always On Availability Groups. When you use this feature, the service will create the primary node in one zone and another stand-by replica node in another zone of the same region. In case of service disruption, it will automatically failover. And the second SQL Server feature added recently is the support for Extended Events. You can use them to capture debugging and troubleshooting information of your server.

For MySQL and MariaDB, when you set up the read replicas, you can define the replication filters, which will help to eliminate the databases and tables that shouldn't be replicated.

Redshift

After the RDS, it's time to see what's new in the data warehouse service. To start, 3 Redshift features passed to the General Availability status. The first of them is data sharing that you can use to share data in read-only mode across different Redshift clusters. The second one is somehow related to data sharing because it lets you query different databases, regardless of which database you are connected to. Both features are supported on RA3 clusters. It's not the case of the last GA change, the native console integration with 3rd party applications. Put another way, you can integrate data from AWS partners like Salesforce, Google Analytics, Facebook Ads, Slack, JIRA, or Splunk, directly to Redshift.

Data sharing, apart from becoming Generally Available, also got an update for the paused producer clusters. A producer cluster is the one that writes the shared data. Starting from April, you can pause this cluster and still be able to share its data with other consumers.

Among other changes, you can notice the availability of better cold query performance in new regions. When you write a SQL query, Redshift compiles it into machine code and distributes it to the cluster nodes. However, when you create a new cluster or upgrade an already existing one. Thanks to the feature, the query is compiled by a more efficient serverless compilation service and stored in an unlimited cache to increase cache hits for the compiled objects from 99.60% to 99.95%!

And to terminate, Redshift Query Editor also has some improvements. It supports clusters with enhanced VPC routing, and queries executed for up to 24 hours. It can also seamlessly connect to AWS Secret Manager to get the connection and authentication information securely.

S3

Even though S3 is one of the oldest AWS services, it's still evolving! The first of the news is the support for AWS PrivateLink. Thanks to it, you can access S3 via a private endpoint within your private network. In other words, the communication between your applications and S3 doesn't leave the Amazon network and is a great privacy improvement announced as Generally Available in February!

Also, starting from February you can delete tags of multiple S3 objects with a single batch request. It can be a great way to reduce the communication overhead between your application and the S3 service because now you can perform the same delete tag action with fewer requests.

Moreover, you can also transform your data at the data source level. Thanks to this exciting feature, you can send an S3 GET request and include a data transformation logic that the service will apply before returning the object data to your application! In other words, you can implement smart things like a predicate pushdown that will filter out the rows not matching the predicate. Or you can also perform some data reduction to limit the size of the response if you only need a small part of it. Your transformation will be executed as a Lambda function.

To terminate, there are 2 changes related to the capacity and pricing. First, the PUT request to Glacier is cheaper and can reduce the charges by 40%! Secondly, the AWS Outposts, an S3 object store on-premise, offers 26TB storage tiers.

Step Functions

To terminate, Step Functions and the first feature letting you define the EMR on EKS job workflows. You can then build fully serverless orchestration workflows deploying your Apache Spark jobs to EKS and responding to some schedule or event (e.g., S3 object upload).

In addition to this EMR-related change, Step Functions supports a data flow simulator in the console. Thanks to it you can simulate and verify the input and output of your state machine before making a possible mistake on pushing it on production.

Azure

CosmosDB

Let's start the Azure updates with CosmosDB. The first great news is the GA of composite indexes that can not only return the results faster but also with less Request Units! You can use them for queries with 2 or more columns in ORDER BY clause, in WHERE clause, or in aggregates. According to the benchmark presented in Azure blog, you can observe even 80x performance improvement for complex queries involving composite indexes in predicate and sorting parts of the query!

Among the preview features, you can try a new role-based access control (RBAC) with Azure AD. You can then use CosmosDB SQL API permissions like item creation, or item replacement to create custom roles and assign them to Azure AD principals.

In addition to that, you can also improve your Data Loss prevention strategy with continuous backup with point in time restore feature. Thanks to it, you can restore a backup within the past 30 days.

Data Explorer

Just a quick note to say that you should notice a performance improvement starting from March 17. I didn't find what changed, though.

Data Lake Storage

Regarding the Data Lake Storage, if you miss the HDFS' append feature, you should be happy now because it's available in limited preview! It's a good data structure for files receiving data continuously, like hourly logs. To test this new feature, you will need to fill a form because the preview is limited.

Databricks

Regarding Databricks, in the last months, the Power BI connector became Generally Available. Thanks to it, you should get the results much faster (optimized ODBC connector), easier (simplified connection configuration, Azure AD support for authentication), and from new places (Data Lake via DirectQuery mode).

Event Hubs

With Databricks, you can process data coming from Event Hubs, which also recently got a few changes. To start, it has a support for disconnected scenarios via the integration with Azure Stack Hub. Azure Stack Hub is a solution providing a way to run the applications based on Azure services on on-premise or edge location. Now, you can also run Event Hubs in this mode! Even though the disconnected Event Hubs doesn't support the Azure AD authentication and Event Hubs capture.

Functions

But you can use not only Databricks to process Event Hubs data. You can also use Azure Functions, starting from March, supports Durable functions in Python. What does it mean? It means that you can write serverless and stateful - 2 years ago I would say that it's an oxymoron - in Python!

You will find the list of supported stateful patterns in the Azure Documentation but as a data engineer, you will probably either use the fan-in/fan-out to dispatch the data or aggregator to build stateful entities.

HDInsight

I would start this part by "but you can also use something else than Azure Functions to process the data" but I don't want to be repetitive. But I just did it anyway, so if you process your data with HDInsight, you can now use a feature called Apache Kafka REST proxy.

As the name indicates, it lets you interact with Apache Kafka via a REST API. So if you are using a consumer or producer written with a language not supported by the Apache Kafka SDK, you can simply communicate thanks to the HTTP protocol!

Purview

After the changes related to the compute part of Azure, it's a good moment to see what happened in the data governance service called Purview. To start, you now can perform bulk edit on a list of data assets. It can be quite useful to modify the classifications or contacts in a single action. Apart from that, you can also define a parent-child relationship between terms within the business glossary. Thanks to this feature, you can use the same term name and use it in the context of the parent term. An example of that is a Customer that can be associated with a Finance or Sales entity and have different templates and attributes in each of them.

The next announcement will make you happy if you are a multi-cloud user. Purview has the possibility to scan and classify data located in an AWS S3 bucket. In addition to that, you also can connect to on-premise Teradata, SAP ERP, or Oracle DB solutions. All this in a fully automated scanning manner!

Among other scanning features, Purview got a support for resource set pattern rules. A resource set is a logical view of a dataset stored in multiple files, not necessarily located in the same directory. For example, it can help classify time-based partitioned data and represent it as a single object in the data catalog. Starting from April, you can define pattern rules that Purview will use to define these logical views of your data.

Finally, also in preview, you can try the search feature from Synapse workspace. It can help you find assets by simply typing some keywords.

Storage

Apart from Purview, another service benefiting from a lot of changes is Azure Storage. First, you can backup your Azure Blobs. The feature is currently in preview and lets you continuously create backups to enforce the point-in-time recovery. Another Data Loss protection change you may observe while working with File Shares is that the soft-deletes are enabled by default.

Good news for you if you need to store really big objects. Azure Blob Storage or ADLS Gen 2 supports objects up to 200 TB! So, in other words, you can have 50 000 blocks of 4GB.

Another good news is for your DevSecOps people, or you, if you are in charge of the security. The encryption scopes are now Generally Available, meaning that you can use different encryption keys for other containers or blobs located inside the same Storage Account!

The last GA feature is the network preference configuration. Thanks to it, you can decide how your clients will access Azure Storage. Choosing the Microsoft global network will often be a better choice from the performance standpoint, whereas opting for the public Internet will lower your costs.

Among other preview features you will find the support for GCS import in the AzCopy. You can now use it to move the data stored on GCP to Azure Storage Account.

Stream Analytics

Apart from Event Hubs, Stack Hub has another new streaming component with Stream Analytics. You can now integrate Azure Stream Analytics jobs to Event Hubs and IoT Hubs running on Azure Stack.

The second major evolution of the service concerns resources isolation. With the Dedicated Tenant mode you can now run your Stream Analytics jobs in a completely isolated environment, with the scaling capability up to 216 SU!

SQL Database

Two interesting changes for another data store solution in Azure, the Azure Database. For the MySQL version, the start/stop functionality is now GA. It means you can stop your server anytime you don't use it to save costs.

Regarding PostgreSQL, and more exactly the Hyperscale (Citus), you can get a better vision of the audit events thanks to the pgAudit extension which is publicly available in preview. In addition to that, Citus got other in-preview features. The first of them is Citus Basic, which is a single node version of the tier. You can use it to start small and with the data load increase, switch to the Standard tier with distributed nodes. And if you encounter some heavy load issues, you can use another new feature and add read replicas within the same region. Finally, you also can define the maintenance schedule, i.e, ask Azure to upgrade the nodes at that preferred time.

Finally, you can configure a Long-Term Backup Retention to retain the SQL Managed Instance backups up to 10 years for regulatory or business purposes.

Synapse

To terminate this blog post updates, let's see what changed for Synapse. The first evolution concerns Synapse and Azure Data Factory because it's about instances used in Data Flow activity. You can now purchase 1-year or 3-years Reserved Instances and get some discount compared to the pay-as-you-go model. In addition to this feature, you can also observe better performance for Data Flows. You can now select a Quick Reuse option while defining the flow. If enabled, the cluster will be reused in the next Data Flow activity until reaching the configured TTL. It results in much faster startup times.

Another mixed feature shared this time by Synapse with Azure SQL Database and Azure SQL Managed Instance is the UNMASK support at the schema-level, table-level, and the column-level. Put another way, you can configure unmask permission for Dynamic Data Masking for a schema, table or column. It's not a new feature but now it's Generally Available.

Aside from the security enhancement, Synapse also got some new features for the data migration part called Azure Synapse Pathway. You can now migrate your data to Synapse from IBM Netezza, SQL Server, and Snowflake. The migration assistant will perform an assessment report of database objects, and T-SQL code translation.

To terminate this part, Serverless Pool support for Azure Synapse Link is Generally Available. To recall, the Synapse Link transforms CosmosDB into an OLAP data store. Thanks to it, any analytical queries executed on top of the database doesn't impact the transactional workload.

GCP

BigQuery

Honestly, I don't remember so many updates for BigQuery! Let's start with the news about materialized views General Availability. These precomputed views can be a good performance accelerator. And so without the maintenance headache because if one of the underlying tables changes, the engine will either reload the modified part (partitioned tables) or the entire table.

The second update concerns clustered tables, or rather non-clustered tables that can be transformed into clustered ones with the help of the tables.update or tables.patch API methods. One thing to notice, though. The change applies only to the new data!

The next big announcement is the new API for streaming, called BigQuery Storage Write API. It's currently in Preview providing exactly-once delivery semantic at low cost for high throughput data. It also supports stream-level and across-streams transactions.

Apart from that, BigQuery SQL support also changed:

Apart from that, you can retrieve the DDL statement used to create the table from INFORMATION_SCHEMA.TABLES view.

From the data loading perspective, ENUM and LIST types in Apache Parquet files got better support with conversion to STRING or BYTES for ENUM, and schema inference for LIST logical types.

Bad news if you're using BigQuery Storage Read API to get the data from Cloud Storage in HTTP: Beginning in early Q3 2021, the API will start charging for network egress.

BigQuery Transfer Service

BigQuery Transfer Service helps automate bringing data to BigQuery. In its February release, it became even more reactive. First, there is no more minimum file age limit. Previously only files created at least one hour ago were integrated. Second, the import job execution interval changed from 1 hour to 15 minutes.

Cloud Composer

The next service, Cloud Composer, is also one of the most updated ones. In addition to the new images supporting new Apache Airflow versions, you will find some important feature updates. Five Preview features became Generally Available:

Moreover, DAG serialization is enabled by default, starting from version 1.15.0. However, it won't work if the asynchronous DAG loading is enabled at the same time. With this asynchronous DAG loading feature, the webserver creates a background process that will periodically load and send DAGs.

You should also get better support for the failed update operation. It should now list possible causes of the problem. Also, when the upgrade fails, the created CloudSQL will be correctly rolled back.

Cloud SQL

Let's start the update list with the changes made for Cloud SQL for PostgreSQL. The first of them is the General Availability of the pg_partman extension that you can use to create time-based and serial-based partition sets. In plus, you can use IAM database authentication with the Cloud SQL Auth proxy. The proxy is the recommended way to connect to the database with IAM database authentication. It ensures stable connection for the OAuth tokens that are short-lived and valid only for 1 hour.

Regarding MySQL, you can configure the innodb_buffer_pool_size and better control the memory allocated to tables and index cache. In addition to that, you can also configure your instances in flexible mode, so to choose the amount of memory and CPUs really needed by your database.

Finally, for the SQL Server distribution, the first new feature is the possibility to set up the Change Data Capture. The second one is the possibility to connect to the tempdb database where you will find temporary objects created by the user or by the database engine. To terminate, you can integrate your instance to Managed Microsoft AD and enable logging to it using Windows Authentication.

Data Fusion

This service is a big surprise for me because it got much more updates in March than 2 months before! To start, in preview, you can perform a continuous data replication from operational data stores like SQL Server and MySQL into BigQuery. Still, in preview, you can also access Transparency logs. Thanks to them, you will be able to see any actions taken by the Google personnel accessing customer content. I didn't hear about this previously, but in a few words, these Transparency logs are a part of Access Transparency commitment. Thanks to it, you can check why Google personnel had to access your data, like for example for solving an outage or attending your support requests.

Among other features you will find:

Dataflow

Among the changes announced for Dataflow, you will find the GA of Dataflow Shuffle, Streaming Engine, and FlexRS in new regional endpoints but also something new like the support for User-Defined Functions (UDF) in Dataflow SQL. As you can guess by the name, thanks to this feature, you can customize Dataflow SQL queries with the operations that don't exist natively in the solution. It's currently in Preview.

The second Preview feature is the Execution details tab in the Dataflow job page. It provides some extra execution details that should help you optimize your jobs. To use it, you have to enable it explicitly in the --experiments=enable_execution_details_collection,use_monitoring_state_manager.

Dataproc

Dataflow is not a single data processing solution on GCP. Another one is Dataproc, and it also got some important improvements. The first one is the General Availability of start/stop cluster feature. Thanks to it, you can stop the cluster - even the running one - and start it later, exactly like you can do now for a lot of RDBMS on the cloud. The stop/start feature was extended recently to High Availability clusters.

Still regarding the cluster, when you submit the new job to Dataproc service you can define a --cluster-labels flag with one or multiple cluster labels. A cluster label is a key-value annotation that you can put to any of the deployed clusters. And thanks to the flag, Dataproc will submit the job to any of the clusters annotated with one of the flags.

The 3rd change is the General Availability of Dataproc Metastore, the fully managed, highly available, auto-healing, and interoperable metastore service. It's also called a serverless Apache Hive metastore, so you can use it to manage metadata of relational entities. It can be shared among many Open Source tools like Apache Spark, Presto, or Apache Hive.

Finally, you can also use Apache Spark 3.1.1 starting from Dataproc's 2.0 image.

Pub/Sub

And what's new for the GCP's streaming service? To start, Message schemas are available in Preview launch stage! The feature is a Kafka's Schema Registry on GCP, so it will help enforce schema consistency in your applications and avoid runtime issues related to the schema evolutions.

And to terminate, an Apache Spark connector for Pub/Sub Lite is available. If you're working with Dataproc or Databricks, which recently moved to GCP, you can connect to Pub/Sub Lite from a Structured Streaming application.

Spanner

The last service from the list is Cloud Spanner. Here too, it happened a lot! Spanner got some interesting improvements in the monitoring part, like the visualization of the CPU utilization by operation type (read, write, commit, ...), the monitoring of the database store utilization, or the monitoring of the number of commit retries in transaction statistics part.

Among the changes not related to the monitoring, you will find the support for point-in-time recovery. Thanks to this feature you can easily recover your Spanner database from anytime in the past, within the past 7 days. If you need longer retention, you should still use Export/Import or Backup/Restore feature.

Also, the support for Customer-Managed Encryption Keys is now Generally Available. You can then store your encryption keys in Cloud KMS and use them to encrypt Spanner database instead of relying on the Google-managed keys.

For the data access part of the service, you can use the request options to define the priority of your queries to one of the available levels: PRIORITY_HIGH, PRIORITY_MEDIUM, PRIORITY_LOW.

And to terminate, you can track the progress of long-running index backfill operations with CLI, REST API or RPC API. "Backfill"? Yes, even though it sounds complicated, it means "adding a secondary index to an already populated database". When it happens, Spanner needs to generate the index for the already present data.

And that's all for the 3rd part of the "What's new on the cloud for data engineers" series. See you next time in Summer!

If you liked it, you should read:

The comments are moderated. I publish them when I answer, so don't worry if you don't see yours immediately :)

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!