When I was preparing my AWS Big Data specialty certification, I was not comfortable with 2 categories, the visualization and security. Because of that I decided to work on them, starting with the latter one which can have a more direct impact. The work that I initiate with this post about data security practices on AWS data services.
New ebook 🔥
Learn 84 ways to solve common data engineering problems with cloud services.
First, by "data services" I mean storage, processing and analytics services. Thus, in the post, I will cover different techniques brought by the AWS platform to secure data on the cloud. I divided the post into 3 sections. Each of them will be dedicated to one aspect of data security, in order: encryption, authorization and data expositio.
Since the post is quite long, there are few key takeaways about securing data services on AWS: SSE encryption easier to manage, beware of encryption changes (old data not encrypted, unavailability of some services), at least privilege principle, fine-grained access (IAM conditions, column-level access on Lake Formation, row-level access on QuickSight), federated access within a corporate AD and data flow (public vs private). For more details, keep reading and check the links from read more section at the end of the post.
Encryption helps to ensure that only the authorized parts can see given data. AWS comes with a quite normalized way to use this technique against its data services thanks to AWS Key Management Service (KMS). It helps to create and manage cryptographic keys on the cloud that you can later use in many other services, like Kinesis, Redshift, RDS, S3 or DynamoDB. But it's not the only one. If you need to manage the keys on your own, you can turn into AWS CloudHSM. You have to remember that this scenario will require much more maintenance work on your part. The third encryption option is LUKS and you can use it for EBS volumes attached to your EMR cluster.
Apart from this distinction, you will find 2 different encryption types on AWS services, server-side called SSE and client-side called CSE. What's the difference? On SSE data is encrypted and decrypted by the server, so globally you will only need to associate appropriate KMS permissions to your IAM role to benefit from it. On the other hand, CSE deals with encryption/decryption from the client perspective, so if we take an example of a Kinesis stream data, we'll encrypt records before sending them to the stream. In that case, your encryption keys can be managed on your own or KMS.
When you deal with encryption, you must remember a few important things. First, not all services let you change the encryption on the fly. One of them is Redshift. If you want to enable encryption on the not encrypted cluster, you will have to take into account some period of downtime for writing. This action is a normal migration action during which the clients can only read data from it. Some other services enable encryption switch on the fly. One of them is Kinesis Data Streams. The thing you need to remember here is that only the data stored after the encryption change will be encrypted! Data that arrived before this action won't be changed. Same applies to static data stores like S3 or DynamoDB.
Also, remember to check if the encryption applies to any "companion" services. For example, if you encrypt your DynamoDB table using DynamoDB Accelerator (DAX), you will need to explicitly enable encryption on your DAX cluster at creation time. The configuration won't automatically apply from the table.
The encryption at rest (stored data) works not only on the data but also on the metadata. If you enable the encryption on Glue, it will apply to the jobs data but also on the data catalog level. Same is also valid for Redshift where aside from data blocks, system metadata is also encrypted for the cluster and its snapshots.
That's all for the encryption at rest but not all for the encryption's part. AWS also supports encryption in transit, ie. when the data moves, for instance from data producer to the data store. To do so, the services use HTTPS protocol (SSL/TLS). It's valid for DynamoDB, S3, Glue, Athena. Redshift also uses SSL but it has 2 different modes. For the client's communication (eg. JDBC connection), it uses standard SSL. For internal operations inside AWS (backup, COPY or UNLOAD commands communication with S3 or DynamoDB), it uses hardware accelerated SSL.
And finally, EMR. It also supports SSL/TLS but the explanation is a little bit longer. Mainly, it's all about the shuffle, so exchanging data between compute nodes, for instance during group by key operations. It's encrypted for Hadoop MapReduce, Presto, and Tez with SSL or/and TLS. In Apache Spark, the internal RPC communication for blocks exchange and external shuffle service are encrypted with AES-256 cipher (EMR 5.9.0 or later) or DIGEST-MD5 (prior 5.9.0).
In addition to the encryption part, there is also a quite similar section about the authorization. By authorization I mean here controlling who can do what. There are 2 main ways to configure that, identity and resource policies. In the former, you define a policy document with the list of allowed and not allowed actions and the resources they apply to. Later you can associate this document with an IAM group, user or role. For example, if you associate it with a user, you can say here that this user can read and write objects from bucket 1 and only read from bucket 2. The resource-based policy does the opposite because you say what actions can be performed by specific users from the resource level. For example, you can say that user 1 can only read data and user 2 can read and write data from a specific S3 bucket. You can also apply this to more structured data stores, for instance with the policies limiting access to specific tables or databases on Glue Data Catalog.
IAM policies are also quite a powerful way to provide fine-grained access control. If you use Condition element, you can, for example, define what items of a DynamoDB table can be accessed by the connected user. It's very useful to keep data private in multi-tenant environments, like social media app.
A quite popular authorization topic I met in different AWS data services tests is cross-account access. Imagine that you're working with a partner and want to enable him to read some data from your bucket. How can you achieve that? There are 3 different ways to do so, at least for S3 (accountB wants to access accountA data):
- cross-account IAM roles where one account can assume the role created by another account
- IAM and resource policies by creating an IAM policy on accountB letting the user from this account to interact with the bucket from accountA. Later add a resource policy on the bucket from accountA authorizing user from accountB to perform the actions on the bucket.
- IAM policy and resource ACL which is quite similar to the previous approach except for few details. Instead of creating a resource policy you need to create ACL permissions on the bucket.
Another subtilty concerns the services like EMR where you associate roles at the creation time and may need to apply the permissions of the real users of this service. Let's say that user A uses EMR's Spark job to process data. In that context, you should ensure that the user has enough permissions to not only add a new EMR step but also to access the data from this step. In other words, you need to take the user's policy and apply it to the executed job. By default, the used role will be the EC2 service role and it will work fine for single-tenant clusters. But if you have to manage multi-tenant clusters, you will need some extra work and use IAM Roles for EMRFS with an authentication mechanism like Kerberos. Unfortunately, this approach works only for S3 data so it's great if you're doing some exploratory job. If your cluster should be able to run multiple jobs, for example, different Apache Spark applications using other DynamoDB services, multi-tenancy will be more challenging. I didn't find any easy way to setup multi-tenant authentication for other services. If you have any clue, please leave a comment!
And the final part about the authorization is federated access, ie. when you use some external provider like an Active Directory to manage users. For example, Redshift has a special IAM permission called redshift:GetClusterCredentials that lets users generate temporary credentials for Redshift connection after a successful login with the use of corporate credentials. If you want to know more details, look for "Amazon Redshift Federated Authentication with Single Sign-On" feature.
In this last part I will focus on something similar to user authorization but to better highlight the features provided by AWS, I decided to put them here. By data exposition I mean what is exposed to the use and I will analyze 3 services. The first of them is Lake Formation and its column-level access. When you grant a SELECT action to the user on a specific table, you can limit the access scope to some columns either by using whitelisting (inclusion) or blacklisting (exclusion). Under-the-hood Lake Formation uses Glue Data Catalog and that's the reason why it's able to provide such fine-grained access control. But you can achieve the same thing in more classical way on Redshift Spectrum with GRANT SELECT (column1, column2) ON EXTERNAL TABLE ....
Similar "data hiding" feature comes from QuickSight. In this data visualization service, the first level of security is an explicit sharing of your dataset with other users. At this moment they can access all the fields of it. But you can fine-tune this and add a row-level security and therefore, define what rows can be seen by the users. To do so, you need to create a SQL query or a file where you will define the user or the group to which you want to apply this extra security layer. Later you need to define the fields of your dataset and for every user or group a set of values the users is allowed or disallowed to see.
The last point of this category is more about resources than about users. Sometimes you may want to keep your resources, like Redshift cluster, private, ie. out of the public access via Internet. Doing that is quite straightforward since you need to create the cluster on your VPC and configure it accordingly in a private subnet. In addition to that, you can use another feature to secure the communication between Redshift and S3 (as of this writing, it's the only supported service) by enabling Enhanced VPC Routing. When you execute a COPY or UNLOAD command, all the traffic will go through your VPC instead of the internet.
Security on AWS is a wide topic. AWS prepared a whole certification for it, so covering everything in a single blog post is very hard. I hope, though, that this summary of the security applied to AWS data service can help you to understand them and prepare an AWS certification.