Data Loss Prevention on the cloud

When I was writing my previous blog post about losing data on the cloud, I wanted to call it "data loss prevention". It happens that this term is currently reserved for a different problem. The problem that I will cover just below.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ πŸ“©


You were maybe like me, and you first heard about Data Loss Prevention, you thought about backups, replication, and generally all techniques that will ensure the data availability. After all, the term is about a "data loss" and a "prevention". But in realty, it applies to something else!

Data Loss Prevention defines the protection and monitoring methods to avoid any violation of data privacy, aka data breaches. From that angle, the data might be lost with data exfiltration attacks (unauthorized user copies your private dataset to his environment), or with the destruction of the sensitive information.

Moreover, the term doesn't apply only to the data at rest! It's also valid for data in use like files or information exchanged in emails or chats.

DLP solution

A DLP solution has different layers. The first of them is the classification. In other words, the services will help you find potential sensitive information in the dataset. Another layer is the data redaction, so the removal of the sensitive information from the datasets. It may not be a strategy appropriate for all use cases, though. After all, you might need some sensitive information somewhere in your data products. If it's the case, you can encrypt the data or expose it to only authorized users.

Among other components, you will find a monitoring layer with an auditing capability to discover any suspicious activity as quickly as possible. You can also use network isolation strategies that I shortly covered in the Cloud networking aspects for data engineers, to reduce the risk of data exfiltration.

Since the blog post talks about data loss, let me focus on the services exclusively dedicated to it, and more specifically, to the identification.

Cloud services

Two of 3 cloud providers of my current focus have their dedicated Data Loss Prevention (DLP) service. And both got it closely at the same moment. AWS launched Amazon Macie in August 2017, whereas GCP released Data Loss Prevention Beta in March 2017. But does a similar launch phase mean similar features? Not completely.

Both services support textual data analysis for the data stored on an object store. They can read files in the formats like Apache Avro, Apache Parquet, or texts (CSV, HTML, ...). But despite this similarity, you will notice the first difference already. GCP DLP supports unstructured data like images, whereas Amazon Macie doesn't. GCP's solution also works with other databases like Datastore or BigQuery.

Since we've just talked about the input, what about the output? Amazon Macie writes the analysis results to S3, but you can also subscribe to EventGrid to get real-time notifications about the findings. GCP DLP can write the output to BigQuery, Pub/Sub, Security Command Center, Data Catalog, or send you an email notification.

When it comes to the features of both services, Amazon Macie uses Natural Language Processing (NLP) algorithms to classify your data as personally identifiable information (PII) or sensitive personal information (SP). It's then a detection tool. On the other hand, GCP DLP has an extra redaction feature. It can still detect and assign the probability (likelihood) for the data to be sensitive but it can also automatically redact sensitive information, including on the images!

GCP DLP also has an extra feature of evaluating the de-identification risk with Risk Analysis Jobs. The process consists of analyzing the dataset and assigning the risk of concretely identifying the sensitive data subjects. For example, it can measure whether the columns like age or job title can identify a specific person.

To terminate, let's quickly see the rules and patterns? GCP DLP has a list of built-in patterns, and the nice thing here is that they adapt to the countries! You can use a rule to detect things like an Australian driver's license number or a French passport number! Amazon Macie also supports regional variances for various concepts like bank accounts or driver license numbers. Also, both services support custom identification. Amazon Macie uses RegEx-based input, whereas GCP DLP, in addition, supports dictionaries of sensitive data terms.

Data Loss Prevention isn't a synonym for "backups", "replication". Sure, these 2 can be a part of a Data Loss Prevention strategy avoiding for example a total loss of sensitive information but they won't say you whether your dataset contains some sensitive data. Fortunately, GCP DLP or Amazon Macie will and, I hope, that the second part of the article explains their main components.