Open Source provides a lot of interesting tools to deal with Big Data: Apache Spark, Apache Kafka, Parquet - to quote only a few of them. However nowadays data platforms without cloud support are more and rarer. It's why this topic merits its own category and posts on this blog. To not go too quickly, the first article speaks about services you can use to work with the data on AWS.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
The post is divided into 4 parts. Each one talks about one topic specific to the data-centric system. The first one is about the storage options. The second one treats data processing whereas the next ones jobs orchestration jobs in classical ETL-style and ad-hoc data analysis.
Storage
The most primitive storage is the one using files. AWS lets us store the data in object storage called Amazon S3. The service comes among others with fine grained access control, the buil-in possibility to analyze the data with analytics services like Athena, and cross-regions replication. Moreover it protects the data against unauthorized use through encryption and against human errors through versioning system. Long-term and cheap storage for archived data is provided with Amazon Glacier service.
AWS also enables streaming data with Amazon Kinesis. This service can be compared to Apache Kafka since it also enables partitioned and real-time data streaming. The events are stored in streams (Kafka's topic) and distributed by partition key on different shards (Kafka's partition). The number of shards impacts the parallelization capabilities of consumers. Kinesis data can be retained up to 7 days. That's why, it's important to envisage extra storage to persist the records longer than that. Aside from purely streaming character, Amazon Kinesis comes with different plugin services like Kinesis Analytics (Kafka SQL), Kinesis Fan-out and Kinesis Firehose (real-time data loader).
In addition to Kinesis, AWS has different messaging services like SQS or SNS. The former one is a poll-based whereas the latter a push-based messaging service. Both are able to send messages to different subscribers like Lambda, SQS and HTTP endpoint.
Among NoSQL databases you can find DynamoDB and Neptune. The first one is a fully managed key-value data store. Its advantage is the ability to scale automatically according to the defined scaling policy. The scaling happens when some read or write requests become throttled during several minutes in a row. Neptune in its turn is a managed graph database. Aside of them, AWS also brings managed version of Elasticsearch search engine and ElastiCache in-memory data store.
But to store and query the data at scale, you will need other technology than key-value or graph stores. For that AWS comes with Redshift that is a horizontally scalable implementation of Massively Parallel Processing. Redshift is a data warehouse solution based on PostgreSQL 8.0.2 so you will able to query the data with SQL.
A pure PostgreSQL instance can be built independently with RDS (Relational Database Service) service. PostgreSQL is only one of many other relational databases provided there. Other popular Open Source RDBMS are MySQL and MariaDB. AWS brings an easy way to manage and monitor them. It also provides an event-driven alerting part that can be used to monitor the state of the machines.
Data processing
A nice thing about AWS and the data processing is that you can implement classical data processing pipelines as well as more modern ones based for instance on an event-driven architecture. The latter is possible thanks to AWS Lambda service that can be used either to process the data or to trigger more complex jobs as soon as the data arrives on any of the data storage services.
Also, in the data processing part you have a possibility to execute Apache Spark on top of Elastic MapReduce service. It's a very convenient way to build a cluster on demand and destroy it as soon as the processing terminates. That cluster is called a transient cluster. With EMR it's also possible to manipulate cluster size after its physical creation. All this can be industrialized with Cloudformation templates. AWS provides not only compute instances but also the automatic installation of all needed dependencies as HDFS, YARN and so forth. Moreover, with the idea of spot instances , you can optimize the costs by for instance bidding idle AWS machines.
Spot instances
A spot instance is an EC2 machine cheaper than the same machine in On Demand price. It comes from the fact that spot instances are EC2 machines that at given moment are idle. The drawback of them is that they can go down at any moment - AWS doesn't guarantee the same availability as for On Demand EC2 instances. AWS notifies the user of given spot instance 2 minutes before taking it back.
You can also build the data pipelines with AWS Batch service. Thanks to the flexible system of triggers and the possibility to launch the jobs through one of available APIs, it can execute at a regular interval (CloudWatch rule) or after a specific event (Lambda trigger). Aside of execution flexibility, AWS Batch provides also reliability with properly defined retry strategy.
AWS Batch uses another service you could employ in the data processing - Amazon Elastic Compute Cloud (EC2). It lets us launch the applications similarly to the EMR, i.e. without the need to worry about the servers (by the way, EMR and AWS Batch use EC2 instances under-the-hood). On the other side, it only creates a new machine and all the logic and tools must be provided apart. The advantage of it is that you can launch almost everything on the cloud: Apache Cassandra, Apache Kafka or customized Apache Spark pipeline. The drawback is that unlike previous solutions, it requires some operational effort.
Orchestrating jobs
I could include the services from this part in the previous category but for better highlight the differences I prefered to put them here. The first service from this category is AWS Glue. It's serverless ETL service, based on Apache Spark letting us define when the jobs are triggered. You can use for that a cron expression, explicit on-demand call or the result of another job. Glue also proposes an interesting feature called crawler that may be used to build the schema from the data stored in different data sources (S3, RDS, DynamoDB).
Another service similar to Glue is AWS Data Pipeline. It facilitates transformation and data movement. It uses predefined job templates executed on EC2 instances. One of such predefined templates can be the execution of Redshift's COPY command, any SQL query on one of the supported databases, or a job on EMR.
Also you can AWS Step Functions to orchestrate different jobs. Step functions are in fact a state machine, so an abstraction able among others to connect tasks and hence to use the output of one task as the input of another task. In AWS the state machine can execute either on an EC2 instance or as a Lambda function.
Reading data
Athena is a good choice for an ad-hoc analysis. It's a pay-per-query service able to execute SQL queries on the files stored on S3. It works well with different file formats (ORC, JSON, Parquet, CSV) and is fully serverless. Moreover, it integrates pretty well with classical BI reporting tools like Tableau. For pure BI tools Amazon proposes QuickSight, a BI data visualization service easily connectable with other AWS sources (S3, Redshift, ...).
Another way to explore the data is already mentioned Kinesis Data Analytics. This service provides a feature to run SQL queries against data streams from Kinesis or Kinesis Firehose services. It lets us get the insight on data in near real-time with classical SQL queries.
Data is everywhere and AWS proposes a lot of managed services to work with it efficiently. As shown in the first section, we have a wide range of solutions to persist the data in a structured or unstructured manner, for real or not real time processing. The processing that can be made in full serverless architecture with AWS Lambda or in a more classical way with clusters of Apache Spark or AWS Batch. Everything can be orchestrated with more data-oriented services like Glue or Data Pipeline. AWS also comes with interesting services for data exploration, like Athena, QuickSigh or Kinesis Data Analytics for real-time querying.