HTTP-based data ingestion to streaming brokers

Data ingestion is the starting point for all data systems. It can work in batch or streaming mode. I've recently covered the batch ingestion pretty much already with previous blog posts but I haven't done anything for the streaming, yet. Until today when you can read a few words about HTTP-based data ingestion to cloud streaming brokers.

New ebook ๐Ÿ”ฅ

Learn 84 ways to solve common data engineering problems with cloud services.

๐Ÿ‘‰ I want my Early Access edition

Data ingestion aspects

Ingesting data from an HTTP endpoint is not only a matter of exposing a web service accepting POST requests. The architecture brings several challenging points with regard to the endpoint and the destination store, such as:

Reliability and fault-tolerance

Let me start with the first 2 points that can be fully addressed on the application side. The first solution uses synchronous communication between the client and the endpoint. In the first step, the client sends one or multiple events to the ingestion API and waits for the confirmation of the successful delivery which is also synchronous. You can find the example in the following schema:

This solution is relatively easy to implement, but the synchronous character brings extra latency. The client is obliged to buffer the new events as long as the endpoint doesn't confirm the delivery. Additionally, the client buffers the data and in case of a failure, it can lose it. Can we do better? Yes but let me solve one problem at a time:

Unfortunately, these improvements also negatively impact the interaction. First, writing things on a disk is slower than keeping them in memory. Secondly, the solution has an at-least-once delivery guarantee. Despite a correct delivery to the streaming broker, the endpoint can still fail before notifying the client about it.

However, as with many other solutions, the one designed here is also based on trade-offs. If you have to ensure all messages are delivered only once, the synchronous approach looks best. On the other hand, if you favor the performance over the exactly-once delivery, the asynchronous code without the local checkpoint sounds better. And if you do care about reducing the data loss, you can think about extending the approach by checkpointing.

Data ingestion infrastructure

AWS, Azure and GCP have a dedicated API management service. It doesn't provide the compute capacity, though. The service acts more as a proxy between the client and the cloud resources responsible for executing the HTTP request:

Each component from the schema has its own implementation and responsibilities:

I've already covered pretty much the Request handler in the previous section by presenting the synchronous and asynchronous data writing. The remaining component to explain is the gateway. I summarized the answers in the following table:

AWS
API Gateway
Azure
API Management
GCP
API Gateway
Backpressure (throttling) Yes. The service supports rate limit configuration. The configuration works at various levels: client (different limit for each API key), method (different limit for each API method), or API (common for all API methods). Yes. The service supports the feature with Inbound processing policies. The mechanism has a fine-grained configuration that can define special throttling rules for a key, like the request IP address, or a header. Yes. However, it's not easy to find in the documentation. I've only discovered how to configure this part in this Beranger Natanelic's blog post. The rate limits implementation uses Open API specification passed to the gateway configuration during the setup.
Authentication The service implements various authentication mechanisms: Lambda Authorizers (token-based or request-basted), resource policies (defines access rules for IPs and VPCs), IAM (roles, policies and tags), endpoint policies for interface VPC endpoints, and Amazon Cognito user pools. The authentication uses one of the available policies: HTTP Basic (Authorization header in the request), client certificate, managed identity. The service has 3 main authentication mechanisms: API keys, service accounts, and JSON Web Tokens (JWT), also called Google ID tokens.
Networking API Gateway can be public and open to internet access, or private, hence accessible only from a VPC. Resource policies can provide even finer-grained access and reduce the accessibility from select VPC, not necessarily located in the same AWS account. An API Management instance can operate inside a VNet. It makes the API private and enables the communication with other resources present in this hidden network space. I didn't find a way to implement the private connectivity inside a VPC.
Data encryption in transit The APIs exposed by the service work only with HTTPS protocol. The gateway supports 2 versions of the TLS protocol, 1.0 and 1.2. You can customize them in the security policy. The API Management supports multiple TLS protocols for the client and backend sides. Moreover, you can enforce the HTTPS access with an appropriate Azure Policy definition. The service supports HTTPS access with the SSL certifications managed by GCP.
Scaling If the gateway doesn't have a throttling configuration, it passes the incoming requests with the respect of service quotas, to the backend service. For example, a Lambda backend will use the auto-scaling capabilities of the Lambda service, still limited by quotas, though. The scaling consists of manipulating units. They represent the compute capability of the service. Except for changing the tier from/to Developer, units change doesn't involve downtime. Additionally, the service supports auto-scaling. The Standard and Premium tiers use Azure Monitor rules for that while the Consumption tier scales automatically, without any configuration required. The GCP implementation is described as a scale-to-zero service. It means that the service will adapt the number of instances to the traffic, respecting the service quotas. As for AWS, the backend also impacts the scaling capabilities.

Even though your data engineering responsibility should end with the reliability and fault-tolerance, it's good to be aware of what happens with the runtime environment presented in the last section. It has a significant impact on the concurrency and can drive some of your implementation decisions.