Almost every year new concept of data-centric architecture appears. In 2014 Kappa conception was published by Jay Kreps. One year after another concept emerged - the architecture called Zeta.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This post presents Zeta architecture. The first part explains its basics. The second, and at the same time the last part, shows the implementation of Zeta architecture in the case of advertising platform.
Zeta architecture presentation
The Zeta architecture was published in Spring 2015 by Jim Scott. This new concept in data-centric systems tends to resolve a lot of problems of previously used architectures - especially related to business continuity and efficient use of resources.
According to quoted post, it seems Zeta is used by the biggest companies exploiting data, such Google. For Jim Scott, even if Google didn't published any formal documentation about used architecture, Zeta concepts can be retrieved in some of Google's services as Gmail.
Proposed architecture is built on pluggable components. All together, they produce a holistic architecture. This term is important since it describes one of main points of Zeta - hierarchy of small but coupled high-level (not rely on particular software, just describe the general idea) 7 components:
- Distributed File System (DFS) - it's the common data location for all applications. This part should be reliable and scalable. A great example of DFS is HDFS.
- Real-Time Data Storage - this component is responsible for delivering responses quickly. It's based on real-time technologies, especially NoSQL solutions as HBase, Couchbase or NewSQL database as Spanner, used internally by Google
- Enterprise Applications (EA) - in past this layer dictated the rest of architecture. In Zeta it's only a member having the same rights and duties as the others. EA components are mandatory to realize all business goals of the system. The examples of this layer are web servers or business applications (e.g. Gmail in the case of Google's implementation).
- Solution Architecture - focuses on specific business problem. Unlike enterprise architecture, it concerns more specific problem, e.g. ads recommendation in mail system. Different solutions can be combined to construct the solution for more global problem.
- Pluggable Compute Model/ Execution Engine - this place is the great opportunity to implement all analytic computations. So it's the place for the technologies like Apache Spark, Hadoop MapReduce or Apache Drill. The important point to note is that they must be pluggable because system business applications can have different needs. Pluggable compute models or execution engines composed together are able to match these needs.
- Dynamic and Global Resource Management - often fixed-size resources allocation are not fully used, e.g. from 2GB of memory reserved to a web application, only 40% of that is used. The response to this problem is global resource manager - one of Zeta components. Thanks to it unused resources can be dynamically allocated to components being more sought. This manager is considered as key member of Zeta architecture because it can influence every other components. The examples of global resource manager come Big Data and are the projects as: Apache Mesos or Apache Hadoop YARN.
- Deployment/Container Management System - the goal of this member is to guarantee a single, standardized method of deployment. It also implies that deployed resources are isolated containers that don't concern about any environment changing, i.e. they deploy in the same manner in local environment as in prod environment. Thanks to this isolation, the containers can be freely moved between machines with the guarantee of repeatability (results on local server will be the same as on prod's one). A famous example of isolated containers is Docker, but can also be used Kubernetes or Mesos.
Using Zeta has some advantages, especially in big and complex data-centric systems:
- the abstraction of containers reduces the time and costs of deployement; it also facilitates the maintenance
- thanks to the distributed file (DFS) system the workflows are often facilitated. The data is written and read directly from DFS, without any additional logic to implement in the form of for example message queues
- isolated containers makes also testing and debugging easier
- optimized resource management improves system throughput
- enforced business continuity thanks to:
- resilience - guarantees high availability of the system, e.g. backup node or replication in HDFS
- contingency - deployed applications behave in the same manner in all environments thanks to containers
Zeta examples of implementation
Zeta architecture is well suited for different cases: complex data-centric web applications, machine learning systems, Big Data or analytic solutions and so on. Below example shows the use of Zeta for the case of advertising platform. It uses simplified schema from Zeta architecture white paper:
The first part focuses on storage layer. Ads are generated from logged user behavior. In classical architecture, this behavior lands in server's local directory and is queued to DFS. In Zeta-oriented approach, the logs are sent directly to DFS. It eliminates complexity of additional layer of transmission.
The second part concerns the transition of collected logs to execution engine. With traditional approach, the logs will be moved to this engine by streaming logs service (e.g. Apache Flume). In Zeta, thanks to DFS, execution engine loads the logs directly. The same workflow applies for advertising engine that logs are directly put to DFS and processed after by the execution engine.
After that, the role of execution engine doesn't change. It puts computed results to databases (real-time data storage) containing: user profiles, optimized advertising configuration and billing information. Computed data is further read by advertising engine and sent back to web server. The difference is that in the case of Zeta, data is sent locally between containers while for the classical case, it's moved between different machines.
Another difference exists in the nature of components. Zeta officially divides components on 2 groups: offering (resource manager, DFS) and consuming resources (entreprise application, execution engine). The members of the first group, unlike the members of the second group, should never be containerized. Their containerization helps to provide deployment repeatability that can facilitate monitoring and debugging.
Zeta architecture brings a fresh view on data-centric architectures. It eliminates a lot of intermediaries in data processing and helps to optimize dynamic resources management. It also improves business continuity by providing an unique way of deployment and reducing the risk of unseen regression depending on deployment environment.