Choosing time-series database for study

In order to learn a new thing, nothing better than try it. However in some cases the choice of the tool to study is not easy. It's especially true in the context of data storage and though also in the context of time-series databases introduced in one of previous posts.

New ebook 🔥

Learn 84 ways to solve common data engineering problems with cloud services.

👉 I want my Early Access edition

The goal of this post is to do a small comparison of some available time-series databases without entering a lot into the details. Thus, the analysis is mostly based on 3rd part opinions. The first section lists and compares 3 databases present in my initial reflection. The second part gives the winner and explains why it's not one of the initial 3 engines.

Time-series database benchmark

The comparison is done on the following axis:

The analysis made in this section is a "bird's eye analysis", i.e. it's based on the analysis made by other users. The list of the pages involved here is included in the "Read also" section. As my starting point I decided to focus on these 3 solutions: InfluxDB, Graphite and Prometheus. Why these 3 ? InfluxDB because it's a pure time-series database, apparently respected in the market. I also added Graphite and Prometheus because they're 2 major monitoring tools that I've already see working (but not working with them directly, so it would be a good occasion to do so). The goal of the study is to chose one database that will be used to show time-series database features in the next posts.

The comparison is divided in different points. Each of points has one or more winner(s) that receive(s) 3 points. The runner-up receives 2 points while the last classified database 1 point. The comparison points are listed below:

Why I didn't chose them ?

The comparison winner are InfluxDB and Prometheus (18 points). Graphite gathered 16 points so it wouldn't be a bad choice neither. However after this quick analysis, none of the candidates convinced me. Initially I had expected a lot from InfluxDB. Unfortunately the commercial extension for the horizontal scalability scarred me even though the database performed well in other fields of my small comparison. The same scalability point discouraged me also regarding to Graphite and Prometheus that seemed to be scalable not so easily as I was thinking. Fortunately, during my research I found by chance a post comparing the Prometheus with Gnocchi. Gnocchi is an Open Source purely time-series database, apparently horizontally scalable and high available. Despite the fact of an apparent lack of popularity, I've decided to discover this solution.

And after some digging I've decided to evaluate it also according to the above points:

  • licence - unsurprisingly, it's Apache 2 (3 pts)
  • implementation language - great, it's Python. Moreover the project is lead by Julien Danjou, the author of Python books "The Hacker's Guide to Python" and "The Hacker's Guide to Scaling Python". It should guarantee to discover a lot of best practices during the base code analysis. (3 pts)
  • community and learning materials - it's here where it performs worse than the rivals.The community seems to be not much concerned by Gnocchi (few blog posts or conference speaks). Although the official documentation looks pretty good as the departure point. (2 pts)
  • project activity - it's similar to the one of InfluxDB and Prometheus. (3 pts)
  • querying facility - it supports HTTP access that is less universal than InfluxQL but still on the same level as Prometheus and Graphite. (2 pts)
  • plugins and connectivity - apparently the clients for other languages than Python don't exist. But since the goal of the study is to learn time-series, it's not a blocking point. (1 pts)
  • production-ready - it's a big surprise. Apparently it seems to be well suited for the cloud computing by offering distributed storage and horizontal scalability. The interrogation point comes from the lack of user stories involving Gnocchi. But since the goal is to learn time-series concepts, it's not a blocking point neither. (3 pts)
  • long-term storage - can be stored on Ceph but also on more popular AWS S3 or Redis storages. (3 pts)
  • Gnocchi gathered 20 points. Despite the fact that it's not so popular as Graphite or Prometheus, it's the clear winner of this subjective analysis. As already told, the goal of this series of posts about time-series is to learn common concepts(archive policy, storage strategy, cleaning, scalability...) and discover their implementation rather than choosing a production-ready database that should support 500k writes per second and 300 concurrent reads just now. Gnocchi, since it natively supports horizontal scaling, seems to be the most complete solution to study.

    The next posts tagged with Gnocchi will concern the database chosen in this post. They'll explain the architecture, the querying, the data ingestion and also show some of time-series-related points as retention policy, granularity and storage.