Choosing time-series database for study

on waitingforcode.com

Choosing time-series database for study

In order to learn a new thing, nothing better than try it. However in some cases the choice of the tool to study is not easy. It's especially true in the context of data storage and though also in the context of time-series databases introduced in one of previous posts.

The goal of this post is to do a small comparison of some available time-series databases without entering a lot into the details. Thus, the analysis is mostly based on 3rd part opinions. The first section lists and compares 3 databases present in my initial reflection. The second part gives the winner and explains why it's not one of the initial 3 engines.

Time-series database benchmark

The comparison is done on the following axis:

  • licence - as you can notice, this blog focuses mostly on Open Source solutions. Hence only Open Source databases will be taken into account in the analysis.
  • implementation language - since I like to see what happens under-the-hood, the implementation language is quite important. The following languages are taken into account in order of importance: JVM-based (Scala, Java, Groovy, ...), Python and Golang.
  • community and learning materials - a special focus will be made on the documentation, blogging hands-on posts, ops facilities (Docker images etc.)
  • project activity - the number of Github or JIRA's issues, the commits frequency will be involved in this point
  • querying facility - how the data can be used ? Is it a SQL-like language, an API with custom DSL and so on. The SQL-like interfaces will be noted higher than the DSL.
  • plugins and connectivity - this point focuses on the facility to use the database in the context of given language. The preference criteria are: JVM-based, Python, Golang
  • specialization - ideally I'd like to deal the next weeks with a database purely designed to store time-series (= time-series databases and monitoring tools). So I willingly eliminated Cassandra, Elasticsearch and HBase, with all time-series solutions based on them (even though globally they're one of valid choices)
  • production-ready - this term represents mainly the horizontal scalability and High Availability concepts. It includes other aspects as architecture simplicity too. It's why Druid that seems to be a powerful analytics tool but pretty complex in terms of ops, was not taken into account.
  • long-term storage - some of projects using mainly in-memory storage (e.g. Beringei) weren't included in this analysis. I wanted to be able to query several days of data in the past and also study how the archive policy concept was implemented.

The analysis made in this section is a "bird's eye analysis", i.e. it's based on the analysis made by other users. The list of the pages involved here is included in the "Read also" section. As my starting point I decided to focus on these 3 solutions: InfluxDB, Graphite and Prometheus. Why these 3 ? InfluxDB because it's a pure time-series database, apparently respected in the market. I also added Graphite and Prometheus because they're 2 major monitoring tools that I've already see working (but not working with them directly, so it would be a good occasion to do so). The goal of the study is to chose one database that will be used to show time-series database features in the next posts.

The comparison is divided in different points. Each of points has one or more winner(s) that receive(s) 3 points. The runner-up receives 2 points while the last classified database 1 point. The comparison points are listed below:

  • licence
    InfluxDB Graphite Prometheus
    MIT Apache2 Apache2
    All tools are good Open Source candidates so they receive 3 points.
  • implementation language
    InfluxDB Graphite Prometheus
    Golang Python Golang
    Since my personal affinity with Python is much bigger than with Golang, Graphite earns 3 points, InfluxDB and Prometheus 2.
  • community and learning materials
    InfluxDB Graphite Prometheus
    good good good
    No clear winner here. All of compared databases seem to provide a good level of learning materials as well official as unofficial (community) ones. 3 points for everyone.
  • project activity
    InfluxDB Graphite Prometheus
    good normal good
    InfluxDB and Prometheus are updated more often. Depending on the moment, the updates were pushed "x hours ago" while for Graphite it's rather the matter of days. Regarding to the releases, all of 3 tools release a new version approximately every month (with some exceptions for minor releases). Thus, InfluxDB and Prometheus win (3 pts) and Graphite is the runner-up (2 pts).
  • querying facility
    InfluxDB Graphite Prometheus
    InfluxQL, very similar to the SQL the easiest access way through Renderer API custom DSL
    InfluxDB dominates in this field. It's not surprising since it's a time-series database while its rivals are among others monitoring tools supporting time-series. InfluxDB - 3 points, Graphite and Prometheus - 2.
  • plugins and connectivity
    InfluxDB Graphite Prometheus
    the clients for almost any of the important languages exist: Java, Python, PHP, Ruby, JavaScript, Golang, all being the official projects it seems only unofficial clients exist the official clients exist for a wide range of languages (Java, Python, Golang...)
    InfluxDB is the most conclusive, it earns 3 points. The second place reserved to Prometheus and the last to Graphite that seems to be supported by the community but with that exist always a risk of projects abandonment.
  • production-ready
    InfluxDB Graphite Prometheus
    the horizontal scalability seems to be provided only in commercial Entreprise version a lot of opinions tell that Graphite doesn't scale. But I found some blog posts showing how to scale it and my final opinion is mitigated. If it scales, it's not easy to do. here the scalability point is similar to Graphite's one. Apparently it's possible to scale but it can be achieved with time-series splitting or federation that seem not be so easy at first glance. However I found a Prometheus service with built-in scalling called Weavecortex that could work.
    InfluxDB, even though it has the horizontal scaling, is placed on the last position. The scaling feature is a commercial one so it goes against the Open Source character of the comparison. Scaling Graphite is apparently possible but at first glance it seems to be more difficult to achieve than in Prometheus where we can use Weavecortex. It's why Prometheus wins (3 pts), followed by Graphite (2 pts) and InfluxDB (1 pts). The High Availability doesn't make any difference since it's possible for all of 3 tools by duplicating the servers.

Why I didn't chose them ?

The comparison winner are InfluxDB and Prometheus (18 points). Graphite gathered 16 points so it wouldn't be a bad choice neither. However after this quick analysis, none of the candidates convinced me. Initially I had expected a lot from InfluxDB. Unfortunately the commercial extension for the horizontal scalability scarred me even though the database performed well in other fields of my small comparison. The same scalability point discouraged me also regarding to Graphite and Prometheus that seemed to be scalable not so easily as I was thinking. Fortunately, during my research I found by chance a post comparing the Prometheus with Gnocchi. Gnocchi is an Open Source purely time-series database, apparently horizontally scalable and high available. Despite the fact of an apparent lack of popularity, I've decided to discover this solution.

And after some digging I've decided to evaluate it also according to the above points:

  • licence - unsurprisingly, it's Apache 2 (3 pts)
  • implementation language - great, it's Python. Moreover the project is lead by Julien Danjou, the author of Python books "The Hacker's Guide to Python" and "The Hacker's Guide to Scaling Python". It should guarantee to discover a lot of best practices during the base code analysis. (3 pts)
  • community and learning materials - it's here where it performs worse than the rivals.The community seems to be not much concerned by Gnocchi (few blog posts or conference speaks). Although the official documentation looks pretty good as the departure point. (2 pts)
  • project activity - it's similar to the one of InfluxDB and Prometheus. (3 pts)
  • querying facility - it supports HTTP access that is less universal than InfluxQL but still on the same level as Prometheus and Graphite. (2 pts)
  • plugins and connectivity - apparently the clients for other languages than Python don't exist. But since the goal of the study is to learn time-series, it's not a blocking point. (1 pts)
  • production-ready - it's a big surprise. Apparently it seems to be well suited for the cloud computing by offering distributed storage and horizontal scalability. The interrogation point comes from the lack of user stories involving Gnocchi. But since the goal is to learn time-series concepts, it's not a blocking point neither. (3 pts)
  • long-term storage - can be stored on Ceph but also on more popular AWS S3 or Redis storages. (3 pts)
  • Gnocchi gathered 20 points. Despite the fact that it's not so popular as Graphite or Prometheus, it's the clear winner of this subjective analysis. As already told, the goal of this series of posts about time-series is to learn common concepts(archive policy, storage strategy, cleaning, scalability...) and discover their implementation rather than choosing a production-ready database that should support 500k writes per second and 300 concurrent reads just now. Gnocchi, since it natively supports horizontal scaling, seems to be the most complete solution to study.

    The next posts tagged with Gnocchi will concern the database chosen in this post. They'll explain the architecture, the querying, the data ingestion and also show some of time-series-related points as retention policy, granularity and storage.

    Share, like or comment this post on Twitter: