In order to learn a new thing, nothing better than try it. However in some cases the choice of the tool to study is not easy. It's especially true in the context of data storage and though also in the context of time-series databases introduced in one of previous posts.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
The goal of this post is to do a small comparison of some available time-series databases without entering a lot into the details. Thus, the analysis is mostly based on 3rd part opinions. The first section lists and compares 3 databases present in my initial reflection. The second part gives the winner and explains why it's not one of the initial 3 engines.
Time-series database benchmark
The comparison is done on the following axis:
- licence - as you can notice, this blog focuses mostly on Open Source solutions. Hence only Open Source databases will be taken into account in the analysis.
- implementation language - since I like to see what happens under-the-hood, the implementation language is quite important. The following languages are taken into account in order of importance: JVM-based (Scala, Java, Groovy, ...), Python and Golang.
- community and learning materials - a special focus will be made on the documentation, blogging hands-on posts, ops facilities (Docker images etc.)
- project activity - the number of Github or JIRA's issues, the commits frequency will be involved in this point
- querying facility - how the data can be used ? Is it a SQL-like language, an API with custom DSL and so on. The SQL-like interfaces will be noted higher than the DSL.
- plugins and connectivity - this point focuses on the facility to use the database in the context of given language. The preference criteria are: JVM-based, Python, Golang
- specialization - ideally I'd like to deal the next weeks with a database purely designed to store time-series (= time-series databases and monitoring tools). So I willingly eliminated Cassandra, Elasticsearch and HBase, with all time-series solutions based on them (even though globally they're one of valid choices)
- production-ready - this term represents mainly the horizontal scalability and High Availability concepts. It includes other aspects as architecture simplicity too. It's why Druid that seems to be a powerful analytics tool but pretty complex in terms of ops, was not taken into account.
- long-term storage - some of projects using mainly in-memory storage (e.g. Beringei) weren't included in this analysis. I wanted to be able to query several days of data in the past and also study how the archive policy concept was implemented.
The analysis made in this section is a "bird's eye analysis", i.e. it's based on the analysis made by other users. The list of the pages involved here is included in the "Read also" section. As my starting point I decided to focus on these 3 solutions: InfluxDB, Graphite and Prometheus. Why these 3 ? InfluxDB because it's a pure time-series database, apparently respected in the market. I also added Graphite and Prometheus because they're 2 major monitoring tools that I've already see working (but not working with them directly, so it would be a good occasion to do so). The goal of the study is to chose one database that will be used to show time-series database features in the next posts.
The comparison is divided in different points. Each of points has one or more winner(s) that receive(s) 3 points. The runner-up receives 2 points while the last classified database 1 point. The comparison points are listed below:
- licence
InfluxDB Graphite Prometheus MIT Apache2 Apache2 - implementation language
InfluxDB Graphite Prometheus Golang Python Golang - community and learning materials
InfluxDB Graphite Prometheus good good good - project activity
InfluxDB Graphite Prometheus good normal good - querying facility
InfluxDB Graphite Prometheus InfluxQL, very similar to the SQL the easiest access way through Renderer API custom DSL - plugins and connectivity
InfluxDB Graphite Prometheus the clients for almost any of the important languages exist: Java, Python, PHP, Ruby, JavaScript, Golang, all being the official projects it seems only unofficial clients exist the official clients exist for a wide range of languages (Java, Python, Golang...) - production-ready
InfluxDB Graphite Prometheus the horizontal scalability seems to be provided only in commercial Entreprise version a lot of opinions tell that Graphite doesn't scale. But I found some blog posts showing how to scale it and my final opinion is mitigated. If it scales, it's not easy to do. here the scalability point is similar to Graphite's one. Apparently it's possible to scale but it can be achieved with time-series splitting or federation that seem not be so easy at first glance. However I found a Prometheus service with built-in scalling called Weavecortex that could work.
Why I didn't chose them ?
The comparison winner are InfluxDB and Prometheus (18 points). Graphite gathered 16 points so it wouldn't be a bad choice neither. However after this quick analysis, none of the candidates convinced me. Initially I had expected a lot from InfluxDB. Unfortunately the commercial extension for the horizontal scalability scarred me even though the database performed well in other fields of my small comparison. The same scalability point discouraged me also regarding to Graphite and Prometheus that seemed to be scalable not so easily as I was thinking. Fortunately, during my research I found by chance a post comparing the Prometheus with Gnocchi. Gnocchi is an Open Source purely time-series database, apparently horizontally scalable and high available. Despite the fact of an apparent lack of popularity, I've decided to discover this solution.
And after some digging I've decided to evaluate it also according to the above points:
Gnocchi gathered 20 points. Despite the fact that it's not so popular as Graphite or Prometheus, it's the clear winner of this subjective analysis. As already told, the goal of this series of posts about time-series is to learn common concepts(archive policy, storage strategy, cleaning, scalability...) and discover their implementation rather than choosing a production-ready database that should support 500k writes per second and 300 concurrent reads just now. Gnocchi, since it natively supports horizontal scaling, seems to be the most complete solution to study.
The next posts tagged with Gnocchi will concern the database chosen in this post. They'll explain the architecture, the querying, the data ingestion and also show some of time-series-related points as retention policy, granularity and storage.