Even if Elasticsearch is not relational system, it allows to aggregate results. This operation is very helpful if we want to group set of documents.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This article describes another interesting feature of Elasticsearch, aggregation. Its first part presents main components of aggregations, such as metrics or buckets. Next part describes some principal types of aggregations available in Elasticsearch. The last part shows how to implement aggregations through Java API.
Buckets and metrics in Elasticsearch aggregation
Aggregations in Elasticsearch are based on 2 main concepts: buckets and metrics. Buckets are documents matching aggregation criteria. When aggregation is executed, Elasticsearch checks to which bucket corresponds each found document. So, if we aggregate through "age" field, documents with the age "30" will be grouped together, those one with "40" will be grouped together etc. Buckets can be nested. More precisely, it means that in our example we can go further and create subgroups of 30 and 40 years-old people and put into these subgroups men and women.
Very often, metrics are simple mathematical operations which help to analyze grouped sets of documents. Thanks to metrics we can, for example, count the number of documents in each bucket or sum them by specific field. If we take the example of grouping people, thanks to metrics we could easily compute average salary of 30 years old men living in San Francisco or count how many 40 years old women live in Boston.
It follows that aggregations are nothing else that a mix of buckets and metrics. They can appear separately, together or even in nested way, as in the previously quoted example of nested buckets. In comparison to SQL language, buckets can be considered as GROUP BY clause while metrics as COUNT(field), SUM(field), AVG(field) and so on.
Types of aggregations in Elasticsearch
Different types of aggregations are implemented in Elasticsearch. Some of them exist also in RDBMS, some not. Let's begin by several popular mathmetical aggregations which mostly can be found in the majority of popular RDBMS systems:
- min: returns the minimal value from grouped documents.
- max: returns the maximal value from grouped documents.
- sum: as the name indicates, it sums values from specific field.
- avg: computes average value from specific field.
Value of these aggregations can be extracted directly from specific field or extracted from a field and treated after by some script (for example by adding some ratio to computed value).
Another family of Elasticsearch aggregations concerns document analyze rather than computing. Among them we can find:
- filter: applies filter to aggregated documents. Thanks to it we can aggregate only a part of documents found in the search. Filter aggregation is very useful in the case of nested aggregations.
- range: helps to group documents in specific range(s). If we take the example from previous part, we could use range aggregation to group people from and to several ages: 0-18, 18-25, 25-30, 30-40 etc.
- terms: in this aggregation documents are grouped by value in specific field. This value can be directly extracted from defined field or computed dynamically as a script.
missing: interesting aggregation which enables the possibility to find documents which haven't specific field (or its value is NULL).
In aggregation list we can also distinguish aggregations about documents structures:
- children: allows to apply aggregation of parent documents to child documents.
- nested: thanks to it we can aggregate nested documents.
From the rest of available aggregations, we can find aggregations for geographical or temporal criteria.
Aggregations example in Java API
In our project about French football stats, we use aggregations very often. Some of them are reused very often, so there were all defined in a single utility class, QueryAggs. It looks like (only distinct aggregations are displayed):
public final class QueryAggs { private QueryAggs() { throw new ConstructorNotInvokableException(); } public static SumBuilder concededHome() { return AggregationBuilders.sum("goals_conceded_home").field("guestGoals"); } public static MaxBuilder maxHostGoals() { return AggregationBuilders.max("goals").field("hostGoals"); } public static TermsBuilder season() { return AggregationBuilders.terms("group_by_season").field("season"); } public static TermsBuilder score() { return AggregationBuilders.terms("scores") .script("doc['hostGoals'].value.toString() + ':' + doc['guestGoals'].value.toString()"); } }
As you can see through these examples, aggregations are constructed with utility class org.elasticsearch.search.aggregations.AggregationBuilders. It contains factory methods which can be used to initialize specific type of aggregations, such as: sum, avg, script or terms. These specific aggregations extend abstract org.elasticsearch.search.aggregations.ValuesSourceAggregationBuilder. This class is a parent for all aggregations based on field values. Another builders extend another abstract builders: AggregationBuilder or AbstractRangeBuilder.
As told in the first part of this article, aggregations can be nested. Programatically it can be achieved thanks to subAggregation(AbstractAggregationBuilder aggregation) method of AggregationBuilder. This building method appends subaggregation to a list of nested aggregations:
public B subAggregation(AbstractAggregationBuilder aggregation) { if (aggregations == null) { aggregations = Lists.newArrayList(); } aggregations.add(aggregation); return (B) this; }
Buckets can be retrieved from the response through getBuckets() method defined in abstract class org.elasticsearch.search.aggregations.bucket.terms.InternalTerms. We can also access directly to specific bucket by calling public Terms.Bucket getBucketByKey(String term) of the same class. We need only to know the name associated to given bucket.
Aggregations are a combination of buckets and metrics, presented at the begin of this article. We discovered also the families of available aggregations which concern, among others, mathematical operations, document fields or document structures. At the end we saw fluent way of define aggregations with Elasticsearch Java API.