Aggregations in Elasticsearch

Even if Elasticsearch is not relational system, it allows to aggregate results. This operation is very helpful if we want to group set of documents.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

This article describes another interesting feature of Elasticsearch, aggregation. Its first part presents main components of aggregations, such as metrics or buckets. Next part describes some principal types of aggregations available in Elasticsearch. The last part shows how to implement aggregations through Java API.

Buckets and metrics in Elasticsearch aggregation

Aggregations in Elasticsearch are based on 2 main concepts: buckets and metrics. Buckets are documents matching aggregation criteria. When aggregation is executed, Elasticsearch checks to which bucket corresponds each found document. So, if we aggregate through "age" field, documents with the age "30" will be grouped together, those one with "40" will be grouped together etc. Buckets can be nested. More precisely, it means that in our example we can go further and create subgroups of 30 and 40 years-old people and put into these subgroups men and women.

Very often, metrics are simple mathematical operations which help to analyze grouped sets of documents. Thanks to metrics we can, for example, count the number of documents in each bucket or sum them by specific field. If we take the example of grouping people, thanks to metrics we could easily compute average salary of 30 years old men living in San Francisco or count how many 40 years old women live in Boston.

It follows that aggregations are nothing else that a mix of buckets and metrics. They can appear separately, together or even in nested way, as in the previously quoted example of nested buckets. In comparison to SQL language, buckets can be considered as GROUP BY clause while metrics as COUNT(field), SUM(field), AVG(field) and so on.

Types of aggregations in Elasticsearch

Different types of aggregations are implemented in Elasticsearch. Some of them exist also in RDBMS, some not. Let's begin by several popular mathmetical aggregations which mostly can be found in the majority of popular RDBMS systems:

Value of these aggregations can be extracted directly from specific field or extracted from a field and treated after by some script (for example by adding some ratio to computed value).

Another family of Elasticsearch aggregations concerns document analyze rather than computing. Among them we can find:

In aggregation list we can also distinguish aggregations about documents structures:

From the rest of available aggregations, we can find aggregations for geographical or temporal criteria.

Aggregations example in Java API

In our project about French football stats, we use aggregations very often. Some of them are reused very often, so there were all defined in a single utility class, QueryAggs. It looks like (only distinct aggregations are displayed):

public final class QueryAggs {

  private QueryAggs() {
    throw new ConstructorNotInvokableException();
  }

  public static SumBuilder concededHome() {
    return AggregationBuilders.sum("goals_conceded_home").field("guestGoals");
  }

  public static MaxBuilder maxHostGoals() {
    return AggregationBuilders.max("goals").field("hostGoals");
  }

  public static TermsBuilder season() {
    return AggregationBuilders.terms("group_by_season").field("season");
  }

  public static TermsBuilder score() {
    return AggregationBuilders.terms("scores")
      .script("doc['hostGoals'].value.toString() + ':' + doc['guestGoals'].value.toString()");
  }

}

As you can see through these examples, aggregations are constructed with utility class org.elasticsearch.search.aggregations.AggregationBuilders. It contains factory methods which can be used to initialize specific type of aggregations, such as: sum, avg, script or terms. These specific aggregations extend abstract org.elasticsearch.search.aggregations.ValuesSourceAggregationBuilder. This class is a parent for all aggregations based on field values. Another builders extend another abstract builders: AggregationBuilder or AbstractRangeBuilder.

As told in the first part of this article, aggregations can be nested. Programatically it can be achieved thanks to subAggregation(AbstractAggregationBuilder aggregation) method of AggregationBuilder. This building method appends subaggregation to a list of nested aggregations:

public B subAggregation(AbstractAggregationBuilder aggregation) {
  if (aggregations == null) {
    aggregations = Lists.newArrayList();
  }
  aggregations.add(aggregation);
  return (B) this;
}

Buckets can be retrieved from the response through getBuckets() method defined in abstract class org.elasticsearch.search.aggregations.bucket.terms.InternalTerms. We can also access directly to specific bucket by calling public Terms.Bucket getBucketByKey(String term) of the same class. We need only to know the name associated to given bucket.

Aggregations are a combination of buckets and metrics, presented at the begin of this article. We discovered also the families of available aggregations which concern, among others, mathematical operations, document fields or document structures. At the end we saw fluent way of define aggregations with Elasticsearch Java API.


If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!