Reverse nested aggregation in Elasticsearch on waitingforcode.com

Aggregations are a really powerful Elasticsearch feature. Besides aggregations known from RDBMS, such as sum, min, max, count, they offer the possibility to apply aggregation on different levels. It's particularly useful with nested documents.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems 👉 contact@waitingforcode.com 📩

This article shows an example of aggregation on nested documents in two scenarios - applied to nested documents and applied to parent documents. But it will begin by setup part.

Test data

Our theme will be remain French football. To show aggregation on nested document we'll index players playing for given team. Below ou can find appropriate mapping:

// PUT on http://localhost:9200/league
{
   "settings":{
      "number_of_shards":1,
      "number_of_replicas":0
   },
   "mappings":{
      "club":{
         "properties":{
            "name":{
               "type":"string",
               "index":"not_analyzed"
            },
            "players":{
               "type":"nested",
               "properties":{
                  "identity":{
                     "type":"string",
                     "index":"not_analyzed"
                  },
                  "games":{
                     "type":"integer",
                     "index":"not_analyzed"
                  },
                  "nationality":{
                     "type":"string",
                     "index":"not_analyzed"
                  }
               }
            }
         }
      }
   }
}

Now, we can index some data with given bulk request:

// POST on http://localhost:9200/league/_bulk
{"index": {"_index": "league", "_type": "club"}}
{"name": "Team 1", "players": [{"identity": "Player_1", "games": 30, "nationality": "FR"},{"identity": "Player_2", "games": 15, "nationality": "DE"},{"identity": "Player_3", "games": 34, "nationality": "FR"},{"identity": "Player_4", "games": 11, "nationality": "BR"},{"identity": "Player_5", "games": 4, "nationality": "BE"},{"identity": "Player_6", "games": 11, "nationality": "FR"}]} 
{"index": {"_index": "league", "_type": "club"}}
{"name": "Team 2", "players": [{"identity": "Player_20", "games": 11, "nationality": "FR"},{"identity": "Player_21", "games": 15, "nationality": "FR"},{"identity": "Player_22", "games": 34, "nationality": "FR"},{"identity": "Player_23", "games": 30, "nationality": "FR"},{"identity": "Player_24", "games": 4, "nationality": "FR"},{"identity": "Player_25", "games": 11, "nationality": "FR"}]} 
{"index": {"_index": "league", "_type": "club"}}
{"name": "Team 3", "players": [{"identity": "Player_30", "games": 11, "nationality": "FR"},{"identity": "Player_31", "games": 15, "nationality": "FR"},{"identity": "Player_32", "games": 12, "nationality": "FR"},{"identity": "Player_33", "games": 15, "nationality": "FR"},{"identity": "Player_34", "games": 4, "nationality": "FR"},{"identity": "Player_35", "games": 11, "nationality": "FR"}]} 
{"index": {"_index": "league", "_type": "club"}}
{"name": "Team 3", "players": [{"identity": "Player_30", "games": 11, "nationality": "FR"},{"identity": "Player_31", "games": 15, "nationality": "FR"},{"identity": "Player_32", "games": 12, "nationality": "FR"},{"identity": "Player_33", "games": 15, "nationality": "FR"},{"identity": "Player_34", "games": 4, "nationality": "FR"},{"identity": "Player_35", "games": 11, "nationality": "FR"}]}

Aggregations on nested documents

We start with simple question - how many players in each team played in at least 30 games. Firstly, the query uses terms filter to group documents by teams. After it starts nested aggregation with filter aggregation to count the number of players with at least 30 games. The query JSON looks like below:

// POST on http://localhost:9200/league/club/_search
{
  "query":{
    "match_all":{}
  },
  "aggs":{
    "by_team":{
      "terms":{
        "field":"name"
      },
      "aggs":{
        "at_least_30_games":{
          "nested":{
            "path":"players"
          },
          "aggs":{
            "count_players":{
              "filter":{
                "range":{
                  "players.games":{
                    "gte":30
                  }
                }
              }
            }
          }
        }
      }
    }
  },
  "size":0
}

The result of this query is:

{
  "took":5,
  "timed_out":false,
  "_shards":{
    "total":1,
    "successful":1,
    "failed":0
  },
  "hits":{
    "total":4,
    "max_score":0.0,
    "hits":[]
  },
  "aggregations":{
    "by_team":{
      "doc_count_error_upper_bound":0,
      "sum_other_doc_count":0,
      "buckets":[
        {
          "key":"Team 3",
          "doc_count":2,
          "at_least_30_games":{
            "doc_count":12,
            "count_players":{
              "doc_count":0
            }
          }
        },
        {
          "key":"Team 1",
          "doc_count":1,
          "at_least_30_games":{
            "doc_count":6,
            "count_players":{
              "doc_count":2
            }
          }
        },
        {
          "key":"Team 2",
          "doc_count":1,
          "at_least_30_games":{
            "doc_count":6,
            "count_players":{
              "doc_count":2
            }
          }
        }
      ]
    }
  }
}

The results are the same as expected - only two teams are playes with at least 30 games. But it was a simple part of the use case.

Nested aggregation applied to root document

Now, we would like to know how many teams have players who played in at least 30 games. If you analyze well response from previous part, you could notice that we can know that by comparing count_players to 0. But we can do this in another way, by allowing Elasticsearch to put 0 if given team doesn't have this kind of players, or 1 if it has them. It's achieved thanks to reverse_nested aggregation.

Below you can find used query:

{
  "query":{
    "match_all":{}
  },
  "aggs":{
    "by_team":{
      "terms":{"field":"name"},
      "aggs":{
        "at_least_30_games":{
          "nested":{"path":"players"},
          "aggs":{
            "count_players":{
              "filter":{
                "range":{
                  "players.games":{"gte":30}
                }
              },
              "aggs":{
                "team_has_players_at_least_30_games":{
                  "reverse_nested":{}
                }
              }
            }
          }
        }
      }
    }
  },
  "size":0
}

Returned response contains 0/1 flag to indicate if given team contains or not players with at least 30 games:

{
   "took":6,
   "timed_out":false,
   "_shards":{
      "total":1,
      "successful":1,
      "failed":0
   },
   "hits":{
      "total":3,
      "max_score":0.0,
      "hits":[

      ]
   },
   "aggregations":{
      "by_team":{
         "doc_count_error_upper_bound":0,
         "sum_other_doc_count":0,
         "buckets":[
            {
               "key":"Team 1",
               "doc_count":1,
               "at_least_30_games":{
                  "doc_count":6,
                  "count_players":{
                     "doc_count":2,
                     "team_has_players_at_least_30_games":{
                        "doc_count":1
                     }
                  }
               }
            },
            {
               "key":"Team 2",
               "doc_count":1,
               "at_least_30_games":{
                  "doc_count":6,
                  "count_players":{
                     "doc_count":2,
                     "team_has_players_at_least_30_games":{
                        "doc_count":1
                     }
                  }
               }
            },
            {
               "key":"Team 3",
               "doc_count":2,
               "at_least_30_games":{
                  "doc_count":6,
                  "count_players":{
                     "doc_count":0,
                     "team_has_players_at_least_30_games":{
                        "doc_count":0
                     }
                  }
               }
            }
         ]
      }
   }
}

As we can deduce, Elasticsearch applies previously defined filter aggregation on nested documents. But unlike in previous part, it doesn't count nested documents. Instead of that, it make a count on parent documents (teams). And because we've indexed only 2 documents for 2 different teams matching the query, Elasticsearch logically returns 0 or 1.

But beware - let's try to decrease the number of matches to 5. Because we have indexed 2 different documents for "Team 3", its team_has_players_at_least_5_games documents counter will be equal to 2:

{
  "key":"Team 3",
  "doc_count":2,
  "at_least_5_games":{
    "doc_count":12,
    "count_players":{
      "doc_count":10,
      "team_has_players_at_least_5_games":{
        "doc_count":2
      }
    }
  }
}

Now we could also imagine that players are nested documents not for teams but, for trainers which are nested documents of teams. We could want to apply the filtered aggregation not to teams but to coaches. To do so, we should configure path attribute on reverse_nested aggregation in this way: "reverse_nested":{"path" : "coaches"}.

Reverse nested aggregation is an useful thing to know in work with Elasticsearch aggregations. It allows to apply some aggregations initially made on nested documents, to parent documents.

Consulting

With nearly 16 years of experience, including 8 as data engineer, I offer expert consulting to design and optimize scalable data solutions. As an O’Reilly author, Data+AI Summit speaker, and blogger, I bring cutting-edge insights to modernize infrastructure, build robust pipelines, and drive data-driven decision-making. Let's transform your data challenges into opportunities—reach out to elevate your data engineering game today!

👉 contact@waitingforcode.com
🔗 past projects