Parent-children relationship in Elasticsearch on waitingforcode.com

Make links between entities is quite easy in relational databases. And it's not a trivial task in document databases, adapted to less normalized data storage. Elasticsearch is not the exception of this rule but it defines some mechanisms to support parent-children relationship between documents.

This time we'll discover how to handle parent-children relationship in Elasticsearch. Because there are several ways to do that, we'll explain each of them in separated part. The first one will describe the simplest solution - denormalization. The second one will present mapping of nested objects. The last one will show how to use parent-child mapping.

To illustrate the cases we'll take the example of football club and its players. Each club has several players who represent it in league competition. Different players can be in the team in each new season. The sample will contain players representing SC Bastia team in 2000/2001: Eric Durand, Piotr Swierczewski, Frederic Nee and Lilian Nalis.

Denormalization

The first method of relationships organization in Elasticsearch is data denormalization. In other words, instead of joining documents together through a kind of foreign key concept from relational databases, we can put everything together in single documents. Evidently, data is duplicated and more difficult to maintain. But we gain other benefits, such as:

read simplicity: because querying denormalized and related documents doesn't differ from querying normal flat documents, without relations
execution rapidity: each queried document already contains all needed information, so the search engine doesn't need to fetch other documents before returning final results

Among denormalization drawbacks, we can list:

index size: data is duplicated, so index size is bigger.
maintenace: instead of changing "parent" document once, we need to make the operation on all of them.
concurrency: because Elasticsearch doesn't implement ACID principle on all written documents (only for separate ones), concurrent changes can lead to very inconscient state of documents with, for example: a part of documents containing changes from user A, and the other part with the changes made by user B. To avoid this kind of problems, we can manage locks manually on all modified documents. But it's an additional stuff to handle by application (and so, an additional source of potential bugs).

Let's now illustrate this method by creating simple denormalized index containing SC Bastia players:

 {
"settings": {"index": {"analysis": { 
  "analyzer": {"lowercase_analyzer": {"tokenizer" : "standard", "filter": ["lowercase" ]}}
}}},
"mappings":
  {"denormalized":{"properties": {
    "name": {"type": "string", "analyzer": "lowercase_analyzer"},
    "team": {"type": "string", "analyzer": "lowercase_analyzer"}
  }}}
}

Querying isn't different from usual one, so we can skip it for this method.

Nested objects

Another approach of managing relationships in Elasticsearch is based on nested objects. It's based on creating parent entity in each document containing several children entities. In our case, SC Bastia will be our parent entity while its players will be children.

This idea corresponds better to the relationship world. No data is duplicated, no concurrency problems exist. However, one major drawback of this solution exists in the case of updating. When we want to add, edit or delete some data from nested documents, the entire document must be reindexed. But apart from that, is quite pretty solution to manage relationships in Elasticsearch. Let's begin to explore it by defining mapping:

{  
  "settings":{  
    "index":{  
      "analysis":{  
        "analyzer":{  
          "lowercase_analyzer":{  
            "tokenizer":"standard",
            "filter":[  
              "lowercase"
            ]
          }
        }
      }
    }
  },
  "mappings":{  
    "nested_team":{  
      "properties":{  
        "team":{  
          "type":"string"
        },
        "players":{
          "type": "nested",  
          "properties":{  
            "name":{  
              "type":"string"
            }
          }
        }
      }
    }
  }
}

http://localhost:9200/nested_object/nested_team/_bulk

Now we can add some players to our team:

{"index": {"_index": "nested_object", "_type": "nested_team"}}
{"team": "SC Bastia", "players": [
  {"name": "Eric Durand"}, 
  {"name": "Piotr Swierczewski"}, 
  {"name": "Frederic Nee"},  
  {"name": "Lilian Nalis"}
]}

To see how this document was indexed, let's get it through the its id http://localhost:9200/players/nested_sample/AU8G870gzr3iDNPLpA5C:

{"_index":"nested_object","_type":"nested_team",
  "_id":"AU8HiHaBzr3iDNPLpHgq","_version":1,
  "found":true, 
  "_source":{"team": "SC Bastia", "players": [
    {"name": "Eric Durand"}, 
    {"name": "Piotr Swierczewski"}, 
    {"name": "Frederic Nee"}, 
    {"name": "Lilian Nalis"}
]}}

As you can see, it doesn't differ from array type supported by Elastcisearch. The querying is also not complicated. To get the player called Eric Durand played for SC Bastia, we can write this specific kind of query for nested objects, called...nested:

{"query": {
     "bool": {
         "must": [
             {"match": {"team": "sc bastia"}},
             {"nested": {
                 "path": "players", "query": {
                 "bool": {
                     "must": [{
                         "match": {"name": "Eric Durand"}
                     }]
                 }
                 }
             }}
         ]}
     }
}

However, in the response we receive whole document and not only the player's information. It can need some client-side filtering and can be considered as a small drawback of nested objects:

{"took":1,"timed_out":false,
 "_shards":{"total":5,"successful":5,"failed":0},
  "hits":{"total":1,"max_score":2.3953633,"hits":[ 
    {"_index":"nested_object","_type":"nested_team",
     "_id":"AU8HiHaBzr3iDNPLpHgq","_score":2.3953633,
     "_source":{"team": "SC Bastia", "players": [
       {"name": "Eric Durand"}, {"name": "Piotr Swierczewski"}, 
       {"name": "Frederic Nee"}, {"name": "Lilian Nalis"}
    ]
}}]}}

Parent-child mapping

The last method to associate some parent and child documents is parent-child mapping. It illustrated the better the idea implemented with relational database JOIN clauses because all data, parent and children, live in different documents. Exactly as SQL database tables joined by some primary-foreign key relationship.

The advantages of parent-child mapping over nested objects are important:

atomicity: because joined documents are real separated documents, there are no need to reindex all relationship because of some changes. Only impacted document must be modified.
more query possibilities: children documents can be returned as separated parts of queries.
caching: even if this construction looks more complicated as nested objects or denormalized data, Elasticsearch tries to simplify it by handling internal mapping between parent and children documents. Thanks to it the lookup is fast.

But as usual, there are some of limitations:

parent and children documents, must live on the same shard
even if consecutively Elasticsearch tries to reduce memory use of these relations, the cost is still here and grows with each added parent. By the way, it's the reason why this method is more appropriated to case of not many parents and many children and not conversely. It's because joins use global ordinal technique which consists on replace memory consuming data (such strings) by more economical types (such ints). Thanks to that joins execute faster but every change on index will provoke the rebuilt of the ordinals. By default for joins, the rebuilt is lazy and is provoked by the first parent-child query or aggregation. It's why it's faster when there are less parents ordinals to build.

Let's try to index our sample data according to parent-child mapping method:

{
"settings": {"index": {"analysis": { 
  "analyzer": {"lowercase_analyzer": {"tokenizer" : "standard", "filter": ["lowercase" ]}}
}}},
"mappings": { 
  "team": {"properties": {"name": {"type": "string"}}},
  "player": {"_parent": {"type": "team"}, "properties": {"name": {"type": "string"}}}
}
}

When we inspect created index (http://localhost:9200/parent_child), we can observe the presence of already known parameter, _routing, which is mandatory for parent-child mapping:

{"parent_child":{"aliases":{},"mappings":
  {"team":{"properties":{"name":{"type":"string"}}},
   "player":{"_parent":{"type":"team"},"_routing":{"required":true},
     "properties":{"name":{"type":"string"}}}},
   "settings":{"index":{"creation_date":"1438938412899","uuid":"bwOS_N-bR6mhHv1xBBHtaA",
     "analysis":{"analyzer":{"lowercase_analyzer":{"filter":
       ["lowercase"],"tokenizer":"standard"}}},
   "number_of_replicas":"1","number_of_shards":"5","version":{"created":"1050099"}}},"warmers":{}}}

We'll begin indexing the data by defining parent document (http://localhost:9200/parent_child/team/_bulk):

{"index": {"_id": "scb", "type": "team"}}
{"name": "SC Bastia"}

In bulk query for children documents we must define id of corresponding parent (http://localhost:9200/parent_child/player/_bulk):

{"index": {"type": "player", "parent": "scb"}}
{"name": "Eric Durand"}
{"index": {"type": "player", "parent": "scb"}}
{"name": "Piotr Swierczewski"}
{"index": {"type": "player", "parent": "scb"}}
{"name": "Frederic Nee"}
{"index": {"type": "player", "parent": "scb"}}
{"name": "Lilian Nalis"}

Before checking search queries, we'll see how to read directly one indexed document. For example, to read a player which id is AU8HbcKbzr3iDNPLpGR4, we must call usual read document URL with parent parameter, as http://localhost:9200/parent_child/player/AU8HbcKbzr3iDNPLpGR4?parent=scb:

{"_index":"parent_child","_type":"player","_id":"AU8HbcKbzr3iDNPLpGR4","_version":1,"found":true,"_source":{"name": "Lilian Nalis"}}

If this request is made without parent parametern, RoutingMissingException is thrown:

{"error":"RoutingMissingException[routing is required for [parent_child]/[player]/[AU8HbcKbzr3iDNPLpGR4]]","status":400}

Search queries for this method are based on has_children and has_parent filters. The first one tries to find parents through defined children, as this request sent to http://localhost:9200/parent_child/team/_search:

{  
  "query":{  
    "has_child":{  
      "type":"player",
      "query":{  
        "match":{  
          "name":"Eric Durand"
        }
      }
    }
  }
}

{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":1,"max_score":1.0,"hits":[
  {"_index":"parent_child","_type":"team","_id":"scb","_score":1.0,
   "_source":{"name": "SC Bastia"}
}]}}

As you can image, has_parent tries to find children through specified parent. For example, this request sent to http://localhost:9200/parent_child/player/_search will return all SC Bastia players:

{  
  "query":{  
    "has_parent":{  
      "type":"team",
      "query":{  
        "match":{  
          "name":"SC Bastia"
        }
      }
    }
  }
}

{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":4,"max_score":1.0,"hits":[{"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR1","_score":1.0,"_source":{"name": "Eric Durand"}},  
 {"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR2","_score":1.0,"_source":{"name": "Piotr Swierczewski"}},
 {"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR3","_score":1.0,"_source":{"name": "Frederic Nee"}},
 {"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR4","_score":1.0,"_source":{"name": "Lilian Nalis"}}
]}}

This article describes 3 methods to manage relationships in Elasticsearch. Each of them has its advantages and drawbacks. Data denormalization, presented in the first part, is quite easy to define and query, but it takes a lot of disk space and ever more effort in maintenance. The second method, based on nested objects, looks like basic array datatype in Elasticsearch documents. However, it needs reindexing documents even after a little change on one small child element. The last one, parent-child relationship, doesn't need that because of parent and children documents separation. But the problem is memory cost of map between parent and children documents held by Elasticsearch.