Parent-children relationship in Elasticsearch

Make links between entities is quite easy in relational databases. And it's not a trivial task in document databases, adapted to less normalized data storage. Elasticsearch is not the exception of this rule but it defines some mechanisms to support parent-children relationship between documents.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I'm currently writing one on that topic and the first chapters are already available in πŸ‘‰ Early Release on the O'Reilly platform

I also help solve your data engineering problems πŸ‘‰ contact@waitingforcode.com πŸ“©

This time we'll discover how to handle parent-children relationship in Elasticsearch. Because there are several ways to do that, we'll explain each of them in separated part. The first one will describe the simplest solution - denormalization. The second one will present mapping of nested objects. The last one will show how to use parent-child mapping.

To illustrate the cases we'll take the example of football club and its players. Each club has several players who represent it in league competition. Different players can be in the team in each new season. The sample will contain players representing SC Bastia team in 2000/2001: Eric Durand, Piotr Swierczewski, Frederic Nee and Lilian Nalis.

Denormalization

The first method of relationships organization in Elasticsearch is data denormalization. In other words, instead of joining documents together through a kind of foreign key concept from relational databases, we can put everything together in single documents. Evidently, data is duplicated and more difficult to maintain. But we gain other benefits, such as:

Among denormalization drawbacks, we can list:

Let's now illustrate this method by creating simple denormalized index containing SC Bastia players:

 {
"settings": {"index": {"analysis": { 
  "analyzer": {"lowercase_analyzer": {"tokenizer" : "standard", "filter": ["lowercase" ]}}
}}},
"mappings":
  {"denormalized":{"properties": {
    "name": {"type": "string", "analyzer": "lowercase_analyzer"},
    "team": {"type": "string", "analyzer": "lowercase_analyzer"}
  }}}
}

Querying isn't different from usual one, so we can skip it for this method.

Nested objects

Another approach of managing relationships in Elasticsearch is based on nested objects. It's based on creating parent entity in each document containing several children entities. In our case, SC Bastia will be our parent entity while its players will be children.

This idea corresponds better to the relationship world. No data is duplicated, no concurrency problems exist. However, one major drawback of this solution exists in the case of updating. When we want to add, edit or delete some data from nested documents, the entire document must be reindexed. But apart from that, is quite pretty solution to manage relationships in Elasticsearch. Let's begin to explore it by defining mapping:

{  
  "settings":{  
    "index":{  
      "analysis":{  
        "analyzer":{  
          "lowercase_analyzer":{  
            "tokenizer":"standard",
            "filter":[  
              "lowercase"
            ]
          }
        }
      }
    }
  },
  "mappings":{  
    "nested_team":{  
      "properties":{  
        "team":{  
          "type":"string"
        },
        "players":{
          "type": "nested",  
          "properties":{  
            "name":{  
              "type":"string"
            }
          }
        }
      }
    }
  }
}
http://localhost:9200/nested_object/nested_team/_bulk

Now we can add some players to our team:

{"index": {"_index": "nested_object", "_type": "nested_team"}}
{"team": "SC Bastia", "players": [
  {"name": "Eric Durand"}, 
  {"name": "Piotr Swierczewski"}, 
  {"name": "Frederic Nee"},  
  {"name": "Lilian Nalis"}
]}

To see how this document was indexed, let's get it through the its id http://localhost:9200/players/nested_sample/AU8G870gzr3iDNPLpA5C:

{"_index":"nested_object","_type":"nested_team",
  "_id":"AU8HiHaBzr3iDNPLpHgq","_version":1,
  "found":true, 
  "_source":{"team": "SC Bastia", "players": [
    {"name": "Eric Durand"}, 
    {"name": "Piotr Swierczewski"}, 
    {"name": "Frederic Nee"}, 
    {"name": "Lilian Nalis"}
]}}

As you can see, it doesn't differ from array type supported by Elastcisearch. The querying is also not complicated. To get the player called Eric Durand played for SC Bastia, we can write this specific kind of query for nested objects, called...nested:

{"query": {
     "bool": {
         "must": [
             {"match": {"team": "sc bastia"}},
             {"nested": {
                 "path": "players", "query": {
                 "bool": {
                     "must": [{
                         "match": {"name": "Eric Durand"}
                     }]
                 }
                 }
             }}
         ]}
     }
}

However, in the response we receive whole document and not only the player's information. It can need some client-side filtering and can be considered as a small drawback of nested objects:

{"took":1,"timed_out":false,
 "_shards":{"total":5,"successful":5,"failed":0},
  "hits":{"total":1,"max_score":2.3953633,"hits":[ 
    {"_index":"nested_object","_type":"nested_team",
     "_id":"AU8HiHaBzr3iDNPLpHgq","_score":2.3953633,
     "_source":{"team": "SC Bastia", "players": [
       {"name": "Eric Durand"}, {"name": "Piotr Swierczewski"}, 
       {"name": "Frederic Nee"}, {"name": "Lilian Nalis"}
    ]
}}]}}

Parent-child mapping

The last method to associate some parent and child documents is parent-child mapping. It illustrated the better the idea implemented with relational database JOIN clauses because all data, parent and children, live in different documents. Exactly as SQL database tables joined by some primary-foreign key relationship.

The advantages of parent-child mapping over nested objects are important:

But as usual, there are some of limitations:

Let's try to index our sample data according to parent-child mapping method:

{
"settings": {"index": {"analysis": { 
  "analyzer": {"lowercase_analyzer": {"tokenizer" : "standard", "filter": ["lowercase" ]}}
}}},
"mappings": { 
  "team": {"properties": {"name": {"type": "string"}}},
  "player": {"_parent": {"type": "team"}, "properties": {"name": {"type": "string"}}}
}
}

When we inspect created index (http://localhost:9200/parent_child), we can observe the presence of already known parameter, _routing, which is mandatory for parent-child mapping:

{"parent_child":{"aliases":{},"mappings":
  {"team":{"properties":{"name":{"type":"string"}}},
   "player":{"_parent":{"type":"team"},"_routing":{"required":true},
     "properties":{"name":{"type":"string"}}}},
   "settings":{"index":{"creation_date":"1438938412899","uuid":"bwOS_N-bR6mhHv1xBBHtaA",
     "analysis":{"analyzer":{"lowercase_analyzer":{"filter":
       ["lowercase"],"tokenizer":"standard"}}},
   "number_of_replicas":"1","number_of_shards":"5","version":{"created":"1050099"}}},"warmers":{}}}

We'll begin indexing the data by defining parent document (http://localhost:9200/parent_child/team/_bulk):

{"index": {"_id": "scb", "type": "team"}}
{"name": "SC Bastia"}

In bulk query for children documents we must define id of corresponding parent (http://localhost:9200/parent_child/player/_bulk):

{"index": {"type": "player", "parent": "scb"}}
{"name": "Eric Durand"}
{"index": {"type": "player", "parent": "scb"}}
{"name": "Piotr Swierczewski"}
{"index": {"type": "player", "parent": "scb"}}
{"name": "Frederic Nee"}
{"index": {"type": "player", "parent": "scb"}}
{"name": "Lilian Nalis"}

Before checking search queries, we'll see how to read directly one indexed document. For example, to read a player which id is AU8HbcKbzr3iDNPLpGR4, we must call usual read document URL with parent parameter, as http://localhost:9200/parent_child/player/AU8HbcKbzr3iDNPLpGR4?parent=scb:

{"_index":"parent_child","_type":"player","_id":"AU8HbcKbzr3iDNPLpGR4","_version":1,"found":true,"_source":{"name": "Lilian Nalis"}}

If this request is made without parent parametern, RoutingMissingException is thrown:

{"error":"RoutingMissingException[routing is required for [parent_child]/[player]/[AU8HbcKbzr3iDNPLpGR4]]","status":400}

Search queries for this method are based on has_children and has_parent filters. The first one tries to find parents through defined children, as this request sent to http://localhost:9200/parent_child/team/_search:

{  
  "query":{  
    "has_child":{  
      "type":"player",
      "query":{  
        "match":{  
          "name":"Eric Durand"
        }
      }
    }
  }
}
{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":1,"max_score":1.0,"hits":[
  {"_index":"parent_child","_type":"team","_id":"scb","_score":1.0,
   "_source":{"name": "SC Bastia"}
}]}}

As you can image, has_parent tries to find children through specified parent. For example, this request sent to http://localhost:9200/parent_child/player/_search will return all SC Bastia players:

{  
  "query":{  
    "has_parent":{  
      "type":"team",
      "query":{  
        "match":{  
          "name":"SC Bastia"
        }
      }
    }
  }
}
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":4,"max_score":1.0,"hits":[{"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR1","_score":1.0,"_source":{"name": "Eric Durand"}},  
 {"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR2","_score":1.0,"_source":{"name": "Piotr Swierczewski"}},
 {"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR3","_score":1.0,"_source":{"name": "Frederic Nee"}},
 {"_index":"parent_child","_type":"player",
  "_id":"AU8HbcKbzr3iDNPLpGR4","_score":1.0,"_source":{"name": "Lilian Nalis"}}
]}}

This article describes 3 methods to manage relationships in Elasticsearch. Each of them has its advantages and drawbacks. Data denormalization, presented in the first part, is quite easy to define and query, but it takes a lot of disk space and ever more effort in maintenance. The second method, based on nested objects, looks like basic array datatype in Elasticsearch documents. However, it needs reindexing documents even after a little change on one small child element. The last one, parent-child relationship, doesn't need that because of parent and children documents separation. But the problem is memory cost of map between parent and children documents held by Elasticsearch.


If you liked it, you should read:

πŸ“š Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!