Partial matching and ngrams in Elasticsearch

Elasticsearch search matches only terms defined in inverted index. So even if we are looking for only two first letters of given term, we won't be able to do it with standard match query. Instead of it we should use partial matching, provided by Elasticsearch in different forms.

In this article we'll explore partial matching provided with ngram concept. At the begin, we'll explain ngram idea. After that, we'll implement it to make some full-text queries to show how it works.

Ngram and partial matching

The way of working of ngram analyzer is quite simple. By the way, we mentioned it in the article about Elasticsearch and some concepts of document-oriented database. Very often, Elasticsearch is configured to generate terms based on some common rules, such as: whitespace separator, coma, point separator etc. With ngram we can subdivide generated tokens according to the number of minimal and maximal characters specified in its configuration. In consequence, Elasticsearch creates additional terms in inverted index.

To understand that, let's take an example of word "house". If we want to find documents related to "house", there are no problems because it's stored as 'house' in indexed terms. However, if we wan to find documents matching "hous", so probably containing "house" term, we need to use ngram analyzer to split the word on multiple partial terms: "h", "ho", "hou", "hous", "house", if we start from 1 character term. It's the reason why the feature of this kind of searching is called partial matching.

Ngram solution for partial matching should be more efficient than wildcards or RegEx queries. RegEx queries need to iterate through index terms, find the matching ones, and return the documents - all that in the fly. In the other side, ngram search works exactly as normal search on index because it searches corresponding term in index and returns corresponding documents directly, without any additional computation.

Example of ngram

To see how we can implement ngrams, let's create simple type storing names of imaginary football clubs:

{
"settings": {
  "index": {
    "analysis": {"filter": {"autocomplete_filter": {
      "type": "edge_ngram",
      "min_gram": 3,
      "max_gram": 20
    }}, "analyzer": {
      "ngram_analyzer": {
        "tokenizer" : "standard",
        "filter": ["lowercase", "autocomplete_filter"]
      }
    }}
}},
"mappings": {"team": { "properties": { "name": {"type": "string", "index_analyzer": "ngram_analyzer"}}}}
}

Now, let's index some documents:

{"index": {"_index": "teams", "_type": "team"}}
{"name": "RC Lens"}
{"index": {"_index": "teams", "_type": "team"}}
{"name": "RC Lensoillois"}
{"index": {"_index": "teams", "_type": "team"}}
{"name": "Lens Racing Club"}
{"index": {"_index": "teams", "_type": "team"}}
{"name": "MetzLens"}
{"index": {"_index": "teams", "_type": "team"}}
{"name": "MetzLensLensMetz"}
{"index": {"_index": "teams", "_type": "team"}}
{"name": "Metz LensLens Metz"}
{"index": {"_index": "teams", "_type": "team"}}
{"name": "Metz Lens Lens Metz"}

Each of these documents was indexed with ngram analyzer. This operation made following terms in inversed index:

Now, if we search one of these terms, we should find matching documents. Let's take "metzle", for which we should get below hits:

{"took":14,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":0.30685282,"hits":[{"_index":"teams","_type":"team","_id":"AU78ruMfrGROWlYMcX-O","_score":0.30685282,"_source":{"name": "MetzLensLensMetz"}},{"_index":"teams","_type":"team","_id":"AU78ruMfrGROWlYMcX-N","_score":0.30685282,"_source":{"name": "MetzLens"}}]}}

This article presents ngram analyzer which is one of possibilities to deal with partial matching in Elasticsearch. In the first part we can learn that ngram consists on dividing main term to a lot of smaller terms. They are all indexed, so the lookup is pretty quick. In other side, indexing step is longer because of this additionnal work. The second part shows how ngram analyzer can be used to make some autocomplete-like queries.

If you liked it, you should read: