Proximity matching in Elasticsearch

Elasticsearch and its idea of inverted index is a kind of magic infinitely deep hat in which we can hide millions of terms. However, sometimes these terms need to be analyzed with some logic, not just only as plain words. It's here where proximity matching comes with help.

In this article we'll discover the idea of proximity matching, known also as phrase matching. At the begin we'll describe some theoretical approaches, such as phrase concept in documents or terms positions in index. After this part we'll try to use proximity matching to find some sample phrases in indexed newspaper articles.

What is proximity matching ?

Proximity matching is a term describing searching of terms being placed exactly in the same order in indexed documents. So, for example if are searching "in my home", we expect that matching documents will have "in" followed by "my" and "home" following "my", somewhere in specific document field. To understand how Elasticsearch knows the order of appeared terms, we can simply call _analyze endpoint with sample text 'This is my home':

{
  "tokens" : [ {
    "token" : "this",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "",
    "position" : 1
  }, {
    "token" : "is",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "",
    "position" : 2
  }, {
    "token" : "my",
    "start_offset" : 8,
    "end_offset" : 10,
    "type" : "",
    "position" : 3
  }, {
    "token" : "home",
    "start_offset" : 11,
    "end_offset" : 15,
    "type" : "",
    "position" : 4
  } ]
}

As we can see, each analyzed and indexed token already has a position. They are used by Elasticsearch to verify if given document contains searched terms in correct order.

In query DSL, proximity matching queries are defined with match_phrase type. They can contain not only searched terms and field which should contain them, but also an attribute called slop. Thanks to it we can give some flexibility to words order by allowing some separator tokens between them. For example, if slop is equal to 1, both documents 'a b c' and 'a D b c', will match the search of 'a b c' phrase. To make a document as 'a D b E c' to match, we should increase splot value to 2, because there are 2 supplementary positions before total phrase reach.

Even is proximity matching is an interesting feature for some contextual searches, it has some drawbacks and limitations:

Proximity matching example

To our example, we'll take the index containing newspaper articles, created with following body (http://localhost:9200/articles):

{
"settings": {"index": {"analysis": { 
  "analyzer": {"lowercase_analyzer": {"tokenizer" : "standard", "filter": ["lowercase" ]}}
}}},
"mappings":
  {"article":{"properties": {
    "title": {"type": "string", "index_analyzer": "lowercase_analyzer"}
  }}}
}

Following documents will be pushed to this index (http://localhost:9200/articles/article/_bulk):

{"index": {"_index": "articles", "_type": "article"}}
{"title": "For years, France has been developing and growing some excellent young players."}
{"index": {"_index": "articles", "_type": "article"}}
{"title": "Players in France are excellent and young."}
{"index": {"_index": "articles", "_type": "article"}}
{"title": "Players in France and Belgium are excellent and young."}
{"index": {"_index": "articles", "_type": "article"}}
{"title": "France Football confirms the interests for Ligue 1 excellent young player"}
{"index": {"_index": "articles", "_type": "article"}}
{"title": "France excellent young players are coming to Premiership"}
{"index": {"_index": "articles", "_type": "article"}}
{"title": "For years, France has been developing and growing some excellent, sometimes old, sometimes young, as well players as coaches."}
{"index": {"_index": "articles", "_type": "article"}}
{"title": "Players from countries like France and Belgium are excellent and young."}

Now we can make some test queries using match_phrase query type:

Proximity matching with match phrase queries allows to find phrases containing searched terms: in defined written order or with some other terms between them. This powerful feature is also a trap because it can produce results not exactly matching query execution context. We could observe that in the last executed query in the second part of this article. Some alternative for match phrases are shingles consisting on indexed terms composed by more than two terms. They are faster than match phrases because of index-time cost. However, they take more disk space.

If you liked it, you should read: