Elasticsearch and its idea of inverted index is a kind of magic infinitely deep hat in which we can hide millions of terms. However, sometimes these terms need to be analyzed with some logic, not just only as plain words. It's here where proximity matching comes with help.
A virtual conference at the intersection of Data and AI. This is not a conference for the hype. Its real users talking about real experiences.
- 40+ speakers with the likes of Hannes from Duck DB, Sol Rashidi, Joe Reis, Sadie St. Lawrence, Ryan Wolf from nvidia, Rebecca from lidl
- 12th September 2024
- Three simultaneous tracks
- Panels, Lighting Talks, Keynotes, Booth crawls, Roundtables and Entertainment.
- Topics include (ingestion, finops for data, data for inference (feature platforms), data for ML observability
- 100% virtual and 100% free
👉 Register here
In this article we'll discover the idea of proximity matching, known also as phrase matching. At the begin we'll describe some theoretical approaches, such as phrase concept in documents or terms positions in index. After this part we'll try to use proximity matching to find some sample phrases in indexed newspaper articles.
What is proximity matching ?
Proximity matching is a term describing searching of terms being placed exactly in the same order in indexed documents. So, for example if are searching "in my home", we expect that matching documents will have "in" followed by "my" and "home" following "my", somewhere in specific document field. To understand how Elasticsearch knows the order of appeared terms, we can simply call _analyze endpoint with sample text 'This is my home':
{ "tokens" : [ { "token" : "this", "start_offset" : 0, "end_offset" : 4, "type" : "", "position" : 1 }, { "token" : "is", "start_offset" : 5, "end_offset" : 7, "type" : " ", "position" : 2 }, { "token" : "my", "start_offset" : 8, "end_offset" : 10, "type" : " ", "position" : 3 }, { "token" : "home", "start_offset" : 11, "end_offset" : 15, "type" : " ", "position" : 4 } ] }
As we can see, each analyzed and indexed token already has a position. They are used by Elasticsearch to verify if given document contains searched terms in correct order.
In query DSL, proximity matching queries are defined with match_phrase type. They can contain not only searched terms and field which should contain them, but also an attribute called slop. Thanks to it we can give some flexibility to words order by allowing some separator tokens between them. For example, if slop is equal to 1, both documents 'a b c' and 'a D b c', will match the search of 'a b c' phrase. To make a document as 'a D b E c' to match, we should increase splot value to 2, because there are 2 supplementary positions before total phrase reach.
Even is proximity matching is an interesting feature for some contextual searches, it has some drawbacks and limitations:
- performance -terms positions are computed in query time, so the cost is endured directly by the final user
- too much flexibility - slop allows to return documents "almost matching" by implementing a kind of Levenshtein edit distance for terms positions. However it also can produce inconsistent results which meaning doesn't exactly match the search context.
Proximity matching example
To our example, we'll take the index containing newspaper articles, created with following body (http://localhost:9200/articles):
{ "settings": {"index": {"analysis": { "analyzer": {"lowercase_analyzer": {"tokenizer" : "standard", "filter": ["lowercase" ]}} }}}, "mappings": {"article":{"properties": { "title": {"type": "string", "index_analyzer": "lowercase_analyzer"} }}} }
Following documents will be pushed to this index (http://localhost:9200/articles/article/_bulk):
{"index": {"_index": "articles", "_type": "article"}} {"title": "For years, France has been developing and growing some excellent young players."} {"index": {"_index": "articles", "_type": "article"}} {"title": "Players in France are excellent and young."} {"index": {"_index": "articles", "_type": "article"}} {"title": "Players in France and Belgium are excellent and young."} {"index": {"_index": "articles", "_type": "article"}} {"title": "France Football confirms the interests for Ligue 1 excellent young player"} {"index": {"_index": "articles", "_type": "article"}} {"title": "France excellent young players are coming to Premiership"} {"index": {"_index": "articles", "_type": "article"}} {"title": "For years, France has been developing and growing some excellent, sometimes old, sometimes young, as well players as coaches."} {"index": {"_index": "articles", "_type": "article"}} {"title": "Players from countries like France and Belgium are excellent and young."}
Now we can make some test queries using match_phrase query type:
- we want to get documents where "france excellent young players" appears":
{"query": {"match_phrase": {"title": "france excellent young players"} } }
As expecting, only one document is returned:
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.38356602,"hits":[{"_index":"articles","_type":"article","_id":"AU8Cq_zZ4Vz5YmYvVPUE","_score":0.38356602,"_source":{"title": "France excellent young players are coming to Premiership"}}]}}
- we want to get the same phrase but allowing 6-sized slop
{"query": {"match_phrase": {"title": {"query": "france excellent young players", "slop": 6}} } }
This time 2 documents match, the 2nd having 6 not matching positions between "France" and "excellent":
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":2,"max_score":0.38356602,"hits":[{"_index":"articles","_type":"article","_id":"AU8CsOpf4Vz5YmYvVPhd","_score":0.38356602,"_source":{"title": "France excellent young players are coming to Premiership"}},{"_index":"articles","_type":"article","_id":"AU8CsOpf4Vz5YmYvVPha","_score":0.11597946,"_source":{"title": "For years, France has been developing and growing some excellent young players."}}]}}
- we still want to match "france excellent young players" phrase but this time with 7-sized slop
{"query": {"match_phrase": {"title": {"query": "france excellent young players", "slop": 7}} } }
Surprisingly, 3 document match and one of them don't contain exact "france excellent young players" phrase. But because slop works as well in forward as in backward way, the document "Players in France are excellent and young." matched:
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":3,"max_score":0.38356602,"hits":[{"_index":"articles","_type":"article","_id":"AU8CsOpf4Vz5YmYvVPhd","_score":0.38356602,"_source":{"title": "France excellent young players are coming to Premiership"}},{"_index":"articles","_type":"article","_id":"AU8CsOpf4Vz5YmYvVPhb","_score":0.16273327,"_source":{"title": "Players in France are excellent and young."}},{"_index":"articles","_type":"article","_id":"AU8CsOpf4Vz5YmYvVPha","_score":0.11597946,"_source":{"title": "For years, France has been developing and growing some excellent young players."}}]}}
- we want to find documents matching the same phrase as previously, but this time with slop 9
{"query": {"match_phrase": {"title": {"query": "france excellent young players", "slop": 9}} } }
Our query explicitly tells that we want to know only excellent young players coming from France. However, because of too big slop value, we we'll also retrieve excellent young players from Belgium:
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":4,"max_score":0.38356602,"hits":[{"_index":"articles","_type":"article","_id":"AU8CvCdi4Vz5YmYvVP_H","_score":0.38356602,"_source":{"title": "France excellent young players are coming to Premiership"}},{"_index":"articles","_type":"article","_id":"AU8CvCdi4Vz5YmYvVP_D","_score":0.22471303,"_source":{"title": "For years, France has been developing and growing some excellent young players."}},{"_index":"articles","_type":"article","_id":"AU8CvCdi4Vz5YmYvVP_E","_score":0.16273327,"_source":{"title": "Players in France are excellent and young."}},{"_index":"articles","_type":"article","_id":"AU8CvCdi4Vz5YmYvVP_F","_score":0.12129422,"_source":{"title": "Players in France and Belgium are excellent and young."}}]}}
Proximity matching with match phrase queries allows to find phrases containing searched terms: in defined written order or with some other terms between them. This powerful feature is also a trap because it can produce results not exactly matching query execution context. We could observe that in the last executed query in the second part of this article. Some alternative for match phrases are shingles consisting on indexed terms composed by more than two terms. They are faster than match phrases because of index-time cost. However, they take more disk space.