Using Elasticsearch without querying is a little bit strange activity. After all, the name of this document-oriented database is composed by "search" suffix.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
This time we'll focus on the most useful aspect of Elasticsearch - searching. The first part of this article will describe the final result of this operation, i.e. returned result. After that we'll try to explain what happens when a search query is sent to Elasticsearch. The last part will show how to compose simple queries with Elasticsearch Java API.
Response meaning
Responses to search queries are called hits. You can find an example of Elasticsearch response below:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 3, "max_score" : 2.5383382, "hits" : [ { "_index" : "waitingforcode", "_type" : "teams", "_id" : "AU3Dc7qdxdyNjb1M0l8-", "_score" : 2.5383382, "_source":{"name": "RC Paris"} }, { "_index" : "waitingforcode", "_type" : "teams", "_id" : "AU3Dc7rLxdyNjb1M0l9X", "_score" : 0.45677248, "_source":{"name": "RC Roubaix"} }, { "_index" : "waitingforcode", "_type" : "teams", "_id" : "AU3Dc7rYxdyNjb1M0l9a", "_score" : 0.42763886, "_source":{"name": "CA Paris"} } ] } }
Let's decompose this JSON output and try to explain each of entries composing search result:
- took - time in milliseconds of request execution.
- timed_out - boolean flag indicating if executed request was timed out. Timeout parameter informs node treating current request that it must terminate and return the results found so far.
- _shards - this block contains information about shards used in query processing. As the names of fields indicate, we can find there total number of shards, shards responding successfully and with a failure. A shard can be marked as failed when its primary and replica copy are lost.
- hits/total - all matching documents for current query.
- hits/max_score - returns the highest score value from _score field of all matching documents.
- hits/hits - contains a list of documents corresponding to executed query. Each hit is composed by several other fields:
- _index: index in which document was found.
- _type: type of index.
- _id: id of matching documents.
- _score: information about document relevancy to executed query.
- _source: data of found document. All stored information is returned in this field. Thanks to that, there are no need to fetch matching documents in separate calls by using the value provided by _id field.
How does query work ?
In previous part we discovered a very interesting part of Elasticsearch - scoring. Thanks to it, Elasticsearch can not only return matching documents, but it can also tell us how well returned documents match to user's query. It's based on concepts coming from term frequency/inverse document frequency and the vector space model, enriched with additional features, such as query boosting. But it's not a good moment to focus on them more in details. By now, we should only know that query terms have theirs own weights and that mathematical operations on them (but not only) determine the final score relevancy.
Actually more important concept to appropriate is searching. Search results are a combination of responses done by all shards in queried index. When a query is executed by user, it arrives to one node in the cluster. This node becomes coordinating node. Its main role is to pass user's query to all shards (primary or replica) in given index. Shards execute the query locally and return results (by the way, it explains the presence of _shards entry in result output). The number of returned documents equals to the number demanded by user in the query. So if user wants to see only 10 documents from 5 shards, coordinating node will receive and filter 50 received documents (formula of documents to return changes when query from argument is greater than 0 to from+size).
So now, coordinating node receives the results from all queried shards. Results already contain the information about document score. Coordinating node must now only merge them into a single queue. However, there are still one element missing - _source. The second phase of fetching must be invoked. Its goals is to make multi-get requests on shards containing returned documents. After, the response on these requests are merged together and returned to the user. Only documents which will be returned to the user are fetched.
Search example with Java API
To see search in action, we'll take for example the methods of TeamServiceImpl used to retrieve teams by theirs names:
@Override public SearchResponse findAllTeamsByName(String name) { QueryBuilder queryBuilder = QueryBuilders.queryStringQuery("name:"+name) .analyzer("team_synonym_analyzer") .queryName("teamSynonymName"); SearchResponse response = index.teams() .setQuery(queryBuilder) .setFrom(0) .setSize(200) .addSort("name", SortOrder.ASC) .get(); return response; } @Override public SearchResponse findAllTeamsByNameAndFuzziness(String name) { QueryBuilder queryBuilder = QueryBuilders.fuzzyQuery("name", name) .fuzziness(Fuzziness.ONE) .queryName("fuzzyTeamName"); SearchResponse response = index.teams() .setQuery(queryBuilder) .setFrom(0) .setSize(200) .addSort("name", SortOrder.ASC) .get(); return response; }
You can observe that creating a query is a quite easy operation. We use another builder, org.elasticsearch.index.query.QueryBuilder. This interface is the base for all objects representing query types available in Elasticsearch. In our example we can find the uses of FuzzyQueryBuilder and QueryStringQueryBuilder, representing consequently, fuzzy search and query string search.
Our query string is composed only by two elements - fragment of the query name:teamName which means that we want to get all teams which name attribute corresponds to teamName value. Another part is analyzer. It indicates the analyzer to use to retrieve matching documents. Simply speaking, analyzer determines how documents are indexed and searched. It can specify, for example, that from sentence "This is a cat", only word "cat" can be indexed and searched. Because we can specify different analyzers at index and search time, it can be passed to QueryStringQueryBuilder object.
Another query object is FuzzyQueryBuilder. The concept of fuzzying was explained in the article about Some basic concepts of document-oriented databases. To simplify, it's an approximate string matching. So it will return documents matching almost exactly to expected query elements. FuzzyQueryBuilder specifies which field is concerned by the fuzzying and with which fuzziness (fault tolerance) value.
Both queries are also sorted and paginated. They return the results sorted by team names in ascending order. Only 200 first documents are returned. Search hits are after converted to expected objects thanks to Converter objects from Spring, as for example:
private enum ToTeamConverter implements Converter<SearchHit, TeamDto> { INSTANCE { @Override public TeamDto convert(SearchHit searchHit) { Map<String, Object> data = searchHit.sourceAsMap(); return TeamDto.valueOf((String) data.get("name")); } } }
sourceAsMap it's not a single method of org.elasticsearch.search.SearchHit objects. Another methods allow us to retrieve the same values as in the returned document, i.e. score, index, type, id. We can also have more information than in usual response with shard containing the document or explanation why given document is returned in query results.
In this article we can see how search works in Elasticsearch. First, we learned that results matching search query are called hits and that they have score relevancy information. Next, we tried to explain search logic under-the-hood. We saw that search query is executed in two steps: search and fetch. The first tries to find corresponding documents and the second to get their data. At the end we saw briefly how to execute simple search queries with Elasticsearch Java API.