Indexing documents in Elasticsearch on waitingforcode.com

Retrieving documents in Elasticsearch shouldn't be possible without indexing. They are a intermediate layer between user and shards which store documents data.

This time we focus on the depart point of Elasticsearch, documents indexing. First 3 parts of this article will describe principal concepts in indexing: shards, analyzers and types with mappings. The last part will present what happens when one document is submitted to index.

Index configuration - shards

Index configuration consists of 3 main parts: shards, analyzers and mapping. If you remember well (if not, you can read an article about Elasticsearch architecture and vocabulary), index points to several shards which store document data. So it's pretty natural that it needs to know how many shards of each type (primary, replica) will be needed. Index properties are transferred inside settings entry in the root of JSON document body:

{"settings": { "number_of_shards": 5, "number_of_replicas": 10}}

This entry defines an index with 5 primary shards and 10 replica. But there are no single available configuration entries. We can also:
- make an index read-only (not changes allowed on already indexed documents) with index.blocks.read_only key or write-only (index.blocks.read set to true)
- configure maximal size of filters cache (index.cache.filter.max_size)
- or set configuration for slow log queries with entries beginning with index.search.slowlog prefix

Index configuration - analyzers

Analyzers are another key part of index configuration. They can be found under analysis section in index configuration file. The main role of analyzers consists on transforming text parts into indexable and searchable terms. Analyzers are composed by:
- character filters: these filters clean introduced text before it's stripped and transformed to terms. It can consist for example on removing all HTML entities from text to index.
- tokenizer: transforms words from a text to indexable terms according to specified configuration. This operation can consist on, for example, splitting text through punctuation signs as ".", "," or ";".
- token filters: the purpose of these filters are to clean not text (as for character ones) but terms found by tokenizer. It can consist for example on removing not-indexable words (such as "and", "or") or normalizing terms (for example lowercase).

Elasticsearch defines some basic analyzers. They can be easily used to compose custom analyzers and define them in "analysis" section of index configuration.

Index configuration - types and mapping

Index configuration consists also on another important point, types and mapping. Index can be created without specific mapping. However in this case Elasticsearch will guest the type for each field. And it can lead to wrong results in searches - integer 21 will match only 21 while text "21" will potentially match "21", "221", "121".

Types are a kind of tables in SQL because they group similar documents together. Each type can have its own mapping defining elements to index and the way of indexing and searching. In mapping, we can define:
- field type (type): can be text (string type), number (integer, double, byte, float), date or even an array
- field storage (store): if false, Elasticsearch won't store given field in the index
- if given field is: indexable and searchable (index with value analyzed) after tokenization process, searchable without tokenization (not_analyzed) or not searchable at all
- analyzers to apply for searching (search_analyzer) and indexing (index_analyzer) or for both (analyzer)

How document is indexed ?

Indexing a document is a process based on 4 steps. First, regarding to document id, Elasticsearch tries to find shard where document is stored. If the document doesn't exist, it's created on chosen shard. After, document fields are validated. If some fields don't exist in initial index mapping, they are added automatically. In this situation, Elasticsearch guess the type of field. After that, the document is indexed on shard.

Note that the document is available for search only after refresh. By default, every shard is refreshed automatically every 1 second. So indexed document should be visible for the search after this delay. However, in some cases (as indexing of millions of huge documents) 1-second refresh can be overkill and it's better to increase this value. Indexed document is moved first to filesystem cache and only after this operations the document is flushed to disk. The order is not random because sending data to cache is cheaper than flushing it to disk. Thanks to that searching performance doesn't decrease because of indexing phase.

Elasticsearch is based on inverted index. Terms to put inside it are determined thanks to analyzers defined in index mapping. It's the reason why it's important to be careful about analyzers used in indexing and search steps. Otherwise search results can be inconsistent.

We discovered the main concepts of indexing in Elasticsearch. First 3 parts presented the principal player participating in indexing: shards, analyzers and types with mapping. The last part explained some operations executed by Elasticsearch when one document is indexed.