Every time before starting to learn new technology, we need to appropriate its specific vocabulary. In the case of Elasticsearch, this vocabulary is mostly related to the architecture terms.
Data Engineering Design Patterns
Looking for a book that defines and solves most common data engineering problems? I'm currently writing
one on that topic and the first chapters are already available in π
Early Release on the O'Reilly platform
I also help solve your data engineering problems π contact@waitingforcode.com π©
Because Elasticsearch is a layer built on Lucene search engine, we'll start by reminding some of terms related to it. After we'll pass to Elasticsearch specific definitions by beginning with architecture words. The last part will present the words that concern Elasticsearch documents. All terms will be listed in logical to easily pass from one definition to another.
Lucene terms in Elasticsearch
Below you can find a list of terms present in Lucene and, consequently, in Elasticsearch:
inverted index - it's the main idea of managing indexes in Lucene. Indexation process in Lucene is based on words called 'terms'. To simplify, we'll consider that a term is a single word, as for example 'house'. Now, Lucene will find all documents matching this term and link them into it. So, for example we could have a structure looking like ['house': [1, 2, 3]] where "1, 2.." are documents matching 'house' term.
It's called "inverted" because it lists words and indicate where they can be found (in which documents). In the analogy to the book, we can think inverted index as a index of terms placed at the end of the book. In this place, the words list the pages where they can be found. Another type of index, forward index, is more like a book's chapters list where every chapter contains a list of words containing in it. Anogically to our 'house' example, we'll have a structure looking like: [1: ['house', 'dog'], 2: ['house', 'car', 'cat', 'dog'], 3: ['house', 'family', 'internet']]
- term - as explained below, is a token to which are mapped all corresponding documents.
- field - simple key-value pair.
- document - a unit of indexing and search, composed by a set of fields. One or more documents form an index.
- segment - composes index database. New segments may be created on new document indexing. Segments can be merged periodically.
- score - a formula describing how well the document matches to sent query. It helps to determine documents relevancy for user's search.
Architecture terms in Elasticsearch
One of powerful features of Elasticsearch is its horizontal scalability-oriented architecture. It means that we can improve searching and indexing performances simply by adding new servers into cluster. It's the reason why a big part of architecture terms are related to this aspect. Following list contains terms related to Elasticsearch architecture:
- cluster - represents all servers available for searching and indexing work. Each cluster has automatically chosen master server.
- node - is cluster's server. In other words, we can consider it as a single running Elasticsearch instance.
- master node - main node in the cluster. It has more responsibilities than a simple node because it manages other nodes (adds new or removes old one).
shard - is a container for indexed data. So it doesn't contain a copy of indexed elements but it contains real indexed elements.
Shards are allocated in the nodes but not in fixed manner. It means that they can move between nodes to guarantee cluster balance.
- primary shard - contains originally indexed document. Every time when new document is indexed, the operation is made first in primary shard.
replica shard - is a copy of primary shard. It's very useful in the case of primary shard failures when replica shard can replace failed primary shard. Another feature of this kind of shard is linked to performance gains. In fact, thanks to replica shards, read requests can be handled on different cluster's nodes.
Unlike primary shard, replica shard can't accept indexing requests. Only reading requests, such as searching or document retrieving, are accepted by it.
Another difference between them is that the number of replica shards can be modified on runtime while primary shards can be specified only when index is created.
Document terms in Elasticsearch
Another terms family useful in Elasticsearch discovery is indexing and searching actions. This time we'll use analogy with relational databases to understand quicker some of concepts:
- document - as in the case of Lucene, it represents an indexable and searchable unit. It can be compared to row in relational database.
- field - analogical to Lucene, is a part composing each document. In the comparison to relational database, field is like a column. However, Elasticsearch is schema-less, so we can have two documents containing different fields.
- index - can be thought as database in relation databases. It maps to 1 or more primary shards and 0 or more replica shards. It's also an exposed endpoint to communicate with applications consuming Elasticsearch data. Index not store data itself.
- type - is like table in relational database. So we can deduce that it's a part of index.
- mapping - defines which indexes and types live in each cluster. It's like schema definition in relational database.
This article introduces some of basics but very important concepts to work well with Elasticsearch. The first part presented the ideas coming with Lucene, essentialy related to index construction. The next part described architecture ideas particular to Elasticsearch. The last part presented, thanks to analogies with relational databases, the main components of Elasticsearch indexing process. Two last points explained also the difference between two similar concepts - shards and indexes. The first one holds data and can be primary or replica by its nature. The indexes make only the links to shards, don't store any data and are exposed to deal with data consumer applications.