Elasticsearch and some concepts of document-oriented database

Every NoSQL solution has some basic concepts associated to it. For example, in graph databases we'll talk about nodes in different meaning than in document-oriented and clustered databases such as ElasticSearch (ElasticSearchSearch). This article will present some of concepts specific to ElasticSearch search engine.

Looking for a better data engineering position and skills?

You have been working as a data engineer but feel stuck? You don't have any new challenges and are still writing the same jobs all over again? You have now different options. You can try to look for a new job, now or later, or learn from the others! "Become a Better Data Engineer" initiative is one of these places where you can find online learning resources where the theory meets the practice. They will help you prepare maybe for the next job, or at least, improve your current skillset without looking for something else.

👉 I'm interested in improving my data engineering skillset

See you there, Bartosz

This article will begin by an attempt to explain the idea hidden behind document-oriented systems. The second part will be dedicated to some of concepts which can be found in ElasticSearch but also in another document-oriented storage mechanisms.

Document-oriented database

According to Wikipedia's definition, a document-oriented database is a system "designed for storing, retrieving, and managing document-oriented information, also known as semi-structured data". This definition relates to new consideration of data which, in some of NoSQL systems, is thought as a document.

The idea of documents brings some of changes in data structure. First of all, strong data structure developed with relational databases was replaced by more flexible schema. This schema is very often called schema-less because there are no fixed constraints in data organization. For example, you can as well have one document representing some person with age attribute and the second document representing another person with age and some other attributes. This flexibility is very useful when you need to make writing operations. Because data is treated as almost separated units (documents), you can as well modify only one of them, without locking the rest of documents.

In additionally, this no-locking mechanism helps to improve scalability by adding some supplementary hardware (horizontal scalability). To simplify that, we could tell that documents are like separated files which we can easily duplicate in the cluster, for example by putting "Document 1" on computer A in Cluster A and on computer B in Cluster B but "Document 2" only on Computer A etc. In the world of relational databases this technique is more difficult to achieve and it consists more on replicating data from one server accepting writing operations to another servers accepting only reading operations. And it seems to be more like "load-balancing" method than horizontal scalability where data can be read from different machines simultaneously.

Basic concepts of document-oriented databases

After some introduction to the world of document-oriented systems, we can approach some of ideas associated with systems exploring them - search engine systems like ElasticSearch or Solr.

  1. ngram

    First described concept is ngram. To simplify and not introduce complex mathematical terms, ngram can be thought as Java String's contains() method which check if some String is composed by searched set of characters. n letter before gram means how many characters must match. For example, if we're looking for 3 characters in some text, we'll talk about trigram, if 5, it will be five-gram etc.

  2. Tokenizer

    To understand the role of tokenizers, we need to remind that document data is indexed by "tokens". It means that if we have a sentence like "This is a house", search engine won't create index under whole sentence (if we don't decide contrary). Instead of that, most of time, it'll create indexes for following words: "This", "is", "a", "house". Theses words are tokens managed by tokenizers.

    As you can see, some of these tokens don't make any sense. It's why many kind of tokenizers exist. For example, in ElasticSearch we can distinguish following principal tokenizers:
    - standard: good for most European languages, implements rules from Unicode Standard Annex #29)
    - classic: adopted to English documents)
    - lowercase: text divided on lowercased tokens
    - keyword: helps to keep whole sentence in a single token
    - whitespace: divides text at whitespace

  3. Stemming

    Stemming is another strange word with very important meaning in search engines. This process consists on reducing words to theirs root forms. For example, it will reduce "fishing", "fisher", "fished", "fishes" to base word "fish". However, this technique has also its pitfalls. It's not very easy to implement and if wrong method is adopted, results of stemming can be invalid. This is known under the names of:
    - understemming: words are not stemmed enough, relevant documents are not found
    - overstemming: too much stemming kills the stemming, the results are not accurated

  4. Stopwords

    In text searching, stopwords are the words with lower importancy than the real indexed words. You can consider as stopwords the article (a/an/the), prepositions (on/in/at) and another words without real world meaning.

    These words can be still indexed and searchable. However, if they're so, we risk search latency (more items to analyze) and less free disk space available. But in another side, stopwords are useful in some kind of searches, for example when you search negations ("certain" doesn't mean the same as "not certain"). Fortunately, stopwords can be configured and, for example, we can exclude negation word "not" from them.

  5. Fuzzying

    This idea is also known under "approximate string matching" term. Fuzzy search consists on finding words matching similarly to expected pattern. For example, we could match our previous sentence ("This is a house") even with words not exactly put inside, as "hose".

    In searching this feature is useful in the case of misspelled tokens. As we could see in our example, "hose" looks really similar to "house" and simply it may be a result of too quick user's reaction.

  6. Scoring

    Scoring is also known as content relevancy ranking. It can be founded on very basic criteria, as the number of occurrences of searched word in the text. But it can also be more complex algorithm.

    ElasticSearch calls scoring methods as boost functions. Boosting allows the enhancing of document relevancy for some search criteria. This process can be done as well on index as on query time.

In this article we discovered some of basic concepts hidden behind ElasticSearch search engine. The first part explained a little bit the type of storage and differences between document-oriented system (ElasticSearch is based on it) and relational databases. The second part explained some of concepts coming from linguistic background which can be found very often in search engine systems. We discovered the methods to handle misspelled words (ngram, fuzzy search) and to promote some kind of results (scoring, stemming). This part also explained which techniques exist to deal with the ways of indexing documents (tokenizer, stopwords).

If you liked it, you should read:

📚 Newsletter Get new posts, recommended reading and other exclusive information every week. SPAM free - no 3rd party ads, only the information about waitingforcode!