Elasticsearch migration from 1.6 to 2.2 on waitingforcode.com

At the begin Elastcisearch 2.2.0 was realeased on February 2016. Because my POC project was frozen with 1.6, I decided to upgrade. But not without surprises and some code rework.

4-day workshop · In-person or online

What would it take for you to trust your Databricks pipelines in production?

A 3-day bug hunt on a 3-person team costs up to €7,200 in lost engineering time. This workshop teaches you to prevent that — unit tests, data tests, and integration tests for PySpark and Databricks Lakeflow, including Spark Declarative Pipelines.

Unit, data & integration tests

Medallion architecture & Lakeflow SDP

Max 10 participants · production-ready templates

See the full curriculum → €7,000 flat fee · cohort of up to 10

Bartosz
Konieczny

In this article we try to migrate code written with Elasticsearch Java API from 1.6.0 to 2.2.0. The first part lists some major breaking changes between these two versions. The second and thirt parts are purely practical. They describe problems while queries and configuration migration.

Changes between Elasticsearch 1.6 and 2.2

One of major changes arrived in 2.0 release. It's in this version that queries and filters were merged. As you remember, the difference between them was the presence of scoring in matches and its absence in filters. In Elasticsearch 2.0 we can configure queries to keep or skip the score. The distinction between queries and filters is based on contexts. Where one request is executed in filter context, score is not calculated. It's is in the context is query. Several things serve to introduce filter context: constant_score query, must_not with filter parameter in bool query, filter and filters parameters in the function_score query or any filter API, such as post_filter.

Also mapping underwent some changes. If some mappings are in conflicted state Elasticsearch throws an error. Conflicted means here that several fields, in different types but the same index, have the same names but different mappings. Another change concerns type names. We can name types without dots with the difference that the name shouldn't begin by a dot. In additionally, the name shouldn't be longer than 255 characters. Also several meta-fields can't be changed anymore, among others: _id, _type, _index, _boost and _analyzer (2 last were removed). Some of other meta-fields are their access limited: _timestamp, _field_names, _routing and _size.

There are also some new deprecation. First of all, count queries shouldn't be used anymore. Instead, a search with size=0 should be used. Also optimize API is deprecated and should be replaced by Force Merge API. Facets deprecation (since 1.0), were finally removed in 2.0.

Elasticsearch migration from 1.6 to 2.2 - queries

First problem we meet is the removal of org.elasticsearch.index.query.FilterBuilders and other related filter builders (RangeFilterBuilder, TermFilterBuilder, TermsFilterBuilder). As told in the first part of the article, filters must be since 2.0 replaced by queries. After some tests, in our case a simple replacement a bool filter by bool query is enough. We can constat that by analyzing returned hits where _source field is null. It means that Elasticsearch doesn't compute the score, so behaves like it applied filters in 1.6:

"hits" : {
  "total" : 7,
  "max_score" : null,
  "hits" : [ {
    "_index" : "french_football",
    "_type" : "scores",
    "_id" : "AVMc7sRw57KQLOYOi-33",
    "_score" : null,
    "_source" : {
      "season" : "1980/1981",
      "hostTeam" : "Paris-SG",
      "guestTeam" : "RC Lens",
      "hostGoals" : 2,
      "guestGoals" : 0,
      "allGoals" : 2,
      "round" : 1
    },
    "sort" : [ "1980/1981" ]
  }, {
    "_index" : "french_football",
    "_type" : "scores",
    "_id" : "AVMc7sSU57KQLOYOi-34",
    "_score" : null,
    "_source" : {
      "season" : "1980/1981",
      "hostTeam" : "Paris-SG",
      "guestTeam" : "RC Lens",
      "hostGoals" : 4,
      "guestGoals" : 5,
      "allGoals" : 9,
      "round" : 1
    },
    "sort" : [ "1980/1981" ]
  },

Replacement of bool filter by bool query in Java API looks like:

// Code in 1.6
FilterBuilder filterBuilder = FilterBuilders.boolFilter()
  .should(
    FilterBuilders.boolFilter().must(
      QueryFilters.hostTeam(teamName),
      QueryFilters.hostGoals(scoredGoals, RangeModes.GTE),
      QueryFilters.guestGoals(scoredGoals, RangeModes.LT)
    ),
    FilterBuilders.boolFilter().must(
      QueryFilters.guestTeam(teamName),
      QueryFilters.guestGoals(scoredGoals, RangeModes.GTE),
      QueryFilters.hostGoals(scoredGoals, RangeModes.LT)
    )
);

// Code in 2.2
QueryBuilder filterBuilder = QueryBuilders.boolQuery()
  .should(
    QueryBuilders.boolQuery().must(QueryFilters.hostTeam(teamName))
      .must(QueryFilters.hostGoals(scoredGoals, RangeModes.GTE))
      .must(QueryFilters.guestGoals(scoredGoals, RangeModes.LT))
  )
  .should(
    QueryBuilders.boolQuery().must(QueryFilters.guestTeam(teamName))
      .must(QueryFilters.guestGoals(scoredGoals, RangeModes.GTE))
      .must(QueryFilters.hostGoals(scoredGoals, RangeModes.LT))
);

A consequence queries and filters merge was the deprecation of filtered queries. In the Javadoc for org.elasticsearch.index.query.QueryBuilders#filteredQuery() we can read:

Use {@link #boolQuery()} instead with a {@code must} clause for the query and a {@code filter} clause for the filter.

Below you can find a code for 1.6 and 2.2 illustrating this deprecation:

// Code for 1.6
return index.tables()
  .setQuery(QueryBuilders.filteredQuery(
    QueryBuilders.matchAllQuery(),
    filterBuilder
  ))
  .addSort("season", SortOrder.ASC)
  .setFrom(0)
  .setSize(ElasticSearchConfig.DEFAULT_MAX_RESULTS)
  .get();

// Code for 2.2
return index.tables()
  .setQuery(QueryBuilders.boolQuery().must(filterBuilder))
  .addSort("season", SortOrder.ASC)
  .setFrom(0)
  .setSize(ElasticSearchConfig.DEFAULT_MAX_RESULTS)
  .get();

To use scripts in 2.2, we must pass by appropriated object, org.elasticsearch.script.Script. Previously it was sufficent to define script body as a String. Below, appropriated example:

// Code for 1.6
public static TermsBuilder score() {
  return AggregationBuilders.terms(SearchDictionary.SCORES)
          .script("doc['hostGoals'].value.toString() 
            + ':' + doc['guestGoals'].value.toString()");
}

// Code for 2.2
private static final Script SCORE_SCRIPT =
  new Script("doc['hostGoals'].value.toString() + 
    ':' + doc['guestGoals'].value.toString()");

public static TermsBuilder score() {
  return AggregationBuilders.terms(SearchDictionary.SCORES)
    .script(SCORE_SCRIPT);
}

Also count queries are deprecated and supposed to be removed in the future. Instead of using classes from org.elasticsearch.action.count, we should pass by normal search query with size equal to 0. Below code illustrates that:

// For 1.6
private CountResponse getTestQuery(Client client) {
  CountRequestBuilder countRequestBuilder = 
    new CountRequestBuilder(client, CountAction.INSTANCE).setTypes("teams");
  ActionFuture<CountResponse> responseFuture = 
    client.count(countRequestBuilder.request());
  return responseFuture.actionGet();
}

CountResponse response = getTestQuery(client);

assertThat(response.getCount()).isGreaterThan(0L);

// For 2.2
private SearchResponse getTestQuery(Client client) {
  SearchRequestBuilder countRequestBuilder = 
    new SearchRequestBuilder(client, SearchAction.INSTANCE)
    .setIndices("teams").setTypes("team").setSize(0)
    .setQuery(QueryBuilders.matchAllQuery());
  ActionFuture<SearchResponse> responseFuture = 
    client.search(countRequestBuilder.request());
  return responseFuture.actionGet();
}

SearchResponse response = getTestQuery(client);

assertThat(response.getHits().getTotalHits()).isGreaterThan(0L);

Another change impacting querying concerns size and from parameters. In fact, the sum of both must be less than or equal to value specified in index.max_result_window configuration entry. To solve this issue we can either exactly configure each query or define quite big value of this entry. An error corresponding to this situation can look like:

org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
// (...)
Caused by: org.elasticsearch.search.query.QueryPhaseExecutionException: 
  Result window is too large, from + size must be less than or equal to: [10000] 
  but was [2147483647]. See the scroll api for a more efficient way to request 
  large data sets. This limit can be set by changing the 
  [index.max_result_window] index level parameter.

Elasticsearch migration from 1.6 to 2.2 - embedded

If you use embedded Elasticsearch engine, you'll certainly encounter some changes. First of all, path.home entry in configuration must be specified. It indicates the home directory of Elasticsearch installation and must be specified on configuration of embedded Elasticsearch node.

Another important difference comes from different scripts evaluation. If in 1.6 Groovy was simply included a Maven dependency, it's not enough in 2.2 and following exception will be thrown on launching scripted query or aggregation:

Failed to execute phase [query], all shards failed; shardFailure (...)
nested: IllegalArgumentException[script_lang not supported [groovy]]

It's because in Elasticsearch 2.2, script engines are externalized as plugins. So, to enable Groovy we can either install plugin in plugins directory, or register Groovy plugin manually, when org.elasticsearch.node.Node is created. Below you can find the code illustrating this 2nd option:

<-- must add this dependency -->
<dependency>
  <groupId>org.elasticsearch.module</groupId>
  <artifactId>lang-groovy</artifactId>
  <version>2.2.0</version>
</dependency>

Settings.Builder elasticsearchSettings = Settings.settingsBuilder()
  .put("path.home", "target")
  .put("http.enabled", false)
  .put("path.data", System.getProperty("env.testBase", "target") + "/test-es-data")
  .put("script.engine.groovy.inline.aggs", "true")
  .put("script.engine.groovy.inline.search", "true")
  .put("index.max_result_window", 2147483647)
  .put("node.name", "integration_test")
  // because we use customized Node object, put properties inline
  .put("cluster.name", "test")
  .put("node.data", true)
  .put("node.local", true)
  .put("client.type", "node");
  
  node = new ConfigurableNode(elasticsearchSettings.build(), 
    Collections.<Class<? extends Plugin>>singleton(GroovyPlugin.class));

  // Must start node explicitely, otherwise NPE for
  // following health checks is thrown
  // This issue should be resolved in 2.3 by
  // https://github.com/elastic/elasticsearch/pull/16746
  node.start();

  // wait for yellow/green status before continue
  node.client().admin().cluster()
    .prepareHealth().setWaitForYellowStatus().execute().actionGet();

// ...
private static class ConfigurableNode extends Node {
  public ConfigurableNode(Settings settings, 
        Collection<Class<? extends Plugin>> classpathPlugins) {
    super(InternalSettingsPreparer.prepareEnvironment(settings, null),
      Version.CURRENT,
      classpathPlugins);
  }
}

If you read previous code carefully, you can see that node is started manually. Without that, following exception is thrown:

Caused by: java.lang.NullPointerException
  at org.elasticsearch.cluster.service.InternalClusterService
    .add(InternalClusterService.java:281)
  at org.elasticsearch.cluster.ClusterStateObserver
    .waitForNextChange(ClusterStateObserver.java:154)
  at org.elasticsearch.cluster.ClusterStateObserver
    .waitForNextChange(ClusterStateObserver.java:99)
  at org.elasticsearch.action.support.master
    .TransportMasterNodeAction$AsyncSingleAction
    .retry(TransportMasterNodeAction.java:190)
  at org.elasticsearch.action.support.master
    .TransportMasterNodeAction$AsyncSingleAction
    .doStart(TransportMasterNodeAction.java:164)
  at org.elasticsearch.action.support.master
    .TransportMasterNodeAction$AsyncSingleAction
    .start(TransportMasterNodeAction.java:121)

Another difference concerns different construction of server settings. In 1.6 we used explicit ImmutableSettings, while 2.2, as shown previously, uses classic builder which, under-the-hood, transform settings to immutable entries:

// It's done under-the-hood by Settings builder
public Settings build() {
  return new Settings(Collections.unmodifiableMap(map));
}

This article shows how to achieve migration from Elasticsearch 1.6 to 2.2. It shows that we'll have a lot of work to transform queries using filters to queries. It shows also that there will be some problems, as in the case of local node construction, when the same code didn't work in the same way between two versions.

Data Engineering Design Patterns

Looking for a book that defines and solves most common data engineering problems? I wrote one on that topic! You can read it online on the O'Reilly platform, or get a print copy on Amazon.

I also help solve your data engineering problems contact@waitingforcode.com 📩

Elasticsearch migration from 1.6 to 2.2