Tag Archives: elasticsearch

Using Luke with ElasticSearch

If you’ve used Lucene or Solr before, you might be familiar with Luke, a Lucene tool for viewing and modifying indexes.  Luke was originally written by Andrzej Bialecki as a side project, and is an indispensable debugging tool for digging into the guts of an index (for example, to see exactly which tokens were stored on a particular field).

Luke overview

Luke’s Overview tab

 

Unfortunately, it hasn’t been particularly well maintained.  The official project at Google Code (link above) hasn’t been updated since Lucene 4.0.0-ALPHA (ca. 2012).  The good news is that there is some great community support for Luke at the moment:

Both should work out of the box with standard Lucene indexes, but you need to do a little extra work to read an ElasticSearch index.

 

If you try to open an index created by ElasticSearch with a stock copy of Luke, you’ll see the following error:

A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [Lucene40, Lucene41]

ElasticSearch uses a custom postings format (the postings format defines how the inverted index is represented in memory / on disk), and Luke doesn’t know about it. To tell Luke about the ES postings format, add the SPI class by following the steps below.

 

1. Clone Dmitry’s Mavenized repo:

$ git clone https://github.com/DmitryKey/luke/

 

2. Add a dependency on your required version of ElasticSearch to the Luke project’s pom file:

<!-- ElasticSearch -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>1.1.1</version>
</dependency>

 

3. Compile the Luke jar file (creates target/luke-with-deps.jar):

$ mvn package

 

4. Unpack Luke’s list of known postings formats to a temporary file:

$ unzip target/luke-with-deps.jar META-INF/services/org.apache.lucene.codecs.PostingsFormat -d ./tmp/
Archive:  target/luke-with-deps.jar
  inflating: ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat

 

5. Add the ElasticSearch postings formats to the temp file:

$ echo "org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat" 
    >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
$ echo "org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat" 
    >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
$ echo "org.elasticsearch.search.suggest.completion.Completion090PostingsFormat" 
    >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat

 

6. Repack the modified file back into the jar:

$ jar -uf target/luke-with-deps.jar -C tmp/ META-INF/services/org.apache.lucene.codecs.PostingsFormat

 

7. Run Luke

$ ./luke.sh

 

You can now open indexes created by ElasticSearch, search for, view, and edit documents, and all the other operations Luke allows.

A good Luke tutorial can be found on LingPipe’s blog:
http://lingpipe-blog.com/2012/07/24/using-luke-the-lucene-index-browser-to-develop-search-queries/

Happy debugging!

Advertisements

Customizing ElasticSearch’s english Analyzer

ElasticSearch has some pretty nice linguistic capability built-in — it comes with a whole host of language-specific text analyzers (currently 30 of them).  These analyzers control how the search engine parses text into usable parts (tokens).  However, the docs are pretty sparse.  Reading about language-specific analyzers, this is about all you get:

 

A set of analyzers aimed at analyzing specific language text. The following types are supported:  […] 

 

If you want more detail, you need to dig into the codebase (and into Lucene, as it turns out).  Why might you want more detail?  One reason might be to customize some aspect of the analyzer’s behavior, without changing anything else.  That was my motivation, so I’ll share what I found and focus on the english analyzer here.

 

Analyzers are made up of two main components: a Tokenizer and a set of TokenFilters.  The tokenizer splits text into tokens according to some set of rules, and the token filters each perform operations on those tokens.  The result is a stream of processed tokens, which are either stored in the index or used to query it.  See the links above for more details on analysis.

 

The default analyzer is the Standard analyzer, and contains these components:

  • Tokenizer: Standard tokenizer
  • TokenFilters:
    • Standard token filter, which does nothing!
    • Lowercase token filter, which downcases all tokens
    • Stop token filter, which removes tokens identified as stop words

 

Digging into the ElasticSearch codebase for the english analyzer, these classes contain relevant details:

 

The english analyzer is really just Lucene’s EnglishAnalyzer.  If you dig into that class, you’ll finally find the analyzers it uses:

 

There are a couple of caveats here:

  1. The English possessive filter doesn’t appear to be exposed in current versions of ElasticSearch (1.1.x).
  2. The keyword marker filter will only be activated if you’ve passed rules for it (either via the stemmer override token filter or the keyword marker filter itself).

 

Keeping the above details in mind, we can now construct a custom analyzer which is based on the built-in english analyzer.  Add the following settings to your ES node:

{
  "analysis": {
    "analyzer": {
      "english_custom_stems": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "standard",
          "lowercase",
          "stop",
          "filter_stemmer_override",
          "porter_stem"
        ]
      }
    },
    "filter": {
      "filter_stemmer_override": {
        "type": "stemmer_override",
        "rules_path": "analysis/stems.txt"
      }
    }
  }
}

This filter approximates the english filter, but adds the ability to override the stemmer with custom rules.

Once the filter is in place, the mapping needs to be updated, and any fields that should get the new behavior should have their analyzer set to the english_custom_stems analyzer created above.

 

Before that will work, the override file needs to be created and placed on each ES node in the cluster.  The format of this file is:

word => stemmed form

 

A good example to illustrate this problem is universe / university.  The Porter stemmer stems all the following words to univers:

  • universe, universes
  • university, universities

Since stemming is algorithmic (rather than dictionary-based), it doesn’t know that these two pairs of words have quite different meanings.  A user would not likely expect search results containing universe when searching for university.  So, let’s fix it so they stem differently.  The change is to add entries to the stemming exclusion list for each of these words, overriding the stemmer.

 

Here’s the file containing just these entries:

universe => universe
universes => universe
university => university
universities => university

On the left is the original term, on the right is the new value that the stemmer should return.  This file should be placed on each ES node (in the analysis/ directory, as defined in the settings above).

 

When complete, it’s worth testing.  The _analyze endpoint lets you pass arbitrary text through the analyzer of your choice and see the processed output.  Here’s a negative test: let’s see the english analyzer reduce universities to the “wrong” form:

$ curl 'localhost:9200/idx/_analyze?analyzer=english&text=university+universities&pretty'
{
  "tokens" : [ {
    "token" : "univers",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "univers",
    "start_offset" : 11,
    "end_offset" : 23,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

As expected, both forms were reduced to univers — not what we want.  If you try the other forms, you’ll see that they all reduce identically.

 

Now for the positive test:

$ curl 'localhost:9200/idx/_analyze?analyzer=english_custom_stems&text=university+universities&pretty'
{
  "tokens" : [ {
    "token" : "university",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "university",
    "start_offset" : 11,
    "end_offset" : 23,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

It works!  The custom analyzer stemmed both forms to our requested value, university.  Try universe and you’ll see similar behavior.

 

One final step is required: dropping your index, recreating, and reindexing all your data.  Since this analyzer is used at both at query time (on search terms) and at index-time (on documents as they are fed), the documents need to be re-fed in order to have the new values populated in the index.

 

ElasticSearch exposes a lot of power in the analyzers, but it can be tricky understanding how to configure them. Dig into the sources when you really need to understand how it works!