Using Luke with ElasticSearch

If you’ve used Lucene or Solr before, you might be familiar with Luke, a Lucene tool for viewing and modifying indexes.  Luke was originally written by Andrzej Bialecki as a side project, and is an indispensable debugging tool for digging into the guts of an index (for example, to see exactly which tokens were stored on a particular field).

Luke overview

Luke’s Overview tab

 

Unfortunately, it hasn’t been particularly well maintained.  The official project at Google Code (link above) hasn’t been updated since Lucene 4.0.0-ALPHA (ca. 2012).  The good news is that there is some great community support for Luke at the moment:

Both should work out of the box with standard Lucene indexes, but you need to do a little extra work to read an ElasticSearch index.

 

If you try to open an index created by ElasticSearch with a stock copy of Luke, you’ll see the following error:

A SPI class of type org.apache.lucene.codecs.PostingsFormat with name 'es090' does not exist. You need to add the corresponding JAR file supporting this SPI to your classpath. The current classpath supports the following names: [Lucene40, Lucene41]

ElasticSearch uses a custom postings format (the postings format defines how the inverted index is represented in memory / on disk), and Luke doesn’t know about it. To tell Luke about the ES postings format, add the SPI class by following the steps below.

 

1. Clone Dmitry’s Mavenized repo:

$ git clone https://github.com/DmitryKey/luke/

 

2. Add a dependency on your required version of ElasticSearch to the Luke project’s pom file:

<!-- ElasticSearch -->
<dependency>
    <groupId>org.elasticsearch</groupId>
    <artifactId>elasticsearch</artifactId>
    <version>1.1.1</version>
</dependency>

 

3. Compile the Luke jar file (creates target/luke-with-deps.jar):

$ mvn package

 

4. Unpack Luke’s list of known postings formats to a temporary file:

$ unzip target/luke-with-deps.jar META-INF/services/org.apache.lucene.codecs.PostingsFormat -d ./tmp/
Archive:  target/luke-with-deps.jar
  inflating: ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat

 

5. Add the ElasticSearch postings formats to the temp file:

$ echo "org.elasticsearch.index.codec.postingsformat.BloomFilterPostingsFormat" 
    >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
$ echo "org.elasticsearch.index.codec.postingsformat.Elasticsearch090PostingsFormat" 
    >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat
$ echo "org.elasticsearch.search.suggest.completion.Completion090PostingsFormat" 
    >> ./tmp/META-INF/services/org.apache.lucene.codecs.PostingsFormat

 

6. Repack the modified file back into the jar:

$ jar -uf target/luke-with-deps.jar -C tmp/ META-INF/services/org.apache.lucene.codecs.PostingsFormat

 

7. Run Luke

$ ./luke.sh

 

You can now open indexes created by ElasticSearch, search for, view, and edit documents, and all the other operations Luke allows.

A good Luke tutorial can be found on LingPipe’s blog:
http://lingpipe-blog.com/2012/07/24/using-luke-the-lucene-index-browser-to-develop-search-queries/

Happy debugging!

Customizing ElasticSearch’s english Analyzer

ElasticSearch has some pretty nice linguistic capability built-in — it comes with a whole host of language-specific text analyzers (currently 30 of them).  These analyzers control how the search engine parses text into usable parts (tokens).  However, the docs are pretty sparse.  Reading about language-specific analyzers, this is about all you get:

 

A set of analyzers aimed at analyzing specific language text. The following types are supported:  […] 

 

If you want more detail, you need to dig into the codebase (and into Lucene, as it turns out).  Why might you want more detail?  One reason might be to customize some aspect of the analyzer’s behavior, without changing anything else.  That was my motivation, so I’ll share what I found and focus on the english analyzer here.

 

Analyzers are made up of two main components: a Tokenizer and a set of TokenFilters.  The tokenizer splits text into tokens according to some set of rules, and the token filters each perform operations on those tokens.  The result is a stream of processed tokens, which are either stored in the index or used to query it.  See the links above for more details on analysis.

 

The default analyzer is the Standard analyzer, and contains these components:

  • Tokenizer: Standard tokenizer
  • TokenFilters:
    • Standard token filter, which does nothing!
    • Lowercase token filter, which downcases all tokens
    • Stop token filter, which removes tokens identified as stop words

 

Digging into the ElasticSearch codebase for the english analyzer, these classes contain relevant details:

 

The english analyzer is really just Lucene’s EnglishAnalyzer.  If you dig into that class, you’ll finally find the analyzers it uses:

 

There are a couple of caveats here:

  1. The English possessive filter doesn’t appear to be exposed in current versions of ElasticSearch (1.1.x).
  2. The keyword marker filter will only be activated if you’ve passed rules for it (either via the stemmer override token filter or the keyword marker filter itself).

 

Keeping the above details in mind, we can now construct a custom analyzer which is based on the built-in english analyzer.  Add the following settings to your ES node:

{
  "analysis": {
    "analyzer": {
      "english_custom_stems": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": [
          "standard",
          "lowercase",
          "stop",
          "filter_stemmer_override",
          "porter_stem"
        ]
      }
    },
    "filter": {
      "filter_stemmer_override": {
        "type": "stemmer_override",
        "rules_path": "analysis/stems.txt"
      }
    }
  }
}

This filter approximates the english filter, but adds the ability to override the stemmer with custom rules.

Once the filter is in place, the mapping needs to be updated, and any fields that should get the new behavior should have their analyzer set to the english_custom_stems analyzer created above.

 

Before that will work, the override file needs to be created and placed on each ES node in the cluster.  The format of this file is:

word => stemmed form

 

A good example to illustrate this problem is universe / university.  The Porter stemmer stems all the following words to univers:

  • universe, universes
  • university, universities

Since stemming is algorithmic (rather than dictionary-based), it doesn’t know that these two pairs of words have quite different meanings.  A user would not likely expect search results containing universe when searching for university.  So, let’s fix it so they stem differently.  The change is to add entries to the stemming exclusion list for each of these words, overriding the stemmer.

 

Here’s the file containing just these entries:

universe => universe
universes => universe
university => university
universities => university

On the left is the original term, on the right is the new value that the stemmer should return.  This file should be placed on each ES node (in the analysis/ directory, as defined in the settings above).

 

When complete, it’s worth testing.  The _analyze endpoint lets you pass arbitrary text through the analyzer of your choice and see the processed output.  Here’s a negative test: let’s see the english analyzer reduce universities to the “wrong” form:

$ curl 'localhost:9200/idx/_analyze?analyzer=english&text=university+universities&pretty'
{
  "tokens" : [ {
    "token" : "univers",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "univers",
    "start_offset" : 11,
    "end_offset" : 23,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

As expected, both forms were reduced to univers — not what we want.  If you try the other forms, you’ll see that they all reduce identically.

 

Now for the positive test:

$ curl 'localhost:9200/idx/_analyze?analyzer=english_custom_stems&text=university+universities&pretty'
{
  "tokens" : [ {
    "token" : "university",
    "start_offset" : 0,
    "end_offset" : 10,
    "type" : "<ALPHANUM>",
    "position" : 1
  }, {
    "token" : "university",
    "start_offset" : 11,
    "end_offset" : 23,
    "type" : "<ALPHANUM>",
    "position" : 2
  } ]
}

It works!  The custom analyzer stemmed both forms to our requested value, university.  Try universe and you’ll see similar behavior.

 

One final step is required: dropping your index, recreating, and reindexing all your data.  Since this analyzer is used at both at query time (on search terms) and at index-time (on documents as they are fed), the documents need to be re-fed in order to have the new values populated in the index.

 

ElasticSearch exposes a lot of power in the analyzers, but it can be tricky understanding how to configure them. Dig into the sources when you really need to understand how it works!

Recovering from an external drive failure on Mac

A few weeks ago I plugged the external USB hard drive that holds my photo library into my Mac, and was greeted with the following message:

"The disk you inserted was not readable by this computer."

“The disk you inserted was not readable by this computer.”

 

Bummer. As it happened, I didn’t have an up-to-date backup of the drive, having made some recent updates. I really needed to recover the contents of the drive. I tried the drive in another Mac, and got the same message. I tried it in a Windows machine, and got a message about an error with the drive.

I plugged it back into the Mac and had a look at Disk Utility. It showed the drive in the left pane, but with no associated volumes. On the First Aid tab, the Verify Disk and Repair Disk items were grayed out. In the disk info section at the bottom, the following telltale appeared: “Partition Map Scheme: Unformatted”. It was clear that the partition table had been corrupted somehow.

I gave up on the GUI tools and switched to diskutil, which is OSX’s tool to manage disks, volumes, and partitions.

 

The list command shows the available drives:

[21:07:16] mini-me ~ $ diskutil  list
/dev/disk0
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *120.0 GB   disk0
   1:                        EFI                         209.7 MB   disk0s1
   2:                  Apple_HFS MacintoshHD             119.2 GB   disk0s2
   3:                 Apple_Boot Recovery HD             650.0 MB   disk0s3
/dev/disk1
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *1.5 TB     disk1
   1:                        EFI                         209.7 MB   disk1s1
   2:                  Apple_HFS BigMedia                1.5 TB     disk1s2
/dev/disk2
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:                                                   *1.0 TB     disk2

 

Lines 13-15 (disk2) represent the drive in question. Note that there are no partition types or volume names, nor are there any partitions. diskutil has several repair verbs. One of them, repairDisk, exists specifically to fix a busted partition map:

Repair the components of a partition map of a disk

 

I am certain that this drive used to be partitioned using the GUID partition scheme (the default under OSX), so answered yes to the confirmation question:

[21:08:23] mini-me ~ $ diskutil repairDisk /dev/disk2
Nonexistent, unknown, or damaged partition map scheme
If you are sure this disk contains a (damaged) APM, MBR, or GPT partition map,
you can hereby try to repair it enough to be recognized as a map; another
"diskutil repairDisk /dev/disk2" might then be necessary for further repairs
Proceed? (y/N) y
Partition map repair complete; you might now want to repeat the
verifyDisk or repairDisk verbs to perform further checks and repairs

 

That seems to have fixed it; diskutil now recognized the two partitions from the external drive:

[21:08:52] mini-me ~ $ diskutil list
/dev/disk0
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *120.0 GB   disk0
   1:                        EFI                         209.7 MB   disk0s1
   2:                  Apple_HFS MacintoshHD             119.2 GB   disk0s2
   3:                 Apple_Boot Recovery HD             650.0 MB   disk0s3
/dev/disk1
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *1.5 TB     disk1
   1:                        EFI                         209.7 MB   disk1s1
   2:                  Apple_HFS BigMedia                1.5 TB     disk1s2
/dev/disk2
   #:                       TYPE NAME                    SIZE       IDENTIFIER
   0:      GUID_partition_scheme                        *1.0 TB     disk2
   1:                        EFI                         209.7 MB   disk2s1
   2:                  Apple_HFS TimeMachine             150.0 GB   disk2s2
   3:                  Apple_HFS Mercury                 849.7 GB   disk2s3

 

To confirm that the partition really was restored, mount up the disk:

[21:09:47] mini-me ~ $ diskutil mountDisk /dev/disk2
Volume(s) mounted successfully

[21:09:53] mini-me ~ $ mount
/dev/disk0s2 on / (hfs, local, journaled)
devfs on /dev (devfs, local, nobrowse)
map -hosts on /net (autofs, nosuid, automounted, nobrowse)
map auto_home on /home (autofs, automounted, nobrowse)
/dev/disk2s2 on /Volumes/TimeMachine (hfs, local, nodev, nosuid, journaled, noowners)
/dev/disk2s3 on /Volumes/Mercury (hfs, local, nodev, nosuid, journaled, noowners)

 

I manually checked the contents of the two partitions, and they appeared to be correct. All the files were readable and I didn’t notice anything missing. Disaster averted!

If diskutil had been unable to recover my files, there are some alternatives:

  • TeskDisk: Data recovery software to rebuild or recover from partition/filesystem problems
  • PhotoRec: Block-by-block recovery software for known file types
  • GParted: graphical partition editor

In addition, there are several commercial disk-recovery applications available.

 

A note on backups: as I mentioned at the beginning of this post, my backup of the data on this disk was out of date.  As I tried to recover them, I realized just how important that data was.  To avoid the recurrence of this problem in the future, I implemented a backup strategy for irreplaceable data:

  1. Master data stored on my machine’s internal drive
  2. Frequent backups, using <code>rsync</code>, to a local external backup drive
  3. Less frequent backups to an EBS volume in Amazon EC2-land (again using <code>rsync</code>)

A future move will be to switch to RAID1 or RAID5 storage locally, to be able to handle a hardware failure.

Essential tools for web developers

I’m a firm believer in the principle of using the right tool for the job.  It applies equally well when building software as it does building a house or fixing a car.  There are literally thousands of tools available to do the job, and it can be hard finding the right one.  This post focuses on web development tools, specifically Firefox extensions.

  1. Firefox: while it’s not an extension, the browser itself deserves mention.  Primary web development should happen in a standards-compliant environment, and Firefox is a great platform for that.  There are certainly other browsers that fit the bill (Safari and Opera are candidates), but neither has the wealth of extensions that Firefox offers, nor the developer community behind them.

  2. Firebug: the tagline at the Firebug site is “web development evolved”, and that is a true statement.  Firebug may well be the largest single innovation in web development, ever.  That’s a bold statement, but it provides so much functionality, you hardly need other tools.  Some of it’s best features include:
    • Visual DOM exploration: mouse over nodes in the DOM tree, see them highlighted in the browser window.  Conversely, inspect an element in the browser window and access it in the DOM tree
    • Realtime CSS editing: changes to styles take effect immediately
    • Javascript console: execute arbitrary javascript in the context of the page
    • Network monitor: see full details on every request and response the browser makes, as they happen
    • Javascript debugging: a full-featured debugger and profiler for javascript code
    • Realtime reporting of Javascript and CSS errors: you’ll know when something’s not right

    If you don’t have Firebug yet, go get it.  Now.

  3. HTML Validator: invalid markup is a web developer’s nightmare — if your markup isn’t right, how can anything else be expected to function or look right?  Yet, unless you’re extremely well disciplined (and even then, sometimes), you’ll make mistakes.  This extension adds on-the-fly validation to Firefox, letting you know right away when there’s an error, indicating where it is, and even offering suggestions on how to fix it.  The validator has multiple validation engines; for best effect choose the serial algorithm (w3C’s SGML parser first, then HTML Tidy).  There’s no longer an excuse for invalid markup.

  4. Web Developer Toolbar: Chris Pederick’s toolbar was one of the first Firefox extensions targeted at web developers, and it still offers plenty of functionality.  Besides offering shortcuts to oft-used Firefox features, it works with cookies, forms, window resizing, HTML validation and more.  It also provides lots of information about the page and it’s elements.

  5. Tamper Data: when you need detailed information about HTTP requests and responses, the Net tab in Firebug is one place to look.  Another is Tamper Data, which provides an easily-filtered interface for inspecting HTTP traffic.  However, Tamper Data also allows the user to “tamper” with the request before it’s dispatched to the server, an incredibly useful trick when debugging misbehaving web applications.

Using the right tools will not only make your job easier, it’ll make you a better developer.  These tools can alert you to mistakes in your code, and you’ll learn to avoid repeating them in the future.

Everyone has their own favorite tools in their toolbox.  These are my favorites, and I ones I believe no professional web developer should be without.

Introducing Omnibug

Omnibug is a tool for web developers.  I wrote it because debugging an Omniture implementation is… painful.

The idea is that web metrics (or webanalytics, if you’re so inclined) systems generally make an HTTP request (usually an image) in order to pass along tracking information.  The URLs contain lots of parameters, conveniently URL-encoded so you can’t read them easily.

Omnibug is an extension to Firebug (without a doubt the best Firefox add-on available, driving a revolution in web UI development).  It adds a new panel with the decoded output of each such HTTP request, making it a breeze to see exactly what values were sent.

Though it was designed with Omniture in mind, it will work with other systems (also tested with Moniforce).  The patterns it looks for are fully configurable, so in theory it should work with any similar system.

An additional feature is the ability to log matching requests to the local filesystem.  While this feature was intended to support automated testing of metrics implementations, it may have other uses.

See the Omnibug page for downloads and full documentation.

Breaking teh sudoers file on OS X

Not having anything better to do, I horked my sudoers file while modifying an entry, thereby preventing myself from sudoing at all (and subsequently fixing it). Weak.

Turns out there’s a pretty easy fix for this using the Finder, so long as your account has administrator privileges:

Steps:

  1. In the Finder, go to /etc (or do <cmd>-<shift>-g and enter /etc, if you don’t see it in the Finder)
  2. Get Info on sudoers
  3. In the Info window’s Sharing & Permissions section, add an entry for your local user with read and write access
  4. With a text editor, edit /etc/sudoers and fix the problem.
  5. Fix the permissions on /etc/sudoers back to 0440.
  6. Enjoy your newly-regained sudo power!

Of course, this can be used to overcome other permissions problems too.

If only visudo had a validator, along with a syntax checker…