Inverted Index Language Shootout

The idea behind this project started with this entry in my blog. Taking inspiration from Doug Bagley's Computer Language Shootout project, I decided to write a sample program in haskell, erlang, ruby, ocaml, and scheme, in order to evaluate each of these languages for their suitability in rewriting the main program I work on. The sample program would be one that is relatively central to what that program does: an inverted index.

Anybody who's ever worked on a search engine will know that the inverted index is the central data structure. A google search for inverted index turns up 227,000 documents (the most interesting of which is the paid advertisement from google itself, looking for search engine developers).

The idea behind an inverted index is simple: you start with a bunch of documents containing words, and you want to "invert" that, to create a bunch of words each of which lists all the documents that contain that word. Or, in code-speak, you start with a MapDocToWords object, and want to create a MapWordToDocs object (except that I won't be limiting myself to objects).

In this shootout, the first criterion will be how easy it is to use the same code for inverted indices of different types (e.g. indices of words, indices of numeric ranges, etc).

Another criterion will be how easy it is to write the inverted index to disk, and read it back in. This is important because building the inverted index usually takes a while, so it's good to do it offline and store the results for faster loading later. Also, it's not uncommon to use different data structures for building the inverted index and for querying it. This requirement has the advantage of making the test a bit more realistic.

And the last criterion will be performing intersection queries. For example, your typical multi-word query will involve looking up the document list for each word, and then intersecting the resulting lists to produce the documents that contain all the words in the query.

The results will be judged on code size, memory usage, speed, and (highly subjective) cool-ness.


Last updated October 1st, 2003
Back to Kimberley's Code.
Back to Kimberley's Home Page.