Metaphysical Developer

Peter Norvig’s Spelling Corrector in 21 Lines of Coffeescript

Posted in Languages by Daniel Ribeiro on March 31, 2011

Coffeescript is a very nice (and relatively new) language that compiles down to javascript, making web programming (and making firefox plugins, nodejs apps, and so forth) much more joyful. Its object model is the same as javascript (one of coffeescript’s motto is Unfancy JavaScript), and its compiled js form is quite easy to read and debug. It has many niceties, including classes (effectively making the prototype chain a first class citizen of the language) and array/object comprehensions (heavily influenced by python’s list comprehensions).

Ruby also has a influence on the language, such as optional parenthesis on method/function invocation. In fact, the original version of Coffeescript compiler was written in Ruby (but nowadays coffeescript is a self-hosting language).

Coffeescript has been used by several projects, including a mobile framework written by Rail’s creator 37 signals. I’ve been using for about one year (including some open source work using a HTML 5 Canvas framework called EaselJs, a port of ruby functionalities and even a Firefox plugin).

Because of all the Ruby and Python influence on the language, and the fact that Coffeescript can convey beautiful and concise code, I had a hunch that it could get a really good position on Peter Norvig’s Spelling Corrector implementation collection (Javascript’s version currently has 53 lines, which is a lot more than python‘s 21). With some work, I managed to implement it in 21 lines as well:

words = (text) -> (t for t in text.toLowerCase().split(/[^a-z]+/) when t.length > 0)
Array::or = (arrayFunc) -> if @length > 0 then @ else arrayFunc()
Array::flat = -> if @length == 0 then @ else @[0].concat(@[1..].flat())
train = (features) ->
 model = {}
 (model[f] = if model[f] then model[f] +1 else 2) for f in features
 return model
NWORDS = train(words(require('fs').readFileSync('./lib/big.txt', 'utf8')))
alphabet = 'abcdefghijklmnopqrstuvwxyz'.split ""
edits1 = (word) ->
 s = ([word.substring(0, i), word.substring(i)] for i in [0..word.length])
 deletes = (a.concat b[1..] for [a, b] in s when b.length > 0)
 transposes = (a + b[1] + b[0] + b.substring(2) for [a, b] in s when b.length > 1)
 replaces = (a + c + b.substring(1) for c in alphabet for [a, b] in s when b.length > 0)
 inserts = (a + c + b for c in alphabet for [a, b] in s)
 return deletes.concat transposes.concat replaces.flat().concat inserts.flat()
known_edits2 = (word) -> ((e2 for e2 in edits1(e1) when NWORDS[e2]? for e1 in edits1(word)).flat())
known = (words) -> (w for w in words when NWORDS[w])
correct = (word) ->
 candidates = known([word]).or -> known(edits1(word)).or -> known_edits2(word).or -> [word]
 ({k: w, v: NWORDS[w] or 1} for w in candidates).sort((a, b)-> b.v  - a.v)[0].k

All the code is hosted on github. The code above can be seen in a more readable version here (after line 21 it also contains a full test, using a fixture generated by a slightly modified version of Peter Norvig’s original implementation). There is a more testable version, along with Jasmine BDD tests (which look a lot like Rspec’s in Coffeescript), which run headless on NodeJs, but work just fine in the browsers.


Findall regex doesn’t exists natively in Javascript, however it is equivalent to spiting by the complementary regex (line 1)

Array::or (line 2) was needed to be implemented, because Python’s truthfulness allows a collection to be true (actually, any iterable) as long as it is not empty. Array::flat (line 3) has to be implemented because Coffeescript’s loop comprehension is a bit different from python’s: double loops (example: x + y for y in col1 for z in col2) return array of arrays instead of a single array.

Also note that loop comprehension’s order is inverted (x + y for x for y in python is translated as x + y for y for x).

Granted, this version runs really fast on NodeJs 0.4.1, and I was quite happy with how the resulting code looked. I was even happier that I did not have to write the compiled Javascript file and its whooping 147 lines of Spelling Corrector.

To see the reason why this work, check out Peter Norvig’s original post.