In order for Aspell to find suggestions for a misspelled word Aspell 1)
creates a list of candidate words, 2) scores them, and 3) returns the most
likely candidates. One of the ways Aspell finds candidate words is to look
for all words with a soundslike which is of a small edit distance from the
soundslike of the original word. The edit distance is the total number of
deletions, insertions, exchanges, or adjacent swaps needed to make one
string equivalent to the other. The exact distance chosen is either 1 or 2
depending on a number of factors. In this part I will focus on how Aspell
find all such soundslike efficiently and how the jump tables play a key
role.
In order two find all possible soundslike within a fixed edit distance one
of two things can be tried. 1) Use trial an error by trying all possible
edits and then seeing if they are in the dictionary, and 2) Scan the
dictionary for possible soundslikes. Before Aspell 0.60 Aspell used the
first method when the edit distance was one and the second otherwise.
Aspell now uses the second method for both methods as I was able to make the
scan very efficient, therefore saving the space of having to store a
separate hash table for the soundslike.
The naive way to scan the list for all possible soundslike is to compute the
edit-distance of every soundslike in the dictionary and keep the ones within
the threshold. This is exactly what Aspell did prior to 0.60. When a fast
enough edit distance function is used this method turns out not to be
unbearably slow, at least for English. For other languages, with large word
lists and no soundslike, it can be slow due to the number of items scanned.
Aspell uses a special edit distance function which gives up if the distance
is larger than the threshold, thus making it very fast. The basic algorithm
is as follows:
limit_edit_distance(A,B,limit) = ed(A,B,0)
where ed(A,B,d) = d if A & B is empty.
= infinity if d > limit
= ed(A[2..],B[2..], d) if A[1] == B[1]
= min ( ed(A[2..],B[2..], d+1),
ed(A, B[2..], d+1),
ed(A[2..],B, d+1) ) otherwise
However the algorithm used also allows for swaps and is not recursive.
Specialized version are provided for an edit distance of one and two. The
running time is asymptotically bounded above by (3^l)*n where l is the limit
and n is the maximum of strlen(A),strlen(B). Based on my informal tests,
however, the n does not really matter and the running time is more like
(3^l). For complete details on this algorithm see the files leditdist.hpp
and leditdist.cpp in the source distribution under modules/speller/default.
By exploiting the properties of limit_edit_distance is possible to avoid
having to look many of the soundslike in the dictionary. Limit_edit_distance
is effecent because in many cases it doesn't have to look at the entire word
before determine that it isn't within the given threshold. By having it
return the last position looked at, "p", it is possible to avoid having to
look ta similar soundslike which are not within in threshold. That is if
two soundslike are the same up to the position "p" than nether of them are
within the given threshold.
Aspell 0.60 exploits this property by using jump tables. Each entry in the
jump table contains two fields: the first N letters of a soundslike, and an
offset. The entries are sorted in lexicographic order based on the raw byte
value. Aspell maintains two jump tables. The first one contains the first
2 letter of a soundslike and the offset points into the second jump table.
The second one contains the first 3 letters of a soundslike where the offset
points to the location of the soundslike in the data block. The soundslike
in the datablock are sorted so that a linear scan can be used to find all
soundslike with the same prefix. If limit_edit_distance stops before
reaching the end of a "soundslike" in the one of the jump tables than it is
possible to skip all the soundslike in the data block with the same prefix.
Thus, the scan for all soundslike within a given edit distance goes
something like this:
1) Compare the entry in the first jump table using limit_edit_distance.
If limit_edit_distance scanned passed the end of the word than go the
first entry in the second jump table with the same prefix.
Otherwise go to the next entry in the first jump table and repeat.
2) Compare the entry in the second jump table. If limit_edit_distance
passed the end of the word than go the first soundslike in the data
block with this prefix. Otherwise if the first two letters of the
next entry are the same as the current one go it and repeat. If the
first two letters are not the same than go to the next entry in the
first jump table and repeat step 1.
3) Compare the soundslike in the data block. If the edit distance
is within the target distance add to the candidate list, otherwise
don't. Let N be the position where limit_edit_distance stopped
(starting at 0). If N is less than 6 skip over any soundslike that
have the same first N + 1 letters. If, after skipping over
any similar soundslike, the next soundslike does not have the same
first three letters go to the next entry in the second jump table
and repeat step 2. Otherwise repeat this step with the next
soundslike.
The part of skipping over soundslike with the first N + 1 letters in step 3
was added in Aspell 0.60.3. The function responsible for most of this is
ReadOnlyDict::SoundslikeElements::next found in readonly_ws.cpp
In the next part I will describe how Aspell deals with soundslike lookup
when affix compression is involved.