Re: [aspell-devel] Mis-spellings database?

From: Mike C. Fletcher
Subject: Re: [aspell-devel] Mis-spellings database?
Date: Wed, 06 Nov 2002 16:33:41 -0500
Hmm, okay, I have a long way to go. _85_ of those don't show up at all in my suggestion lists even with distance=3 (19 with distance=5) scans. I don't have ranker's built yet, so I can't see how I'm doing wrt where in the suggestion-lists the targets show up.

A considerable number of the problems seem to be with what may be a faulty phonetic compressor (psy doesn't seem to go to sy, sy doesn't go to si, stand-alone y, i and e are not considered equal, etceteras), a few of them are with my optimised search for distance > 1 (which takes the tack that you can mis-spell/pronounce either of the start or the end of a word, but not both, and still get a decent response).

Anyway, for the task at hand (building a common-error-tracker for suggestions and rankings), the 1100 or so I have with yours + the mentioned corpus should give enough for everything but stress-testing and distributable "starter" dictionaries.

Will let you know if I find a large database some day (though, as a note, if developers actually use this feature, it should be possible to simply ask users to export their common-errors tables and send them to us to get very large corpi (corpuses?) :) ).


Kevin Atkinson wrote:

On Wed, 6 Nov 2002, Mike C. Fletcher wrote:

I'm wondering if anyone knows of a decent mis-spellings database anywhere? That is, a mapping from mis-spelling to correct spelling (or vice-versa)? I'm currently using a 550-item set adapted from:
and it's fine for testing, but I'm looking for something that might have a few tens of thousands of entries. Basically, I want to build a "common error tracking" system into my spell-checker, and would like a corpus of (real-world (English)) data so that I can judge the effectiveness of the new feature when built.

The best I can offer you is a small set I used for testing Aspell. It contains a lot of my own misspellings plus a few others from other sources. It has a lot of the more difficult misspellings that many spell checkers are unable to get. You can find it at

If you do find such a list I will certainly be interested in it also.


