aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] Experiments with spell-checking...


From: Mike C. Fletcher
Subject: [aspell-devel] Experiments with spell-checking...
Date: Mon, 28 Oct 2002 16:19:16 -0500
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.1) Gecko/20020826

I've been working on and off the last week on a contexualisable spell-check engine for Python. So far I have a basically functional system, but I'm running into a number of issues. I'm wondering if others have suggestions about how to fix these problems (better approaches). I'm going to summarise the approach I've taken so the problems have some useful context:

Overview of the system

   I'm providing two major services with the spell-checker:

The first is the standard "is this word in the dictionary" "check", which is fast, and very basic regardless of the underlying technology. Basically for any given word you have 1 lookup in the database which returns the (normally first, but optionally all) word-set object in which the word was found, and any variants on the normalised form of the word (e.g. if the word checked was "Form" and only "form" was in the database, you'd get back a list with only "form" in it, and your app could decide whether that constituted an error or not (e.g. by looking at whether you are at the start of a sentence, in a title, etc).

The second service is the "what words are similar to this?" "suggestion", which is comparatively very slow (0.03s (at edit-distance=1) through 2.5s (at edit-distance=2) seconds depending on the chosen "fuzziness"). The algorithm is currently based on a phonetic compression mechanism (with the rules read from Aspell's files) with edit-distance calculations determining which records are considered 'hits'. At the moment, this uses an approach based on the documentation for the aspell engine. It does a heavily-trimmed query for single-edit-distance fuzziness (~2000 queries on a 500,000 word dictionary) and a (slightly-trimmed) straight scan through the database for more lax searches (which scans an average of ~200,000 records in a 500,000 word dictionary).

   Eventually I intend to provide a few other services:

Rank words according to their likely relevance to a context (assumes words are in the dictionary). This will be built on a number of mechanisms, including whole-word-list rankings by dictionary (e.g. common English words having a higher ranking than Shakespearean English words in most dictionaries, but the reverse in a Shakespearean dictionary), statistical recording per-user (i.e. you always use the word "ripping", so it's more likely to be what you mean), and potentially algorithmic mechanisms based on the context of the word (for example, Noun-verb agreement). Beyond the obvious use-case of sorting suggestions, it would be useful for technical writers to take a Standard-English dictionary split by commonality of word (as is available) and colourise each word in a corpus according to its commonality.

Word-completion (lookup of all completing words in the current dictionary)

Correction and usage tracking (would feed some of the ranking mechanisms above). This would likely be per-user (and optional). It would allow you to store per-app/per-doc/whatever usage databases to provide both more efficient and more personal/contextualised services to the user. Correction tracking is a sub-set of usage tracking, used solely by the spelling system. Usage tracking in the more general approach allows you to make educated guesses based on the current input and past history, what the user probably intended to say.


The primary focus of the system's current design is on modularity and flexibility, so the code allows for loading both in-memory and on-disk word lists and mixing and matching word-lists within multiple loaded dictionaries (note: aspell also offers mix-and-match mechanisms for word-lists). The idea here being that you load word-lists per-document, per-project, and per-application and add them to the current user's default dictionaries/correction-lists/usage-statistics to give you a contextually-relevant dictionary.


Current Status and Problems

I'm currently using Python's bsddb wrappers for storing the word-lists. I'm storing the phonetic database as "phonetic-compressed": word [+ '\000'+word] and the non-phonetic database as "normalised": word [+ '\000'+word]. That means that the database requires (loosely) 4 times the storage space of the words themselves + the overhead of the database. In practice that means that the 500,000 word phonetic database (7.1MB plain-text for words) takes around 29MB when compiled). 500,000 is a fairly large dictionary (my "official scrabble dictionary" (dead-trees) is 500,000), but I'm thinking a real-world installation will want ~ 1,000,000 words to be available in system word-sets (for each language), so we'd be talking around 120MB*(number of languages) just for the system dictionaries. So, some thoughts on how to compress the word-sets without dramatically overloading the system (processing overhead) would be appreciated.

bsddb will, if a program using it crashes with a db open for writing, corrupt the database and be unable to open it again. Is there a better embeddable dbase system for this kind of work? I'm considering testing meta-kit (which I used a few years ago), but the last time I worked with it the same problem plagued it (it would corrupt files if not properly closed).

The scanning speed is just not usable under the current bsddb system (even if I use a non-thread-safe naive approach), so the aspell-style searches for larger edit distances (i.e. more mistakes allowed) are prohibitively expensive (i.e. a single suggestion taking 2.5 seconds). The current code goes through around 200,000 iterations for a distance=2 search (even after optimising out the 2-deletion cases). That would likely be unnoticably fast in C, but it's hideously slow in Python.

I am considering creating a phonetic-feature-set database instead of continuing work on the scanning phonetic database. This would store all (2-character sub-strings in the phonetic compression): (each word having that feature). That would give n+1 (start and end are considered characters in the feature-sets) entries for each word. To do that w/out creating huge databases I would need a mechanism for storing pointers to words, rather than words, within the database. I'm sure there must be some useful way to do that, but everything I've seen relies on storing integers then doing a query to resolve each integer back to a string. That seems like it would still be expensive (especially using bsddb). Any better mechanisms people know of? As a note, this is a toolkit for building spell-checkers, so I'll likely provide both mechanisms, but it's unlikely anyone would use both at the same time (it would mean ~180MB/language for system dictionaries).

For those interested in playing with the current code, the project is on sourceforge at:

   http://sourceforge.net/projects/pyspelling/

Enjoy yourselves,
Mike

_______________________________________
 Mike C. Fletcher
 Designer, VR Plumber, Coder
 http://members.rogers.com/mcfletch/







reply via email to

[Prev in Thread] Current Thread [Next in Thread]