aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Aspell Status Update as of February 12, 2004


From: Kevin Atkinson
Subject: Re: [aspell-devel] Aspell Status Update as of February 12, 2004
Date: Fri, 13 Feb 2004 22:00:44 -0500 (EST)

On Sat, 14 Feb 2004, Lars Aronsson wrote:

> Kevin Atkinson wrote:
> > The biggest change in Aspell 0.51 is support for Affix Compression.
> > Affix compression is the act of combining several words with a common base
> > word into one word which consists of the base word and a list of affixes
> > to apply.  (Affix is the generic term for prefix, suffix or infix).  For
> > example "alarm alarms alarmed alarming" will become "alarm/SDG" where SDG
> > stands for the suffixes of alarm.  This can make a huge difference in
> > space for languages with have extensive affixation such as German.
> 
> While I greet this improvement, I object to the term "affix
> compression".  Making the dictionary file smaller (compression) might
> be one effect of using affix flags, but more important to languages
> such as German and Swedish is a guarantee that every grammatically
> legal ending for each word is covered by the dictionary.  "Grammatical
> completeness" is the desired effect.  It would be sad if this couldn't
> be combined with the highly appreciated sounds-alike function of
> Aspell.

Well Aspell CAN accept word lists with affix flags and use it to make a 
dictionary.   It will expand the words as they are read in.  Hence the 
dictionary will be a lot larger if it is not "affix compressed".  Also, 
when checking a document Aspell will have the affix information available 
to it.  This can be used to inform the user if the word in not in the 
dictionary but can be formed from a root word that is.

> Ultimately, every time a new word is added to the dictionary, the
> correct affix flags should also be added.  There is little point in
> adding "alarming" to the dictionary unless "alarm" and "alarms" are
> added at the same time.

You don't really need affix flags to do this.  Just add all the expanded 
forms at once.

> The OCR software FineReader version 6 and later, at least in its
> English version, contains an example of how a user interface for
> adding words to a dictionary with affix patterns can be designed.
> This is try-and-buy software (for Microsoft Windows), so you can have
> a free look at it at http://www.finereader.com/
> 
> Roughly speaking, when the user wants to add a word to the dictionary,
> she is asked for the word's basic form (alarming -> alarm) and then
> all possible endings resulting from the available affix flags are
> listed with check boxes.  The user can check the flexions that apply
> and submit the new word.
> 
> An example: You want to add "going" to the dictionary.  The system
> asks what the basic form is.  You enter "go".  The system asks which
> endings are legal: goes, goed, going.  You mark goes and going, and
> submit.  The system stores go/SG.  (Assuming that /S adds -es to words
> that end in a wovel.)

That's an interesting idea.  Aspell could do this without the dictionary
being affix compressed it will still have the affix information available
to it.  All it needs to do is just insert all the expanded forms of the 
word.

> Will the affix definition file follow the ispell or myspell format,
> or use its own format?

MySpell.

> I personally maintain a Swedish dictionary in ispell format from which I
> generate my Aspell dictionary, using "ispell -e" for expansion.
> Currently I have no good way to add new words interactively, when using
> Aspell.  I usually open my source dictionary file in Emacs, edit it,
> then run "make" to rebuild my dictionaries, all batch oriented.

How is ispell any different?  It can't add a word with affix flags (or can 
it?).

> Ispell comes with the "munchlist" utility that can be helpful in
> developing good dictionaries.  If munchlist fails to apply an affix
> flag, it is because the expanded dictionary (current aspell format)
> didn't contain one form of the word.  My Swedish expanded
> Aspell dictionary has 5.34 times more words (264K words) than my
> source file in ispell format (49K).  Munchlist is able to compress
> this to marginally smaller (48K), because my source is grammatically
> correct and not mathematically optimized for list compression.

I have ported the munchlist script to use Aspell.  You can find it in the 
latest snapshot (although its not 100% bug free).  However, the munch 
program that came with MySpell does a better job, at least for the English 
word list, that is also distributed with the latest snapshot.

-- 
http://kevin.atkinson.dhs.org






reply via email to

[Prev in Thread] Current Thread [Next in Thread]