[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [aspell] Affix or something else
From: |
Kevin Atkinson |
Subject: |
Re: [aspell] Affix or something else |
Date: |
Wed, 31 Jan 2001 10:28:58 -0500 (EST) |
What follows are relevant notes I posted on Affix compression to the list.
My view on this issue has not changes as is basically. I will gladly work
with anyone on implementing Affix compression for Aspell however I do
not have time to implement it my self at this point in time.
Subject: [aspell] The Deal on Affix Compression
Date: Fri, 12 Mar 1999 19:39:29 -0500
I realize that affix compression is important for languages with a
lot of affix compression however it is not vital. The reason is that
without affix compression all you have to do is list all all of the
possible combinations. I release that this wastes space however it
is doable.
For example the word list that comes with Aspell has
70,598 words
After running it through the munchlist script it has
30,953 words
Which leads to a ratio of
2.3
Now a polish word lists has the numbers.
1,041,430
146,626
7.1
Which means that the polish language affix compression saves about 3.1
times more space than it would for the English dictionary. Not that
big of a deal.
Also, notice that this dictionary is mighty large. Especially
considering that the largest English word list I have has these numbers.
120,361
73,358
1.6
So my question is do you really need that large of a dictionary? The
original poster sending me these figures agrees that a lot of those
146,626 base words are not needed. So, lets say that we reduce the
base list to down to 35,000, than the numbers are...
248,000
35,000
7.1
Which has about 3.5 times more words than Aspell English
dictionary. True this is large and is slightly wasteful however it is
certainly manageable.
So once again affix compression WILL save space however the expanded
word lists are manageable with out affix compression. Nevertheless, I
do plan on implementing affix compression. It is just being put on
hold until the rest of the international code is done.
If you really care about affix compression than implement it your
self. But first talk to me so I can make sure you are doing it in a
manner that is consistent with the rest of the aspell library.
Subject: [aspell] Affix Compression
Date: Wed, 14 Jun 2000 05:41:22 -0400 (EDT)
If anyone was interested in implementing affix compression here is a nice
write up by the author of Ispell.
But before Aspell can even use Affix compression it will need an alternate
strategy of suggesting words that is not based on based
"soundslike" words. I plan on implanting a special dictionary that when
asked for all words with X soundslike it will simply return all words that
are equal to X after being stripped of the capitalization and
accents. This will be possible to do without using an extra lookup table
because the words will be indexes based in there lowercase, accent striped
equivalent.
---------- Forwarded message ----------
Date: Wed, 14 Jun 2000 00:12:20 -0700
From: Geoff Kuenning <address@hidden>
To: address@hidden
Subject: Re: Affix Compression Again
> I release that you are rather busy however if you have the time I would
> appreciate it if you could give me an English overview of how affix
> compression works. For example how does looking the words up in the
> dictionary work, does it strip the word to its base name, if so what does
> it do if there could be more than one base name?
As it turns out, I had to implement it in both directions. The "add
affixes" direction, which is the way people think of it when they
write affix files, was originally created for generating suggestions
and for use by munchlist. However, it's also turned out to be very
useful for all of those people who want to turn an ispell dictionary
back into a raw word list. I'll describe that one first.
For simplicity, I'll describe everything in terms of suffixes, which
is what existed first. It should be obvious how to extend things to
prefixes, with one exception that I'll cover below.
Adding an affix to a word is fairly easy. First, you check the
contexts. That's easy and quick because the contexts are encoded as
flag arrays, so checking them is a simple lookup. For example,
consider flag T from english.aff:
flag T:
E > ST # As in late > latest
[^AEIOU]Y > -Y,IEST # As in dirty > dirtiest
[AEIOU]Y > EST # As in gray > grayest
[^EY] > EST # As in small > smallest
Suppose we have "dirty/T". First I check to make sure the word is at
least 2 characters long (1 more than the "strip" length -- you can't
wind up with a null root after stripping!). Then I check the
contexts; I think I go backwards for convenience.
The context characters are encoded as bits within an array of bytes,
named "conds". There are 256 entries in the array, one for each 8-bit
character (in some cases there can be more than 256 entries, but I'm
simplifying a bit). In any entry, bit 0 is set if the corresponding
character is a legal context for the last character in the word.
(Actually, only the entries for uppercase characters matter, but I
maintain them all so that the code will run fast.) Similarly, bit 1
is set if the character is legal for the next-to-last character, and
so forth. That limits me to 8 context characters, but 8 seems to be
enough for everybody.
So, for example, the second flag-table entry for "T" will state that
there are two context characters. I use the "y" (converted to
uppercase) to index into the 256-byte "conds" array and pick up an
8-bit value. Since bit 0 is set, I know that "Y" is legal as the
final character of a string for this suffix entry. I then do the same
thing for the "t" of "dirty"; this time I look at bit 1. Note that in
this case, bit 0 will be set only in the entry for "y", while bit 1
will be set in every entry except for the vowels.
Having checked the context, the rest is easy. Since I know the length
of what I need to strip, and the contex matches, I just delete that
number of characters. Then I tack on the "append" string (it's called
"affix" in the flagent struct) and I'm done--except for
capitalization, which is a horrible mess I'll discuss later.
Actually, the delete/tack on part can be done with a single strcpy for
suffixes; prefixes are a bit trickier but really just take a bit of
careful programming.
Whew! The "remove affixes" direction, which is used when you need to
verify the spelling of a word, works in almost the same way. The big
difference is that since the context rules apply to the root, you
can't check them until after you've made the transformation. Again,
the first thing I do is to check the length of the word against the
affix lengths to make sure that what I'm about to do is sensible.
Then I string-compare the "append" string with the tail of the word.
If they match, I strip off the "append" string (by just trimming the
right number of characters) and add on the "strip" string. Again,
that can be done with a single strcpy for suffixes.
Finally, after doing the suffix transformation, I can check the
context using the same rules as before. Again, I have to correct
capitalization in a separate step.
Prefixes work the same way except that bit 0 of the "conds" array
corresponds to the first character of the string.
Oh, I missed one of your questions: "what does it do if there could be
more than one base name?" The answer is that I try all
possibilities. I just loop through the flag table, looking for a
hit. If I recall correctly, there's an argument to the function that
tells it whether to stop on the first success, or return all matches.
That's used in suggestion generation, I think.
If you've been thinking about performance as you read this, you may
have spotted a nasty little problem: when you're working in the
"remove affixes" direction, you have to spend a lot of time in strcmp.
Yup. In fact, that turned out to be a killer, and I had to come up
with a clever data structure to help out. Unfortunately, I don't
remember what I did! Lemme look...hmmm, looks like there's a
tree-like table of pointers to flag entries, indexed by the characters
of the word. For example, the entry for "T" would point to a list of
flag entries describing words that could end in "T", including
"dirtiest". If there are a lot of suffixes ending in T, the table
would contain another level of indexing, so you'd wind up following
the "S" entry to find all words that end in "ST", and so forth. This
tree is built dynamically in lookup.c when the affix table is read in,
and is used in tgood.c (search for "flagptr", which is the struct used
for the tree). Man, I have a convoluted mind when I need to. Anyway,
you can look at the table by turning on the DUMPINDEX compile-time
option.
The one thing I've been postponing talking about is capitalization,
because it's so tricky. You probably know that ispell divides the
world up into four categories all-caps (IBM), capitalized (Kevin),
followcase (practically every dot-com nowadays) and lowercase. To
make things fun, there's a "promotion" sequence: lowercase words can
be capitalized, and any word can be all-caps (as in titles for
example). The result is that you need to maintain the capitalization
when you manipulate affixes. Here, from memory, is how it's done.
The first thing you need is the root word, because that's where the
information is kept about which category the word falls into. If it's
all-caps, life is simple: just force everything into uppercase.
Otherwise, we need to do what the user wants. All of this only
matters in "add affixes" mode, because the "remove" mode always wants
to look up an uppercase version of the word.
To properly add affixes, you need to know what the original word
looked like. There are two cases: generation from a raw dictionary
(the -e switch) and generation of suggestions for a misspelled word.
Working from a raw dictionary, I use the following rules:
Lowercase word: the root and all affixes become lowercase
Capitalized word: suffixes become lowercase. Prefixes are
capitalized, and the first letter of the root is forced to
lowercase.
All-caps word: the root and all affixes become uppercase
Followcase word: the prefix is forced to the case of the first
character of the root, and the suffix is forced to the case of the
last character of the root. That doesn't always do the right
thing, but munchlist makes sure that a followcase word doesn't get
an affix flag unless the capitalization works out OK; otherwise
the "expanded" versions will get listed separately in the
dictionary. Also, with followcase words there can be more than
one legal case choice, so I might have to handle several options
and produce the word multiple times.
Incidentally, you can get a good idea of what ispell will do in these
situations by typing into the standard input of "ispell -ee"...hmm, in
fact I see that the rules above are incorrect, because giving it
"ITcorp/U" gives "Unitcorp" instead of "UNITcorp" as I clamed -- but I
don't know if that's a bug or a feature! I think it's a bug.
Finally, for generating suggestions, the rules above are followed, but
the original misspelled word is used as a prototype. If it was
all-caps, then the suggestion is generated in the same form. If the
prototype was capitalized, the suggestion is generated capitalized
unless the root was marked as all-caps or followcase -- in those
situations, the rules listed with the root are followed just as if we
had been working from a raw dictionary. Finally, if the prototype was
all lowercase, then the raw-dictionary rules are followed.
My code isn't nearly as clean as the above description. In
particular, I should have written the raw-dictionary code separately,
and then called it from the prototype cde when appropriate. But
that's not the way it grew; you know how *that* goes.
> It would help me out a lot so that when implementing it myself I don't
> make the same mistakes that you make.
Me? Mistakes? Surely you jest... :-)
--
Geoff Kuenning address@hidden http://www.cs.hmc.edu/~geoff/
"Du kannst dem Leben nicht mehr Tage geben, aber den Tag mehr Leben."
-- source unknown, attribution requested
(You can't give your life more days, but you can give your days more life.)
Subject: [aspell] More Affix Compression Notes
Date: Wed, 14 Jun 2000 17:11:25 -0400 (EDT)
---------- Forwarded message ----------
Date: Wed, 14 Jun 2000 13:32:42 -0700
From: Geoff Kuenning <address@hidden>
To: address@hidden
Subject: Re: Affix Compression Again
> How do you store the list of valid rules in the hash table. Do you use a
> bit array or a list of the flags and why did you use the method, space,
> speed, convenience, etc...
If I understand your question, you want to know how trouble/DGS
represents the DGS part in the hash table. It's a bit array. That
part of the code dates back to Pace's original ispell, and probably
back to the PDP-10 assembly-language version. It was chosen for a
combination of space and speed. I'd stick with it nowadays for the
speed part; space has become kind of a moot point, but speed still
matters a lot when checking large documents in some languages.
--
Geoff Kuenning address@hidden http://www.cs.hmc.edu/~geoff/
"Du kannst dem Leben nicht mehr Tage geben, aber den Tag mehr Leben."
-- source unknown, attribution requested
(You can't give your life more days, but you can give your days more life.)
Subject: [aspell] Compiled Word List Compression Notes
Date: Sat, 24 Jun 2000 11:11:28 -0400 (EDT)
I have given a lot of thought on compressing aspell compiled word list
and here is what I have come up with.
1. Affix Compression)
It would still be possible to use affix compression with the soundslike
lookup however the compiled word list will still take up more space than
Ispell's. The data would be laid out roughly as follows.
<base word><affix flags><base word><affix flags>....
<word hash> (size * base words * 1.25 * 4)
<soundslike base><affix flag 1-2 bytes><word list><affix flag><word list>
<soundslike hash> (size * base soundslike * 1.25 * 4)
Where the word list will consist of
<num of words, 2 bytes>
<base word offset 3 bytes, affix flag 1 byte>
<base word>...
The approximate size would be between
(size of ispell hash table) + (num of expanded words)*4 and
(size of ispell hash table) * 2 + (num of expanded words)*4
Depending on the roughness of the soundslike data.
A little more space can be saved by using 3 byte words instead of
4 byte integers in the hash tables.
2. Huffman)
The other idea I had was to count the frequencies that all single letters
appear, all strings of length 2 appear, all strings of length 3, etc...
And then encode it via Huffman code.
There are some good notes of Huffman coding at
http://www.compressconsult.com/huffman/. If you don't know what Huffman
coding is there is a good introduction at
http://www.faqs.org/faqs/compression-faq/part2/section-1.html.
If I can compress most words to under 4 bytes then I can store the words
directly in the hash table so the space would become.
# of words * (1.25*4+4 = 9) + # of soundslike * (1.25 * 4 = 5)
+ size of prefix code lookup tables.
The tricky part would be
1.
Identifying when a string is common enough to justify a prefix code.
2.
Determine where to split the string. For example if the following
codes are available:
freq 1, equency 2, ency 3, ue 4
The word frequency can be encoded in several different ways:
14ncy
1u3
fr2
Which one is optimal depends on the length of the prefix code for each
string and the individual letters. This problem should be able to be
solved effectually using dynamic programming as it seams closely related
to
the matrix chain multiplication problem.
However, the choice of code words to use will also affect the frequency of
the strings and thus the optimal prefix codes....
Additional Notes)
Both of these methods will create problems with the new code I added in
the latest snapshot to control compound formation. The affix compression
method should still work provided that if the base word can be used in a
compound so can all the expanded forms. For the Huffman method the
compound information would either need to be stored separately or merged
in
with the word. If they are stored separately another (# word words) bytes
will need to be used. The existing method only takes up extra space if
the word has compound information attached to it.
A. Implication)
I don't plan on implanting either of these methods any time real soon now
as I have lots of other thing I want to get done first in Aspell. However
if someone plans on implanting either of these methods I would gladly work
with you.
B. Feedback)
As always feedback is more than welcome. I am especially in feedback on
the Huffman method.
Subject: [aspell] Affix Compression Assistance Wanted
Date: Tue, 25 Jul 2000 14:10:28 -0400 (EDT)
I would greatly appreciate some assistance with Affix Compression.
I image that it would take a good week to do which is why I don't want
to do it right now (I have many other things I want to get done first).
If anyone is willing i will be able to tell you almost almost exactly what
needs to be done to support affix compression as I have given it a lot of
thought. The only I have an advantage over someone else who knows C++
well is that I have an intimate knowledge of the aspell source. When
someone decided to do it I will give them tremendous help on the exact
areas of aspell that need to be modified. I will handle most of the
interface stuff so that the only thing they have to worry about is the
actual implementation. That is, they wont have to spend several days
studying the aspell source to figure out how to integrate X feature with
Aspell. Even if you don't know C++ very well but due know C I can still
use your effort.
I know Trond Eivind Glomsrød, the guy packaging Aspell for the next
version of RedHat, will appreciate it.
---
Kevin Atkinson
kevina at users sourceforge net
http://metalab.unc.edu/kevina/