[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[aspell-devel] Big wordlist and affix lexicons
From: |
Børre Gaup |
Subject: |
[aspell-devel] Big wordlist and affix lexicons |
Date: |
Fri, 24 Nov 2006 15:53:17 +0100 |
User-agent: |
KMail/1.9.5 |
Hello!
I work in a project which is going to make spellcheckers for Northern and Lule
Sami, among others a high-quality Aspell spell checker.
We use Xerox two-level morphological tools to make fullform word lists. The
Northern Sami fullform word list is now about 24GB. The word list can be
broken down into word forms covering a single stem + inflection and other
endings. Each word can have up to 16000 unique endings, and the set of
inflectional endings a word can have varies. We thus have several such sets
of inflectional endings. The exact number needed for Aspell is not yet known,
but the present Xerox-based lexicons have more than 150 such sets.
We made an affix file containing the 16000 unique endings one of our words
had, and that file alone became 1.5 MB. Our calculations tell us that if we
continue in this vein for all our words, we will end up with an affix file
that can be as big as 50MB.
As far as we understand there are 52 available affix classes for the affix
file. It is probable that we would need more affix classes than the existing
52. Is it possible to increase this number?
If that is not possible, we will probably end up with a very big wordlist,
amounting up to some gigabyte. How well will aspell tackle a wordlist of that
size?
regards,
--
Børre Gaup
Prošeaktamielbargi - Project worker
tel(W): +47 77 64 59 64
tel(GSM): +47 41 08 03 64
e-mail:address@hidden
http://divvun.no/english.html
- [aspell-devel] Big wordlist and affix lexicons,
Børre Gaup <=