aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[aspell-devel] Big wordlist and affix lexicons


From: Børre Gaup
Subject: [aspell-devel] Big wordlist and affix lexicons
Date: Fri, 24 Nov 2006 15:53:17 +0100
User-agent: KMail/1.9.5

Hello!

I work in a project which is going to make spellcheckers for Northern and Lule 
Sami, among others a high-quality Aspell spell checker.

We use Xerox two-level morphological tools to make fullform word lists. The 
Northern Sami fullform word list is now about 24GB. The word list can be 
broken down into word forms covering a single stem + inflection and other 
endings. Each word can have up to 16000 unique endings, and the set of 
inflectional endings a word can have varies. We thus have several such sets 
of inflectional endings. The exact number needed for Aspell is not yet known, 
but the present Xerox-based lexicons have more than 150 such sets.

We made an affix file containing the 16000 unique endings one of our words 
had, and that file alone became 1.5 MB. Our calculations tell us that if we 
continue in this vein for all our words, we will end up with an affix file 
that can be as big as 50MB.

As far as we understand there are 52 available affix classes for the affix 
file. It is probable that we would need more affix classes than the existing 
52. Is it possible to increase this number?

If that is not possible, we will probably end up with a very big wordlist, 
amounting up to some gigabyte. How well will aspell tackle a wordlist of that 
size?

regards,
--
Børre Gaup
Prošeaktamielbargi - Project worker
tel(W): +47 77 64 59 64
tel(GSM): +47 41 08 03 64
e-mail:address@hidden
http://divvun.no/english.html




reply via email to

[Prev in Thread] Current Thread [Next in Thread]