aspell-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Language Info Needed for Aspell


From: Kevin Atkinson
Subject: Re: [aspell-devel] Language Info Needed for Aspell
Date: Tue, 23 Mar 2004 17:21:09 -0500 (EST)

On Tue, 23 Mar 2004, Kevin Patrick Scannell wrote:

>    I'm pleased to hear you are trying to extend aspell
>  support as widely as possible.  I'm hoping I can contribute
>  in a substantial way here.   I have some web crawling software
>  available that targets particular languages:
> 
>  http://borel.slu.edu/crubadan/
> 
>  It "bootstraps" a model of the target language based on 
>  previously seen texts and rarely makes mistakes if provided
>  with sufficient "seed" texts.
> 
>  As you can see on the status page I've built up text corpora for quite a few
>  languages.     Part of the crawler is a module that ranks words in terms
>  of the likelihood that they are actually correctly spelled words in the
>  target language.    The highest frequency words make it of course --
>  also n-gram statistics are calculated which are a good way of
>  disqualifying the foreign (mostly English) words that sneak in.
>  In the cases where
>  I can find a dictionary I can check any suspect words manually.
>  This is also, I should say, an excellent way of improving 
>  existing word lists.  I've been in contact with the Breton 
>  and Welsh maintainers already.
> 
>  The upshot is that I should be able to package up reasonably
>  clean wordlists for Manx Gaelic (gv), Scottish Gaelic (gd),
>  Cebuano (ceb-- though I think "proc" chokes on 3-letter
>  ISO-639 codes), and Setswana (tn).       

Great.  The latest version of the "proc" script handles it (I _really_ need 
a better name ;).  Although it will choke on two word languages.  That is 
already fixed.  I will upload a new version soon.  If you know Perl you 
can fix it yourself.  All you need to do it change the regular expression.

>  The Walloon ispell dictionary has a Makefile target that
>  builds and installs an aspell dictionary, so that should
>  be easy enough.
> 
>  Perhaps in future if you have speakers of small languages
>  contacting you about creating spellcheckers from scratch you can direct
>  them to me.

No problem.

>   I should mention that it works out of the box for ISO-8859 character sets
>   but takes some effort for utf8... 

Oh well.  Many of those languages need to parsed carefully anyway to 
properly get the words.


BTW: Are you good in perl?  If so would you be willing to help extent my 
"proc" script to make it more flexible?  Ideally I want to expand it so 
that an word list author can use it to create Aspell, MySpell, and Ispell 
dictionaries.  It also needs to be enhanced to allow more than one official 
dictionary to coexist for a given language such as pt and pt_BR, and to 
provide support for Aspell 0.60 specific features.

-- 
http://kevin.atkinson.dhs.org






reply via email to

[Prev in Thread] Current Thread [Next in Thread]