[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Aspell-user] Aspell Now Has Full UTF-8 Support
From: |
Kevin Atkinson |
Subject: |
[Aspell-user] Aspell Now Has Full UTF-8 Support |
Date: |
Thu, 18 Mar 2004 02:42:16 -0500 (EST) |
Aspell now fully supports spell checking documents in UTF-8. In addition
Aspell now has support for accepting all input and printing all output in
UTF-8 or any other encoding that Aspell supports. The fact that Aspell is
still 8-bit internally can now be made completely transparent to the end
user. Previous versions of Aspel supported Unicode to some extent;
however, word list still had to be in an 8-bit character set.
Furthermore, spell checking documents in an encoding that is different from
the internal encoding was pragmatic. This has all changed now.
With this change Aspell can now support any language that no more than 220
distinct characters, including different capitalizations and accents,
_even if_ there is not an existing 8-bit encoding that supports the
language. All one has to do is creating a new character data file which
is a fairly simple task. The internal encoding never has to be seen by
the end-user, including the word list author, since not even the word list
has to be in the same encoding that Aspell uses.
Full UTF-8 support was added with 0.51-20040219, the next snapshot,
0.51-20040227 fixed a few bugs, while the latest 0.60-20040317 uses a new,
simpler, format for the character data files.
Aspell snapshots can be downloaded from ftp://alpha.gnu.org/gnu/aspell/.
Notes on 8-bit Characters
*************************
There is a very good reason I use 8-bit characters in Aspell. Speed and
simplicity. While many parts of my code can fairly be easily be
converted to some sort of wide character as my code is clean. Other
parts can not be.
One of the reasons because is many, many places I use a direct lookup
to find out various information about characters. With 8-bit characters
this is very feasible because there is only 256 of them. With 16-bit
wide characters this will waste a LOT of space. With 32-bit characters
this is just plain impossible. Converting the lookup tables to some
other form, while certainly possible, will degrade performance
significantly.
Furthermore, some of my algorithms relay on words consisting only on
a small number of distinct characters (often around 30 when case and
accents are not considered). When the possible character can consist of
any Unicode character this number because several thousand, if that. In
order for these algorithms to still be used some sort of limit will
need to be placed on the possible characters the word can contain. If I
impose that limit, I might as well use some sort of 8-bit characters
set which will automatically place the limit on what the characters can
be.
There is also the issue of how I should store the word lists in
memory? As a string of 32 bit wide characters. Now that is using up 4
times more memory than charters would and for languages that can fit
within an 8-bit character that is, in my view, a gross waste of memory.
So maybe I should store them is some variable width format such as
UTF-8. Unfortunately, way, way to many of may algorithms will simply
not work with variable width characters without significant
modification which will very likely degrade performance. So the
solution is to work with the characters as 32-bit wide characters and
than convert it to a shorter representation when storing them in the
lookup tables. Now than can lead to an inefficiency. I could also use
16 bit wide characters however that may not be good enough to hold all
of future versions of Unicode and it has the same problems.
As a response to the space waste used by storing word lists in some
sort of wide format some one asked:
Since hard drive are cheaper and cheaper, you could store
dictionary in a usable (uncompressed) form and use it directly
with memory mapping. Then the efficiency would directly depend on
the disk caching method, and only the used part of the
dictionaries would relay be loaded into memory. You would no more
have to load plain dictionaries into main memory, you'll just want
to compute some indexes (or something like that) after mapping.
However, the fact of the matter is that most of the dictionary will
be read into memory anyway if it is available. If it is not available
than there would be a good deal of disk swaps. Making characters 32-bit
wide will increase the change that there are more disk swap. So the
bottom line is that it will be cheaper to convert the characters from
something like UTF-8 into some sort of wide character. I could also use
some sort of disk space lookup table such as the Berkeley Database.
However this will *definitely* degrade performance.
The bottom line is that keeping Aspell 8-bit internally is a very
well though out decision that is not likely to change any time soon.
Fell free to challenge me on it, but, don't expect me to change my mind
unless you can bring up some point that I have not thought of before
and quite possible a patch to solve cleanly convert Aspell to Unicode
internally with out a serious performance lost OR serious memory usage
increase.
Languages Which Aspell can Support
**********************************
Even though Aspell will remain 8-bit internally it should still be be
able to support any written languages not based on a logographic
script. A only logographic writing system in current use are those
based on hànzi which includes Chinese, Japanese, and sometimes Korean.
Languages with 220 or Fewer Unique Symbols
==========================================
Aspell 0.60 should be able to support the following languages as, to
the best of my knowledge, they all contain 220 or fewer symbols and can
thus, fit within an 8-bit character set. If an existing character set
does not exists than a new one can be invented. This is true even if
the script is not yet supported by Unicode as the private use area can
be used.
Code Language Name Script Dictionary Gettext
Available Translation
ab Abkhazian Cyrillic - -
ae Avestan Avestan - -
af Afrikaans Latin Yes -
an Aragonese Latin - -
ar Arabic Arabic - -
as Assamese Bengali - -
ay Aymara Latin - -
az Azerbaijani Arabic - -
az Cyrillic - -
az Latin - -
ba Bashkir Cyrillic - -
be Belarusian Cyrillic - Yes
bg Bulgarian Cyrillic Yes -
bh Bihari Devanagari - -
bn Bengali Bengali - -
bo Tibetan Tibetan - -
br Breton Latin Yes -
bs Bosnian Latin - -
ca Catalan/Valencian Latin Yes -
ce Chechen Cyrillic - -
ch Chamorro Latin - -
co Corsican Latin - -
cr Cree Canadian Syllabics - -
cr Latin - -
cs Czech Latin Yes -
cv Chuvash Cyrillic - -
cy Welsh Latin Yes -
da Danish Latin Yes -
de German Latin Yes -
dv Divehi Dhives Akuru - -
dz Dzongkha Tibetan - -
el Greek Greek Yes -
en English Latin Yes -
eo Esperanto Latin Yes -
es Spanish Latin Yes Incomplete
et Estonian Latin - -
eu Basque Latin - -
fa Persian Arabic - -
fi Finnish Latin - -
fj Fijian Latin - -
fo Faroese Latin Yes -
fr French Latin Yes Yes
fy Frisian Latin - -
ga Irish Latin Yes Yes
gd Scottish Gaelic Latin - -
gl Gallegan Latin Yes -
gn Guarani Latin - -
gu Gujarati Gujarati - -
gv Manx Latin - -
ha Hausa Latin - -
he Hebrew Hebrew - -
hi Hindi Devanagari - -
hr Croatian Latin Yes -
hu Hungarian Latin - -
hy Armenian Armenian - -
ia Interlingua (IALA) Latin - -
id Indonesian Arabic - -
id Latin Yes -
io Ido Latin - -
is Icelandic Latin Yes -
it Italian Latin Yes -
iu Inuktitut Canadian Syllabics - -
iu Latin - -
ja Japanese Latin - -
jv Javanese Javanese - -
jv Latin - -
ka Georgian Georgian - -
kk Kazakh Cyrillic - -
kl Kalaallisut/Greenlandic Latin - -
km Khmer Khmer - -
kn Kannada Kannada - -
ko Korean Hangeul - -
kr Kanuri Latin - -
ks Kashmiri Arabic - -
ks Devanagari - -
ku Kurdish Arabic - -
ku Cyrillic - -
ku Latin - -
kv Komi Cyrillic - -
kw Cornish Latin - -
ky Kirghiz Arabic - -
ky Cyrillic - -
ky Latin - -
la Latin Latin - -
lo Lao Lao - -
lt Lithuanian Latin - -
lv Latvian Latin - -
mi Maori Latin Yes -
mk Makasar Lontara/Makasar - -
ml Malayalam Latin - -
ml Malayalam - -
mn Mongolian Cyrillic - -
mn Mongolian - -
mo Moldavian Cyrillic - -
mr Marathi Devanagari - -
ms Malay Arabic - -
ms Latin Yes -
mt Maltese Latin - -
my Burmese Myanmar - -
nb Norwegian Bokmal Latin - -
ne Nepali Devanagari - -
nl Dutch Latin Yes Yes
nn Norwegian Nynorsk Latin - -
no Norwegian Latin Yes -
nv Navajo Latin - -
oc Occitan/Provencal Latin - -
oj Ojibwa Ojibwe - -
or Oriya Oriya - -
os Ossetic Cyrillic - -
pa Punjabi Gurmukhi - -
pi Pali Devanagari - -
pi Sinhala - -
pl Polish Latin Yes -
ps Pushto Arabic - -
pt Portuguese Latin Yes Yes
qu Quechua Latin - -
rm Raeto-Romance Latin - -
ro Romanian Latin Yes -
ru Russian Cyrillic Yes -
sa Sanskrit Devanagari - -
sa Sinhala - -
sd Sindhi Arabic - -
se Northern Sami Latin - -
sk Slovak Latin Yes -
sl Slovenian Latin Yes -
sn Shona Latin - -
so Somali Latin - -
sq Albanian Latin - -
sr Serbian Cyrillic - Yes
sr Latin - -
su Sundanese Latin - -
sv Swedish Latin Yes -
sw Swahili Latin - -
ta Tamil Tamil - -
te Telugu Telugu - -
tg Tajik Arabic - -
tg Cyrillic - -
tg Latin - -
tk Turkmen Arabic - -
tk Cyrillic - -
tk Latin - -
tl Tagalog Latin - -
tl Tagalog - -
tr Turkish Arabic - -
tr Latin - -
tt Tatar Cyrillic - -
ty Tahitian Latin - -
ug Uighur Arabic - -
ug Cyrillic - -
ug Latin - -
ug Uyghur - -
uk Ukrainian Cyrillic Yes -
ur Urdu Arabic - -
uz Uzbek Cyrillic - -
uz Latin - -
vi Vietnamese Latin - -
vo Volapuk Latin - -
wa Walloon Latin - Incomplete
yi Yiddish Hebrew - -
yo Yoruba Latin - -
zu Zulu Latin - -
Languages in Which the Exact Script Used in Unknown
===================================================
Aspell can most likely support any of the following languages; however,
I am unsure what script they are written in. Most of them are probably
written in Latin but I am not sure. If you have any information about
these languages please email me at <address@hidden>.
Code Language Name
aa Afar
ak Akan
av Avaric
bi Bislama
bm Bambara
cu Old Slavonic
ee Ewe
ff Fulah
ho Hiri Motu
ht Haitian Creole
hz Herero
ie Interlingue
ig Igbo
ii Sichuan Yi
ik Inupiaq
kg Kongo
ki Kikuyu/Gikuyu
kj Kwanyama
lb Luxembourgish
lg Ganda
li Limburgan
ln Lingala
lu Luba-Katanga
mg Malagasy
mh Marshallese
na Nauru
nd North Ndebele
ng Ndonga
nr South Ndebele
ny Nyanja
rn Rundi
rw Kinyarwanda
sc Sardinian
sg Sango
si Sinhalese
sm Samoan
ss Swati
st Southern Sotho
tn Tswana
to Tonga
ts Tsonga
tw Twi
ve Venda
wo Wolof
xh Xhosa
za Zhuang
The Ethiopic Script
===================
Even though the Ethiopic script has more than 220 distinct characters
with a little work Aspell can still handle it. The idea is to split
each character into two parts based on the matrix representation. The
first 3 bits will be the first part and could be mapped to `10000???'.
The next 6 bits will be the second part and could be mapped to
`11??????'. The combined character will then be mapped with the upper
bits coming first. Thus each Ethiopic syllabary will have the form
`11?????? 10000???'. By mapping the first and second parts to separate
8-bit characters it is easy to tell which part represents the consonant
and which part represents the vowel of the syllabary. This encoding of
the syllabary is far more useful to Aspell than if they were stored in
UTF-8 or UTF-16. In fact, the exiting suggestion strategy of Aspell
will work well with this encoding with out any additional
modifications. However, additional improvements may be possible by
taking advantage of the consonant-vowel structure of this encoding.
In fact, the split consonant-vowel representation may prove to be so
useful that it may be beneficial to encode other syllabary in this
fashion, even if they are less than 220 of them.
The code to break up a syllabary into the consonant-vowel parts does
not exists as of Aspell 0.60. However, it will be fairly easy to add
it as part of the Unicode normalization process once that is written.
The Thai Script
===============
The Thai script presents a different problem for Aspell. The problem
is not that there are more than 220 unique symbols, but that there are
no spaces between words. This means that there is no easy way to split
a sentence into individual words. However, it is still possible to
spell check Thai, it is just a lot more difficult. I will be happy to
work within someone who is interested in adding Thai support to Aspell,
but it is not likely something I will do in the foreseeable future.
Languages which use Hànzi Characters
====================================
Hànzi Characters are used to write Chinese, Japanese, Korean, and were
once used to write Vietnamese. Each hànzi character represents a
syllable of a spoken word and also has a meaning. Since there are
around 3,000 of them in common usage it is unlikely that Aspell will
ever be able to support spell checking languages written using hànzi.
However, I am not even sure if these languages need spell checking since
hànzi characters are generally not entered in directly. Furthermore
even if Aspell could spell check hànzi the exiting suggestion strategy
will not work well at all, and thus a completely new strategy will need
to be developed.
Japanese
========
Modern Japanese is written in a mixture of hiragana, katakana, kanji,
and sometimes romaji. Hiragana, Katakana are both syllabary unique to
Japan, kanji is a modified form of hànzi, and romaji uses the Latin
alphabet. With some work, Aspell should be able to check the non-kanji
part of Japanese text. However, based on my limited understanding of
Japanese hiragana is often used at the end of kanji. Thus if Aspell
was to simply separate out the hiragana from kanji it would end up with
a lot of word endings which are not proper words and will thus be
flagged as misspellings.
Languages Written in Multiple Scripts
=====================================
Aspell should be able to check text written in the same language but in
multiple scripts with some work. If the number of unique symbols in
both scripts is less than 220 than a special character set can be used
to allow both scripts to be encoding in the same dictionary. However
this may not be the most efficient solution. An alternate solution is
to store each script in its own dictionary and allow Aspell to chose
the correct dictionary based on which script the given word is written
in. Aspell currently does not support this mode of spell checking
however it is something that I hope to eventually support.
--
http://kevin.atkinson.dhs.org
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- [Aspell-user] Aspell Now Has Full UTF-8 Support,
Kevin Atkinson <=