Hi, Mohammed et al. Gokalp Yapici and I are also working on getting Arabic for Aspell. I thought we could share our plans to see if anybody wants to offer us helpful feedback.
For character-set data, we started with the Farsi implementation in Aspell, which uses utf-8 as the word-list encoding and Windows Arabic as the internal encoding.
For a word list, our plan is to use the data from Buckwalter's Arabic morphological analyzer -- the same data used in the Duali attempt at Arabic spell checking. This data has a complex specification of the structure of an Arabic word, which we'll need to translate into the simpler format required by Aspell.
In Buckwalter's format, each stem, prefix, or suffix is a member of a stem, prefix, or suffix class. Three auxilliary files specify which prefix classes can connect to which stem classes; which stem classes can connect to which suffix classes; and which prefix classes are compatible with which suffix classes.
If it weren't for that last file, this would be an easy problem: it would just be a matter of translating code names. Instead, we'll write perl scripts to recognize the easy translations (when no prefix/suffix combination is allowd, or all combinations are allowed), and do the easy thing. For the harder combinations (where some of the prefixes go to some of the suffixes) we'll expand out the prefixes or the suffixes (whichever there are fewer of), combining them with the stems as new "stem" entries.
There are a total of 170 affix (suffix and prefix) classes to start with. We'll probably more than run out of Aspell class codes (they're limited to 255) with the new classes we're creating. If that's very severe, I'll see if we can't get Aspell updated to allow more suffix classes. Otherwise, we'll just explicitly expand out the combinations which lead to the fewest new entries in the stem list.
What are some of the issues we haven't thought of? Any feedback is welcome!