|Subject:||Re: [aspell-devel] Thoughts on using aspell for Indian language ing|
|Date:||Mon, 13 Nov 2006 09:37:15 -0800|
On 2:03:08 pm 11/13/06 Kevin Atkinson <address@hidden> wrote:
> On Mon, 13 Nov 2006, address@hidden wrote:
> > Linguistically, this consists
> > of the consonant "ca" (U091A), and the conjunct "kra", à¤à¥à¤°
> > (U0915 + U094D +
> > U0930), and the UTF-8 storage would be U091A U0915 U094D U0930.
> So how many "letters"? Is that 3 or 4? Is U094D considered a
That is 4 letters, including the initial consonant that is separate.
The conjunct itself is three.
> > Now, any
> > calculations of edit distance, such as swap, etc., should use the
> > consonant "ca" and the conjunct "kra", not the individual Unicode
> > characters. If for example, we operated on the individual
> > characters, a swap might move the "halant" (U094D) ahead of the
> > "ka" (U0915), making the character sequence U091A U094D U0915
> > U0930. As the "halant" is what is used to construct conjuncts,
> > this makes a new conjunct, "chka", à¤à¥à¤ (U091A
> > + U094D + U0915), followed by the consonant "ra", à¤° (U0930).
> > This is not desirable, as a confusion of spelling would never
> > arise between "chka" and "kra".
> So it is never the case you might want to substitute a letter in the
> conjunct with another letter? I assume you would. I would also
> assume that you would want to consider two conjuncts which are the
> same except for one letter as closer than two completely different
Yes, it is desirable to substitute a letter in the conjunct with
another letter, but the above example, where moving the halant changes
the structure of the word is unlikely to be a likely mistake. I have
to think this through further, but maybe an edit distance mechanism
that keeps the position of the halant immutable might be the way to go.
> Also how likely is it that the user will swap two glyphs?
Not very likely as a typing error. However, it is quite likely that one
syllable might be substituted mentally for another while thinking about
what to write.
> Also if you every want to implement any sort of true soundslike I
> would think you would want to work with letters not syllables.
I will need more advice from you on this, but I would have thought
that syllables are better to work with, especially as most Indian
languages are spelt phonetically.
> > Hope this makes more sense. I will come up with a more detailed
> > write-up including a description of conjuncts, and why one should
> > use syllables, rather than characters, as the basic units for
> > Indian language spellchecking. Some of these issues, maybe most of
> > them, can be made up for by appropriate soundslike rules. I really
> > should try out some quantitative tests first.
> Possible but you really need a "looks like" rather than a
> "soundslike". I agree if you want to unique represent each syllable
> you may run out of symbols to use.
> However, it may me better to just use a syllable aware edit distance.
That is a very good suggestion, and I have to try it out.
> I now understand the issue. However, I think that the fact that
> Aspell is 8-bit internally is a very small factor. Converting Aspell
> to be 16-bit internally will not magically fix this issue. I don't
> even think it will make it significantly easier to solve.
Yes, the 8-bit size is not so much the issue. It is more that if the
internal representation were Unicode, it would be easier to use
existing libraries to parse syllables. However, a workaround is
probably not too difficult.
> I do believe to truly handle this situation well some modifications
> will need to be made to Aspell. I suggest you start studying
> readonly_ws.cpp and suggest.cpp. I while ago I wrote some docs on
> how Aspell works: http://lists.gnu.org/archive/html/aspell-devel/20
> which may be helpful.
Thanks. These look useful.
> I will get back to you latter with some ideas on how to approach this
> issue. If you already thought of some please share them.
I am realising that linguistically I am probably in over my depth with
Hindi. However, we are meeting this Sat., along with some literary
Hindi folk, and I am talking to experts in other Indian languages, to
plan out an approach. I will certainly make these available, probably
on a Wiki page.
Thanks for all the interest that you have shown in this.
Aspell-devel mailing list
|[Prev in Thread]||Current Thread||[Next in Thread]|