[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Thoughts on using aspell for Indian language ing

From: Ethan Bradford
Subject: Re: [aspell-devel] Thoughts on using aspell for Indian language ing
Date: Mon, 13 Nov 2006 09:37:15 -0800

Kevin, one small fact on Indic graphology that may help: consonants have an intrinsic vowel (an "a"), so "ka" is one Unicode character.  There are then combining vowels, so to write "ko", you use "ka" plus the combining "o".  To get pairs of consonants, you need to suppress the inherent vowel, which is what the halant does.  Thus, "kra" is "ka + drop-the-a + ra".

Gora, how do Hindi keyboards support the entry of halant?  If entering the halant is just another keystroke (so the codes are entered as they are stored in Unicode), then why wouldn't the transposition of a halant with another keystroke be just as likely as any other transposition?  "Teh" makes no sense whatever in English, but I type it often enough.  Or are there separate keystrokes for the half-width ( i.e. vowel-less) consonants, which automatically add a halant?  If that's the case, then we want to treat "ka+halant" special, not "kra".

On 11/13/06, address@hidden <address@hidden> wrote:
On 2:03:08 pm 11/13/06 Kevin Atkinson <address@hidden> wrote:
> On Mon, 13 Nov 2006, address@hidden wrote:
> >  Linguistically, this consists
> >  of the consonant "ca" (U091A), and the conjunct "kra", क्र
> >  (U0915 + U094D +
> >  U0930), and the UTF-8 storage would be U091A U0915 U094D U0930.
> So how many "letters"?  Is that 3 or 4?  Is U094D considered a
> "letter"?

That is 4 letters, including the initial consonant that is separate.
The conjunct itself is three.
> >  Now, any
> >  calculations of edit distance, such as swap, etc., should use the
> >  consonant "ca" and the conjunct "kra", not the individual Unicode
> >  characters. If for example, we operated on the individual
> >  characters, a swap might move the "halant" (U094D) ahead of the
> >  "ka" (U0915), making the character sequence U091A U094D U0915
> >  U0930. As the "halant" is what is used to construct conjuncts,
> >  this makes a new conjunct, "chka", च्क (U091A
> >  + U094D + U0915), followed by the consonant "ra", र (U0930).
> >  This is not desirable, as a confusion of spelling would never
> >  arise between "chka" and "kra".
> So it is never the case you might want to substitute a letter in the
> conjunct with another letter?  I assume you would.  I would also
> assume that you would want to consider two conjuncts which are the
> same except for one letter as closer than two completely different
> conjuncts?

Yes, it is desirable to substitute a letter in the conjunct with
another letter, but the above example, where moving the halant changes
the structure of the word is unlikely to be a likely mistake. I have
to think this through further, but maybe an edit distance mechanism
that keeps the position of the halant immutable might be the way to go.

> Also how likely is it that the user will swap two glyphs?

Not very likely as a typing error. However, it is quite likely that one
syllable might be substituted mentally for another while thinking about
what to write.

> Also if you every want to implement any sort of true soundslike I
> would think you would want to work with letters not syllables.

I will need more advice from you on this, but I would have thought
that syllables are better to work with, especially as most Indian
languages are spelt phonetically.

> >   Hope this makes more sense. I will come up with a more detailed
> >  write-up including a description of conjuncts, and why one should
> >  use syllables, rather than characters, as the basic units for
> >  Indian language spellchecking. Some of these issues, maybe most of
> >  them, can be made up for by appropriate soundslike rules. I really
> >  should try out some quantitative tests first.
> Possible but you really need a "looks like" rather than a
> "soundslike". I agree if you want to unique represent each syllable
> you may run out of symbols to use.
> However, it may me better to just use a syllable aware edit distance.

That is a very good suggestion, and I have to try it out.

> I now understand the issue.  However, I think that the fact that
> Aspell is 8-bit internally is a very small factor.  Converting Aspell
> to be 16-bit internally will not magically fix this issue.  I don't
> even think it will make it significantly easier to solve.

Yes, the 8-bit size is not so much the issue. It is more that if the
internal representation were Unicode, it would be easier to use
existing libraries to parse syllables. However, a workaround is
probably not too difficult.

> I do believe to truly handle this situation well some modifications
> will need to be made to Aspell.  I suggest you start studying
> readonly_ws.cpp and suggest.cpp.  I while ago I wrote some docs on
> how Aspell works:
> 05-09/msg00007.html
> l
> which may be helpful.

Thanks. These look useful.

> I will get back to you latter with some ideas on how to approach this
> issue.  If you already thought of some please share them.

I am realising that linguistically I am probably in over my depth with
Hindi. However, we are meeting this Sat., along with some literary
Hindi folk, and I am talking to experts in other Indian languages, to
plan out an approach. I will certainly make these available, probably
on a Wiki page.
  Thanks for all the interest that you have shown in this.


Aspell-devel mailing list

reply via email to

[Prev in Thread] Current Thread [Next in Thread]