[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [aspell-devel] Thoughts on using aspell for Indian language checking

From: gora
Subject: Re: [aspell-devel] Thoughts on using aspell for Indian language checking
Date: Mon, 13 Nov 2006 09:19:10 +0100

On 6:32:36 am 11/13/06 Kevin Atkinson <address@hidden> wrote:
> On Sun, 12 Nov 2006, address@hidden wrote:
> >  That is what I am doing at present, and there is no real problem
> >  with it. The only advantage to a C++ interface would be that it is
> >  then easier to ensure consistency of classes across different
> >  object-oriented programming languages. I guess that it would still
> >  be OK to wrap a C++ class around the C interface, and then use
> that for the SWIG bindings.
> I really don't see a point on adding this extra layer of indirection
> just to "ensure consistency of classes across different
> object-oriented programming languages".

Well, the idea was that a class-based interface is more natural in
languages that support object orientation. With the low-level C API,
I would have to write separate class wrappers in each of the languages,
which makes it a pain if additional functionality becomes needed. If
I use SWIG with a C++ class, I believe that it builds proper class
interfaces for other languages, though I have to check this more
thoroughly. As an aside, bindings to the C interface are also available
in each language, as that makes it quite easy to translate an existing C
program using aspell.

> >  The problem there would be that both the base characters, and the
> >  syllables are needed, and the total number of these might be more
> >  than 256 in many Indian languages.
> I have done a systematic survey of all languages and the conclusion it
> that it any written languages not based on hanzi (Chinese, Japanese,
> Korean) will fit in an 8-bit character set.  See
>  If I am missing something
> let me know.

The base characters themselves certainly fit. However, if one wishes to
operate on syllables (made by combining consonants in the base
character set), the number of these syllables can exceed 256.
  Here is a short example of just one of the issues that come up when
treating characters, rather than syllables as the base unit in Hindi.
Take, for example, the conjunct, "kra", क्र. This is represented
linguistically, and in UTF-8, as क + ् + र (U0915 + U094D + U0930).
It makes no sense to swap the "halant" (U094D) with the "ka" or the
"ra", as that creates a completely different conjunct, and is not a
mistake that would typically be made. As you suggest, I could just
include "kra" in the encoding, but, in many Indian languages, the
256 available slots are not sufficient for all such conjuncts.

> BTW: I assume you know that there is a very basic dictionary Hindi
> dictionary available for Aspell.

Yes, I do. Most of the existing Indian language aspell dictionaries
are there because I have pushed people into producing them. They are
still quite inadequate, though. We have added various aspell rules to
the Hindi dictionary, and are in the process of testing their effect.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]