silpa-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [silpa-discuss] Machine translteration for Indic languages


From: Santhosh Thottingal
Subject: Re: [silpa-discuss] Machine translteration for Indic languages
Date: Mon, 14 Mar 2016 14:34:33 +0530

(edited subject)

Irshad, I read the document. You have written the project concept very well. I think project has lot of potential use cases. 

Do you have any mentor in mind for the project? How about https://researchweb.iiit.ac.in/~riyaz.bhat/ ?

Santhosh


On Wed, Mar 9, 2016 at 3:42 PM, Irshad Ahmad <address@hidden> wrote:
Hello Sir,

Please find attached my Project Idea for Transliteration Module of Libindic.

I agree to the fact that only a mapping table approach would not suffice. I've proposed two approaches in my project idea. First, as you have already mentioned, we need another set of rules to take care of special language characteristics. Second, rather than developing exhaustive rules, we use a machine learning (ML) approach. Machine Learning algorithms gives computers the ability to automatically capture such patterns without being explicitly programmed.

I suggest we keep both rule-based and ML system for Indic-Indic transliteration and only ML system for Indic-Roman transliteration. Rest of the details are in the attached PDF.

I hope you like the idea.

Thanks
--
Irshad Ahmad


----- Original Message -----
From: "Santhosh Thottingal" <address@hidden>
To: "Irshad Ahmad" <address@hidden>
Cc: "silpa-discuss" <address@hidden>, "Riyaz Ahmed" <address@hidden>
Sent: Saturday, March 5, 2016 5:10:54 PM
Subject: Re: [silpa-discuss] (no subject)

Thanks Irshad for introducing your work.

> echo 'आम आदमी से आजादी आज भी कोसों दूर है' | converter-indic --l hin |
converter-indic --l mal --s wx
> ആമ ആദമീ സേ ആജാദീ ആജ ഭീ കോസോം ദൂര ഹൈ
[.. And othr examples..]

This example illustrate one key challenge in transliteration. The output is
wrong. But if you consider only the letter by letter transliteration output
is correct. आम in Hindi in Malayalam is ആം. ആമ means tortoise. This
difference is because of
https://en.wikipedia.org/wiki/Schwa_deletion_in_Indo-Aryan_languages

So, along with a mapping table approach, we need another set of rules to
take care of this special language characteristics. Tamil has less
consonants that can map to more than one consonant in other Indic languages
depedending on the context. Similarly while converting to Tamil also you
will face this difference. Malayalam has chillu letters - the vowel less
form of consonants. It is a huge list of such language features. I believe
this category can be rule based.
There is another set of characterristics that cannot be rule based. In the
past years, people using the existing transliteration library in libindic
mailed me asking about name transliterations especially from English. Name
is one specific set, but can be generalized as any nouns. A name like
pradeep, prathip, pratheep, pradip, pradeeb, pratib, pratib, prateep - all
should transliterate same to Indic languages. This is also a case where you
miss the one-to-one correspondance of letters and mapping rules fails. I
think you already thought about using machine learning to get this part
done. I think solving this and making the transliteration library smart
enough is a good project. I can think of various use cases.

I must add that I am bit disconnected from this project and library for
many months or even couple of years because of my busy job and other pet
projects. So I might be unaware of some progress made in this area by
researchers or developers. I also want to make clear that I am not
committing for mentoring this, unless you really make me impressed and not
getting anybody else :)

I would suggest you to write down your project idea, including the expected
outcome, with bried notes on the existing tools and limitations(this will
help you to understand what is really missing, instead of doing something
for the purpose of doing). If you do this excercise you will get more
understanding of planning, timeline, challenges. No matter whether you get
this in GSOC or not, that will help you to materialize the project in any
other means.

Santhosh



--
Santhosh Thottingal
http://thottingal.in

reply via email to

[Prev in Thread] Current Thread [Next in Thread]