Re: [silpa-discuss] Fwd: Re: Regarding GSoC in Indic

Hi,

Since this particular problem of Sandhi splitter is unsolved for many years, recently I assessed the problem in a wider scope and discussed with computational linguistics experts in other languages.

Sandhi splitting means finding out possible splitting of agglutinated or inflected word. We will get lost in this problem unless we think about the real world problems we want to solve with this kind of tool. Since our primary objective is not academic but application software that is usable in solving real world usecases, we need to think in that way.

Given a word, finding out what constitutes that word and finding all semantic units in a word is required in all applications that require interpreting that word for semantic purposes, such as translation, grammar, spelling contexts.

The reverse process of this is also equally important for the same set of use cases. Given a set of semantic units in a language, construct a valid compound or inflected word. For example, apply negation to this verb, combine this adjective and noun etc.

I dont see this part given importance in the Sandhi splitter related academic works(atleast in the papers I read)

Now, coming back to splitting process, what output we need from splitting? Is it just the string fragments? Or stem of those fragments? Do we need to know POS of that fragment from comound word? Do we need some disambiguation process? Also Whether the split can be lossless? Meaning, Given that a word W gives w1,w2,w3 as output, is it possible to get W back in a later process by just using w1,w2,w3? IMO, we need all of these features in place.

I would like to see this problem as a foundation problem for higher level language processing. Morphological analysis. We need to solve this in its totality. Any reductuanlist approach like word splitting is not enough. And morphological analysis is long and continous iterative process.

I am skeptic about using RNN to solve this kind of fundamental problem, whose nature is close to the rules explained in classical grammer books like panineeyam(or Keralapanineeyam for Malayalam), ofcourse with a bundles of exceptions. A few months back I tried to tackle the inflection problem using ML approach and did not see much success(may be because I don't have enough expertise or lacking data to train) but happy to see it proven otherwise. Since morphological analysis is a foundation problem and many high level applications are to be build on top of that, I look for mathematical precision to its result.

During last GSOC mentor summit, there was a meeting of mentors and students who work on NLP. I raised this problem of dravidian languages and one conclusion we reached is, we need morhology analyser and Finite State Transducer was the recommended technology. Finnish and Turkish - two languages having complex nature of agglutination and inflection uses that technology and succeeded in it.

Since 2016 november, I am experimenting with such a system for Malayalam https://github.com/santhoshtr/mlmorph So far the results are very enouraging.

Have a look at https://github.com/flammie/omorfi for Finnish morphology analyser project - it is a massive project developed for years with support from University of Helsinki

I will not stop anybody who want to try out RNN, may be it can also solve the problem-But before attempting it, I would advice looking at the problem in some more deeply and then think about solution rather than trying to fit a solution to this problem. Non -Dravidian Indian languages has relatively simple inflection and agglutination- may be you can pick such a language to start with.

On Sat, Feb 25, 2017 at 9:55 PM Jerin Philip <address@hidden> wrote:

I expect RNN to be
superior, if you have enough data. Deep learning will also require enough samples so that it can capture patterns. I think you can improve using a Bidirectional RNN to capture both the forward and backward contexts since we'll be giving the entire thing at once.

We don't have any annotated data for other languages, will you be able to get this on your own? The community can help to some extent, I hope.

Also if you're willing to take up the challenge, can you attempt devising an algorithm to capture the patterns unsupervised from raw text. For example, if you look for words like 'texting', 'walking', we should easily be able to realize from prose that 'ing' is an ending pattern and *-ing is a split point of sorts.

I also suggest you emphasize on the development side of things more on your proposal, since you have to produce a usable API for other developers to build on.

Please send mail to the mailing list starting now, if you want to increase your chances of a faster response.

Best,

On Feb 25, 2017 8:03 PM, "vasudev" <address@hidden> wrote:

Jerin please suggest

Sent from my Mi phone
---------- Forwarded message ----------
From: Anivar Aravind <address@hidden>
Date: 25-Feb-2017 6:00 PM
Subject: Fwd: Re: Regarding GSoC in Indic
To: Santhosh Thottingal <address@hidden>, Santhosh Thottingal <address@hidden>, Vasudev Kamath <address@hidden>, Jishnu Mohan <address@hidden>
Cc: Anivar Aravind <address@hidden>

Dear Santhosh vasudev & Jishnu

What you think about the idea proposed .
Please respond to the student
---------- Forwarded message ----------
From: "Rohan Saxena" <address@hidden>
Date: 25 Feb 2017 5:47 p.m.
Subject: Re: Regarding GSoC in Indic
To: "Akshay S Dinesh" <address@hidden>
Cc: <address@hidden>

Hello Sir,

I did not hear from you regarding my previous mail. I just wanted to know what do you think about the idea of using RNNs in sandhi splitting? Would the community be interested in such a project being implemented?

If not, I am also open to the idea of porting the LibIndic code to Python 3.

Thanks,
Rohan Saxena

On Thu, Feb 23, 2017 at 9:28 PM, Rohan Saxena <address@hidden> wrote:
Hello Sir,

I went through the approach used in the present version of Sandhi-splitter (the documentation mentions that this approach is employed).

While it is an interesting algorithm, I think it relies a lot on human judgement (see below on how and why this can be improved). The entire model has been constructed by making certain decisions at crucial stages of the architecture design based on the designers' understanding of the structure and semantics of the dataset. For example, the technique of skipping an initial part of the word to a particular position, and moving equal characters on either side. This is the designers' choice, and does not reflect whether this is one of the best strategies for this task. The model has not arrived on this approach on its own.
(On a side note, we calculate the probability of a substring being at the start or end of a sandhi by coursing through the entire dataset and counting all occurrences of the substring in such a position. This relies strongly on a uniform distribution of our dataset (with respect to the substrings occurring before and after the sandhi) ).

From above, "How and why can this be improved?" While this (strong human architectural decisions) need not necessarily be bad, recent advancements in deep learning have shown that allowing the model to learn the features on its own, in increasing levels of abstraction gives stellar results (see 'evidence paper'). Such techniques of automatic learning have reached amazing accuracies in computer vision and pattern recognition. This is one of the important factors which has made deep learning so effective and popular.

From above, 'evidence paper': See this paper by Y. LeCun et al. "The main message of this paper is that better pattern recognition systems can be built by relying more on automatic learning, and less on hand-designed heuristics... We show that hand-crafted feature extraction can be advantageously replaced by carefully designed learning machines that operate directly on pixel images."

I think it would be interesting to see if we could apply this philosophy of deep learning to try to achieve better results on sandhi splitting.

RNNs are popular models that are showing good performance on NLP tasks. If you are unsure whether feature learning can be incorporated into RNNs, see this.

I have my mid-semester exams coming up soon so unfortunately I have little time on my hands. However, if you wish me to go deeper on a particular aspect of this argument, let me know!

Thank you,
Rohan Saxena

On Wed, Feb 22, 2017 at 1:22 PM, Rohan Saxena <address@hidden> wrote:
Sure. Let me get back to you on this in a couple of days.

Thanks,
Rohan

On Wed, Feb 22, 2017 at 1:59 AM, Akshay S Dinesh <address@hidden> wrote:
Can you go through http://jerinphilip.github.io/posts/2016-08-22-gsoc-final-report.html which is the final report by jerin from last year and think of how that approach differs from RNN and compare them?

On Tue, 21 Feb 2017 at 21:17, Rohan Saxena <address@hidden> wrote:
Hello sir,

Thank you for sharing the wonderful links. The essay on free software was backed by some pretty deep psychology, and really changed how I view on open source software. I have also shared it with some of my friends and fellow programmers :)

I went through the ideas list you mentioned and am interested to work on improving the Sandhi Splitter to include RNN strategies. This project is in line with my interests (deep learning and neural networks) and my skills (machine learning and python).

I apologise for giving a very brief introduction about myself in my first mail. Here is some more information about me:
I am a second year computer science student from BITS Pilani. Here I have been a member of the Embedded Systems and Robotics lab since my freshman year itself. I work on robotics and artificial intelligence (specifically computer vision and deep learning).
I am also a student at the Udacity self-driving car nanodegree, and as part of the programme I have implemented various deep learning architectures which have achieved for example, accuracies over 99% in the MNIST dataset (classification of handwritten digits) and 95% in GTS dataset (classification of German Traffic Signs). I want to work on a similar project for GSoC 2017.

Kindly advice me on how to proceed.

Thank you,
Rohan Saxena.

On Tue, Feb 21, 2017 at 4:34 PM, Akshay S Dinesh <address@hidden> wrote:
Hey Rohan,
If this is your first time contributing to free software, read http://asd.learnlearn.in/gsoc-handbook/
Then, go through the gsoc repo at https://gitlab.com/indicproject/gsoc-2017/

See how and where you can contribute and try to come up with an idea.

Let us know.

Akshay

On Tue, 21 Feb 2017 at 16:31, Rohan Saxena <address@hidden> wrote:
Hello,

I am a second year Computer Science student at BITS Pilani, Pilani campus interested in machine learning (especially deep learning).

I wish to contribute to the Indic project as part of GSoC 2017. How can I go about doing this? It would be helpful if you could point me in the right direction.

Thank you,
Rohan Saxena.

Sent with Mailtrack

From:	Santhosh Thottingal
Subject:	Re: [silpa-discuss] Fwd: Re: Regarding GSoC in Indic
Date:	Sun, 26 Feb 2017 04:52:56 +0000