Hi,
Since this particular problem of Sandhi splitter is unsolved for many years, recently I assessed the problem in a wider scope and discussed with computational linguistics experts in other languages.
Sandhi splitting means finding out possible splitting of agglutinated or inflected word. We will get lost in this problem unless we think about the real world problems we want to solve with this kind of tool. Since our primary objective is not academic but application software that is usable in solving real world usecases, we need to think in that way.
Given a word, finding out what constitutes that word and finding all semantic units in a word is required in all applications that require interpreting that word for semantic purposes, such as translation, grammar, spelling contexts.
The reverse process of this is also equally important for the same set of use cases. Given a set of semantic units in a language, construct a valid compound or inflected word. For example, apply negation to this verb, combine this adjective and noun etc.
I dont see this part given importance in the Sandhi splitter related academic works(atleast in the papers I read)
Now, coming back to splitting process, what output we need from splitting? Is it just the string fragments? Or stem of those fragments? Do we need to know POS of that fragment from comound word? Do we need some disambiguation process? Also Whether the split can be lossless? Meaning, Given that a word W gives w1,w2,w3 as output, is it possible to get W back in a later process by just using w1,w2,w3? IMO, we need all of these features in place.
I would like to see this problem as a foundation problem for higher level language processing. Morphological analysis. We need to solve this in its totality. Any reductuanlist approach like word splitting is not enough. And morphological analysis is long and continous iterative process.
I am skeptic about using RNN to solve this kind of fundamental problem, whose nature is close to the rules explained in classical grammer books like panineeyam(or Keralapanineeyam for Malayalam), ofcourse with a bundles of exceptions. A few months back I tried to tackle the inflection problem using ML approach and did not see much success(may be because I don't have enough expertise or lacking data to train) but happy to see it proven otherwise. Since morphological analysis is a foundation problem and many high level applications are to be build on top of that, I look for mathematical precision to its result.
During last GSOC mentor summit, there was a meeting of mentors and students who work on NLP. I raised this problem of dravidian languages and one conclusion we reached is, we need morhology analyser and Finite State Transducer was the recommended technology. Finnish and Turkish - two languages having complex nature of agglutination and inflection uses that technology and succeeded in it.
Have a look at
https://github.com/flammie/omorfi for Finnish morphology analyser project - it is a massive project developed for years with support from University of Helsinki
I will not stop anybody who want to try out RNN, may be it can also solve the problem-But before attempting it, I would advice looking at the problem in some more deeply and then think about solution rather than trying to fit a solution to this problem. Non -Dravidian Indian languages has relatively simple inflection and agglutination- may be you can pick such a language to start with.
I expect RNN to be
superior, if you have enough data. Deep learning will also require enough samples so that it can capture patterns. I think you can improve using a Bidirectional RNN to capture both the forward and backward contexts since we'll be giving the entire thing at once.
We don't have any annotated data for other languages, will you be able to get this on your own? The community can help to some extent, I hope.
Also if you're willing to take up the challenge, can you attempt devising an algorithm to capture the patterns unsupervised from raw text. For example, if you look for words like 'texting', 'walking', we should easily be able to realize from prose that 'ing' is an ending pattern and *-ing is a split point of sorts.
I also suggest you emphasize on the development side of things more on your proposal, since you have to produce a usable API for other developers to build on.
Please send mail to the mailing list starting now, if you want to increase your chances of a faster response.