[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Language identification

From: Juri Linkov
Subject: Re: Language identification
Date: Fri, 28 Aug 2009 22:08:28 +0300
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1.50 (x86_64-pc-linux-gnu)

>>> In `auto-mode-alist' you can see that with the exception of
>>> `archive-mode', `doc-view-mode' and `image-mode', all remaining
>>> modes are programming text modes.  It would be more useful
>>> to identify file types for these modes that libmagic can't do.
>>> Do you know a library that identifies programming languages?
>>> Such a library might be implemented using a Bayesian classifier
>>> trained on a sufficiently large corpus of different programming
>>> languages.
>> N-Gram algorithms is could be used to identify languages - it simpler
>> than bayes, and requires smaller database
> Sorry, I skipped, that this was about programming languages, not real
> languages.

It would be interesting to try using N-Gram algorithms for programming
languages and see how well they perform.  For example, most frequently
used bigram "/*" belongs to C, most frequently used trigram ";;;" belongs
to Lisp, etc.

Juri Linkov

reply via email to

[Prev in Thread] Current Thread [Next in Thread]