Re: Language identification

From: Alex Ott
Subject: Re: Language identification
Date: Fri, 28 Aug 2009 08:45:05 +0200



N-Gram algorithms is could be used to identify languages - it simpler than
bayes, and requires smaller database

Juri Linkov  at "Fri, 28 Aug 2009 03:27:35 +0300" wrote:
 >> I often wish that files would open in Emacs with correct mode
 >> more often when there is no file extension.

 JL> In `auto-mode-alist' you can see that with the exception of
 JL> `archive-mode', `doc-view-mode' and `image-mode', all remaining
 JL> modes are programming text modes.  It would be more useful
 JL> to identify file types for these modes that libmagic can't do.
 JL> Do you know a library that identifies programming languages?
 JL> Such a library might be implemented using a Bayesian classifier
 JL> trained on a sufficiently large corpus of different programming
 JL> languages.

With best wishes, Alex Ott, MBA


