RE: Automatic recognition of some specific coding systems

From: Jürgen Hartmann
Subject: RE: Automatic recognition of some specific coding systems
Date: Thu, 26 Feb 2015 23:34:05 +0100

@Eli Zaretskii: Thank you very much for your profound assessment:

> It looks like what you want is beyond the current capabilities of
> Emacs's auto-detection of encoding.  See below for some alternatives.
> Having said that...
>> By the way, could you verify, that this is possible with Emacs 22.3
>> with the customization described in my previous post?
>, it doesn't work for me.  The latin-9 file is decoded using my
> locale's encoding (which isn't latin-9), and cp850 file is still
> raw-text.

Oops, this is an important finding indeed.

> So I think some other factor(s) is/are at work on your system.  Your
> locale's encoding is certainly one of them, but I think there should
> be something else, either in your customizations or somewhere else.

I just repeated the tests with Emacs 22.3 using the POSIX locale,

   LC_ALL=C ./emacs -q

and you are right: the cp850 file was recognized as raw-text now. The
locale I used before was


The more I get involved in this topic the more I see that it is much
more complex that I thought at first glance.

> In general, even if Emacs 22.3 was capable to do the job, I think it
> was by sheer luck, and is anyway fragile, since the same
> customizations don't work for me (and AFAIU, aren't supposed to work).
> So I would suggest to explore alternative ways of doing this in Emacs
> 24 reliably.

This sounds reasonable to me. Besides the aspect of reliability, which
is of curse the most important one, doing so might also yield a
solution that is likely to survive future updates.

> Some possibilities you may wish to explore:
>   . Put a 'coding: cp850' cookie in the cp850 files

I would rather avoid altering the files content for this technical reason.

>   . If the names of the cp850 files all match some common pattern, you
>     can use modify-coding-system-alist to tell Emacs to decode them by
>     cp850

Unfortunately in my case there is no such pattern in the file names
that would allow to tell which coding the respective file might use.

>   . Similarly, if the cp850 files' contents match some common regexp,
>     you can customize auto-coding-regexp-alist to force their decoding
>     by cp850

That one might do the trick: In my case the only files (at least in
the big picture) that use the DOS EOL variant are those encoded with
cp850 and vice versa. So one could think about a regular expression
that matches this unique EOL pattern.

> Of course, you can always turn the table, and do the above for
> latin-9, while keeping cp850 in set-coding-system-priority call.  It
> all depends which one of these 2 lends itself better to one of these
> methods.
> I believe that if one of these alternatives can do the job for you,
> the result will be much more reliable.

I also think so.

So, I have to play around a little bit to get acquainted with the
construction of regular expressions for Emacs. I will be back when I
have gained a deeper insight, or a concrete solution at best.

Meanwhile I would like to thank you, Eli Zaretskii, very much for your
time and effort that you spent to provide me with this thorough
analysis and your valuable suggestions.



