Re: Automatic recognition of some specific coding systems

From: Eli Zaretskii
Subject: Re: Automatic recognition of some specific coding systems
Date: Thu, 26 Feb 2015 18:36:04 +0200

> From: Jürgen Hartmann <address@hidden>
> Date: Thu, 26 Feb 2015 00:23:50 +0100
> > Try this:
> > 
> >   (set-coding-system-priority 'utf-8 'cp850)
> After doing this, the coding systems
>    utf-8
>    cp850
> get correctly recognized, but
>    latin-9-unix
> gets wrongly recognized as cp850-unix encoded.
> If I modify the lisp expression to
>    (set-coding-system-priority 'utf-8 'latin-9)
> it is utf-8 and latin-9 that are properly recognized while the test
> file
>    cp850-dos
> gets detected as iso-latin-9-dos encoded.

I feared that might be the result.

> If I pass all three coding systems to set-coding-system-priority,
>    (set-coding-system-priority 'utf-8 'latin-9 'cp850)   or
>    (set-coding-system-priority 'utf-8 'cp850 'latin-9)
> it turns out that the function set-coding-system-priority ignores the third
> coding system in these cases, because it belongs to the same coding
> category as the coding system named in the second place. The source
> code src/coding.c comments this in the lines 9972 and 9973 like this:
>     /* Ignore this coding system because a coding system of the
>        same category already had a higher priority.  */

Yes, I know.  That's why I only mentioned 2 of them.

It looks like what you want is beyond the current capabilities of
Emacs's auto-detection of encoding.  See below for some alternatives.

Having said that...

> By the way, could you verify, that this is possible with Emacs 22.3
> with the customization described in my previous post?, it doesn't work for me.  The latin-9 file is decoded using my
locale's encoding (which isn't latin-9), and cp850 file is still

So I think some other factor(s) is/are at work on your system.  Your
locale's encoding is certainly one of them, but I think there should
be something else, either in your customizations or somewhere else.

In general, even if Emacs 22.3 was capable to do the job, I think it
was by sheer luck, and is anyway fragile, since the same
customizations don't work for me (and AFAIU, aren't supposed to work).
So I would suggest to explore alternative ways of doing this in Emacs
24 reliably.  Some possibilities you may wish to explore:

  . Put a 'coding: cp850' cookie in the cp850 files

  . If the names of the cp850 files all match some common pattern, you
    can use modify-coding-system-alist to tell Emacs to decode them by

  . Similarly, if the cp850 files' contents match some common regexp,
    you can customize auto-coding-regexp-alist to force their decoding
    by cp850

Of course, you can always turn the table, and do the above for
latin-9, while keeping cp850 in set-coding-system-priority call.  It
all depends which one of these 2 lends itself better to one of these

I believe that if one of these alternatives can do the job for you,
the result will be much more reliable.

