[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Detecting the coding system of a file programmatically
From: |
Eli Zaretskii |
Subject: |
Re: Detecting the coding system of a file programmatically |
Date: |
Fri, 10 Aug 2018 10:28:07 +0300 |
> From: Andrea Cardaci <address@hidden>
> Date: Fri, 10 Aug 2018 03:02:55 +0200
>
> (with-temp-buffer
> (insert-file-contents-literally path)
> (decode-coding-region (point-min) (point-max) 'utf-8)
> (... do suff with the buffer ...))
>
> I use `insert-file-contents-literally' because the non-literally
> counterpart is too slow (about twice as much apparently) as it does a
> bunch of stuff in addition to simply populate the buffer.
> Unfortunately, one of these things is to decode the buffer.
>
> Now instead of hardcoding 'utf-8 I'd like to detect the correct
> encoding where possible, so I tried experimenting with
> `find-operation-coding-system'.
That's the wrong function to use in this case; you want
decode-coding-inserted-region instead. Alternatively, you could use
detect-coding-region and then decode-coding-region with the value it
returns. I suggest a good read of the "Explicit Encoding" and "Lisp
and Coding Systems" nodes of the ELisp manual.
> I created a latin-1 file (which gets
> recognised properly when I visit it) and tried the following:
>
> (with-temp-buffer
> (setq path "~/tmp/latin-1")
> (insert-file-contents-literally path)
> (find-operation-coding-system
> 'insert-file-contents
> (cons path (current-buffer))))
>
> But all I get is (undecided).
That's expected: find-operation-coding-system returns the _default_ to
use for the named operation. It doesn't consider the contents of the
buffer.
> Now my question is twofold: is this the best approach for what I'm
> trying to achieve? And in any case, why does the latter example does
> not work as expected? (And hence how I can detect the coding system
> programmatically?)
I hope I answered all of those questions, if not, please ask more.
In any case, it is definitely OK to call decode-coding-region with the
value 'undecided' returned by find-operation-coding-system, because
'undecided' is a special value which signals to decode-coding-region
that detection of the actual encoding is necessary. Thus, I expect
this to work for you:
(with-temp-buffer
(insert-file-contents-literally path)
(decode-coding-region (point-min) (point-max)
(find-operation-coding-system
'insert-file-contents
(cons path (current-buffer)))))
But I still recommend to use decode-coding-inserted-region, because it
will do all of the above (and slightly more) for you internally.