bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34350: 27.0.50; ediff-revision broken with SVN backend + non ascii c


From: Vincent Belaïche
Subject: bug#34350: 27.0.50; ediff-revision broken with SVN backend + non ascii chars both in directory and in filename
Date: Sun, 10 Feb 2019 20:52:40 +0100

Le 09/02/2019 à 08:58, Dmitry Gutov a écrit :
> On 09.02.2019 00:45, Eli Zaretskii wrote:
>> How is vc-annotate relevant to vc-git-find-revision and the issue at
>> hand?
>
> It's an investigation tool.
>
>> coding-system-for-write affects both I/O from and to a subprocess and
>> the encoding of the command-line arguments we pass to that
>> subprocess.
>
> Aren't there any programs that expect one encoding for their
> arguments, yet output contents in other encodings sometimes?

Yes there are. For instance if you edit a document with LaTeX and use
pdflatex + koi8 for the encoding, then all the compilation error
messages showing the error messages interleaved with pieces of text will
be a byte stream where the pieces of text will correspond to
characteters coded in koi8.

But they are also interleaved with filenames which are also a byte
stream which correspond to characters encoded with some encoding
depending on the filesystem, for instance that will be utf8 for an
MSW-10 machine as it seems that my pdflatex port (MiKTeX) has them in
this format — for filenames all in ASCII that will just make no
difference because both utf8 & koi8 are supersets of ASCII. Internally,
AFAIK MSW10 uses UTF16, I speculate that my pdflatex port to MSW uses
UTF16 wchar_t to access the files and does some utf8/utf16
encoding/decoding w.r.t to input/output with tex source code/tex log.

Of course users will want to have the log shown with character decoding
using koi8, so that the pieces of text in the log are understandable, so
that will mess-up the non-ASCII filenames, for instance if the filename
is « Макрон—Узурпатор.tex » , it will show out like
« п°п╟п╨я─п╬п╫Б─■пёп╥я┐я─п©п╟я┌п╬я─.tex » in the log, and AUCTeX won't
be able to jump at the error line when you do « M-g n » — maybe someday
I submit some contribution for AUCTeX to be configurable to decode the
filenames, or maybe by that time I simply use xelatex and everything is
in utf8 :-/


On the other hand the pdflatex command will accept arguments encoded in
whatever the OS requires — I don't know what my pdflatex port internally
does, probably it has some « int wmain(int ac, wchar_t const* av[]) »
prototype and does some utf16/utf8 encoding of the command line args in
order to get a byte stream.

So, still under MSW, if my pdflatex is launched from a powershell
window, filenames and any arguments are passed to the command as wide
chars with utf16 encoding and pdflatex.

So if the filename is №1.tex, and I type from a powershell prompt:

  pdflatex №1.tex

that will work fine — provided that the file exists — because, although
the character « № » does not exist in window1252, both powershell and my
pdflatex port use utf16.

Now, if I try the same command from a cmd prompt, funnilly that also
will work — however a DOS .bat file encoded in utf-8 with the same
command, even if you launch cmd with /U option…

But if now I try this command from Emacs (with « M-x compile », or
« M-& », or directly by eval of « (call-process "pdflatex" nil t nil
"№1.tex") » that will not work either. And that will not work either
from a *shell* buffer using cmdproxy.exe. This is probably because Emacs
uses the standard system call « system(…) » function — I speculate — and
the latter accepts windows1252 encoded command line, and not the
microsoft specific _wsystem one.

BTW, I don't know whether _wsystem is ported to MinGW, I would not be
surprised if it is not : for instance « int main(int ac, wchar_t
const*av[]) » prototypes aren't ported to MinGW, they do a linker error,
and one has to resort to GetCommandLineW & CommandLineToArgvW windows
API functions. An alternative is 1) to use the powershell quote syntax
for the command line + utf16 enocoding, 2) to base-64 encode the command
line, and 3) pass the encoded block to powershell with the
-EncodedCommand option through the usual system(…)  call.


So, to summarise :

1. encoding of filenames, encoding of command line, and encoding of file
   content are 3 different things. So, it is a bit surprising if
   coding-system-for-write affects all of them in the same way.

2. filename encoding may be different on the command line and on the
   input/output streams (e.g. pdflatex called from powershell has utf16
   on the command line, and utf8 in the files for filenames.

3. filename/argument encoding depends on the command (under MSW, some
   commands have « int main(int ac, char const*av[]) » under the hood
   and as such they expect arguments to be windows1252 encoded — in
   Western Europe — other have « int main(int ac, wchar_t const*av[]) »
   under the hood and as such they expect utf16 encoded arguments. So
   with the first type of command you can have « œ » but not « № » in
   the arguments, while with the second type you can have both of them.


  V.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]