[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Passing unicode filenames to start-process on Windows?

From: Klaus-Dieter Bauer
Subject: Re: Passing unicode filenames to start-process on Windows?
Date: Wed, 6 Jan 2016 22:19:39 +0100

2016-01-06 17:13 GMT+01:00 Eli Zaretskii <address@hidden>:
> From: Klaus-Dieter Bauer <address@hidden>
> Date: Wed, 6 Jan 2016 16:20:29 +0100
> Is there a reliable way to pass unicode file names as
> arguments through `start-process'?

No, not at the moment, not in the native Windows build of Emacs.
Arguments to subprocesses are forced to be encoded in the current
system codepage.  This commentary in w32.c tells a few more details:

   . Running subprocesses in non-ASCII directories and with non-ASCII
     file arguments is limited to the current codepage (even though
     Emacs is perfectly capable of finding an executable program file
     in a directory whose name cannot be encoded in the current
     codepage).  This is because the command-line arguments are
     encoded _before_ they get to the w32-specific level, and the
     encoding is not known in advance (it doesn't have to be the
     current ANSI codepage), so w32proc.c functions cannot re-encode
     them in UTF-16.  This should be fixed, but will also require
     changes in cmdproxy.  The current limitation is not terribly bad
     anyway, since very few, if any, Windows console programs that are
     likely to be invoked by Emacs support UTF-16 encoded command

   . For similar reasons, server.el and emacsclient are also limited
     to the current ANSI codepage for now.

   . Emacs itself can only handle command-line arguments encoded in
     the current codepage.

The main reason for this being a low-priority problem is that the
absolute majority of console programs Emacs might invoke don't support
UTF-16 encoded command-line arguments anyway, so the efforts to enable
this would yield very little gains.  However, patches to do that will
be welcome.  (Note that, as the comment above says, the changes will
also need to touch cmdproxy, since we invoke all the programs through

> I realized two limitations:
> 1. Using `prefer-coding-system' with anything other than
> `locale-default-encoding', e.g.
> (prefer-coding-system 'utf-8),
> causes a file name "Ö.txt" to be misdecoded as by
> subprocesses -- notably including "emacs.exe", but also
> all other executables I tried (both Windows builtins like
> where.exe and third party executables like ffmpeg.exe or
> GnuWin32 utilities).
> In my case (German locale, 'utf-8 preferred coding
> system) it is mis-decoded as "Ö.txt", i.e. emacs encodes
> the process argument as 'utf-8 but the subprocess decodes
> it as 'latin-1 (in my case).
> While this can be fixed by an explicit encoding
> (start-process ...
> (encode-coding-string filename locale-coding-system))
> such code will probably not be used in most projects, as
> the issue occurs only on Windows, dependent on the user
> configuration (-> hard-to-find bug?). I have added some
> elisp for demonstration at the end of the mail.
> 2. When a file-name contains characters that cannot be
> encoded in the locale's encoding, e.g. Japanese
> characters in a German locale, I cannot find any way to
> pass the file name through the `start-process' interface;
> Unlike for characters, that are supported by the locale,
> it fails even in a clean "emacs -Q" session.
> Curiously the file name can still be used in cmd.exe,
> though entering it may require TAB-completion (even
> though the active codepage shouldn't support them).

Does the program which you invoke support UTF-16 encoded command-line
arguments?  It would need to either use '_wmain' instead of 'main', or
access the command-line arguments via GetCommandLineW or such likes,
and process them as wchar_t strings.

If the program doesn't have these capabilities, it won't help that
Emacs passes it UTF-16 encoded arguments, because Windows will attempt
to convert them to strings encoded in the current codepage, and will
replace any un-encodable characters with question marks or blanks.

> ;; Set the preferred coding system.
> (prefer-coding-system 'utf-8)

You cannot use UTF-8 to encode command-line arguments on Windows, not
in general, even if the program you invoke does understand UTF-8
strings as its command-line arguments.  (I can explain if you want.)

> ;; On Unix (tested with cygwin), it works fine; Presumably because
> ;; the file name is decoded (in `directory-files') and encoded (in
> ;; `start-process') with the same preferred coding system.

It works with Cygwin because Cygwin does support UTF-8 for passing
strings to subprograms.  That support lives inside the Cygwin DLL,
which replaces most of the Windows runtime with Posix-compatible
APIs.  The native Windows build of Emacs doesn't have that luxury.

I checked again and found that indeed some of the utilities I tested before (specifically the GnuWin32 tools) can't handle japanese characters when called from cmd.exe; 

ffmpeg on the other hand supports unicode file names in cmd.exe, but I agree that this is quite a niche usage. 

I thought up some workarounds, but they all run into limitations:

Would you happen to know any other possible workaround?

thanks for the explanations, 
- Klaus

reply via email to

[Prev in Thread] Current Thread [Next in Thread]