emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: "args-out-of-range" error when using data from external process on W


From: Alexis
Subject: Re: "args-out-of-range" error when using data from external process on Windows
Date: Thu, 18 Apr 2024 21:20:55 +1000
User-agent: mu4e 1.12.4; emacs 29.3


Thanks again for your assistance!

As some additional context: i haven't actively used a Windows system in more than a decade - it was Windows 7 - and even then, i was running it in a VM in order to run some other software. i've also never used Windows outside of an "Australian English" context, and have never done any dev work on the Windows platform. So i've got only a minimal idea of how Windows does various things nowadays, and have never needed to become familiar with sysadmin-/dev-level Windows documentation. Until now. :-)

Specific responses inline below.

I don't think I understand the setting of LC_ALL part. First, AFAIK Windows programs generally ignore LC_* environment variables. If you read the Microsoft documentation of 'setlocale', here:

  
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-170

you will not see any reference to environment variables there.

Thanks for this link; it gives me a good starting point to explore the Win docs on this issue.

The Windows 'setlocale' supports only LC_* _categories_ in direct calls to the function, and doesn't consider the corresponding environment variables. The Emacs source code doesn't reference LC_* environment variables on MS-Windows, either. So how did the user set LC_ALL, and why did it have any effect whatsoever on the issue?

They didn't say; all they wrote (https://github.com/flexibeast/ebuku/issues/31#issuecomment-2058171986) was:

I ... changed my LC_ALL to zh_CN.UTF-8. Ebuku can find the db now.

i'll ask them.

Second, the user sets a UTF-8 locale, which as I wrote up-thread is not a good idea on MS-Windows. It could well cause failures in invoking external programs from Emacs, if the arguments to those programs include non-ASCII characters. In general, on MS-Windows Emacs can only safely invoke programs with non-ASCII characters in the command-line arguments if those characters can be encoded by the system codepage, in this case codepage-936 AFAIU.

Thanks, i'll add that to the information i pass back to the user on that GitHub issue.

Regarding the "invalid string for collation: Invalid argument" error: how does ebuku determine the LOCALE argument with which it calls string-collate-lessp? It is important to understand what was the locale with which w32_compare_strings was called in that case.

The single use of `string-collate-lessp` doesn't pass any LOCALE argument, as i just wanted it to use the user's current locale for sorting a given bookmark's tags into the appropriate lexicographical order.

Finally, the issues with Windows-style file names with drive letters and with file names that begin with "~" lead me to believe that perhaps the underlying program 'buku' is not a native Windows program, but a Cygwin or MSYS program, in which case there could be incompatibilities both regarding file names and regarding handling of non-ASCII characters (Cygwin and MSYS use UTF-8 by default, whereas the native Windows build of Emacs does not).

Sorry; i mentioned in my first email, but didn't reiterate in my second, that `buku` is Python-based.

You need to take a good look at whether non-ASCII characters are passed to 'buku' in this case, and how the output from 'buku' is decoded.

👍

Also, ebuku-buku-path and ebuku-database-path should both be quoted with shell-quote-argument (but I don't think this is a problem in this case). Can ARGS include whitespace or characters special for the Windows shell? if so, each argument should be quoted with shell-quote-argument as well.

Thanks, noted.

How output is decoded when it is put into the temporary buffer is also of interest -- what is the value of buffer-file-coding-system in the temporary buffer after reading output, in the OP's case?

*nod*

Emacs on MS-Windows cannot use UTF-8 when encoding command-line arguments for sub-programs, it can only use the system codepage. Using set-language-environment as above will force Emacs to encode command-line arguments in UTF-8, which could very well be the reason for some of these problems.

Ah okay.

No.

The issue is complicated by several factors and will take a long post to explain. The upshot is that for passing non-ASCII characters safely to subprograms on their command lines, Emacs should use the system codepage, not UTF-8 or anything else (and definitely not UTF-16). This might require some tricky juggling with coding-system related settings when you call call-process, because coding-system-for-write is used for both encoding of the command-line arguments and of the stuff we send to the sub-program, so if they both can include non-ASCII characters, some care is in order. (By contrast, coding-system-for-read can be always bound to UTF-8 to decode the output correctly -- assuming 'buku' outputs UTF-8 encoded text on MS-Windows.)

That's very helpful, thank you.

The more important question is: can CRAB emoji be safely encoded by codepage 936, the system codepage of the OP? If not, and if that emoji can appear in the command-line arguments of a 'buku' invocation (as opposed to in the text we write to or read from 'buku'), then this character cannot be used at all with this package on MS-Windows.

(And please note that Emacs now has a native SQLite support, which should make many of these complications simply disappear.)

It would certainly make many things easier to just interact with the db directly. That said, doing so would involve a substantial rewrite, and i've got many things on my plate nowadays, including supporting disabled loved ones while having chronic health issues myself. But maybe i can open an issue requesting help to start and develop a branch doing such a rewrite.
As for why the problems disappear when the CRAB emoji is removed: as I wrote elsewhere, that's probably because all the other characters are plain ASCII, so all the encoding-related issues don't matter.

*nod*

They don't have any effect on Emacs on MS-Windows, that's for sure. Whether they have effect on 'buku' depends on whether it's a native MS-Windows program or Cygwin/MSYS program, and also on its code (a program could potentially augment the MS 'setlocale' function with its own code which looks at the LC_* environment variables, and does TRT in the application code).

*nod*

But what should i do to handle the more general case of an arbitrary encoding? Do i need to have a defcustom, with 'reasonable defaults', that the user can set if necessary, which i use as the value to pass to coding-system-for-read?

That depends on what encoding does 'buku' expect on input and what encoding does it use on output. If it always uses UTF-8, you just need to make sure Emacs uses UTF-8 when encoding and decoding text passed to and from 'buku' (but note the caveat about encoding the command-line arguments -- these _must_ be encoded in the system codepage). If, OTOH, the encoding used by 'buku' can be changed dynamically, and Emacs cannot know what it is (for example, if it is determined by the encoding of the text put in the SQL database by the user), then a user option is in order.

Great, thank you.

As i interpret their comments in the above discussions so far, yes, they had themselves set LANG to "zh_CN.UTF-8" (and yes, as described above, had definitely `set-language-environment` as "UTF-8".

NOT RECOMMENDED!

*chuckle* i'll be sure to pass this on. :-)

Thanks again!


Alexis.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]