|
From: | Manuel Collado |
Subject: | Re: gawk: Wrong behavior in binary mode |
Date: | Tue, 16 Dec 2008 10:27:27 +0100 |
User-agent: | Thunderbird 2.0.0.14 (Windows/20080421) |
John Cowan escribió:
Eli Zaretskii scripsit:I think you are missing the fact that LC_ALL=C has broad effects other than just disabling multibyte characters. For example, it also causes Gawk to speak US English when displaying messages, and use US format for dates and currency. What do I do if I want my error messages in Hebrew, but need to work with raw binary data that is not a character string?Quite so. There should be some way to specify the encoding of Gawk's input and output files independent of the locale (IMHO, encoding the character encoding into the local identifier was just a botch.)
I strongly agree! In the worldwide environment of nowadays text-processing utilities should be able to cope with files from different sources with different encodings, and combine them in a single run. This implies having independent encodings for:
- each source file - each input data file - each output data file SGML/XML utilities already do that. For AWK a possible approach could be:- Use a fixed implementation-chosen encoding for internal processing (covering UNICODE) - On-the-fly convert each source or input data to the internal encoding before processing.
- On-the-fly convert output data to the external encoding before printing. This approach requires a method for specifying file encodings. Examples: - source files: explicit @encoding directive, use locale just as default.- input and output data: use the value of a predefined ENCODING variable at open time, or the locale as default.
Is it OK to discuss this topic in this forum? -- Manuel Collado - http://lml.ls.fi.upm.es/~mcollado
[Prev in Thread] | Current Thread | [Next in Thread] |