lmi
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [lmi] What encoding does wx_test console output use?


From: Vadim Zeitlin
Subject: Re: [lmi] What encoding does wx_test console output use?
Date: Sat, 8 Sep 2018 17:36:58 +0200

On Sat, 8 Sep 2018 14:28:38 +0000 Greg Chicares <address@hidden> wrote:

GC> On 2018-09-08 12:36, Vadim Zeitlin wrote:
GC> > On Sat, 8 Sep 2018 09:57:24 +0000 Greg Chicares <address@hidden> wrote:
GC> > 
GC> > GC> $wine ./wx_test --ash_nazg --data_path=/opt/lmi/data 
--pyx=only_new_pdf >../src/lmi/wx_test_output
GC> > GC> $file -bi wx_test_output
GC> > GC> application/octet-stream; charset=binary
GC> > GC> 
GC> > GC> I'd like to filter this, removing expected lines and leaving only
GC> > GC> unexpected--much as 'nychthemeral_test.sh' does for other tests
GC> > GC> with its '_clutter' sed scripts. I suppose
GC> > GC>   iconv -t UTF-8 -f SOME_ENCODING wx_test_output
GC> > GC> might work, for some value (what?) of SOME_ENCODING.
GC> > 
GC> >  Under "genuine" MSW it would be UTF-16, but I didn't test if it was the
GC> > same thing under Wine. I'd expect it to be...
GC> 
GC> Yes, thanks, this converts it:
GC>   iconv -t UTF-8 -f UTF-16
GC> I should have thought to try that, but I figured that if it was
GC> UTF-16, 'file' should have detected it.

 I'm surprised it doesn't detect it neither. I guess the authors of find
decided that the risk of false positives was too high, but it's still
strange that they didn't think to reuse the same heuristics they already
use for determining if a file contains text or just some random 7 bit data
for this case.

GC> 'file' does detect UTF-16 if I iconv it back to UTF-16, though:
GC> 
GC> $iconv -t UTF-8 -f UTF-16 gui_test_output.raw >gui_test_output.txt 
GC> $iconv -f UTF-8 -t UTF-16 gui_test_output.txt >gui_test_output.16  
GC> $file gui_test_output.*
GC> gui_test_output.16:  Little-endian UTF-16 Unicode text
GC> gui_test_output.raw: data
GC> gui_test_output.txt: ASCII text
GC> 
GC> The file grows by two bytes when I convert it back to UTF-16.
GC> 
GC> $od -t x1 gui_test_output.16 |head -1
GC> 0000000 ff fe 4e 00 4f 00 54 00 45 00 3a 00 20 00 73 00
GC> 
GC> The initial U+FEFF BOM is apparently what 'file' needs. But it's
GC> optional, and I guess it's not customary to include it when
GC> stdout is written to in msw.

 I have 2 explanations for why the original output doesn't contain BOM:
first one is conceptual and consists in saying that BOM is for documents
and an individual program output is not really a document. Second one is
practical: the shell, doing the redirection, knows absolutely nothing about
the encoding used by the program and can't know whether it needs to use the
UTF-16 BOM, UTF-32 BOM or something else, it just passes the bytes through.
And for the CRT it would be, I think, pretty difficult if not impossible to
ensure that the BOM is output exactly once for any program writing to
stdout, especially considering that the same process might use multiple
copies of the CRT or even different CRTs.

 Regards,
VZ


reply via email to

[Prev in Thread] Current Thread [Next in Thread]