bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13947: bug report for core-utils command : OD


From: Eric Blake
Subject: bug#13947: bug report for core-utils command : OD
Date: Wed, 13 Mar 2013 15:34:14 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130219 Thunderbird/17.0.3

On 03/13/2013 02:16 PM, Marc Grondin wrote:
> Good Afternoon, 

Hello, and thanks for the report.

> 
> My client was attempting to run the command : od -c on this xml file (sample 
> only) 
> ------------------------------------------------------------------------------
> <?xml version = '1.0' encoding = 'UTF-8'?>
> <top>
>    <x>丸</x>

Here, you are representing a character in UTF-8

> He was getting this output : 
> ------------------------------------------------------------------------------
> 0000000   <   ?   x   m   l       v   e   r   s   i   o   n       =    
> 0000020   '   1   .   0   '       e   n   c   o   d   i   n   g       =
> 0000040       '   U   T   F   -   8   '   ?   >  \n   <   t   o   p   >
> 0000060  \n               <   x   >   �   �   �   <   /   x   >  \n    

and here, you were running od in a different character set:

> This all based on the LANG env.  He was using : 
> LANG=en_US.iso88591, instead of
> LANG=en_US.UTF-8 

In ISO-88591, every byte is a character, and those particular bytes
happen to be printable, so od was faithfully replaying the character as
printable, only to then be shown by your UTF-8 terminal as an invalid
UTF-8 sequence.  Mismatching character sets between your program and
your terminal is always a recipe for confusion.

However, you HAVE identified a bug, in our documentation.

> 
> ------------------------------------------------------------------------------
> 
> Question : 
> Since the output is based on the ASCII character set, should it not, in both 
> cases give a numerical output (as it did in scenario #2) 
> for a symbol outside the ascii/extended-ascii character set ? 

Our documentation is lying.  Here's what POSIX says about od -c:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
"Interpret bytes as characters specified by the current setting of the
LC_CTYPE category. Certain non-graphic characters appear as C escapes:
"NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others
appear as 3-digit octal numbers."

Nothing in there restricts the output to ASCII only.  The bytes that are
showing up as � are graphic characters in your current choice of
LC_CTYPE, so there is no escaping performed (since escaping is permitted
only on non-graphic characters).  If your terminal was using the same
character set as you ran od under, you would see proper graphical
characters in the ISO-88591 set (but then again, you wouldn't see the
nice 丸 character that the UTF-8 was representing).

Coreutils is properly obeying the locale, what is wrong is the info
documentation which stated:

`-c'
     Output as ASCII characters or backslash escapes.

In reality, that should state something like:
     Output as characters in the current locale, using octal sequences
or backslash escapes for all non-graphic bytes.

Meanwhile, if you want to guarantee ASCII-only output from od, you have
to use a different format, such as -b or -tx1, or use LC_ALL=C on a
system where the C locale does not treat non-ascii bytes as graphical
characters (most glibc systems, including the one you are using, fit
this bill).

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]