[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#13947: bug report for core-utils command : OD
From: |
Eric Blake |
Subject: |
bug#13947: bug report for core-utils command : OD |
Date: |
Wed, 13 Mar 2013 15:34:14 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130219 Thunderbird/17.0.3 |
On 03/13/2013 02:16 PM, Marc Grondin wrote:
> Good Afternoon,
Hello, and thanks for the report.
>
> My client was attempting to run the command : od -c on this xml file (sample
> only)
> ------------------------------------------------------------------------------
> <?xml version = '1.0' encoding = 'UTF-8'?>
> <top>
> <x>丸</x>
Here, you are representing a character in UTF-8
> He was getting this output :
> ------------------------------------------------------------------------------
> 0000000 < ? x m l v e r s i o n =
> 0000020 ' 1 . 0 ' e n c o d i n g =
> 0000040 ' U T F - 8 ' ? > \n < t o p >
> 0000060 \n < x > � � � < / x > \n
and here, you were running od in a different character set:
> This all based on the LANG env. He was using :
> LANG=en_US.iso88591, instead of
> LANG=en_US.UTF-8
In ISO-88591, every byte is a character, and those particular bytes
happen to be printable, so od was faithfully replaying the character as
printable, only to then be shown by your UTF-8 terminal as an invalid
UTF-8 sequence. Mismatching character sets between your program and
your terminal is always a recipe for confusion.
However, you HAVE identified a bug, in our documentation.
>
> ------------------------------------------------------------------------------
>
> Question :
> Since the output is based on the ASCII character set, should it not, in both
> cases give a numerical output (as it did in scenario #2)
> for a symbol outside the ascii/extended-ascii character set ?
Our documentation is lying. Here's what POSIX says about od -c:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/od.html
"Interpret bytes as characters specified by the current setting of the
LC_CTYPE category. Certain non-graphic characters appear as C escapes:
"NUL=\0" , "BS=\b" , "FF=\f" , "NL=\n" , "CR=\r" , "HT=\t" ; others
appear as 3-digit octal numbers."
Nothing in there restricts the output to ASCII only. The bytes that are
showing up as � are graphic characters in your current choice of
LC_CTYPE, so there is no escaping performed (since escaping is permitted
only on non-graphic characters). If your terminal was using the same
character set as you ran od under, you would see proper graphical
characters in the ISO-88591 set (but then again, you wouldn't see the
nice 丸 character that the UTF-8 was representing).
Coreutils is properly obeying the locale, what is wrong is the info
documentation which stated:
`-c'
Output as ASCII characters or backslash escapes.
In reality, that should state something like:
Output as characters in the current locale, using octal sequences
or backslash escapes for all non-graphic bytes.
Meanwhile, if you want to guarantee ASCII-only output from od, you have
to use a different format, such as -b or -tx1, or use LC_ALL=C on a
system where the C locale does not treat non-ascii bytes as graphical
characters (most glibc systems, including the one you are using, fit
this bill).
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
- bug#13947: bug report for core-utils command : OD, Marc Grondin, 2013/03/13
- bug#13947: bug report for core-utils command : OD,
Eric Blake <=
- bug#13947: bug report for core-utils command : OD, Pádraig Brady, 2013/03/13
- bug#13947: bug report for core-utils command : OD, Pádraig Brady, 2013/03/22
- bug#13947: bug report for core-utils command : OD, Eric Blake, 2013/03/22
- bug#13947: bug report for core-utils command : OD, Pádraig Brady, 2013/03/22
- bug#13947: bug report for core-utils command : OD, Mark JAEGER, 2013/03/27
- bug#13947: bug report for core-utils command : OD, Eric Blake, 2013/03/27