bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: /usr/bin/printf: invalid universal character name


From: Hermann Peifer
Subject: Re: /usr/bin/printf: invalid universal character name
Date: Sun, 11 May 2008 21:15:01 +0200
User-agent: Thunderbird 2.0.0.12 (X11/20080227)

Jim wrote:
Hermann Peifer <address@hidden> wrote:
Jim wrote:
Hermann Peifer <address@hidden> wrote:

printf  \uHHHH  is expected to print Unicode chars. This work fine in
most cases, but  some legal code points are reported as errors: values
in the ASCII range and C1 control chars, and values between
U+D800..U+DFFF

I would say that this behaviour is rather a bug than a feature.

Thanks for the report, but this is not some arbitrary restriction,
but rather conformance to the standard (C99, ISO/IEC 10646) for
"universal character name" syntax:

  http://www.open-std.org/jtc1/sc22/wg14/www/docs/n717.htm

Here's part of printf.c, with a comment that probably came from
a version of N717:

      /* A universal character name shall not specify a character short
         identifier in the range 00000000 through 00000020, 0000007F through
         0000009F, or 0000D800 through 0000DFFF inclusive. A universal
         character name shall not designate a character in the required
         character set.  */
      if ((uni_value <= 0x9f
           && uni_value != 0x24 && uni_value != 0x40 && uni_value != 0x60)
          || (uni_value >= 0xd800 && uni_value <= 0xdfff))
        error (EXIT_FAILURE, 0, _("invalid universal character name \\%c%0*x"),
               esc_char, (esc_char == 'u' ? 4 : 8), uni_value);


/usr/bin/printf: invalid universal character name \u0000
/usr/bin/printf: invalid universal character name \u0001

...

I can understand that you'd find the restriction surprising,
but I wouldn't call it a bug.

Thanks for your swift reply. (BTW: are mails to address@hidden
not copied to gnu.utils.bug?)

No.  That's a separate list.

I do acknowledge that C0 and C1 control chars are some sort of a
border case. It is true that the Unicode standard does not assign
*normative names* for them but rather adds the placeholder "<control>"
as a dummy name (btw, this was different in earlier versions of
Unicode). However, all C0 and C1 *code points* are at least included
in:

http://www.unicode.org/charts/PDF/U0000.pdf
http://www.unicode.org/charts/PDF/U0080.pdf
http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt

And I didn't expect /usr/bin/printf to worry about normative or
non-normative names of Unicode chars, but rather print the chars
themselves.

If we let the control chars question aside, it is still hard to
believe that it is not a bug that almost all ASCII chars 0020..007e
lead to EXIT_FAILURE. This rule is more than peculiar, to say the
least and it is also inconsistent with its own comment:

     if ((uni_value <= 0x9f
           && uni_value != 0x24 && uni_value != 0x40 && uni_value != 0x60)


Only DOLLAR SIGN, COMMERCIAL AT and GRAVE ACCENT are legal in the
range 0x00..0x9f ?

I still think that these 92 cases are bugs, rather than anything else:

/usr/bin/printf: invalid universal character name \u0020
/usr/bin/printf: invalid universal character name \u0021
...

I don't know the motivation for those exceptions.
Paul Eggert added this feature 8 years ago, so things may have changed.

FYI, there are plenty of odd-looking exceptions in this domain.
For a taste, see the function, ucn_valid_in_identifier, in gcc's
libcpp/charset.c

That code determines that this is valid C99 code (with -fextended-identifiers):

    int ok\u09CB = 1;

but this is not:

    int not_ok\u09FF = 1;

Just an addition concerning the border case, ie the control chars. From the Unicode FAQ:

> Unicode: 0000..007F; Basic Latin
> 10646: 0020-007E BASIC LATIN

> Unicode: 0080..00FF; Latin-1 Supplement
> 10646: 00A0-00FF LATIN-1 SUPPLEMENT

see: http://www.unicode.org/faq/blocks_ranges.html#21

So as /usr/bin/printf man page talks about Unicode characters (rather than ISO 10646 chars): the control chars should be included, I would say.

Hermann




reply via email to

[Prev in Thread] Current Thread [Next in Thread]