bug-bison
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH 10/11] quote consistently and make tests pass with new quotin


From: Akim Demaille
Subject: Re: [PATCH 10/11] quote consistently and make tests pass with new quoting from gnulib
Date: Wed, 25 Jan 2012 14:04:55 +0100

Hi Paul,

I'm sending this message to you as the main author of
the quotearg module.  I am not sure which component should
be considered guilty here, but the problem is:

- independently of any LC_*, localcharset.c returns UTF-8
  on OS X.

- If I instrument localcharset.c, I can see that the OS
  returns "US-ASCII" as locale_codeset.

- localcharset's get_charset_aliases then maps US-ASCII
  to UTF-8 (this is where it looks wrong to me, but...).
  See the excerpt below.  FWIW, I have also attached the
  charset.alias file.

- so quotearg decides to use nice UTF-8 quotes (since
  quote.c asks for locale-dependent quotes).  See below
  gettext_quote

- so the test suite fails since it expects plain old "'".

What module would be considered faulty here?  I can provide
a patch, but I would first like to know for which part :)

Thanks!

        Akim

Le 23 janv. 2012 à 16:06, Akim Demaille a écrit :

> 
> Le 23 janv. 2012 à 15:34, Jim Meyering a écrit :
> 
>>> I had never realized that the tests are not specifying LC_ALL=C
>>> and they should.  But even when I do, I still have nice quotes.
>> 
>> Hi Akim,
>> 
>> Maybe you need to set LANG to empty or to C?
>> glibc honors LANG (erroneously, imho)
> 
> My tests were on OS X.  LANG=C, or unset, does not
> change anything.
> 
> Some digging led me into this:
> 
>> # if defined DARWIN7
>>      /* To avoid the trouble of installing a file that is shared by many
>>         GNU packages -- many packaging systems have problems with this --,
>>         simply inline the aliases here.  */
>>      cp = "ISO8859-1" "\0" "ISO-8859-1" "\0"
>>           "ISO8859-2" "\0" "ISO-8859-2" "\0"
>>           "ISO8859-4" "\0" "ISO-8859-4" "\0"
>>           "ISO8859-5" "\0" "ISO-8859-5" "\0"
>>           "ISO8859-7" "\0" "ISO-8859-7" "\0"
>>           "ISO8859-9" "\0" "ISO-8859-9" "\0"
>>           "ISO8859-13" "\0" "ISO-8859-13" "\0"
>>           "ISO8859-15" "\0" "ISO-8859-15" "\0"
>>           "KOI8-R" "\0" "KOI8-R" "\0"
>>           "KOI8-U" "\0" "KOI8-U" "\0"
>>           "CP866" "\0" "CP866" "\0"
>>           "CP949" "\0" "CP949" "\0"
>>           "CP1131" "\0" "CP1131" "\0"
>>           "CP1251" "\0" "CP1251" "\0"
>>           "eucCN" "\0" "GB2312" "\0"
>>           "GB2312" "\0" "GB2312" "\0"
>>           "eucJP" "\0" "EUC-JP" "\0"
>>           "eucKR" "\0" "EUC-KR" "\0"
>>           "Big5" "\0" "BIG5" "\0"
>>           "Big5HKSCS" "\0" "BIG5-HKSCS" "\0"
>>           "GBK" "\0" "GBK" "\0"
>>           "GB18030" "\0" "GB18030" "\0"
>>           "SJIS" "\0" "SHIFT_JIS" "\0"
>>           "ARMSCII-8" "\0" "ARMSCII-8" "\0"
>>           "PT154" "\0" "PT154" "\0"
>>         /*"ISCII-DEV" "\0" "?" "\0"*/
>>           "*" "\0" "UTF-8" "\0";
>> # endif
> 
> which, IIUC, maps my "US-ASCII" (which is the
> answer on my system for locale_codeset in locale_charset)
> to UTF-8.  And then, it seems to be hard-coded to use UTF-8
> quotes in quoteargs.
> 
>> /* MSGID approximates a quotation mark.  Return its translation if it
>>   has one; otherwise, return either it or "\"", depending on S.
>> 
>>   S is either clocale_quoting_style or locale_quoting_style.  */
>> static char const *
>> gettext_quote (char const *msgid, enum quoting_style s)
>> {
>>  char const *translation = _(msgid);
>>  char const *locale_code;
>> 
>>  if (translation != msgid)
>>    return translation;
>> 
>>  /* For UTF-8 and GB-18030, use single quotes U+2018 and U+2019.
>>     Here is a list of other locales that include U+2018 and U+2019:
>> 
>>        ISO-8859-7   0xA1                 KOI8-T       0x91
>>        CP869        0x8B                 CP874        0x91
>>        CP932        0x81 0x65            CP936        0xA1 0xAE
>>        CP949        0xA1 0xAE            CP950        0xA1 0xA5
>>        CP1250       0x91                 CP1251       0x91
>>        CP1252       0x91                 CP1253       0x91
>>        CP1254       0x91                 CP1255       0x91
>>        CP1256       0x91                 CP1257       0x91
>>        EUC-JP       0xA1 0xC6            EUC-KR       0xA1 0xAE
>>        EUC-TW       0xA1 0xE4            BIG5         0xA1 0xA5
>>        BIG5-HKSCS   0xA1 0xA5            EUC-CN       0xA1 0xAE
>>        GBK          0xA1 0xAE            Georgian-PS  0x91
>>        PT154        0x91
>> 
>>     None of these is still in wide use; using iconv is overkill.  */
>>  locale_code = locale_charset ();
>>  fprintf (stderr, "charset: %s\n", locale_code);
> 
> I get "charset: UTF-8".
> 
>>  if (STRCASEEQ (locale_code, "UTF-8", 'U','T','F','-','8',0,0,0,0))
>>    return msgid[0] == '`' ? "\xe2\x80\x98": "\xe2\x80\x99";
>>  if (STRCASEEQ (locale_code, "GB18030", 'G','B','1','8','0','3','0',0,0))
>>    return msgid[0] == '`' ? "\xa1\ae": "\xa1\xaf";
>> 
>>  return (s == clocale_quoting_style ? "\"" : "'");
>> }
> 
> 
> My understanding is that there is nothing prepared for me to override
> this, since bison is using:
> 
>> /* Return an unambiguous printable representation of NAME,
>>   allocated in slot N, suitable for diagnostics.  */
>> char const *
>> quote_n (int n, char const *name)
>> {
>>  return quotearg_n_style (n, locale_quoting_style, name);
>> }
> 
> I could add some dependency on LC_ALL here, but it looks wrong.
> It feels wrong that even with LC_CTYPE=C, I get UTF-8.


Attachment: charset.alias.txt
Description: Text document



reply via email to

[Prev in Thread] Current Thread [Next in Thread]