bug-texinfo
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: texi2dvi: locale-dependent error in egrep [A-z]


From: Martin von Gagern
Subject: Re: texi2dvi: locale-dependent error in egrep [A-z]
Date: Wed, 31 Mar 2010 09:11:08 +0200
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.8) Gecko/20100322 Thunderbird/3.0.3

On 31.03.2010 01:20, Karl Berry wrote:
> Thanks for the report.  I changed it to [A-Za-z] but I don't actually
> understand why [A-z] is invalid in UTF-8, including American English UTF-8.
> 
> $ env LC_ALL=en_US.utf8 grep '[A-z]' /etc/issue
> grep: Invalid range end
> 
> UTF-8 is the same as ASCII in this area, so where's the beef?
> Can you (or anyone) here explain?
> 
> Of course I know that [A-z] includes the ASCII characters between Z
> and a, namely  [\]^_`  which technically aren't allowed as DOS drive
> letters, so the range has always been incorrect in that sense, but I
> don't see why it's an "invalid range end" in UTF8 (and not "C").

Don't know the answer, but did some investigation.

For locales other than "C", characters are collated according to
different rules, which of course does affect range expressions as well.
It seems that glibc in our locales does collate upper case letters AFTER
lower case letters. So if you try [z-A], you get a valid regular
expression in the "de_DE.utf8" locale. Not one useful here, but valid
nevertheless.

Using the source, I'm looking for REG_ERANGE in
http://sourceware.org/git/?p=glibc.git;a=blob;f=posix/regcomp.c#l2867
Several occurrences in that file, none of them obviously to blame. Some
gdbing indicates the one around line 2867 as the most likely candidate:

> start_collseq = lookup_collation_sequence_value (start_elem);
> end_collseq = lookup_collation_sequence_value (end_elem);
> /* Check start/end collation sequence values.  */
> if (BE (start_collseq == UINT_MAX || end_collseq == UINT_MAX, 0))
>   return REG_ECOLLATE;
> if (BE ((syntax & RE_NO_EMPTY_RANGES) && start_collseq > end_collseq, 0))
>   return REG_ERANGE;

With [A-z] and if my debug symbols are right, we have start_collseq=580
and end_collseq=568 here, again for de_DE.utf8, so this rule matches.

Greetings,
 Martin

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]