[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: documentation bug re character range expressions

From: Aharon Robbins
Subject: Re: documentation bug re character range expressions
Date: Fri, 03 Jun 2011 11:03:23 +0300

This is a thorny issue that plagues all POSIX-compliant utilities,
not just Bash.  (POSIX locales are just a blight.)

For gawk 4.0, I have said "to heck with it" and changed gawk so that
ranges act like they are in the C locale (unless --posix is used).

I and some other people are campaigning to make similar fixes to the
other prominent GNU utilities, since this has to be one of the most
Frequent of the FAQs.

Please pray for us!



In article <address@hidden> you write:
>Is it really a programmer mistake, though, to assume that [A-Z] is only 
>capital letters? A through Z are a contiguous range in every 
>representation system except EBCDIC, and it is even contiguous the 
>modern unicode.
>In the world of programming characters are numbers, and programmers know 
>this (especially if they've ever learned any C). For the example of 
>[a-c], programmers are treating letters the way the machine treats them, 
>as numbers.
>How is the person typing [a-c] the one making the mistake when it 
>results in matching against values outside of that range? To make it 
>plainer, type it as [\0x61-\0x63] -- if you saw that in a program, you 
>would expect that to cover 0x61, 0x62, 0x63, wouldn't you? If you were 
>designing a programming language, wouldn't you make it do that?
>If person A types [\0x61-\0x63] on software written by person B and it 
>comes out matching 0x61, 0x41, 0x62, 0x42, 0x63, and perhaps something 
>completely different when the same code is run on a computer in Russia, 
>who would you say made the programming mistake? Surely not person A.
>This is something that wasn't a "bad programming habit" until somewhere, 
>someone made a decision that removed meaning from a sensible, 
>logical-looking syntax.
>Let's compare the syntaxes:
>Under the old notation, there was:
>- a succinct way to specify lowercase letters: [a-z]
>- likewise for uppercase: [A-Z]
>- likewise for case-insensitive: [A-Za-z]
>- an easy way to specify ranges of letters of a particular case: [a-m], 
>- case-insensitive ranges: [A-Ma-m]
>Under the new notation, those things are written as:
>- lowercase letters: [[:lower:]] (over twice as long to type)
>- uppercase letters: [[:upper:]] (likewise)
>- case-insensitive: [[:alpha:]] (not as bad, but still longer)
>- how *are* you supposed to specify case-sensitive ranges? 
>[abcdefghijklm] looks ridiculous.
>- case-insensitive ranges: [a-M] (looks like an error at first glance: 
>"why is the M uppercase?" you need to know something about the system 
>internals to see why that's not wrong. And that something is a lot more 
>complicated to explain than "computers represent letters as numbers")
>Bash is a shell. Shells should have a quick, brief, plain language so 
>that one can get things done in them. Shells should also be quite 
>portable: syntax that works on one system should work on any other as 
>much as possible.
>[[:alpha:]] is too difficult to type to make it useful for the kind of 
>quick pattern-matching that character ranges are used for on the 
>interactive shell. Try it. Open-bracket, colon is an awkward sequence 
>compared to something like "[a-z]".
>But usually one doesn't want all of the alphabet, nor case 
>insensitivity. I have actually never had occasion to say [A-Za-z] on the 
>command line, or even [A-Ca-c]. I have, however, very often wanted to 
>grab everything with a lowercase 'a' through lowercase 'k', for instance.
>Previously, that would have been [a-k]. Now I have no way to specify it 
>except [abcdefghijk], and I'm not typing that. A useful feature is gone.
>You say this is not only a "bash problem" because it's a programmer's 
>mistake to assume that [a-c] means the same thing in bash as it does in 
>Perl, Python, Java, C/C++ (POSIX regex.h, with system locale set!), 
>JavaScript, PHP, sed, grep, and on and on -- you can see why one might 
>make this "mistake".
>And these aren't historical examples, these are modern implementations 
>of these languages that I just tested this on to double-check, on a 
>system with its locale set to something that collates 
>case-insensitively. Bash is the *only* thing I know of that treats 
>character ranges this way, so I would say that does make it "only a bash 
>Even grep, whose man page says it obeys LC_COLLATE and the locale, 
>actually has [a-c] equivalent to [abc] on all locales. Someone must have 
>snuck in and fixed it. I'm guessing that if grep were to start using 
>locale-aware character ranges, a heck of a lot more people would 
>complain than do about bash. This is a seldom-used feature in bash but 
>many, many people rely on grep being predictable and standard.
>On 2011-06-02 22:32, Jan Schampera wrote:
>> Hi,
>> just as side note, not meant to touch the maintainer discussion.
>> This is not only a "Bash problem". The programmer/user mistake to use
>> [A-Z] for "only capital letters, capital A to capital Z" is a very
>> common one.
>> But I'm not sure if every official application-level documentation
>> should cover those kind of pitfalls. There would be many topics around
>> "bad programming habbits" that should be documented.

Aharon (Arnold) Robbins                         arnold AT skeeve DOT com
P.O. Box 354            Home Phone: +972  8 979-0381
Nof Ayalon              Cell Phone: +972 50 729-7545
D.N. Shimshon 99785     ISRAEL

reply via email to

[Prev in Thread] Current Thread [Next in Thread]