[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: documentation bug re character range expressions

From: Marcel (Felix) Giannelia
Subject: Re: documentation bug re character range expressions
Date: Fri, 03 Jun 2011 00:06:32 -0700
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20100330 Shredder/3.0.4

Is it really a programmer mistake, though, to assume that [A-Z] is only capital letters? A through Z are a contiguous range in every representation system except EBCDIC, and it is even contiguous the modern unicode.

In the world of programming characters are numbers, and programmers know this (especially if they've ever learned any C). For the example of [a-c], programmers are treating letters the way the machine treats them, as numbers.

How is the person typing [a-c] the one making the mistake when it results in matching against values outside of that range? To make it plainer, type it as [\0x61-\0x63] -- if you saw that in a program, you would expect that to cover 0x61, 0x62, 0x63, wouldn't you? If you were designing a programming language, wouldn't you make it do that?

If person A types [\0x61-\0x63] on software written by person B and it comes out matching 0x61, 0x41, 0x62, 0x42, 0x63, and perhaps something completely different when the same code is run on a computer in Russia, who would you say made the programming mistake? Surely not person A.

This is something that wasn't a "bad programming habit" until somewhere, someone made a decision that removed meaning from a sensible, logical-looking syntax.

Let's compare the syntaxes:

Under the old notation, there was:

- a succinct way to specify lowercase letters: [a-z]

- likewise for uppercase: [A-Z]

- likewise for case-insensitive: [A-Za-z]

- an easy way to specify ranges of letters of a particular case: [a-m], [A-M]

- case-insensitive ranges: [A-Ma-m]

Under the new notation, those things are written as:

- lowercase letters: [[:lower:]] (over twice as long to type)

- uppercase letters: [[:upper:]] (likewise)

- case-insensitive: [[:alpha:]] (not as bad, but still longer)

- how *are* you supposed to specify case-sensitive ranges? [abcdefghijklm] looks ridiculous.

- case-insensitive ranges: [a-M] (looks like an error at first glance: "why is the M uppercase?" you need to know something about the system internals to see why that's not wrong. And that something is a lot more complicated to explain than "computers represent letters as numbers")

Bash is a shell. Shells should have a quick, brief, plain language so that one can get things done in them. Shells should also be quite portable: syntax that works on one system should work on any other as much as possible.

[[:alpha:]] is too difficult to type to make it useful for the kind of quick pattern-matching that character ranges are used for on the interactive shell. Try it. Open-bracket, colon is an awkward sequence compared to something like "[a-z]".

But usually one doesn't want all of the alphabet, nor case insensitivity. I have actually never had occasion to say [A-Za-z] on the command line, or even [A-Ca-c]. I have, however, very often wanted to grab everything with a lowercase 'a' through lowercase 'k', for instance.

Previously, that would have been [a-k]. Now I have no way to specify it except [abcdefghijk], and I'm not typing that. A useful feature is gone.

You say this is not only a "bash problem" because it's a programmer's mistake to assume that [a-c] means the same thing in bash as it does in Perl, Python, Java, C/C++ (POSIX regex.h, with system locale set!), JavaScript, PHP, sed, grep, and on and on -- you can see why one might make this "mistake".

And these aren't historical examples, these are modern implementations of these languages that I just tested this on to double-check, on a system with its locale set to something that collates case-insensitively. Bash is the *only* thing I know of that treats character ranges this way, so I would say that does make it "only a bash problem".

Even grep, whose man page says it obeys LC_COLLATE and the locale, actually has [a-c] equivalent to [abc] on all locales. Someone must have snuck in and fixed it. I'm guessing that if grep were to start using locale-aware character ranges, a heck of a lot more people would complain than do about bash. This is a seldom-used feature in bash but many, many people rely on grep being predictable and standard.


On 2011-06-02 22:32, Jan Schampera wrote:

just as side note, not meant to touch the maintainer discussion.

This is not only a "Bash problem". The programmer/user mistake to use
[A-Z] for "only capital letters, capital A to capital Z" is a very
common one.

But I'm not sure if every official application-level documentation
should cover those kind of pitfalls. There would be many topics around
"bad programming habbits" that should be documented.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]