[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scan of regexps in Emacs (March 17)

From: Paul Eggert
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Wed, 20 Mar 2019 15:01:51 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.3

On 3/19/19 7:20 PM, Stefan Monnier wrote:
> I wonder why the doc doesn't just say that `-` should be the last
> character and not mention the other possibilities which just make the
> rule unnecessarily complex.

'-' can also be the first character in a regular expression; this is
pretty common and is standard. POSIX also says '-' can be the upper
bound of a range, which is a bit weird (but hey! it's standard).

I went through the documentation and attempted to fix the doc to
describe this mess better by installing the attached patch into the
emacs-26 branch. The basic ideas are:

* The doc already says that regular expressions like "*foo" and "+foo"
are problematic (they're confusing, and POSIX says the behavior is
undefined) and should be avoided. REs like "[a-m-z]" and "[!-[:alpha:]]"
and "[[:alpha:]-~]" are problematic in the same way and also should be

* The doc doesn't clearly say when the Emacs range behavior is an
extension to POSIX; saying this will help people know better when they
can export Emacs regular expressions to other programs.

* The doc is confused (and there's a comment about this) about what
happens when one end of a range is unibyte and the other is multibyte. I
added something saying that if one bound is a raw 8-bit byte then the
other should be a unibyte character (either ASCII, or a raw 8-bit byte).
I don't see any good way to specify the behavior when one bound is a raw
8-bit byte and the other bound is a multibyte character, in such a way
that it's a natural extension of the documented behavior, so the
documentation now recommends against that.

* We might as well go ahead and say that [b-a] matches nothing, as
enough code (ab)uses regexps in that way, and there is value in having a
simple regular expression that always fails to match. However, I expect
that we should say that users should avoid wilder examples like [~-!] so
that the trawler can catch them as typos.

These new recommendations ("should"s in the attached patch) will give
the trawler license to diagnose questionable REs like "[a-m-z]",
"[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is no
change to actual Emacs behavior.

Attachment: 0001-Say-which-regexp-ranges-should-be-avoided.patch
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]