[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scan of regexps in Emacs (March 17)

From: Mattias Engdegård
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Thu, 21 Mar 2019 12:15:57 +0100

20 mars 2019 kl. 23.01 skrev Paul Eggert <address@hidden>:
> On 3/19/19 7:20 PM, Stefan Monnier wrote:
>> I wonder why the doc doesn't just say that `-` should be the last
>> character and not mention the other possibilities which just make the
>> rule unnecessarily complex.

Agreed, that is what the 'how to write regexps' part of the docs should say. 
But don't we also need a precise description of exactly how they are 
interpreted by the engine? Otherwise, a user cannot read and understand 
existing code. (Unless he or she uses xr!) Perhaps there needs to be a separate 
'gritty details' section.

> * The doc already says that regular expressions like "*foo" and "+foo"
> are problematic (they're confusing, and POSIX says the behavior is
> undefined) and should be avoided. REs like "[a-m-z]" and "[!-[:alpha:]]"
> and "[[:alpha:]-~]" are problematic in the same way and also should be
> avoided.

I'm with Stefan here; `-' should go last. Anything else is a gritty detail.

> * The doc doesn't clearly say when the Emacs range behavior is an
> extension to POSIX; saying this will help people know better when they
> can export Emacs regular expressions to other programs.

Documenting differences from POSIX regexps is useful. Do you prefer having 
those differences being spread out, or all concentrated into one section?

These days, a user may be more familiar with the various PCRE dialects than 
traditional or extended POSIX. Should that be taken into account?

> * The doc is confused (and there's a comment about this) about what
> happens when one end of a range is unibyte and the other is multibyte. I
> added something saying that if one bound is a raw 8-bit byte then the
> other should be a unibyte character (either ASCII, or a raw 8-bit byte).
> I don't see any good way to specify the behavior when one bound is a raw
> 8-bit byte and the other bound is a multibyte character, in such a way
> that it's a natural extension of the documented behavior, so the
> documentation now recommends against that.

The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'unibyte'? 
Is \x7f ever a raw 8-bit byte?
I agree that [å-\xff], say, should be invalid but I've never seen such 

> * We might as well go ahead and say that [b-a] matches nothing, as
> enough code (ab)uses regexps in that way, and there is value in having a
> simple regular expression that always fails to match. However, I expect
> that we should say that users should avoid wilder examples like [~-!] so
> that the trawler can catch them as typos.

It already does, and some bugs were found that way. As a special case, it no 
longer complains about z-a because that is unlikely to be an accident and 
occurs in some code on purpose.

I'm not sure it's a good idea to document reversed ranges as a recommended way 
to match any or no character (although the description of the semantics would 
belong in a 'gritty details' section), and only to use [Y-X] where Y=X+1. More 
about that in a separate post.

> These new recommendations ("should"s in the attached patch) will give
> the trawler license to diagnose questionable REs like "[a-m-z]",
> "[!-[:alpha:]]", "[~-!]", and (my favorite) "[\u00FF-\xFF]". There is no
> change to actual Emacs behavior.

As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr and 
found a handful in both Emacs and GNU ELPA, but none of them carried a freeload 
of bugs. Keeping that check didn't seem worthwhile; the regexps may be a bit 
odd-looking, but aren't wrong.

[!-[:alpha:]] is already detected since xr parses it correctly and will 
complain about the duplication of ':'. The reverse, [[:digit:]-z], is seen 
occasionally but again does not seem to be a serious bug proxy.

Much as I would like to outlaw ranges where a typical programmer has to consult 
an ASCII table to understand what's included, they just seem too common, with 
too many false positives, to merit inclusion in xr.
Nevertheless I had a quick look and extracted a few that might merit attention; 
see attachment.

Similarly, a rule finding [X-Y] where Y=X+1 found one or two questionable cases 
in a sea of false positives (also in the attachment).

Attachment: possibly-broken-regexps.log
Description: Binary data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]