[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Scan of regexps in Emacs (March 17)

From: Paul Eggert
Subject: Re: Scan of regexps in Emacs (March 17)
Date: Tue, 2 Apr 2019 00:33:28 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1

Mattias Engdegård wrote:
don't we also need a precise description of exactly how they are interpreted by 
the engine?

In other parts of Emacs, we are typically OK with specs that don't completely specify behavior. This gives us more freedom to make changes in the undocumented behavior later. I think it makes sense to do that here too, for regular expressions like "[z-a-m]" that most readers would find confusing.

I'm with Stefan here; `-' should go last. Anything else is a gritty detail.

Stefan already changed the doc in master to say that. The attached patch tightens up the wording (and still says that "-" should go last).

Documenting differences from POSIX regexps is useful. Do you prefer having 
those differences being spread out, or all concentrated into one section?

I don't have a strong preference. I wrote it concentrated originally, and that form seems to work well.

These days, a user may be more familiar with the various PCRE dialects than 
traditional or extended POSIX. Should that be taken into account?

It might be helpful. However, PCRE is further away from Emacs regexps than POSIX is, and a comparison of PCRE and POSIX regexps is probably best put into a different section. It's not a section I'd like to write, to be honest; PCRE is pretty hairy.

The terminology is a bit confusing. Is 'raw 8-bit byte' included in 'unibyte'? 
Is \x7f ever a raw 8-bit byte?
I agree that [å-\xff], say, should be invalid but I've never seen such 

After looking into it I realized that I don't really know the semantics here (the text I recently added there seems to be wrong, in some cases), and I have my doubts that anyone else knows the semantics either. The attached patch simply gets rid of that section, leaving the area undocumented. User beware!

It already does, and some bugs were found that way. As a special case, it no 
longer complains about z-a because that is unlikely to be an accident and 
occurs in some code on purpose.

OK, then we should document z-a as the preferred syntax (best go with the flow...). Done in the attached patch.

As an experiment, I added detection of 'chained' ranges like [a-m-z] to xr and 
found a handful in both Emacs and GNU ELPA, but none of them carried a freeload 
of bugs. Keeping that check didn't seem worthwhile; the regexps may be a bit 
odd-looking, but aren't wrong.

It depends on what one means by "wrong". If one wants to use the ranges in both Emacs and grep they are "wrong", so it's reasonable for the manual to recommend against them.
a rule finding [X-Y] where Y=X+1 found one or two questionable cases in a sea 
of false positives (also in the attachment).

It might also help for the trawler to warn about [X-Z] where Z = X+2. [XYZ] is clearer and less error-prone than [X-Z]. I shoehorned that into the attached patch too.

Attachment: 0001-More-regexp-advice-and-clarifications.patch
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]