[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Questionable code in handling of wordend in the regexp engine in reg

From: Alan Mackenzie
Subject: Re: Questionable code in handling of wordend in the regexp engine in regex-emacs.c
Date: Fri, 1 Mar 2019 11:10:18 +0000
User-agent: Mutt/1.10.1 (2018-07-13)

Hello, Stefan.

On Mon, Feb 25, 2019 at 14:18:10 -0500, Stefan Monnier wrote:
> >> > Surely the argument to the second occurrence should be (charpos + 1)?
> >> I believe it's instead the other one that needs to use "charpos - 1"
> >> because the UPDATE_SYNTAX_TABLE is called just before reading the char
> >> *before* charpos (see patch below).
> > I don't think this is right.  offset is calculated from d, and then
> > decremented, before calculating charpos.

> Hmm... I think you're right (and the symend code does like you suggest).
> This said, I find it odd that the code does:

>             ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
>             ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
>             UPDATE_SYNTAX_TABLE (charpos);

> Supposedly `d` is a char* pointing to the beginning of a potentially
> multibyte char, In that case `d - 1` will point "somewhere before the
> end of the previous multibyte char" but not necessarily at
> its beginning.  Maybe the patch below would be preferable to avoid
> this situation?

SYNTAX_TABLE_BYTE_TO_CHAR ends up calling buf_bytepos_to_charpos (in
marker.c).  This latter function doesn't handle well the case of `d'
being in the middle of a multibyte character; sometimes it "rounds it
down", other times it "rounds it up" to a character position.  I think
it should be defined as rounding it down.  It would be a relatively
simple correction (at least, technically ;-).

I think your patch "always do the arithmetic on a charpos, not a
bytepos" is a very good idea indeed.

But I'm still a little worried about buf_bytepos_to_charpos.  Perhaps it
should state that the result is undefined when the bytepos is "invalid".
How many other places in Emacs call it with "invalid" byte positions?
For that matter, how many charpos <-> bytepos functions are there in
Emacs?  Just this one?

> Worse, in notwordbound we do:

>               ptrdiff_t offset = PTR_TO_OFFSET (d - 1);
>               ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
>               UPDATE_SYNTAX_TABLE (charpos);

> which seems even more broken because `d` might point to the first byte
> after the gap, so `d - 1` will point in the middle of the gap, so it's
> simply an invalid argument to PTR_TO_OFFSET.

I don't think this is right.  Both `d' and `offset' are byte
measurements, not character measurements, so it shouldn't matter whether
the "- 1" is inside or outside the parens.  However, it would be less
confusing if they were both (?all) the same.

> According to the definition of PTR_TO_OFFSET and POINTER_TO_OFFSET,
> the result may be the same as if we did the decrement after the fact,
> but it still looks fishy.  WDYT?

I think it is suboptimal to have both PTR_TO_OFFSET and
POINTER_TO_OFFSET meaning different things in the same source file.  ;-)

>         Stefan

> diff --git a/src/regex-emacs.c b/src/regex-emacs.c
> index b667a43a37..b21cba0e46 100644
> --- a/src/regex-emacs.c
> +++ b/src/regex-emacs.c
> @@ -4811,9 +4811,9 @@ re_match_2_internal (struct re_pattern_buffer *bufp, 
> re_char *string1,
>             int c1, c2;
>             int s1, s2;
>             int dummy;
> -           ptrdiff_t offset = PTR_TO_OFFSET (d) - 1;
> +           ptrdiff_t offset = PTR_TO_OFFSET (d);
>             ptrdiff_t charpos = SYNTAX_TABLE_BYTE_TO_CHAR (offset);
> -           UPDATE_SYNTAX_TABLE (charpos);
> +           UPDATE_SYNTAX_TABLE (charpos - 1);
>             GET_CHAR_BEFORE_2 (c1, d, string1, end1, string2, end2);
>             s1 = SYNTAX (c1);
There are eight occurrences of SYNTAX_TABLE_BYTE_TO_CHAR in
regex-emacs.c.  I think I will check them all, amending them as in your

What do you say?

Alan Mackenzie (Nuremberg, Germany).

reply via email to

[Prev in Thread] Current Thread [Next in Thread]