bug#33205: 26.1; unibyte/multibyte missing in rx.el

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#33205: 26.1; unibyte/multibyte missing in rx.el

From:	Mattias Engdegård
Subject:	bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date:	Wed, 7 Nov 2018 19:08:43 +0100

5 nov. 2018 kl. 17.49 skrev Eli Zaretskii <eliz@gnu.org>:
> After looking into this, my conclusion is that what I wrote above was
> not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
> distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
> out [:unibyte:] is not what one might think it is, you can see that in
> re_wctype_to_bit, for example.

Thank you very much for taking your time to look at this, and for the detailed 
answer.
My apologies for severely complicating what I initially thought was quite a 
trifle!

> That ^[:ascii:] is not the same as [:nonascii:], and the same with
> [:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
> becomes clear if you look at how we generate the fastmap in each of
> these cases and how we set the bits in the work-area of the range
> table, but I don't know enough to say how easy would it be to fix
> that.
> 
> An alternative is to use an explicit character class, as in \000-\377,
> that works as you'd expect.

I'm not sure what I expected [\000-\377] to mean in a multibyte string; one 
endpoint is ASCII and the other is a raw byte. It does work, as you noted, 
because two ranges are generated, as if written [\000-\177\200-\377].

In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw 
bytes in multibyte strings/buffers, and everything in unibyte strings/buffers 
(aka [\000-\377] in both cases), and [:multibyte:] the complement of that. 
Thus, at some point the behaviour changed, but I cannot find any NEWS reference 
to it. It could have been an accident.
Perhaps those char classes didn't see much use.

The old behaviour seems a little more intuitive, but it must be rare to need 
regex matching of rubbish bytes in multibyte strings. If you could argue that 
the status quo is fine then I wouldn't necessarily object, but would suggest 
that at least the code be made explicit about it (and the documentation, as 
well).

> Well, what do you think now?  Is it worth adding those to rx.el? I'm
> not sure.  How important is it to find unibyte characters in a string,
> anyway?

Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own 
right, it's fine to leave rx.el as is, as far as I'm concerned. There is no 
loss of expressivity.

[Prev in Thread]

Current Thread

[Next in Thread]

bug#33205: 26.1; unibyte/multibyte missing in rx.el, Eli Zaretskii, 2018/11/05
- bug#33205: 26.1; unibyte/multibyte missing in rx.el, Mattias Engdegård <=
  - bug#33205: 26.1; unibyte/multibyte missing in rx.el, Eli Zaretskii, 2018/11/07
    - bug#33205: 26.1; unibyte/multibyte missing in rx.el, Mattias Engdegård, 2018/11/07
    - bug#33205: 26.1; unibyte/multibyte missing in rx.el, Mattias Engdegård, 2018/11/19

Prev by Date: bug#33294: xwidget-insert crashes Emacs
Next by Date: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Previous by thread: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Next by thread: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Index(es):
- Date
- Thread