bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#33205: 26.1; unibyte/multibyte missing in rx.el


From: Mattias Engdegård
Subject: bug#33205: 26.1; unibyte/multibyte missing in rx.el
Date: Wed, 7 Nov 2018 19:08:43 +0100

5 nov. 2018 kl. 17.49 skrev Eli Zaretskii <eliz@gnu.org>:
> After looking into this, my conclusion is that what I wrote above was
> not too wrong.  Indeed, currently [:ascii:]/[:nonascii:] cannot be
> distinguished from [:unibyte:]/[:multibyte:].  In a nutshell, it turns
> out [:unibyte:] is not what one might think it is, you can see that in
> re_wctype_to_bit, for example.

Thank you very much for taking your time to look at this, and for the detailed 
answer.
My apologies for severely complicating what I initially thought was quite a 
trifle!

> That ^[:ascii:] is not the same as [:nonascii:], and the same with
> [:unibyte:] vs ^[:multibyte:], is arguably a bug.  The reason for that
> becomes clear if you look at how we generate the fastmap in each of
> these cases and how we set the bits in the work-area of the range
> table, but I don't know enough to say how easy would it be to fix
> that.
> 
> An alternative is to use an explicit character class, as in \000-\377,
> that works as you'd expect.

I'm not sure what I expected [\000-\377] to mean in a multibyte string; one 
endpoint is ASCII and the other is a raw byte. It does work, as you noted, 
because two ranges are generated, as if written [\000-\177\200-\377].

In old Emacs versions (I tried 22.1.1), [:unibyte:] appears to include raw 
bytes in multibyte strings/buffers, and everything in unibyte strings/buffers 
(aka [\000-\377] in both cases), and [:multibyte:] the complement of that. 
Thus, at some point the behaviour changed, but I cannot find any NEWS reference 
to it. It could have been an accident.
Perhaps those char classes didn't see much use.

The old behaviour seems a little more intuitive, but it must be rare to need 
regex matching of rubbish bytes in multibyte strings. If you could argue that 
the status quo is fine then I wouldn't necessarily object, but would suggest 
that at least the code be made explicit about it (and the documentation, as 
well).

> Well, what do you think now?  Is it worth adding those to rx.el? I'm
> not sure.  How important is it to find unibyte characters in a string,
> anyway?

Unless we manage to make [:unibyte:]/[:multibyte:] more useful in their own 
right, it's fine to leave rx.el as is, as far as I'm concerned. There is no 
loss of expressivity.







reply via email to

[Prev in Thread] Current Thread [Next in Thread]