|
From: | Eric Blake |
Subject: | Re: built-in regex matches wrong character |
Date: | Thu, 6 Sep 2018 12:58:17 -0500 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 |
On 09/06/2018 12:39 PM, Aharon Robbins wrote:
In article <mailman.444.1536243821.1284.bug-bash@gnu.org>, Eric Blake <eblake@redhat.com> wrote:But bash could be taught to convert any regex that contains a range with both endpoints ASCII into a different bracket expression before handing things over to regcomp(). That is, if the user is matching against [a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in regcomp() to get RRI, just merely some pre-processing (and often memory allocation, as the expansion of a range into a non-range tends to require more characters).This is easy and inexpensive for ASCII only. Full RRI does the same thing for wide character sets as well, though, and there the possibility for using very large amounts of memory makes the rewrite-the-range idea less palatable.
Indeed. But the bash option is named 'globasciiranges', and I find far more use in having ranges with both endpoints in single-byte ASCII behaving sanely than I do for ranges with one or more ends resulting in a multibyte character (by the time my regex involves multibyte characters, I am already admitting that I am in locale-dependent territory, and RRI may no longer be the best action anyway). That is, RRI makes the most sense when dealing with ASCII characters (< 128) in the first place, and that's a reasonable stopgap for immediate implementation, even if we don't get full RRI across all of Unicode (assuming that such might later become available via a new regcomp() flag).
-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
[Prev in Thread] | Current Thread | [Next in Thread] |