One more bug:
The call to pcre2_compile_32 should be changed from:
code =
pcre2_compile_32(pattern_ucs, pattern.size(),
PCRE2_NO_UTF_CHECK | flags, &error_code,
&error_offset, 0);
To:
code =
pcre2_compile_32(pattern_ucs, pattern.size(),
PCRE2_UTF | PCRE2_UCP | flags,
&error_code,
&error_offset, 0);
Without PCRE2_UTF, proper Unicode semantics will
not be applied (such as properly handling case matching for
non-ASCII characters).
PCRE2_UCP, is a little less obvious. I think it
would make sense to enable it, since we care more for
correctness than performance. Here's what the documentation
has to say about it:
“This option changes the way PCRE2 processes \B, \b,
\D, \d, \S, \s, \W, \w, and some of the POSIX character
classes. By default, only ASCII characters are recognized,
but if PCRE2_UCP is set, Unicode properties are used
instead to classify characters. More details are given in
the section on generic character types in the pcre2pattern
page. If you set PCRE2_UCP, matching one of the items it
affects takes much longer.”
Finally, I don't think it makes sense to use PCRE2_NO_UTF_CHECK since
at best it's a no-op (since we're using UTF-32) and at worst
it can cause a crash when trying to match an invalid string.
That's not worth what little performance benefit there is to
gain from it.
Regards,
Elias