[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Octave-patch-tracker] [patch #10295] Add test for regexp match on UTF-8

From: Markus Mützel
Subject: [Octave-patch-tracker] [patch #10295] Add test for regexp match on UTF-8 string
Date: Wed, 16 Nov 2022 12:59:54 -0500 (EST)

Follow-up Comment #1, patch #10295 (project octave):

Thank you for the test. Imho, a test like this is probably good to have. It
might also make sense to include a character from outside the BMP.

However, I've always seen it as kind of an implementation detail that
character arrays are encoded in UTF-8 in Octave. That implementation detail
might change in the future. (Maybe, they'll be UTF-16 encoded like in Matlab.
Not sure how likely that is though.)
I guess we could just change that BIST if that should be the case.

Next thing I'm not certain about: Are we sure that the test sources are always
interpreted as UTF-8 (independent of, e.g., the system locale)?
It would probably be good if they were. But I don't know if that is the case
One way around that might be to replace the literal string with the
corresponding UTF-8 encoded byte sequence:

diff -r 9f4a9dd4a6ee -r a6a427632ab1 libinterp/corefcn/regexp.cc
--- a/libinterp/corefcn/regexp.cc       Sun Nov 13 13:00:16 2022 -0500
+++ b/libinterp/corefcn/regexp.cc       Tue Nov 15 12:25:40 2022 -0300
@@ -919,6 +919,9 @@
 %!assert (regexp ('abcabc', 'abc$'), 4)
 %!assert (regexp ('abcabc', '^abc$'), zeros (1,0))
+## UTF-8 test with character array "âé🙂ïõù"
+%!assert (regexp (char ([195, 162, 195, 169, 240, 159, 153, 130, 195, 175,
195, 181, 195, 185]), "."), [1, 3, 5, 9, 11, 13])
 %! [s, e, te, m, t] = regexp (' No Match ', 'f(.*)uck');
 %! assert (s, zeros (1,0));

That's even more relying on the fact that character arrays in Octave are
currently UTF-8 encoded. But the test relies on that anyway...

What do you think?


Reply to this item at:


Message sent via Savannah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]