[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Octave-patch-tracker] [patch #10295] Add test for regexp match on UTF-8
From: |
Markus Mützel |
Subject: |
[Octave-patch-tracker] [patch #10295] Add test for regexp match on UTF-8 string |
Date: |
Wed, 16 Nov 2022 12:59:54 -0500 (EST) |
Follow-up Comment #1, patch #10295 (project octave):
Thank you for the test. Imho, a test like this is probably good to have. It
might also make sense to include a character from outside the BMP.
However, I've always seen it as kind of an implementation detail that
character arrays are encoded in UTF-8 in Octave. That implementation detail
might change in the future. (Maybe, they'll be UTF-16 encoded like in Matlab.
Not sure how likely that is though.)
I guess we could just change that BIST if that should be the case.
Next thing I'm not certain about: Are we sure that the test sources are always
interpreted as UTF-8 (independent of, e.g., the system locale)?
It would probably be good if they were. But I don't know if that is the case
currently.
One way around that might be to replace the literal string with the
corresponding UTF-8 encoded byte sequence:
diff -r 9f4a9dd4a6ee -r a6a427632ab1 libinterp/corefcn/regexp.cc
--- a/libinterp/corefcn/regexp.cc Sun Nov 13 13:00:16 2022 -0500
+++ b/libinterp/corefcn/regexp.cc Tue Nov 15 12:25:40 2022 -0300
@@ -919,6 +919,9 @@
%!assert (regexp ('abcabc', 'abc$'), 4)
%!assert (regexp ('abcabc', '^abc$'), zeros (1,0))
+## UTF-8 test with character array "âé🙂ïõù"
+%!assert (regexp (char ([195, 162, 195, 169, 240, 159, 153, 130, 195, 175,
195, 181, 195, 185]), "."), [1, 3, 5, 9, 11, 13])
+
%!test
%! [s, e, te, m, t] = regexp (' No Match ', 'f(.*)uck');
%! assert (s, zeros (1,0));
That's even more relying on the fact that character arrays in Octave are
currently UTF-8 encoded. But the test relies on that anyway...
What do you think?
_______________________________________________________
Reply to this item at:
<https://savannah.gnu.org/patch/?10295>
_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/