[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#29871: 25.3; ZWJ word-boundaries in regexps
From: |
Eli Zaretskii |
Subject: |
bug#29871: 25.3; ZWJ word-boundaries in regexps |
Date: |
Wed, 27 Dec 2017 22:33:22 +0200 |
> From: "Mark Shoulson" <mark@nagas.meson.org>
> Date: Wed, 27 Dec 2017 14:07:40 -0500
>
> According to http://unicode.org/reports/tr29/#Word_Boundaries rule WB4,
> it would seem that a ZWJ character (U+200D ZERO WIDTH JOINER) between
> two "word" characters should not constitute a word boundary. And yet:
>
> (string-match "\\<" "foo\u200Dfbar" 1)
>
> evaluates to 4 (the 1 is to skip the word-beginning at the start of the
> string). Or you can search for "\\b" or "\\>" and get 3. Either way,
> indicative of a word-break at the ZWJ character. Is this correct?
Emacs considers a change of script as a word break, and U+200D's
script is 'symbol', which is different from 'latin', the script of the
ASCII characters.