|
From: | Paul Eggert |
Subject: | bug#18777: [PATCH] dfa: improvement for checking of multibyte character boundary |
Date: | Mon, 15 Dec 2014 09:43:54 -0800 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.3.0 |
On 12/15/2014 06:59 AM, Norihiro Tanaka wrote:
+/* True if each byte can not occur inside a multibyte character */ +static bool always_single_byte[NOTCHAR]; + +static void +dfaalwayssb (void) +{ + size_t i; + unsigned char const uc[] = { '\0', '\n', '\r', '.', '/' }; + for (i = 0; i < sizeof uc / sizeof uc[0]; ++i) + always_single_byte[uc[i]] = true; +}
Can't we improve this when using_utf8 () is true? In that case, every ASCII character is always single byte. Also, the bytes 0xc0, 0xc1, and 0xf5 through 0xff can be added to the table: they are not single-byte characters but they are always encoding errors so they will be a character boundary as far as skip_remains_mb is concerned. This suggests that the table 'always_single_byte' should be renamed to something like 'always_character_boundary'.
wint_t wc = WEOF; + if (always_single_byte[*p]) + return p;
This won't assign anything to *WCP, contrary to the documented API for for skip_remains_mb. This is OK (as callers don't care) but the API documentation should be changed to reflect the actual behavior.
[Prev in Thread] | Current Thread | [Next in Thread] |