emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] add 'string-distance' to calculate Levenshtein distance


From: chen bin
Subject: Re: [PATCH] add 'string-distance' to calculate Levenshtein distance
Date: Tue, 17 Apr 2018 12:43:33 +1000

'bytecompare' only applies the original Levenshtein distance algorihtm
to the ascii strings. It's not enabled by default.

As we agreed, it has some pleasant side effect for East Asian. It's not right
and wrong thing. Just side effect of UTF-8 encoding.

Weighted distance is NOT doable for only one reason:
You can't increase character weight of certain race or religion in Emacs.
It's politically wrong.

So I suggest keep both unicode and ascii algorithm.

I will study the macros you mentioned tonight and figure out a new solution
at the end of day.


On Tue, Apr 17, 2018 at 5:41 AM, Eli Zaretskii <address@hidden> wrote:
>> From: chen bin <address@hidden>
>> Date: Tue, 17 Apr 2018 00:29:39 +1000
>>
>> To be honest I'm 100% confident about my current solution. I'm
>> actually an expert in dealing unicode related C programming.
>
> I appreciate your expertise, that's not the issue here.  The main
> issue is how to support all the range of characters that Emacs wants
> to support.
>
>> It's standard practice to convert UTF-8 character to UTF-16
>> character before any further string operation.
>>
>> To anwser your questions, I have to introduce East Asian culture at first.
>>
>> China/Japan/Korea use same Hanzi (Chinese characters). China's
>> national standard GBK defines
>> 21886 characters used Eas Asians.
>>
>> UTF-16 is compatilbe with GBK.
>>
>> Even most educated Chinese knows only 6000~8000 characters.
>>
>> UTF-32 exists because there are another 50000 ancient Hanzi which no
>> Chinese uses.
>>
>> Among 99089 characters of Unicode v5.0, 71226 characters are Hanzi.
>>
>> In short, you got about 22000 non-hanzi characters and 21000 Hanzi
>> characters. UTF-16 is totally fine
>> to deal with it.
>
> UTF-16 is fine for characters within the BMP, yes.  But Emacs supports
> much more, see below.
>
>> Say my name is 陈斌, my colleague's name is 张斌. Only one Hanzi difference, 
>> right?
>>
>> While reading file named "陈斌's doucment0", I want to list other
>> documents with similar file names.
>> Using my byte compare algorithm, "陈斌's document1", "陈斌's document2" is
>> on the top of the sorted
>> document list. Make sense.
>>
>> Without byte comparing, my family name "陈" is not more important than
>> an latin character
>> So files like "张斌's document1", "张斌's document2" is on the top of the list 
>> now.
>
> You are saying that a difference of 1 character has more "weight" for
> some characters than for others.  That might be so, but using the
> number of bytes in the UTF-8 representation as that weight makes very
> little sense to me, because the length of the UTF-8 sequence for a
> given character is determined by reasons that are largely historical:
> it depends on when was a particular character introduced into Unicode.
> It has nothing to do with "importance" of a character.
>
> Thus, your example might produce a sensible result when comparing
> ASCII characters vs non-ASCII characters, but will make much less
> sense when comparing one non-ASCII character with another non-ASCII
> character.  For example, emoji characters, which need 4 bytes in UTF-8
> sequences, will be considered twice as "important" as the above Hanzi
> characters of your example (and 4 times as important as ASCII).  The
> results will be totally unexpected and unpredictable (unless the user
> knows the entire Unicode database by heart).
>
> So I think features such as what you want to achieve with the above
> example will have to use some other additional feature, like "weighted
> difference", and using the byte length as a kind of surrogate "weight"
> for this purpose is IMO not a good idea, even though it could look
> like that at first glance.
>
>> 2. As mentioned, China has national standard GBK which is compatible
>> with UTF-16.
>> Every character is a 16 bit integer. It's common practice to convert
>> UTF-8 string (or other encoded string)
>> to UTF-16 format before any further string operation. Or else, it's
>> impossbile to display Japanese, Korean, Chinese
>> in one document (considering tyical user operations, text
>> searching/replacing ...)
>>
>> I've been doing this since day one of my career because I'm Chinese.
>
> As I said, this is okay for characters in the BMP.  But Emacs must
> support much more.
>
> First, we support the latest Unicode 10.0, which added a lot of
> characters outside the BMP.  You have the emoji there, you have many
> new symbols there (the Alchemichal Symbols, Geometric Shapes, and
> Supplemental Arrows-C blocks), Domino Tiles, Mathematical
> Alphanumerics, Musical symbols, and many scripts that there's no
> reason to tell people we don't support in Emacs.
>
> Moreover, Emacs's internal representation of raw 8-bit bytes uses the
> area #x3FFF00..#x3FFFFF, which need 5 bytes in its UTF-8 sequence.  We
> cannot exclude the possibility that strings we are comparing have
> embedded raw bytes in them, this happens in practice and must work as
> expected.
>
>> The integer is called wide character (wchar_t in standard C, or WCHAR
>> in Windows GUI SDK),
>
> Windows indeed uses a 16-bit wchar_t type, which is why its support
> for characters beyond the BMP is problematic and in some cases simply
> non-existent, because standard C library functions that accept a
> single wide character argument are unable to cope with UTF-16 encoding
> that requires 2 16-bit words to represent a character.  So, for
> example, you don't have a way of looking for a character outside the
> BMP in a wide character string by calling the standard function
> wcschr, because its 2nd argument is a single wchar_t wide character,
> which can only express characters inside the BMP.
>
> Emacs cannot tolerate such limitations, not in the year 2018.
>
> So using implementations that only support the BMP is really a
> non-starter for Emacs.  We cannot accept such code.
>
>> Now allow me explain why I can't use CHAR_STRING_ADVANCE. The first
>> parameter 'c' is ALREADY a wide character.
>
> I apologize for my stupid typo.  I meant STRING_CHAR_ADVANCE, not
> CHAR_STRING_ADVANCE.  At least I didn't make such a mistake in the 2nd
> macro I mentioned, FETCH_STRING_CHAR_ADVANCE...
>
>> It's efficient to extract characters into one unsigned short array in
>> one shot.
>
> It is indeed quite efficient, but it requires one more pass of the
> entire string, which is unnecessary.  And anyway, the main reason not
> to convert to UTF-16 is what I wrote above about supporting characters
> beyond the BMP, not the extra loop.
>
>> Or else, I need convert character index to byte offset first. So I
>> need get byte offset by calling 'string_char_to_byte'
>> twice to get offset of two elements for comparing, The element coould
>> could be 1, 2, 3 bytes. So I still need convert bytes
>>  into a 4 byte integer for comparing.
>
> The macros I mentioned already do that for you, they manage both
> character offsets and byte offsets efficiently (and they are much more
> efficient than string_char_to_byte because they advance sequentially,
> whereas string_char_to_byte is for random access to a string).
>
> Thanks.
>
> P.S.  Please, let's return this discussion to the list, where it
> belongs.  I'm disappointed that this was again a private email, and
> the list isn't CC'ed.  This discussion should be public, not private.



-- 
help me, help you.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]