[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#12803: 24.3.50; accented Thai Unicode characters are turned into dec
From: |
Kenichi Handa |
Subject: |
bug#12803: 24.3.50; accented Thai Unicode characters are turned into decomposed ones on Mac OS X by replace-regexp |
Date: |
Mon, 05 Nov 2012 23:41:58 +0900 |
In article <DF4C7EEF-CE55-4363-A91A-0577DD28AEED@freenet.de>, Peter Dyballa
<peter_dyballa@freenet.de> writes:
> I wanted to get the unique Thai characters from such an eMail subject:
> FW:grcthai สร้างรายได้แบบไร้ขีดจำกัด กับการทำงานแบบไร้ขอบเขต..
> So I marked the Thai text and invoked replace-regexp with "\(.\)" -> ”\1 " to
> later do replace-string " " -> "C-qC-j" and then [g]sort -u the result. I had
> in buffer *Shell Command Output* decomposed Thai Unicode characters…
> But actually it is already the function replace-regexp which produces the
> decomposed characters (originally 41 characters, after replace-regexp not 82
> but 89 according to column-number-mode).
There's no such a character as "accented Thai Unicode character".
Your example is not originally 41 characters, it's just
originally 41 columns on display.
For Thai, Unicode doesn't assign a character code, for
instance, to "ร้". It's a two characters sequence, and on
displaying, it's composed into one grapheme cluster
occupying one column on display.
The more strangely looking example is "จำ". It's a two
characters sequence, but the first character is จ and the
second is ำ. Unicode doesn't have a character "จ with
small-circle-above".
---
Kenichi Handa
handa@gnu.org