[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Wed, 24 Aug 2011 18:51:32 -0700
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:18.104.22.168) Gecko/20100228 Thunderbird/22.214.171.124 Mnenhy/0.7.6.666
Chet Ramey wrote:
In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL
LETTER E followed by U+0301 COMBINING ACUTE ACCENT.
That's not valid UTF-8, since UTF-8 requires that the shortest sequence
be used to encode a character.
This is exactly true...
Valid UTF-8 is anything that is decomposable as VALID.
It may not be the "NFC" form, but it is is a valid NFD, NFKD and NFKC form.
NFC format is only *recommended* (not required) by W3C (which
is separate from the Unicode standards body).
The Unicode consortium makes available a "torture test" that can be used to
evaluate bash's compliance in correctly normalizing alternate forms into
It's in a file called NormalizationTest.txt in the dir of each
released version of unicode.
NFC is the "Normalized Form of Composition", "NFD" is the
Normalized Form of Decomposition"; NFKC and NFKD are alternate
expressions of the same character that could be combined into
the same NFC -- but were written differently.
I.e. if you have a simple case of an accent and a underscore, say as a
"character", both the following chars are combining. But the order is
not fixed -- and is often locale dependent based (Humans are humans, not
computers, it's up to computers to normalize them, though humans ideally,
would learn NFC, it's a modern convention, but the w3c feels the author
be the one to use it, as they know the original intent when writing it.
Example: the letter ǟ, is a NFC form of the letter 'a',
a combining diaeresis, and a combining macron. Note that the order
that they are entered (macron or diaeresis first) isn't necessarily
important as it will display the same. To humans reading it, these are
all valid characters. Computers are the things that get confused, but
computers don't prescribe language use. UTF-8 is an encoding. (Like
POSIX was supposed to be descriptive, not prescriptive, but failed,
When one speaks of valid / invalid UTF-8 sequences, it means it is possible
to have hex stings that decode (under some decoder) to a valid codepoint,
but unless it is in minimal form, it's not guaranteed to work on other
decoders. It's in the encoding of a single valid 'code point' that
UTF-8 must be minimal -- HOWEVER, the base note writer used
2 code points a & combining accent, --- both are valid code points,
and legal under UTF-8. They aren't minimal nor in the Normalized
format, but they are legal/valid UTF-8 strings, that should (not must,
but ideally), be *semantically* handled like their pre-composed
Syntactically --- is a different manner, as they can be broken down
deliberately, for some reason unknown to a computer, though I suspect many
word processors of tomorrow will auto-correct non-canonical forms be default
and you'll have to jump through hoops to keep it from doing so.
Example: take the string (separated by vertical bars for legibility:
"a | ̧ | ̣̣̣ | ̊ ", i.e. an 'a' followed by a cedilla (class 202), and
underdot accent (class 220), and a ring accent (class 230). The
glyph obtained in the end is: "ạ̒̊" [NOTE: it looks correct in var-width font,
but wrong in TB-fixed font -- it does look correct on a terminal
emulator like 'ScreenCRT' (windows), uxterm and 'terminal'
(linux/x)....but is blank on xterm (maybe doesn't have right charset,
and causes a write error under windows cmd console in
cp 65001...(what else is new!)]. The NFD of this string will be the
same because the string is already canonical (the classes are in
increasing order). On the other hand, the NFC is " å | ̧ | ̣̣̣ " The
rules of NFC enabled
combining of the ring accent, despite its distance from the base character.
So in the that character, you still have 2 combining characters following
a base characters -- which I suspect would lead to the same problem -- even
though the above is in NFC form.
The W3C is a strong proponent of the NFS form -- so that web software won't
have to deal with normalization. But bash isn't web SW, so none of the
'desires' would even apply in bash.
I'm not intimately familiar with this stuff myself, but it looks like
a real bastard to me... I thought the point of UTF-8 was that you could
read it a byte at a time, and know when you encountered a byte that
signified the start of a multi-byte character. But apparently not!
If I'm interpreting this COMBINING ACUTE ACCENT thing properly, the
only indicator that you are in a multi-byte character comes with the
*second* byte, so you have to backtrack. What idiot thought this up?
A multi-byte character represents 1 code point. you can have
code points that yield 1 letter.
TB (on win) has some quirks with spacing control regarding those
characters...as well, but they work fine in a character based window
like the term mode or even Gvim... (which isn't to say the gvim window is
exactly legible...but hey, it still takes 1 space!)...
BTW, Thomas -- what is the Character that comes after 'De' in your
name? I read it as hex '0xc282c2' which doesn't seem to be valid unicode.
BTW -- nothing I've said make any comment on things I didn't mention..
- Re: accents, Thomas De Contes, 2011/08/24
- Re: accents,
Linda Walsh <=