[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] updated po files

From: Jim Segrave
Subject: Re: [Bug-gnubg] updated po files
Date: Tue, 15 Jun 2004 21:50:41 +0200
User-agent: Mutt/1.4.1i

On Tue 15 Jun 2004 (09:03 +0000), Joern Thyssen wrote:
> On Tue, Jun 15, 2004 at 10:33:45AM +0200, Petr Kadlec wrote


> > Well, my file has a different checksum, but if I convert the line ends
> > using fromdos, I get exactly the same hash, which proves that CVS
> > converts line ends during transfer (which is good and well).
> Well, yes and no. For single byte character sets this is good. However,
> for multiple byte characters sets this is problematic. For example, the
> unicode character sequence for a c with a dot above is 0x01 0x0A. I
> think cvs would convert this to 0x01 0x0A 0x0D inserting a line feed in
> the text. Consider this imaginary UTF-8 sequence: 0x0A 0x56 being
> converted by cvs to 0x0A 0x0D 0x56, which is probably an illegal UTF-8
> sequence. 
> Anyway, I can see that Kaoru has committed a fix.

I think that Unicode character set is done in UTF8, where ther
representations are somewhat different - see


Multibyte characters are flagged by the first byte of the sequence
with the first byte being either < 0x80, in which case it's a 1 byte
character or 0xc0..ef to indicate the first byte of a 2 character
sequence, etc. SO the 0x0a 0x56 becoming 0x0a 0x0d 0x56 would not
create a new UTF8 character. Further, each byte of a UTF8 encoding of
a multibyte sequence will have bit 7 set, so no multibyte character
will ever contain 0x0a and conversion to 0x0d 0x0a won't change

What can be a problem is using 8 bit single byte characters and then
reading the result expecting UTF8, since any signed characters in the
input will be mistakenly turned into (possibly illegal) multi-byte

Jim Segrave           address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]