[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-16 and (ice-9 rdelim)

From: Neil Jerram
Subject: Re: UTF-16 and (ice-9 rdelim)
Date: Mon, 18 Jan 2010 20:13:31 +0000
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux)

Hi Mike,

Many thanks for your quick response.  I'll hopefully work on these fixes

A few comments...

Mike Gran <address@hidden> writes:

> This should work.  BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8,
> and UTF-32 respectively.  And if the port encoding is expected to be
> set correctly in the first place, a BOM should always be the first
> code point returned by read-char.

Thanks.  For the moment, I am assuming that the encoding will have
previously been declared correctly, by `set-port-encoding' or by a
`coding:' comment.

> If you already have to go to the trouble of converting to u32, it might
> be simplest to reimplement the non-Latin-1 case in Scheme,
> since read-char and unread-char should work even for UTF-16.
> That might do bad things to speed, though.

I'll have a look; it's nice to prototype that way, at least.

> There are a couple of issues here.  If you want a port to automatically
> identify a Unicode encoding by checking its first four bytes for a BOM, 
> then you would need some sort of association table.  It wouldn't be that
> hard to do.

I'm not thinking of that yet.  (For the future, clearly it must be
possible, as Emacs is doing it all the time.)

> But, if you just want to get rid of a BOM, you can cut it down to 
> a rule.  If the first code point that a port reads is U+FEFF and if the
> encoding has the string "utf" in it, ignore it.  If the first code point
> is U+FFFE and the encoding has "utf" in it, flag an error.


Out of interest, does that mean that iconv will auto-detect the
endianness if the encoding does not explicitly say "le" or "be"?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]