[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-16 and (ice-9 rdelim)
From: |
Neil Jerram |
Subject: |
Re: UTF-16 and (ice-9 rdelim) |
Date: |
Mon, 18 Jan 2010 20:13:31 +0000 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) |
Hi Mike,
Many thanks for your quick response. I'll hopefully work on these fixes
shortly.
A few comments...
Mike Gran <address@hidden> writes:
> This should work. BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8,
> and UTF-32 respectively. And if the port encoding is expected to be
> set correctly in the first place, a BOM should always be the first
> code point returned by read-char.
Thanks. For the moment, I am assuming that the encoding will have
previously been declared correctly, by `set-port-encoding' or by a
`coding:' comment.
> If you already have to go to the trouble of converting to u32, it might
> be simplest to reimplement the non-Latin-1 case in Scheme,
> since read-char and unread-char should work even for UTF-16.
> That might do bad things to speed, though.
I'll have a look; it's nice to prototype that way, at least.
> There are a couple of issues here. If you want a port to automatically
> identify a Unicode encoding by checking its first four bytes for a BOM,
> then you would need some sort of association table. It wouldn't be that
> hard to do.
I'm not thinking of that yet. (For the future, clearly it must be
possible, as Emacs is doing it all the time.)
> But, if you just want to get rid of a BOM, you can cut it down to
> a rule. If the first code point that a port reads is U+FEFF and if the
> encoding has the string "utf" in it, ignore it. If the first code point
> is U+FFFE and the encoding has "utf" in it, flag an error.
Agreed.
Out of interest, does that mean that iconv will auto-detect the
endianness if the encoding does not explicitly say "le" or "be"?
Regards,
Neil