[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: UTF-16 and (ice-9 rdelim)
Re: UTF-16 and (ice-9 rdelim)
Sun, 17 Jan 2010 16:11:44 -0800 (PST)
> From: Neil Jerram <address@hidden>
> 1. It seems that most (all?) UTF-16 files begin with a byte order marker
> (BOM), \ufeff, which readers are conventionally supposed to discard -
> but Guile doesn't. So first-line becomes "\ufeffhello"
> 2. The internals of (read-line) just search for a '\n' char to determine
> the end of the first line, which means they're assuming that
> - '\n' never occurs as part of some other multibyte sequence
> - when '\n' occurs as part of the newline sequence, it occupies a single
> This causes the second line to be read wrong, because newline in UTF-16
> is actually 2 bytes - \n \0 - and the first (read-line) leaves the \0
> byte unconsumed.
> I think the fixes for these are roughly as follows.
> For 1:
> - Add a flag to the representation of a file port to say whether we're
> still at the start of the file. This flag starts off true, and
> becomes false once we've read enough bytes to get past a possible BOM.
> - Define a static map from encodings to possible BOMs.
> - When reading bytes, and the flag is true, and the port has an
> encoding, and that encoding has a possible BOM, check for and consume
> the BOM.
This should work. BOMs are either 2, 3, or 4 bytes for UTF-16, UTF-8, and
UTF-32 respectively. And if the port encoding is expected to be set
correctly in the first place, a BOM should always be the first code point
returned by read-char.
> For 2:
> - In scm_do_read_line(), keep the current (fast) code for the case where
> the port has no encoding.
> - When the port has an encoding, use a modified implementation that
> copies raw bytes into an intermediate buffer, calls
> u32_conv_from_encoding to convert those to u32*, and uses u32_strchr
> to look for a newline.
> Does that sound about right? Are there any possible optimizations?
If you already have to go to the trouble of converting to u32, it might
be simplest to reimplement the non-Latin-1 case in Scheme,
since read-char and unread-char should work even for UTF-16.
That might do bad things to speed, though.
> For the static map, is there a canonical set of possible encoding
> strings, or a way to get a single canonical ID for all the strings that
> are allowed to mean the same encoding? For UTF-16, for example, it
> seems to me that many of the following encoding strings will work
> + the same with different case
> and we don't want a map entry for each one.
> I suppose one pseudo-canonical method would be to upcase and remove all
> punctuation. Then we're only left with "UTF16" and "UTF16LE", which
> makes sense.
There are a couple of issues here. If you want a port to automatically
identify a Unicode encoding by checking its first four bytes for a BOM,
then you would need some sort of association table. It wouldn't be that
hard to do.
But, if you just want to get rid of a BOM, you can cut it down to
a rule. If the first code point that a port reads is U+FEFF and if the
encoding has the string "utf" in it, ignore it. If the first code point
is U+FFFE and the encoding has "utf" in it, flag an error.