[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-16 and (ice-9 rdelim)

From: Mike Gran
Subject: Re: UTF-16 and (ice-9 rdelim)
Date: Mon, 18 Jan 2010 13:29:19 -0800 (PST)

> From: Neil Jerram
> Hi Mike,

> > But, if you just want to get rid of a BOM, you can cut it down to 
> > a rule.  If the first code point that a port reads is U+FEFF and if the
> > encoding has the string "utf" in it, ignore it.  If the first code point
> > is U+FFFE and the encoding has "utf" in it, flag an error.
> Agreed.
> Out of interest, does that mean that iconv will auto-detect the
> endianness if the encoding does not explicitly say "le" or "be"?

The Unicode FAQ from says that "the unmarked form (UTF-16, UTF-32)
uses big-endian byte serialization by default, but may include a byte order
mark at the beginning to indicate the actual byte serialization used."  So,
I guess the strictly correct thing to do for UTF-16 would be to

* check for a BOM.  
* if it exists
  *  if it is U+FFFE, modify the port encoding to UTF-16-LE
  *  if it is U+FEFF, leave the port encoding as UTF-16
  *  discard the BOM
* else, leave the port-encoding to UTF-16

and similarly for UTF-32.

- Mike

reply via email to

[Prev in Thread] Current Thread [Next in Thread]