[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: 23.0.60; [nxml] BOM and utf-8

From: tomas
Subject: Re: 23.0.60; [nxml] BOM and utf-8
Date: Thu, 22 May 2008 06:17:45 +0200
User-agent: Mutt/1.5.15+20070412 (2007-04-11)

Hash: SHA1

On Thu, May 22, 2008 at 12:37:11AM +0200, Patrick Drechsler wrote:
> Patrick Drechsler <address@hidden> writes:

This would be rather a question to w3.org, but...

> > ,----[ http://www.w3.org/TR/2006/REC-xml-20060816/#charencoding ]
> > | Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY
> > | begin with the Byte Order Mark [...]
> > |        [...]  XML processors MUST be able to use this character to
> > | differentiate between UTF-8 and UTF-16 encoded documents.
> > `----

...and how are the XML processors supposed to achieve that? Is there a
second variant of BOM, indicating UTF-8?

> > and
> >
> > ,----[ 
> > http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing-with-ext-info ]
> > | If an XML entity is in a file, the Byte-Order Mark and encoding
> > | declaration are used (if present) to determine the character encoding.
> > `----

...or is rather the absence of a BOM the indicator for UTF-8?

Am I completely whacko, or are they?

Sorry. I am confused.

Ah, and BTW: interpreting the BOM as whitespace is not that far off --
as stated in <http://unicode.org/faq/utf_bom.html#38>:

 | Q: What should I do with U+FEFF in the middle of a file?
 | A: In the absence of a protocol supporting its use as a BOM and when not
 | at the beginning of a text stream, U+FEFF should normally not occur. For
 | backwards compatibility it should be treated as ZERO WIDTH NON-BREAKING
 | SPACE (ZWNBSP), and is then part of the content of the file or string.

But that would be "in the middle of a file", not at the beginning, as
our case is.

I'd appreciate any insights.

- -- tomás
Version: GnuPG v1.4.6 (GNU/Linux)


reply via email to

[Prev in Thread] Current Thread [Next in Thread]