[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: byte-order marks

From: Mark H Weaver
Subject: Re: byte-order marks
Date: Tue, 29 Jan 2013 14:09:16 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux)

I wrote:
> Having slept on this, I think I agree that 'open-input-file' should
> auto-consume BOMs.

On the other hand, there's a nasty complication.  Of course
(open-input-file FILENAME) is just (open-file FILENAME "r"), so the
auto-consuming logic should be in 'open-file'.

So what should (open-file FILENAME "r+") do?  The problem is that we
don't know if the user will read or write first.  If they write first,
then they may reasonably assume that what they write will be put at the
very beginning of the file, no?

Also, Unicode 6.2 section 2.6 table 2-4 says that BOMs are only allowed
for the encoding schemes UTF-8, UTF-16, and UTF-32.  They are *not*
allowed for UTF-16BE, UTF-16LE, UTF-32BE, or UTF-32LE.

Unicode 6.2 section 16.8 goes into more detail:

   For compatibility with versions of the Unicode Standard prior to
   Version 3.2, the code point U+FEFF has the word-joining semantics of
   zero width no-break space when it is not used as a BOM.  [...]

   Where the byte order is explicitly specified, such as in UTF-16BE or
   UTF-16LE, then all U+FEFF characters -- even at the very beginning of
   the text -- are to be interpreted as zero width no-break spaces.
   Similarly, where Unicode text has known byte order, initial U+FEFF
   characters are not required, but for backward compatibility are to be
   interpreted as zero width no-break spaces.  [...]

   Systems that use the byte order mark must recognize when an initial
   U+FEFF signals the byte order. In those cases, it is not part of the
   textual content and should be removed before processing, because
   otherwise it may be mistaken for a legitimate zero width no-break
   space.  To represent an initial U+FEFF zero width no-break space in a
   UTF-16 file, use U+FEFF twice in a row. The first one is a byte order
   mark; the second one is the initial zero width no-break space.  [...]

This will require some more research and thought.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]