[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: byte-order marks

From: Mark H Weaver
Subject: Re: byte-order marks
Date: Tue, 29 Jan 2013 12:09:44 -0500
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/24.2 (gnu/linux)


address@hidden (Ludovic Courtès) writes:
>> For textual files, it doesn’t seem unreasonable for ‘open-input-file’ to
>> consume the BOM, IMO.  It’s not much different from the ‘eol-style’
>> transcoders.

Andy Wingo <address@hidden> writes:
> I could go either way.  I would prefer for open-input-file to consume a
> BOM on textual files.

Having slept on this, I think I agree that 'open-input-file' should
auto-consume BOMs.  As you say, textual transcoders are already somewhat
lossy anyway.  If the user wants to preserve such details, they should
use binary I/O.

However, 'open-input-file' should not auto-detect the encoding by
default, and should only consume BOMs that match the specified encoding.

'scm_i_scan_for_file_encoding' should look for (but not consume) BOMs as
a last resort, but only if no coding declaration is found.

> But I have another patch that fixes the (sxml simple) problem, so I'm
> also OK with punting on this issue for now.

IMO, BOMs should probably also be consumed by (sxml simple), but again
only if the BOM is already in the previously specified encoding.  This
is to handle the case where the XML is read from a non-file stream whose
contents originally comes from a file containing a BOM, e.g. from a web
server that losslessly copies a static file to the socket.

> [Ludo and Mark and I scribas]:
>>>> * 'open-input-file' could perhaps auto-consume a BOM at the beginning of
>>>>   the stream, but *only* if the BOM is already in the encoding specified
>>>>   by the user (possibly via an explicit call to 'file-encoding').
>>> The problem is that we have no way of knowing what file encoding the
>>> user specifies.  The encoding could come from the environment, or from
>>> some fluid that some other piece of code binds.  We are really missing
>>> an encoding argument to open-file.
>> Well, ‘%default-port-encoding’ is really an argument to ‘open-file’,
>> though admittedly not a convenient one.
> Dunno :)  In the end this reduces to saying "the user always specifies a
> port encoding".

A common case, hopefully soon to be nearly ubiquitous, are modern OSes
that use UTF-8 locales by default, and where virtually all textual data
on the system is encoded using UTF-8.  I'd like this to be robust, and
not broken by files that contain strings that look like coding

>> However, there’s no way to open a file in binary mode when using
>> ‘open-input-file’, ‘call-with-input-file’, etc.
> We can add keyword or optional arguments of course.  (Not suggesting
> that we do so at this time though.)

This has been on my TODO list for a while, and I agree that it would be
a good thing.

What do you think?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]