[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: plists in UTF8

From: Richard Frith-Macdonald
Subject: Re: plists in UTF8
Date: Wed, 14 Jun 2006 13:31:21 +0100

On 14 Jun 2006, at 13:12, David Ayers wrote:

The issue is whether a UTF-8 plist without a BOM is a valid plist (i.e.
should be considered non-portable).

Well, if it has no BOM then how do you know it's UTF-8? For an XML plist you can theoretically use the initial header to determine character encoding (we don't have support for that and it's not in the OpenStep/MacOS-X spec/documentation that we should), but other than that the only standard we have is to use the encoding for the locale we are working in ... which is non-portable by definition.

I've often read that BOM's in UTF-8 files cause issues (e.g.:
http://en.wikipedia.org/wiki/Byte_Order_Mark).  It becomes a problem
when multiple text files are concatenated and someone (I think it was
you) told me that BOM's within files have been deprecated. (I wonder if
cat(1) or it underlying facilities would be patched to handle this).

I think a BOM within (ie not at the start of) a file is actually illegal. It's the zero-width space in UTF-16 (acts as a BOM at start of file) which is deprecated.

I guess you just can't really use 'cat' to join UTF-8 (or UTF-16) files ... depends whether you consider 'cat' to be a binary data utility or a text utility ... probably some people would argue it works correctly if it just concatenates the data streams. Historically, we are used to having the same tools work with binary data and with text, but in a world with different locales and different text coding schemes that's no longer the case. I don't believe that BOMs cause special problems ... they only cause problems if you join text files improperly ... which is really no worse (perhaps better because it's more easily detected) than if you concatenated files containing text in different encodings.

I think that one could argue that a plain UTF-8 file should be
considered valid/portable by plparse... But for that to be of any value would also mean, that UTF-8 files would be parsed correctly in non- UTF-8
locales, which I suppose is the reason that UTF-8 without BOM is
currently considered non-portable.

Well yes ... if there is no means of telling that a file is UTF-8 ... then for practical purpose it isn't UTF-8 ... it's just a bunch of bytes with no known meaning. You can guess what encoding it is, but that guess is going to vary depending on the locale you are in. Guessing may be reasonable for an editor (debatable), but is inappropriate for a checker.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]