[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] How does arch/tla handle encodings?

From: Jan Hudec
Subject: Re: [Gnu-arch-users] How does arch/tla handle encodings?
Date: Sat, 28 Aug 2004 17:22:43 +0200
User-agent: Mutt/1.5.6+20040818i

On Sat, Aug 28, 2004 at 13:46:40 +0300, Marcus Sundman wrote:
> On Saturday 28 August 2004 12:53, Jan Hudec wrote:
> > On Fri, Aug 27, 2004 at 21:38:06 +0300, Marcus Sundman wrote:
> > > On Friday 27 August 2004 21:23, Andrew Suffield wrote:
> > > > On Fri, Aug 27, 2004 at 08:20:00PM +0300, Marcus Sundman wrote:
> > > > > On Friday 27 August 2004 19:52, Andrew Suffield wrote:
> > > > > > On Fri, Aug 27, 2004 at 06:50:23PM +0200, Vaclav Haisman wrote:
> > > > > > > File's encoding is imho metadata as much as permisions are.
> > > > > >
> > > > > > It's not. Encoding is data.
> > > > >
> > > > > Oh, get a clue. And a dictionary. The encoding info is data about
> > > > > the data that is the content of the file. "Data about data" is
> > > > > called "metadata". "Encoding" is an attribute of the file, just as
> > > > > "filename" and "permissions" are.
> > > >
> > > > And I repeat: encoding is data.
> > >
> > > Yes, but it's also metadata. You said it isn't, but it is. Don't
> > > pretend to be more stupid than you are.
> >
> > It is **NOT** metadata in the sense of filename, permissions, timestamp,
> > ie. file attributes. It is metadata in the general sense "data about
> > data".
> >
> > So while *calling* it metadata is ok, *treating* it as file attributes
> > is not. The encoding is needed to understand the file, so it better be
> > deduced from it's contents. The attributes do not bind that tighlty and
> > they can be lost at any moment. Especially since applications don't know
> > how to handle them.
> Are you seriously suggesting that metadata is not actually metadata if it is 

No. I am actualy suggesting, that it *DIFFERENT KIND* of metadata than
file name, permissions and timestamp. And thus should be handled
differently, if at all possible by the file format itself.

> mandatory? Only optional metadata is actually metadata? Both a file's name 
> and its encoding are properties of the file. The former can be changed 
> without modifying the contents of the file, the latter can't necessarily. 
> This is irrelevant. Both are equally metadata.

Yes. They are equaly metadata. Which by does not mean they are best
treated the same. There may be many ways of treating metadata and
different ways are appropriate for different metadata.

> You just don't make sense. Is the "description" attribute metadata? Let's 
> say you have a picture that is displaying a particular shade of red, and 
> has the attribute "description: the color of my car". You use this picture

Then the comment (most graphical formats provide room for one) is the
most appropriate place for that -- and that is within the actual file.

> to find the correct shade when shopping for car paint. If you lose the
> description attribute the picture is meaningless. The description is an
> essential part of the picture and can't be deduced from it. Does this make
> the attribute not metadata? Or how is this different from the encoding of a
> text file? (And please don't say something stupid like "it's different 
> because the color of characters are irrelevant".)

It makes the attribute a metadata. But a metadata of the contents, as
opposed to metadata of the filesystem object.

While the metadata of the filesystem object are best stored in the
inode, perhaps as extended attributes, the content metadata are much
better stored in the file itself, of course if the file format has room
for them. If it does not, extended attributes are surely better than
nothing. But they are not good.

> Also, the encoding can *not* be deduced from the file's contents. I have 
> already told why this is. E.g. if a file is in ISO-8859-2 there is no way 
> that the editor could know that it's not ISO-8859-1 or ISO-8859-4 or 
> ISO-8859-5 or ISO-8859-8 or ISO-8859-9 or ISO-8859-10 or ISO-8859-13 or 
> ISO-8859-14 or ISO-8859-15 or some other of the 30+ encodings for which the 
> given byte sequence is valid.

It definitely CAN -- if it's format supports it. If you say, that a file
staring with comment containing:
# encoding: iso-8859-15
is in said encoding (eg. python sources have this rule), than the
encoding is deduced from the file contents. Just there is no standard
for this. There is no standard for extended attributes there either.

> > After all, that's what the byte-order-mark is for.  In most editors, the 
> > sequence 0xfe 0xff indicates utf-16be, 0xff 0xfe indicates utf-16le and
> > 0xef 0xbb 0xbf indicates utf-8 encoding.
> No, the BOM is for specifying endianess of the encoding. (All unicode 
> formats support a BOM, it's just that it's not needed for single byte based 
> ones, such as UTF-8. That said, I fully support using BOMs also in UTF-8 
> files to more often detect badly behaving programs.) If you don't know 
> which encoding (or group of encodings) a file is in then you can't possibly 
> know how to interpret the first bytes of the file. There is no way of 
> knowing if a file beginning with the bytes 0xFE and 0xFF is a big-endian 
> UTF-16 file or an ISO-8859-1 file starting with "thorn yuml" or something 
> completely different in some other encoding.

No, you really don't know that. You don't even know that is a text file.

Actualy arch could support external attributes on contents, because the
contents is tagged with the id tag. Perhaps when the file-as-directory
interface is finalized. (There is currently a flamewar on
linux-filesystem about reiser4 concerning, among other things, this
extended attributes topic).

                                                 Jan 'Bulb' Hudec 

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]