[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] Re: How does arch/tla handle encodings?

From: Stefan Monnier
Subject: [Gnu-arch-users] Re: How does arch/tla handle encodings?
Date: 29 Aug 2004 20:53:54 -0400
User-agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.3.50

>> Of course, in theory it's all a really serious problem.
>> In practice, it's a very minor problem which bites raely and when it does
>> it's usually obvious and easy to fix.

> No, it's not a minor problem.  No, it's most certainly not obvious.
> And it  can be quite tedious to fix.  Have you only worked with source
> code with  english comments?  Or perhaps only with teams where all members
> are using  very similar setups?  Well, many haven't.

That's definitely not the general consensus around herem so if you want to
convince us, you'll need more than to say "no, you're wrong".

>> this problem is so much larger than Arch that it just feels wrong for tla
>> to try and fix it.
> I don't believe this! That is the worst attitude anyone could have.

Maybe you don't like this attitude, but I happen to think it's a very
wise one: fix the problem where it can be fixed once and for all.

But in any case, I don't think anyone knows what a real solution can
look like.  Some people think utf-8 is it.

> The problem is only larger than Arch in that Arch isn't the only badly 
> behaving program.

You think that Arch behaves badly because it "throws away info", but the
fact is that it doesn't have this info in the first place.  Arch really only
deals with files composed of a sequence of bytes.  It does have some added
heuristics used for the special case of "text files" which are composed of
sequences of lines themselves composed of sequences of bytes.
Nowhere there appears the notion of a character or an encoding.

Of course, Arch could assume that all the files it handles are encoded in
the "system's standard encoding", basically the encoding specified in the
default locale.  But that's not a reliable assumption.

So you first have to come up with a way to tell Arch which files use which
encoding (or no encoding for non-text files).  I.e. you're back to
square one.  So we first need to come up with a standard way to bundle this
meta-data with the data, so that apps like Arch, Emacs, etc... can correctly
preserve it.

And note that the meta-data might be pretty complex: how is Arch supposed to
handle a file which is partly encoded in utf-8, partly in iso-2202, partly
in utf-16, partly in koi8-u?

My take on it is that the best way is to encode this meta-data directly in
the data.  I.e. have the data be self-describing.  This way, you don't need
to change everything that manipulates the data but just the end points.

> Besides, a basic fix should be done anyway, namely adding support for 
> arbitrary file metadata.

Adding meta-data to Arch might be a good idea, but if it's an Arch-specific
standard, then it can only really be used for Arch-specific meta-data.
I.e. not for encoding.

> Anyway, it shouldn't even be necessary for me to tell you this. If you'd 
> just sit down and think for a few seconds you'd no doubt come to the 
> conclusion that it's a Really Bad Thing(tm) to throw away the encoding 
> metadata of the data.

It shouldn't be that hard for you to see that this info is not really there
to start with.  You're just using a guess (just like Emacs does).


reply via email to

[Prev in Thread] Current Thread [Next in Thread]