[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?

From: Marcus Sundman
Subject: Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?
Date: Mon, 30 Aug 2004 05:14:01 +0300
User-agent: KMail/1.7

On Monday 30 August 2004 03:53, Stefan Monnier wrote:
> >> Of course, in theory it's all a really serious problem.
> >> In practice, it's a very minor problem which bites raely and when it
> >> does it's usually obvious and easy to fix.
> >
> > No, it's not a minor problem.  No, it's most certainly not obvious.
> > And it  can be quite tedious to fix.  Have you only worked with source
> > code with  english comments?  Or perhaps only with teams where all
> > members are using very similar setups?  Well, many haven't.
> That's definitely not the general consensus around herem so if you want
> to convince us, you'll need more than to say "no, you're wrong".

I have already told you more than that, and you know it. Implying that I've 
only said "no, you're wrong" is dishonest and insidious.

> >> this problem is so much larger than Arch that it just feels wrong for
> >> tla to try and fix it.
> >
> > I don't believe this! That is the worst attitude anyone could have.
> Maybe you don't like this attitude, but I happen to think it's a very
> wise one: fix the problem where it can be fixed once and for all.

That was not the attitude I was referring to. It'd have been apparent if 
you'd included the next line when quoting me.

Anyway, if there would be one single place where this could be fixed once 
and for all it'd be great. If you know of such a thing, please present your 
ideas. I don't believe it can be fixed at one place. I think it needs to be 
fixed all over the place, including in RC-/CM-systems.

> But in any case, I don't think anyone knows what a real solution can
> look like.  Some people think utf-8 is it.

So instead of doing what we can now we should just wait for Someone Else(tm) 
to come up with The Ultimate Solution(tm)? Wow, you are just full of good 
ideas, aren't you?

> > The problem is only larger than Arch in that Arch isn't the only badly
> > behaving program.
> You think that Arch behaves badly because it "throws away info", but the
> fact is that it doesn't have this info in the first place.

Hence I've been saying that there should be a way to tell arch this info. 
Hello? (The wheel is spinning but the hamster is dead?)

> [Arch] does have some added heuristics used for the special case of "text
> files" which are composed of sequences of lines themselves composed of
> sequences of bytes.

Yuck! I dislike both such heuristics (they are too unreliable) and that 
artificial distinction of "text files" and "binary files".

> Of course, Arch could assume that all the files it handles are encoded in
> the "system's standard encoding", basically the encoding specified in the
> default locale.  But that's not a reliable assumption.

I agree. It's completely unrealistic, and hence needs to be accompanied with 
other options as I've already explained.

> So you first have to come up with a way to tell Arch which files use
> which encoding (or no encoding for non-text files).  I.e. you're back to
> square one.

This I have done. I.e. I'm not back anywhere.

> So we first need to come up with a standard way to bundle 
> this meta-data with the data, so that apps like Arch, Emacs, etc... can
> correctly preserve it.

I think your obsession with this wonderful, non-existing "ultimate solution" 
is blinding you from seeing what can be done now.

> And note that the meta-data might be pretty complex: how is Arch supposed
> to handle a file which is partly encoded in utf-8, partly in iso-2202,
> partly in utf-16, partly in koi8-u?

In my proposal that would depend on the case. If it was an xml file you'd 
use the top level encoding (sub-elements may define their own encodings). 
In some other case you might want to use multipart messages (see RFC 2046). 
Or you might want to use something else.
In any case such files must obviously have the "Auto-Filter" attribute set 
to "false".

> My take on it is that the best way is to encode this meta-data directly
> in the data.  I.e. have the data be self-describing.  This way, you don't
> need to change everything that manipulates the data but just the end
> points.

Unfortunately this can't always be done.

> > Besides, a basic fix should be done anyway, namely adding support for
> > arbitrary file metadata.
> Adding meta-data to Arch might be a good idea, but if it's an
> Arch-specific standard, then it can only really be used for Arch-specific
> meta-data. I.e. not for encoding.

Why do you think it can't be used for the encoding info even if it's 

Sometime in the future when there is a standard way to do it then arch can 
do it that way, but until such a thing exists it /by definition/ has to be 
a non-standard solution.

> > Anyway, it shouldn't even be necessary for me to tell you this. If
> > you'd just sit down and think for a few seconds you'd no doubt come to
> > the conclusion that it's a Really Bad Thing(tm) to throw away the
> > encoding metadata of the data.
> It shouldn't be that hard for you to see that this info is not really
> there to start with.  You're just using a guess (just like Emacs does).

Huh? Of course it's there. Obviously whichever tool does the encoding knows 
what encoding it's using. I can tell my editor to use UTF-8 for all files 
in a particular module and then tell arch that the default encoding for 
that module is UTF-8 (so unless I tell it otherwise all new files in that 
module will be in UTF-8). There is no guessing involved.

Howcome I have to repeat everything I say over and over again? Don't you 
read what I write? Or am I just really, really bad at explaining things? Or 
is my English really, really bad? Why is it that I'm not getting through to 
you? Or am I, but you just like to screw around with me?

- Marcus Sundman

reply via email to

[Prev in Thread] Current Thread [Next in Thread]