[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?

From: Marcus Sundman
Subject: Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?
Date: Sat, 28 Aug 2004 01:56:20 +0300
User-agent: KMail/1.7

> >> > An editor cannot possibly know which encoding a file has.
> >>
> >> Looks like you began with an empirically false statement. Mine does.
> >
> > You are kidding, right?
> Go read the definition of "empirically".

I know the definition of the term quite well. However, it doesn't matter 
whether or not some editor happened to guess the right encoding. It still 
doesn't _know_ that it's the correct one. It can't know which encoding the 
file has unless someone tells it to the editor.

Now, you may argue that this isn't a big problem, since such guesses are 
correct more often than not. Or that incorrect guesses aren't really that 
important. However, if this can be fixed easily, then why on earth would 
you not? I just can't believe the mentality of you people.

> Of course, in theory it's all a really serious problem.
> In practice, it's a very minor problem which bites raely and when it does
> it's usually obvious and easy to fix.

No, it's not a minor problem. No, it's most certainly not obvious. And it 
can be quite tedious to fix. Have you only worked with source code with 
english comments? Or perhaps only with teams where all members are using 
very similar setups? Well, many haven't.

If you haven't had this kind of problems then good for you. I have. A lot. 
Last problem was today, when I couldn't do a cvs update because someone had 
commited a file in some 8-bits/char encoding into the repository which 
otherwise contains only UTF-8 encoded files. Luckily the byte sequence 
wasn't a valid UTF-8 one, and luckily my cvs client actually checked this 
so it noticed the error (SmartCVS rules!). Thus I was able to fix it before 
it went into production code where it would have taken *considerably* more 
effort to fix. Usually errors like this aren't noticed. Still, fixing it 
wasn't very easy. First I had to tell my cvs client to pretend that all 
files were in some 8-bits/char encoding, so that I could get my hands on 
the file. Then I had to find _all_ those garbled characters and find out 
what they were supposed to be (God bless IM-systems, although I wish my IM 
client's jabber-plugin wouldn't display "å" chars as "Í"... 
another annoying encoding problem).

> this problem is so much larger than Arch that it just feels wrong for tla
> to try and fix it.

I don't believe this! That is the worst attitude anyone could have. "Why 
should /we/ do anything about it? Let someone else fix it."
The problem is only larger than Arch in that Arch isn't the only badly 
behaving program. However, for this problem to go away completely it needs 
to be fixed in _all_ systems, including arch. When a piece of text is sent 
around as bytes _no_ link in the chain may throw away the encoding 
metadata. (It's not like some global pollution problem where it'd be OK if 
the majority fixed their systems, and then some minority could pollute all 
they want since they are so few anyway. No, this needs to be fixed 
everywhere. It's a typical "weakest link" scenario.)

Besides, a basic fix should be done anyway, namely adding support for 
arbitrary file metadata.

Anyway, it shouldn't even be necessary for me to tell you this. If you'd 
just sit down and think for a few seconds you'd no doubt come to the 
conclusion that it's a Really Bad Thing(tm) to throw away the encoding 
metadata of the data.

Of course complete, 100% automatic solutions will only be possible when 
everyone has file systems that store the encoding info, and all tools 
actually use that info. However, this doesn't mean that we can't make 
something that would work today.

- Marcus Sundman

reply via email to

[Prev in Thread] Current Thread [Next in Thread]