[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Gnu-arch-users] How does arch/tla handle encodings?

From: Marcus Sundman
Subject: [Gnu-arch-users] How does arch/tla handle encodings?
Date: Fri, 27 Aug 2004 18:25:10 +0300
User-agent: KMail/1.7

Members in quite a few programming teams, especially OSS ones, use different 
encodings. Some people aren't even aware of this fact, even though it 
causes quite a lot of trouble every now and then. (E.g., although 
Windows-1252 and ISO-8859-1 are quite similar they are not the same. E.g. a 
long dash in Windows-1252 is a control character in ISO-8859-1. And then 
when the windows guys use command line utilities they are suddenly using 
ibm850 which is completely different. (You can of course "chcp 1252", but 
that leads to a plethora of other problems.) Then we have many people 
working in different languages and they want to use UTF-8 or similar 
multi-byte encoding, which isn't well, if at all, supported on some 

The vast majority of programs assume input text files to be in the local 
system's default encoding. Therefore the files on disk should preferably 
use whatever happens to be the local system's default encoding. (In some 
files the encoding is part of the file's semantics, though, so such files 
should be left as they are.)

Since we can't get people to agree on one single encoding we obviously have 
to transcode files. For this to be possible we need to always know which 
encoding a particular text file is written in. After all, a text file is 
basically just a binary blob combined with an encoding metadata attribute. 
If we lose the encoding info then the file is no longer text, but "raw 
data". Thus arch needs to keep track of this very important piece of 
metadata. (This goes for other files than text files, too.)

We also have some other textual metadata, such as file names and paths, 
commit comments etc. Some of these *must* be transcoded on some systems.

All this leaves us with two options. Either everyone has transcoding 
wrappers around their arch client, or the arch client does the transcoding. 
(Obviously you can't implement transcoders for all current and future 
encodings/formats so there would have to be some kind of plug-in system for 
this. In tla one would probably use xl for such plugins.)

So, my questions are these:
1) Does arch/tla keep track of the type/encoding of each file?
2) How does arch/tla handle file names and paths that are incompatible with 
some system?
3) Diff/merge/annotate et al. have to understand the encoding of files for 
them to be able to present something sensible to the user. Do they and are 

- Marcus Sundman

reply via email to

[Prev in Thread] Current Thread [Next in Thread]