[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Encoding handling proposal

From: John Meinel
Subject: Re: [Gnu-arch-users] Encoding handling proposal
Date: Sun, 29 Aug 2004 13:13:09 -0500
User-agent: Mozilla Thunderbird 0.7 (Windows/20040616)

Marcus Sundman wrote:
D) There should be a filter/plugin architecture to enable a transcoding of files on input and output based on their content-types and user settings and user-provided parameters.

E) Utilities such as "diff", "merge" and "annotate" (aka "blame") should be provided by plugins mapped to content-types.

You definitely have some interesting proposals here. One thing to watch out for, though... Once we stop having one type of diff (say a xdelta diff for binary files, and another type for xml files, etc.) how do we make (or at least help) everyone have all of these programs.

Maybe it's something that happens outside of tla, but one of the nice things is that tla uses diff, patch, and tar. Which are reasonably simple programs that everyone is likely to have.

If I *don't* have the xmldiff/xmlpatch program, then it is likely that I won't be able to checkout a project that used them. As I would doubt the format for the .patch file will be the same as diff/patch. Also, what about versions, is xmldiff 1.0 compatible with xmlpatch 2.0? (1 year ago I checked it in, but now I'm getting it back).

Will there be "blessed" diff/transcode programs? Will it only be the ones that are bundled inside of tla?

I'm not sure about your statement that files are typically stored in the "local" encoding. The editors I use (gvim, scintilla) allow me to specify the encoding. (Admittedly it's mostly latin-1, or utf-8, or utf-16). So in that situation, when I write out a file, if I try to check it into arch, then I have to worry about telling arch *not* to use the local encoding.

I know one of your reasons for wanting encoding to be included is so you can keep the "official" repository in the official encoding. One way to do that is to put a person in there. So people are allowed to work on any repository they want, but only a few people commit to the "official" one, and they are all knowledgeable about watching out for file encoding issues.

F) Commit comments and other string attributes should use UTF-8.

G) Filenames and paths should use UTF-8 in the repository, and be transcoded to the proper encoding when a client accesses the local file system.

This I do agree with. But I seem to recall that Tom's position is people will probably want the files in local encoding. So that

        cat <patch-log>

Will be readable on that system.

I remember a big discussion about this in the past, but I don't think it was thoroughly resolved.

I think Tom designed hackerlab such that you deal with characters, and never know how many bytes/codepoints/etc is used underneath.


D) Since editors and other programmers' tools tend to use whatever the local system encoding happens to be and a project might include people with different systems there needs to be some transcoding of most text files. The contents of files whose "Auto-Filter" attribute is set to "true" will be stored UTF-8 encoded with U+2028 newlines in the repository and transcoded from/to the local encoding and local newlines on input/output. The contents of files whose "Auto-Filter" attribute is set to "false" will not be transcoded on input/output. Often the proper local encoding and line breaks can be detected automatically, but the user should be able to override the auto-detection in his settings and/or by a parameter to the cm client.

This is where I feel "use the local system encoding" may not be perfectly true. But it is possible that "Auto-Filter" will handle this.

E) E.g. if two files with the content-type "application/vnd.sun.xml.writer" are diffed the system should use a diff plugin that knows how to interpret Writer documents. If no such plugin is found it defaults to the standard diff which regards the files as byte blobs.

This is where the problem with plugins exists. On *my* machine, I have the application/vnd.sun.xml.writer diff program. You don't have it on *your* machine. You can no longer read my archive.

If you just treat everything as blobs, at least you can get version 1 and version 10, and create your own diff, and manually patch so that you get nice context-sensitive diffs.

My personal feeling is that we could do this 2 ways. Have tla generate the standard diff and the special one. Clients who understand the special format use it, else you can rely on the standard one. (This was proposed for xdelta use with pure binary files.)

The other way is to have tla start to incorporate more diff/patch programs. Keep in mind that adding a new diff/patch effectively changes the archive format, which is not something to do lightly.

I favor the former, though it doesn't allow for compact archive size.


Notice that there is no distinction between "text files" and "binary files". The same system that converts between different text encodings might just as well be used to convert between different "raw" audio formats. Just add the appropriate plugin/filter and you're set.

Interesting idea, but I have to wonder if it is what you would really want.

- Marcus Sundman

Overall, I think you raise some good points. There is just a lot of care with something that could potentially fragment repositories.


Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]