[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?

From: Jan Hudec
Subject: Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?
Date: Sat, 28 Aug 2004 12:23:24 +0200
User-agent: Mutt/1.5.6+20040818i

On Sat, Aug 28, 2004 at 04:18:54 +0300, Marcus Sundman wrote:
> On Saturday 28 August 2004 03:35, Robin Green wrote:
> > On Sat, Aug 28, 2004 at 01:56:20AM +0300, Marcus Sundman wrote:
> > > However, for this problem to go away completely it needs
> > > to be fixed in _all_ systems, including arch. When a piece of text is
> > > sent around as bytes _no_ link in the chain may throw away the encoding
> > > metadata.
> >
> > If you want that property
> Umm.. what property? That text files remain text files instead of turning 
> into raw byte blobs? Yeah, I really do want that property.

UTF-16 will not work with 99.9999% of standard tools. That's because
utf-16 is not compatible with how standard C library handles strings.
It's far easier to forget that utf-16 was ever invented, than to rewrite
all those tools.

UTF-8 works is 99.99999% of standard tools right out of the box. Yes,
that does include diff and patch.

Note: in both cases, compilers and interpreters of about anything are
part of the "tools".

> > isn't the most sensible solution to put the encoding metadata _inside_ the
> > file, like xml does?
> Purists generally hate this solution of xml. Theoretically speaking it's 
> wrong because you would have to interpret at least the beginning of the 
> file to get information on how to interpret the file, thus creating a 
> circular dependency paradox. Practically speaking it's wrong since it 
> severely limits what encodings can be used, since the file would have to 
> contain a byte sequence equivalent to a string like '<?xml version="1.0" 
> encoding="utf-8"?>' encoded in ANSI X3.4-1986.

Which is *RIGHT* thing to do. You need to standartize the encoding at
leas a bit, lest you create an utter mess.

> That said, I'm personally not completely against this approach, but I 
> haven't given it much thought. However, only few formats (anything besides 
> sgml?) support this system. E.g., if you want a text file to contain only 
> the string "hello world" then there is no way for you to use this approach.

And there is no other way transparent for transport.

> > Transcoding need not be a goal of a revision control system, since you
> > can just transcode files to and from the working directory with a
> > separate utility.
> I have never said that transcoding has to be done by a CMS/RCS. However, the 
> system has to support this, at least by not throwing away the encoding 
> info.

For all sane things, the encoding info shall be part of the data. And
thus not thrown away...

> After giving it a lot of thought (quite a while ago), I concluded that I 
> would personally prefer a general filter plug-in system in the CMS/RCS. 
> This way the logic can be standardized and centralized, moving the burden 
> (and the responsibility) of setting up the filters from each developer to 
> the project leader. This way you also won't have issues with different 
> people using different platforms and/or clients. (Anyhow, this is only my 
> personal opinion, and I wouldn't want to impose it on others.)

Getting quite somewhere else... Would be a nice idea. Though it's pretty
tricky to get that right.

                                                 Jan 'Bulb' Hudec 

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]