[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?

From: Marcus Sundman
Subject: Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?
Date: Sat, 28 Aug 2004 14:24:11 +0300
User-agent: KMail/1.7

On Saturday 28 August 2004 13:23, Jan Hudec wrote:
> On Sat, Aug 28, 2004 at 04:18:54 +0300, Marcus Sundman wrote:
> > On Saturday 28 August 2004 03:35, Robin Green wrote:
> > > On Sat, Aug 28, 2004 at 01:56:20AM +0300, Marcus Sundman wrote:
> > > > However, for this problem to go away completely it needs
> > > > to be fixed in _all_ systems, including arch. When a piece of text
> > > > is sent around as bytes _no_ link in the chain may throw away the
> > > > encoding metadata.
> > >
> > > If you want that property
> >
> > Umm.. what property? That text files remain text files instead of
> > turning into raw byte blobs? Yeah, I really do want that property.
> UTF-16 will not work with 99.9999% of standard tools. That's because
> utf-16 is not compatible with how standard C library handles strings.
> It's far easier to forget that utf-16 was ever invented, than to rewrite
> all those tools.
> UTF-8 works is 99.99999% of standard tools right out of the box. Yes,
> that does include diff and patch.

That is incorrect. How many windows tools that parse text can handle UTF-8? 
A few %? AFAIK UTF-16 has better support on windows since MS decided early 
on that unicode=UTF-16. And there are probably over a hundred text editors 
for linux. All of them should support UTF-8, according to you (only one in 
ten million programs should not, you say), but I bet only less than 5% 
does. Or would you care to name ten command line editors that does? C'mon, 
you even have to do tricks like 'unicode_start/unicode_stop' to get the 
console in linux to support UTF-8.

That said, I agree that UTF-8 is much better than UTF-16 in most cases, and 
that there is much better support for UTF-8 in general. Especially among 
programs that just pass along text without parsing it.

So, if diff supports UTF-8 then it must surely support the only "correct" 
line break in unicode, namely U+2028. Does it?

> > > isn't the most sensible solution to put the encoding metadata
> > > _inside_ the file, like xml does?
> >
> > Purists generally hate this solution of xml. Theoretically speaking
> > it's wrong because you would have to interpret at least the beginning
> > of the file to get information on how to interpret the file, thus
> > creating a circular dependency paradox. Practically speaking it's wrong
> > since it severely limits what encodings can be used, since the file
> > would have to contain a byte sequence equivalent to a string like
> > '<?xml version="1.0" encoding="utf-8"?>' encoded in ANSI X3.4-1986.
> Which is *RIGHT* thing to do. You need to standartize the encoding at
> leas a bit, lest you create an utter mess.

Many do not agree on that this approach would be the right thing to do. It's 
far from undisputable. The alternative is not necessarily a mess, even 
though the implementations aren't uniform.

> > That said, I'm personally not completely against this approach, but I
> > haven't given it much thought. However, only few formats (anything
> > besides sgml?) support this system. E.g., if you want a text file to
> > contain only the string "hello world" then there is no way for you to
> > use this approach.
> And there is no other way transparent for transport.

Not exactly true. You can use any transport mechanism you want, as long as 
you wrap the data in something that supports mime-types or similar. Either 
way you have to get the end applications to agree on some basic rules.

> > > Transcoding need not be a goal of a revision control system, since
> > > you can just transcode files to and from the working directory with a
> > > separate utility.
> >
> > I have never said that transcoding has to be done by a CMS/RCS.
> > However, the system has to support this, at least by not throwing away
> > the encoding info.
> For all sane things, the encoding info shall be part of the data. And
> thus not thrown away...

Many argue that it isn't even remotely sane to have such a circular 
dependency paradox and such restrictions. I myself wouldn't call it "sane", 
but I do think it at least is much better than throwing away the encoding 

> > After giving it a lot of thought (quite a while ago), I concluded that
> > I would personally prefer a general filter plug-in system in the
> > CMS/RCS. This way the logic can be standardized and centralized, moving
> > the burden (and the responsibility) of setting up the filters from each
> > developer to the project leader. This way you also won't have issues
> > with different people using different platforms and/or clients.
> > (Anyhow, this is only my personal opinion, and I wouldn't want to
> > impose it on others.)
> Getting quite somewhere else... Would be a nice idea. Though it's pretty
> tricky to get that right.

I don't see why it would be particularly tricky. Many filtering systems have 
been designed over the years, and probably quite a few of them by very 
inexperienced developers. Especially if arch/tla gets an integrated VM then 
it'd be a piece of cake.

- Marcus Sundman

reply via email to

[Prev in Thread] Current Thread [Next in Thread]