[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?

From: Jan Hudec
Subject: Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?
Date: Sat, 28 Aug 2004 17:47:11 +0200
User-agent: Mutt/1.5.6+20040818i

On Sat, Aug 28, 2004 at 14:24:11 +0300, Marcus Sundman wrote:
> On Saturday 28 August 2004 13:23, Jan Hudec wrote:
> > On Sat, Aug 28, 2004 at 04:18:54 +0300, Marcus Sundman wrote:
> > > On Saturday 28 August 2004 03:35, Robin Green wrote:
> > > > On Sat, Aug 28, 2004 at 01:56:20AM +0300, Marcus Sundman wrote:
> > > > > However, for this problem to go away completely it needs
> > > > > to be fixed in _all_ systems, including arch. When a piece of text
> > > > > is sent around as bytes _no_ link in the chain may throw away the
> > > > > encoding metadata.
> > > >
> > > > If you want that property
> > >
> > > Umm.. what property? That text files remain text files instead of
> > > turning into raw byte blobs? Yeah, I really do want that property.
> >
> > UTF-16 will not work with 99.9999% of standard tools. That's because
> > utf-16 is not compatible with how standard C library handles strings.
> > It's far easier to forget that utf-16 was ever invented, than to rewrite
> > all those tools.
> >
> > UTF-8 works is 99.99999% of standard tools right out of the box. Yes,
> > that does include diff and patch.
> That is incorrect. How many windows tools that parse text can handle UTF-8? 
> A few %? AFAIK UTF-16 has better support on windows since MS decided early 
> on that unicode=UTF-16. And there are probably over a hundred text editors 
> for linux. All of them should support UTF-8, according to you (only one in 
> ten million programs should not, you say), but I bet only less than 5% 
> does. Or would you care to name ten command line editors that does? C'mon, 
> you even have to do tricks like 'unicode_start/unicode_stop' to get the 
> console in linux to support UTF-8.

I was exagerating, I agree.
1) Tools mean, amongst other things, all textutils, shellutils and
   fileutils. And these don't care about the difference between utf-8
   and other 8-bit ascii extensions, but fail with utf-16, since they
   can't accept NUL bytes.
2) Even if a program should, but can't, properly display unicode, it is
   possible to make at least some sense out of it. That's not true of
3) Yes, windows use utf-16 and ucs-4 a lot, since they started with
   unicode and utf-16 is the original encoding (IIRC utf-8 came with
   iso-10646). It's only more mess.

> That said, I agree that UTF-8 is much better than UTF-16 in most cases, and 
> that there is much better support for UTF-8 in general. Especially among 
> programs that just pass along text without parsing it.
> So, if diff supports UTF-8 then it must surely support the only "correct" 
> line break in unicode, namely U+2028. Does it?

No, it does not. But I doubt any programming language does either.
Seems I have overestimated the sanity of unicode people...

> > > > Transcoding need not be a goal of a revision control system, since
> > > > you can just transcode files to and from the working directory with a
> > > > separate utility.
> > >
> > > I have never said that transcoding has to be done by a CMS/RCS.
> > > However, the system has to support this, at least by not throwing away
> > > the encoding info.
> >
> > For all sane things, the encoding info shall be part of the data. And
> > thus not thrown away...
> Many argue that it isn't even remotely sane to have such a circular 
> dependency paradox and such restrictions. I myself wouldn't call it "sane", 
> but I do think it at least is much better than throwing away the encoding 
> info.

After all, I think this question is more generic than just about
encodings. It is generaly about adding external info to files.

I really think arch should learn to handle files-as-directories hybrids
when that interface is finalized in linux. Actualy that tar should learn
it (which it quickly will when reiser4 gets in mainstream linux) and tla
should start using that tar. I further assume that tar will support
translating this format to extended attributes, the sun's "*at" version
(basicaly the same but with special syscall), microsoft's forks etc.
Well, we will need a diff for extended attributes, too.

> > > After giving it a lot of thought (quite a while ago), I concluded that
> > > I would personally prefer a general filter plug-in system in the
> > > CMS/RCS. This way the logic can be standardized and centralized, moving
> > > the burden (and the responsibility) of setting up the filters from each
> > > developer to the project leader. This way you also won't have issues
> > > with different people using different platforms and/or clients.
> > > (Anyhow, this is only my personal opinion, and I wouldn't want to
> > > impose it on others.)
> >
> > Getting quite somewhere else... Would be a nice idea. Though it's pretty
> > tricky to get that right.
> I don't see why it would be particularly tricky. Many filtering systems have 
> been designed over the years, and probably quite a few of them by very 
> inexperienced developers. Especially if arch/tla gets an integrated VM then 
> it'd be a piece of cake.

No, it's not. Eg. CVS got it wrong. You must make sure that the
filter-cause differencies won't cause conflict on merge and that they
won't clutter up diff output. Thus you need to run the filters in quite
many commands and you need to be careful to choose the right place in
each of them.

                                                 Jan 'Bulb' Hudec 

Attachment: signature.asc
Description: Digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]