gnu-arch-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?


From: Tom Lord
Subject: Re: [Gnu-arch-users] Re: How does arch/tla handle encodings?
Date: Sat, 28 Aug 2004 12:20:05 -0700 (PDT)

    > From: Jan Hudec <address@hidden>

    > UTF-16 will not work with 99.9999% of standard tools. That's because
    > utf-16 is not compatible with how standard C library handles strings.
    > It's far easier to forget that utf-16 was ever invented, than to rewrite
    > all those tools.

It depends on the time scale you are thinking about and the goals you
are trying to achieve.

You want fast upward compatability for those tools?  Sure, UTF-8
everywhere.

You want something which, in the long term, has good space and time
performance for the widest range of users?  Easiest to maintain?
Between the 8-bit tools and UTF-16, it ain't UTF-16 you ought to
forget was ever invented.....

I *exaggerate* more than a little bit but not without point:

    > UTF-8 works is 99.99999% of standard tools right out of the box. Yes,
    > that does include diff and patch.

Many tools work or close-enough-work if they are reinterpreted as
being about bytes rather than characters.   Many still work "by
coicidence" as character-oriented tools.

So, yes, you can very quickly turn an ASCII program into a crude UTF-8
program, and then perhaps touch it up a bit to make it less crude.
That's a fantastic thing to be doing and to have done, my comments
notwithstanding.

Converting those same programs to a "universally UTF-16"
representation could have been doable as well --- not *that* much
harder than what was done moving to UTF-8.  Would that have been
better?

Alas, from everything I've seen, what Unicode (or really, any
similarly large character set) requires is a better factoring out of
(or layering upon of) "characters" and "strings" from "integer
scalars" and "homogenous arays of integers".  Our low-level language
features that conflate those things aren't helping us right now.
Probably what we want is exactly what we've got (at a *very*
early/crude stage) in hackerlab: a string library that is
encoding-independent, allowing source and target encodings of string
operations to be chosen in context specific ways, so as to optimize
for time or space (e.g., with avoided conversions).

Is it worth it?  Is it worth a brisk but gentle revolution away from
UTF-8-everywhere in order to lay down a foundation of software that is
good for all humans?  I find it to be a (quite solvable) intellectual
challenge and a task which, rather than asking "is it worth doing?"
we ought to be remarking "why, that's one of most intense, important,
and exciting challenges for my generation of hackers!".

-t





reply via email to

[Prev in Thread] Current Thread [Next in Thread]