[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gnu-arch-users] Patch Logs vs. character sets
Stephen J. Turnbull
Re: [Gnu-arch-users] Patch Logs vs. character sets
Mon, 31 May 2004 19:53:40 +0900
Gnus/5.1006 (Gnus v5.10.6) XEmacs/21.5 (chayote, linux)
------------------------ THE BOTTOM LINE -------------------------------
Be a hardass about it---that's the only "graceful" way to go. Right
now, nobody can put anything but ASCII in their logs if they conform
to the arch spec, right? Even if not, you have very few legacy logs
to worry about. Just do the extension right the first time; if you
throw away this golden opportunity to strike a blow for truth,
justice, and the internationalist way, I'll never forgive you. ;-)
>>>>> "Tom" == Tom Lord <address@hidden> writes:
Tom> The solution I proposed for patch logs is aimed at handling
Tom> that kind of situation, and many similar situations,
Please don't do this; you're just saving up trouble for the near
future when a big minority of people will use Unicode by default
(because that's what Word for Programmers uses :), and most of the
rest will have software that silently autogroks and DTRTs. The
exceptions will be people using software whose programmers have
decided that only Unicode matters, at least for release 1.0, and
panics on any input that is not valid UTF-something. That's not
graceful, and now that I've put my prophecy into the g-a-u archive,
"that's not arch's fault" is no longer an excuse. _You have been
Oh, I'm sure those you have coddled in 2004 will remember that small
favor, and thank you profusely in 2005, or whenever you finally decide
to mandate Unicode. ¡Not!
Tom> Too bad. Welcome to string processing in the 21st century.
Tom> Get used to it.
You mean "last quarter of the 20th century". In _this_ century, sane
people will use Unicode nearly exclusively for the internals of new
I18N software, and mostly for I18N external use, too. By 2010,
"legacy encoding" will mean "of technical interest only to those who
fondly recall open reel 16-track tape as a _convenient_ mass storage
medium". Only people confronted with a few unsuspected legacy-encoded
files will care at all pragmatically. Why are you proposing to create
a whole new unnecessary class of randomly legacy-encoded files?
The "save-the-tatami" programming technique works. Make those text
inputs take off their dirty legacy charsets at the door, and give them
clean Unicode codepoints to wear in the house. Emacs/Mule enforces it
(with its own legacy charset, true, but that's temporary). All the
I18Nized software I know of written in Java, Python, and Ruby uses
And yes, arch patchlogs are "internal"; arch spends a lot of effort
providing appropriate APIs to get at them. Being able to use cat(1)
on log files is a _bonus_, not the main point, and if your terminal is
any good, you can use cat(1) to view UTF-8 logs anyway.
The only places where Unicode string processing is painful are in
editor redisplay, DTP, and Buddhist scholarship, none of which is an
issue for arch. (Note that I can imagine Pika Scheme dealing with all
those issues, but they are not issues for arch patchlogs. Don't
confuse the two applications.) Oh, and maybe it's painful in
hackerlab, if that library doesn't wrap iconv(3) yet, but rx already
does Unicode, so what's the BFD?
Tom> So that the on-disk format is one that makes the archive
Tom> owner happy.
I can't believe the guy who writes "vi is broken because its use of
'+' creates problems for file naming conventions" is writing that.
Don't you see that permitting random ASCII-compatible coding causes
exactly the same type of coding collision problem (see (2) below)?
Windows, Mac, Plan 9, and BeOS users will have no problem with UTF-8.
Legacy *nix users (most prominently, GNU/Linux and FreeBSD) will have
iconv(1), recode(1), and/or xterm(1). What is the problem here?
Sure, it'll be a FAQ. So write it up as 'FAQ 4.5.3 Why does "cat
patchlog" spew garbage?', and bind "That's FAQ #4.5.3" to F8.
OTOH (1) why make it hard for people who know how to use iconv by
making them guess WTF legacy coding system(s) the archive logs are in,
when there's no way to guarantee even within-archive consistency? The
policy you propose is pandering to monolingual lusers, not enabling for
(2) How do you plan to deal with the project in Tel Aviv where you've
got a bunch of expat Russians doing KOI-8 in their logs and native
Israelis with ISO 8859-8 in their logs? Once they're in the archive,
there's no way for cat(1) to tell the difference, and there's no way
to validate on the way in, either. People can always fake out the
system with -l, 'cause tla sez "it's all just octets to me, brother."
It's not "a management problem" either; consider Messrs Kamihira,
Dode, and Goikhman who (if I guess correctly) are going to want to use
three different ASCII supersets in logs for their own tla-related
projects. Sure, _they_ all have near-native or better proficiency in
English, so it's not a real problem for arch devo, but I've heard of
several real multilingual teams where that common language was not
available. So now they start pulling changesets from each other. Hm.
BTW, in a multilingual context, guessing personal preference from LANG
is very risky (equally likely that's colleagues' preference), and
personal preference is likely to differ from project rule. No-win.
You don't have to open yourself up to any of that, so don't.
(3) Note that you can change your mind about the restriction to UTF-8
later if people get too pissed off. Open it up to ASCII-compatible
8-bit encodings, and you're stuck supporting them forever---people
will immediately start writing logs in ISO 8859-15 and KOI8-R, and
they will be very vocal if you announce you are thinking about
retracting your mistake. I know---I advocated "save-the-tatami
programming" on python-devel, and the BDFL pronounced "You are right
in theory, but Martin van Löwis is right in practice---Python is
already supporting many millions of lines of code that assume that it
makes sense to operate on blocks of bytes assuming they are encoded in
KOI8-R and ISO 8859-15 and stuff like that; we can't make it invalid,
we can only try to make it more reliable." Lesser of two evils. You
don't have to open yourself up to that, so don't.
And, oh, yeah
(4) I can't believe the guy who writes "vi is broken because its use
of '+' creates problems for file naming conventions" wrote that "All
header data which arch wants to be parsable will be in ASCII, using
Pika escaping and Unicode for non-ASCII character data." Who says
that just because _arch_ doesn't parse log data, somebody else's tool
won't? How is the "概要" header (ie, the Japanese translation of the
Summary header) at all useful to users of cat(1) if it appears in
bloody Pika-escaped Unicode?! Do you really mean it should be OK to
put arbitrary binary crap in the patchlogs, but make all natural
languages except "Yankish" be completely unreadable in the headers
unless you have special software (that doesn't even exist yet)?
_Right now_ everybody in the world can read Unicode; at the worst they
have to install one of several utilities that are all free software.
You don't have to open yourself up to those flames, so don't.
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN