[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stu

From: Justin Patrin
Subject: Re: [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff.
Date: Fri, 16 Feb 2007 18:32:58 -0800

On 2/15/07, Lapo Luchini <address@hidden> wrote:
Zack Weinberg wrote:
> The //IGNORE and //TRANSLIT features are glibc / GNU libiconv
> specific, but I would have thought that they were available in recent
> Gentoo (they've been around since 2001 give or take).

I thought they would be present on *most* BSD and Linux available today...

Uh. I know nothing about Gentoo, but I would have thought it was in
Portage, but this doesn't seem to be it at all:

> The real problem, though, is that an awful lot of non-GNUish systems
> have iconv implementations that are useless.  I mean _useless_.  They
> implement hardly any conversions at all.  We have to have the "(list
> of names for ASCII) <-> UTF8" shortcut for _correctness_, not just for
> speed; real live systems don't support conversion between their own
> locale's name for ASCII and UTF-8.   *headdesk*

Well, an iconv that doesn't even know how to make conversion *to* UTF8
is no good for us: we simply can't use it.
An iconv that doesn't know about //IGNORE//TRANSLIT, OTOH, is good for
the strict sanity conversion, but not good for the "best effort"
print-to-the-terminal that I wired into "mtn log" (but other places
would need that, too).

I guess the "solution" could be to add an autoconf test for support of
//IGNORE//TRANSLIT and, when not available, we can easily write a
"quick&dirty" lossy conversion from UTF8 to either Latin1 or ASCII:

#define UTF8_to_Latin1(u) ((u >= 256) ? '?' : (char)u)
#define UTF8_to_ASCII(u)  ((u >= 128) ? '?' : (char)u)

Or maybe we could get the "transliteration table" right out of iconv...

> It might be possible to bundle GNU libiconv, but I hesitate to
> recommend that because I recall its being another Haible/Drepper build
> system monstrosity like intl.

IMHO we bundle already too much =)

> Many systems have an iconv(1) command line utility that may be helpful
> here.

Uh, right, but writing a "known good UTF-8 string" escaped on the
command line seems a bit trickier to me... no, not really.

% echo "\xC2\xB7" | iconv -f UTF-8 -t CP1252//IGNORE//TRANSLIT
· (that is, the correct and converted U+00B7 MIDDLE DOT)
% echo "\xC2\xB7" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
% echo "\xC3\x80" | iconv -f UTF-8 -t CP1252//IGNORE//TRANSLIT
% echo "\xC3\x80" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT

Derek (or anyonelse with Gentoo), what do you get with these?

Running these commands over PuTTY on my Gentoo system (from Windows) gives me:
address@hidden ~ $ echo "\xC2\xB7" | iconv -f UTF-8 -t
address@hidden ~ $ echo "\xC2\xB7" | iconv -f UTF-8 -t
address@hidden ~ $ echo "\xC3\x80" | iconv -f UTF-8 -t
address@hidden ~ $ echo "\xC3\x80" | iconv -f UTF-8 -t

Which I assume means that my shell is sending those strings in
straight instead of making them UTF-8.

Justin Patrin

reply via email to

[Prev in Thread] Current Thread [Next in Thread]