[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff.

From: Lapo Luchini
Subject: [Monotone-devel] Re: Why is utf8 type _NOVERIFY, and other vocab stuff.
Date: Thu, 15 Feb 2007 20:25:19 +0100
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv: Gecko/20061207 Thunderbird/ Mnenhy/

Zack Weinberg wrote:
> The //IGNORE and //TRANSLIT features are glibc / GNU libiconv
> specific, but I would have thought that they were available in recent
> Gentoo (they've been around since 2001 give or take).

I thought they would be present on *most* BSD and Linux available today...

Uh. I know nothing about Gentoo, but I would have thought it was in
Portage, but this doesn't seem to be it at all:

> The real problem, though, is that an awful lot of non-GNUish systems
> have iconv implementations that are useless.  I mean _useless_.  They
> implement hardly any conversions at all.  We have to have the "(list
> of names for ASCII) <-> UTF8" shortcut for _correctness_, not just for
> speed; real live systems don't support conversion between their own
> locale's name for ASCII and UTF-8.   *headdesk*

Well, an iconv that doesn't even know how to make conversion *to* UTF8
is no good for us: we simply can't use it.
An iconv that doesn't know about //IGNORE//TRANSLIT, OTOH, is good for
the strict sanity conversion, but not good for the "best effort"
print-to-the-terminal that I wired into "mtn log" (but other places
would need that, too).

I guess the "solution" could be to add an autoconf test for support of
//IGNORE//TRANSLIT and, when not available, we can easily write a
"quick&dirty" lossy conversion from UTF8 to either Latin1 or ASCII:

#define UTF8_to_Latin1(u) ((u >= 256) ? '?' : (char)u)
#define UTF8_to_ASCII(u)  ((u >= 128) ? '?' : (char)u)

Or maybe we could get the "transliteration table" right out of iconv...

> It might be possible to bundle GNU libiconv, but I hesitate to
> recommend that because I recall its being another Haible/Drepper build
> system monstrosity like intl.

IMHO we bundle already too much =)

> Many systems have an iconv(1) command line utility that may be helpful
> here.

Uh, right, but writing a "known good UTF-8 string" escaped on the
command line seems a bit trickier to me... no, not really.

% echo "\xC2\xB7" | iconv -f UTF-8 -t CP1252//IGNORE//TRANSLIT
· (that is, the correct and converted U+00B7 MIDDLE DOT)
% echo "\xC2\xB7" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
% echo "\xC3\x80" | iconv -f UTF-8 -t CP1252//IGNORE//TRANSLIT
% echo "\xC3\x80" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT

Derek (or anyonelse with Gentoo), what do you get with these?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]