monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Monotone-devel] iconv diffs [Was: Why is utf8...]


From: Lapo Luchini
Subject: [Monotone-devel] iconv diffs [Was: Why is utf8...]
Date: Fri, 16 Feb 2007 16:26:57 +0100
User-agent: Thunderbird 1.5.0.9 (X11/20070129)

Lapo Luchini wrote:
> Zack Weinberg wrote:
>> The //IGNORE and //TRANSLIT features are glibc / GNU libiconv
>> specific, but I would have thought that they were available in recent
>> Gentoo (they've been around since 2001 give or take).
> 
>> Many systems have an iconv(1) command line utility that may be helpful
>> here.
> 
> Uh, right, but writing a "known good UTF-8 string" escaped on the
> command line seems a bit trickier to me... no, not really.
> 
> % echo "\xC2\xB7" | iconv -f UTF-8 -t CP1252//IGNORE//TRANSLIT
> · (that is, the correct and converted U+00B7 MIDDLE DOT)
> % echo "\xC2\xB7" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
> .
> % echo "\xC3\x80" | iconv -f UTF-8 -t CP1252//IGNORE//TRANSLIT
> À (that is, correct U+00C0 LATIN CAPITAL LETTER A WITH GRAVE)
> % echo "\xC3\x80" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
> `A
> 
> Derek (or anyonelse with Gentoo), what do you get with these?

OK, I managed to reproduce it here at work with a Fedora box, it's a
really braindead iconv:

% echo "\xC3\x80" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
iconv: illegal input sequence at position 3
% echo "\xC3\x80" | iconv -f UTF-8 -t ASCII//IGNORE
iconv: illegal input sequence at position 3
% echo "\xC3\x80" | iconv -f UTF-8 -t ASCII//TRANSLIT
?

So the "solution" on those hosts would be to use only //TRANSLIT: but
that's a partial solution anyway, as not everything can be
transliterated. E.g. the japanese "po" katakana (U+30DD):

on FreeBSD, with libiconv 1.9.2:
% echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
% echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE
% echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//TRANSLIT
iconv: (stdin): cannot convert

on Fedora, with libiconv bundled inside libc:
% echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
iconv: illegal input sequence at position 4
% echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE
iconv: illegal input sequence at position 4
% echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//TRANSLIT
?

There isn't any form that do something useful on both. =(

I'll take a better look at the problem probably this evening.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]