monotone-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Monotone-devel] iconv diffs [Was: Why is utf8...]


From: Nathaniel Smith
Subject: Re: [Monotone-devel] iconv diffs [Was: Why is utf8...]
Date: Fri, 16 Feb 2007 14:44:07 -0800
User-agent: Mutt/1.5.13 (2006-08-11)

On Fri, Feb 16, 2007 at 05:23:01PM +0100, Thomas Moschny wrote:
> On Freitag, 16. Februar 2007, Lapo Luchini wrote:
> > on Fedora, with libiconv bundled inside libc:
> > % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
> > iconv: illegal input sequence at position 4
> > % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//IGNORE
> > iconv: illegal input sequence at position 4
> > % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//TRANSLIT
> > ?
> 
> Order of the modifiers seems to matter.
> 
> On Fedora Core 6:
> % echo "\xE3\x83\x9D" | iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE
> ?

There's something seriously odd going on with //IGNORE as well.
Notice the "position 4" there.  On FC1, I get:

fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE
ab
iconv: illegal input sequence at position 6

i.e., it seems to actually translate everything correctly, then throw
a bogus error upon reaching end-of-string.

For completeness:
fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT
?ab
fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE
?ab
fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
ab
iconv: illegal input sequence at position 6

So in all the //foo//bar cases, it actually acts like the second //bar
isn't even there.

fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII
iconv: illegal input sequence at position 0
fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE,TRANSLIT
iconv: illegal input sequence at position 0
fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT,IGNORE
iconv: illegal input sequence at position 0

With comma, the outward behavior is the same as if the //foo isn't
there at _all_... given that the iconv manual actually just documents
that you can use //IGNORE or //TRANSLIT, it's possible that once upon
a time there was no comma parsing at all?  Dunno.  It doesn't give an
error on other unrecognized modifiers, either:

fc1$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//ASDF
iconv: illegal input sequence at position 0

On mostly-current debian sid, the comma stuff and TRANSLIT seem to
work:

sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII        
iconv: illegal input sequence at position 0
sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT
?ab
sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT,IGNORE
?ab
sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE,TRANSLIT 
?ab

But the weird //IGNORE error is still there:

sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE
ab
iconv: illegal input sequence at position 6

Not sure if this is a bug, or just something odd in the iconv command
line tool -- perhaps it is perfectly expected that if you use
//IGNORE, iconv will work correctly and then set errno to something to
say "hey, I totally had errors that I ignored, just so you know".

Again, if you use //foo//bar, then it acts the same as if you had only
passed //foo and left off //bar:

sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//IGNORE//TRANSLIT
ab
iconv: illegal input sequence at position 6
sid$ echo -e "\xE3\x83\x9Dab" | iconv -f UTF-8 -t ASCII//TRANSLIT//IGNORE
?ab

I'm not really sure why it works this way; looking at gconv_open.c in
the glibc sources, AFAICT it should simply fail to understand this
bizarre "TRANSLIT//ENCODE" error handling specification entirely and
ignore it.  But my skills at reading string-parsing-in-C code are
pretty rusty.

So, ummm... in conclusion.  //IGNORE actually seems like it is working
correctly and usefully, just with an unexpected API.  //TRANSLIT works
pretty okay too.  But mostly we've only tested with GNU iconv -- I
have no idea what's going to happen on, say, OSX or *BSD or Solaris.

One option is just to write our own "//IGNORE"-style iconv wrapper.
iconv's normal API is that it does as much work as it can, then it
tells you where it bombed out.  It's perfectly possible at that point
to skip ahead a byte or more on the input, stick a question mark in
the output string, and then try again from there.  Not the most
efficient thing in the world, but probably a lot easier than trying to
ship iconv conversion tables.

-- Nathaniel

-- 
Electrons find their paths in subtle ways.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]