|
From: | dethrophes |
Subject: | Re: Can somebody explain to me what u32tochar in /lib/sh/unicode.c is trying to do? |
Date: | Sun, 11 Mar 2012 00:02:24 +0100 |
User-agent: | Mozilla/5.0 (Windows NT 6.1; WOW64; rv:10.0.2) Gecko/20120216 Thunderbird/10.0.2 |
Am 10.03.2012 23:17, schrieb Chet Ramey:
I guess I was a bit terse wouln't call it a personal insult though. Though I guess I do have pretty thick skin, sorry if you felt it was meant as one.On 3/7/12 12:07 AM, John Kearney wrote:You really should stop using this function. It is just plain wrong, and is not predictable. It may enocde BIG5 and SJIS but is is more by accident that intent. If you want to do something like this then do it properly. basically all of the multibyte system have to have a detection method for multibyte characters, most of them rely on bit7 to indicate a multibyte sequence or use vt100 SS3 escape sequences. You really can't just inject random data into a txt buffer. even returning UTF-8 as a fallback is a bug. The most that should be done is return ASCII in error case and I mean U+0-U+7f only and ignore or warn about any unsupported characters. Using this function is dangerous and pointless. I mean seriously in what world does it make sense to inject utf-8 into a big5 string? Or indead into a ascii string. Code should behave like an adult, not like a frightened kid. By which I mean it shouldn't pretend it knows what its doing when it doesn't, it should admit the problem so that the problem can be fixed.Wow. Do you really think that personal insults are a good way to advance an argument? Listen: bottom line. It's a fallback function. It's called in the unlikely event that iconv isn't available at all and we're not in a UTF-8 locale. Any fallback is as good as another, though maybe the best one would be to return \uNNNN or \UNNNNNNNN (before you ask, Posix leaves the \u/\U failure cases unspecified). The real question is what to do with invalid input data, since any transformation is going to "inject random data" into the buffer. Maybe the identity function would be better after all. But then you'd ask whether or not it makes sense to inject a C-style escape sequence into a big5 string. Chet
My point is the fallback function/handler should report an error/warning not do anything and move on. Trying to reover an irrecoverable error is just making it more difficult to figure out what is going on. Basically this is a script/enviroment error, so report the error, don't hide it.
Its a similar problem with the iconv fallback of returning UTF-8. If iconv says it can't encode the unicode value in the destination charset do we really know better? Again it is better to report the error an move on. because injecting utf-8 into big5 or whatever is also wrong. because if utf-8 is the destination charset then it would have already been detected or iconv would have worked so contextually we this is wrong. if (iconv (localconv, (ICONV_CONST char **)&iptr, &sn, &optr, &obytesleft) == (size_t)-1)
return n; /* You get utf-8 if iconv fails */now don't forget we know at this point that iconv knows the source and destination charsets so we have unicode character unsupported in destination charset.
or here n = u32toutf8 (c, s); if (utf8locale || localconv == (iconv_t)-1) return n;If destination charset is utf-8 OR destiation charset NOT utf-8 and icconv didn't recognise detination charset encode it as uft-8.
Lets say CTYPE=BIG5 and you try to encode a unicode char U+F000 which is an invalid big5 char(at least I think it is).
so iconv returns an error.now the code inserts the utf-8 encoding of U+F000, which is an invalid string sequence.
this isn't helping anyody.
[Prev in Thread] | Current Thread | [Next in Thread] |