Re: [bug-gnu-libiconv] problem with iso-8859-8 encoding

On Tue, Feb 26, 2008 at 4:25 AM, Bruno Haible <address@hidden> wrote:

Hello,

Alexander Sirotkin wrote:
> I find it hard to believe, but apparently iconv have a problem converting
> iso-8859-8 (hebrew) to any other encoding, for instance UTF-8. Hebrew
> letters in the result appear in the revere order.

As you can read in [1], [2], text in ISO 8859-8 is "sometimes in logical,
sometimes in visual order". Therefore the request to convert ISO-8859-8 to
UTF-8 is already ambiguous per se. Some others [3] say that ISO-8859-8 is always
visual... Oh well.

Additionally, conversion between visual and logical order requires an
arbitrary amount of memory (whose size depends on the input); this is
does not fit into the way iconv is implemented in GNU libc and in GNU libiconv.

For these reasons, GNU libc and GNU libiconv don't implement this reordering.

Fribidi implements reordering from logical to visual order.

The only free software (that I know of) that does reordering of ISO-8859-8
from visual to logical is ICU, and its documentation [4] says:

"Legacy systems frequently stored text in visual order to avoid
reordering for display. When exchanging data with such systems for
processing in Unicode it is necessary to reorder the data from visual
order to logical order and back. Such not-for-display transformations
are sometimes referred to as "storage layout" transformations.

There are two problems with an "inverse reordering" from visual to
logical order: There may be more than one logical order of text that
results in the same display (logical-to-visual reordering is a many-to-one
function), and there is no standard algorithm for it. ICU's BiDi API
provides a setting for "inverse" operation that modifies the standard
Unicode Bidi algorithm. However, it may not always produce the expected
results. Bidirectional data should be converted to Unicode and reordered
to logical order only once to avoid roundtrip losses. Just as it is best
to never convert to non-Unicode charsets, data should not be reordered
from logical to visual order except for display and printing."

Well, I was under impression that ISO-8859-8 always means visual and ISO-8859-8-I is logical, but apparently not everybody thinks the same.

ICU looks like what I need. Indeed, uconv utility from that package seems to support both ISO-8859-8 and ISO-8859-8-I, at least according to the help. Unfortunately, it does not seem to fix the direction either. But I guess I will have to write ICU people about that.

Thanks a lot.

Bruno

[1] http://en.wikipedia.org/wiki/ISO_8859-8
[2] http://en.wikipedia.org/wiki/ISO-8859-8-I
[3] http://www.w3.org/TR/2002/WD-xhtml2-20021211/mod-bidi.html
[4] http://www.icu-project.org/userguide/icu.pdf

From:	Alexander (Sasha) Sirotkin
Subject:	Re: [bug-gnu-libiconv] problem with iso-8859-8 encoding
Date:	Thu, 28 Feb 2008 23:34:07 +0200