bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new modules for Unicode normalization


From: Bruno Haible
Subject: Re: new modules for Unicode normalization
Date: Sun, 22 Feb 2009 03:31:28 +0100
User-agent: KMail/1.9.9

Hi Pádraig,

> So I'm wondering now why normalization functionality isn't in iconv?
> Seems like a big ommision to me.

1) Not every functionality that is a filter should become part of iconv.
   Unicode normalization forms? Removal of accents? Case conversions?
   Transliteration from one script to another (e.g. the recode-sr-latin
   program)? Logical to visual orientation for bidi? Where do you put
   the limit?

2) The specification of the iconv() function assumes that one character on
   input corresponds to one character on output. This leads to contortions
   in converters like EUC-JISX0213 or CP1258, where the notion of "one
   character" is context dependent. When you deal with decomposition and
   canonical reordering, these specification problems are aggravated.

2) iconv() cannot be extended portably. Some vendor iconv implementations
   can only be extended through tables, GNU libiconv only by changing the
   source code, GNU libc by creating specially crafted shared objects
   (I don't know if anyone has ever done this). Whereas when you write a
   filter as a separate program, it is immediately available on all systems.

3) In glibc, IIRC, the size of the conversion state inside an iconv_t is of
   limited size. But Unicode normalization, as well as bidi reordering,
   requires an unbounded amount of temporary space.

> There is a mention of it here: 
> http://www.archivum.info/address@hidden/2006-08/msg00004.html

This page mentions that some vendor iconv don't even get
iconv_open ("UTF-8", "UTF-8")  implemented right. You see how little you
can portably expect from iconv (unless you consider installing GNU libiconv).

> Then I also noticed `uconv` which is in the "icu" package of fedora at least.
> To normalize text the following worked for me:
>   uconv -x NFC < test.utf8
> 
> So ... uconv already has it.
> Do we really need another util in coreutils for this?

ICU is certainly seminal, because it served as a testbed for the development
of Unicode. But I shudder when I see these library sizes (ICU 3.6 on x86):

$ size libicu*.so.*.0
    text    data     bss      dec     hex filename
10152037     116       0 10152153  9ae8d9 libicudata.so.36.0
 1215645   21760    1396  1238801  12e711 libicui18n.so.36.0
   34402    2524      36    36962    9062 libicuio.so.36.0
  245797    4644      88   250529   3d2a1 libicule.so.36.0
   34011    1232       4    35247    89af libiculx.so.36.0
  101228    1264       8   102500   19064 libicutu.so.36.0
 1093450   28360    6364  1128174  1136ee libicuuc.so.36.0

I cannot estimate how much of these 10 MB get actually loaded into a
process' working set. 10 MB - this is 11 times the size of GNU libiconv
with all its conversion tables!

The benefit of a reimplementation is that
  - It implements only the required specifications, does not carry the
    historical baggage of 10 years of ICU, hence smaller code and table
    sizes.
  - When you find a bug or limitation, you have higher chances of getting it
    fixed.

Bruno




reply via email to

[Prev in Thread] Current Thread [Next in Thread]