bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new modules for Unicode normalization


From: Pádraig Brady
Subject: Re: new modules for Unicode normalization
Date: Sun, 22 Feb 2009 11:22:48 +0000
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Bruno Haible wrote:
> Hi Pádraig,
> 
>> So I'm wondering now why normalization functionality isn't in iconv?
>> Seems like a big ommision to me.
> 

[snip valid points on iconv limitations]

>> There is a mention of it here: 
>> http://www.archivum.info/address@hidden/2006-08/msg00004.html
> 
> This page mentions that some vendor iconv don't even get
> iconv_open ("UTF-8", "UTF-8")  implemented right. You see how little you
> can portably expect from iconv (unless you consider installing GNU libiconv).
> 
>> Then I also noticed `uconv` which is in the "icu" package of fedora at least.
>> To normalize text the following worked for me:
>>   uconv -x NFC < test.utf8
>>
>> So ... uconv already has it.
>> Do we really need another util in coreutils for this?
> 
> ICU is certainly seminal, because it served as a testbed for the development
> of Unicode. But I shudder when I see these library sizes (ICU 3.6 on x86):
> 
> $ size libicu*.so.*.0
>     text    data     bss      dec     hex filename
> 10152037     116       0 10152153  9ae8d9 libicudata.so.36.0
>  1215645   21760    1396  1238801  12e711 libicui18n.so.36.0
>    34402    2524      36    36962    9062 libicuio.so.36.0
>   245797    4644      88   250529   3d2a1 libicule.so.36.0
>    34011    1232       4    35247    89af libiculx.so.36.0
>   101228    1264       8   102500   19064 libicutu.so.36.0
>  1093450   28360    6364  1128174  1136ee libicuuc.so.36.0
> 
> I cannot estimate how much of these 10 MB get actually loaded into a
> process' working set. 10 MB - this is 11 times the size of GNU libiconv
> with all its conversion tables!

$ uconv -x NFC&
$ sudo bin/ps_mem.py | grep uconv
 Private  +   Shared  =  RAM used       Program
  1.9 MiB + 788.0 KiB =   2.7 MiB       uconv
$ uconv -x NFC&
$ sudo bin/ps_mem.py | grep uconv
912.0 KiB +   2.2 MiB =   3.1 MiB       uconv (2)

> The benefit of a reimplementation is that
>   - It implements only the required specifications, does not carry the
>     historical baggage of 10 years of ICU, hence smaller code and table
>     sizes.
>   - When you find a bug or limitation, you have higher chances of getting it
>     fixed.

I don't doubt the usefulness of libiconv, though I'm still not sure
another "normalization util" is required when uconv is availble.

thanks again for all the info,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]