coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[coreutils] Re: [PATCH] join: support multi-byte character encodings


From: Pádraig Brady
Subject: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Date: Mon, 20 Sep 2010 09:57:19 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

On 20/09/10 01:55, Bruno Haible wrote:
> [CCing bug-libunistring. This is a reply to
>  <http://lists.gnu.org/archive/html/coreutils/2010-09/msg00032.html>.]
> 
> Hi Pádraig,
> 
>> I was doing some performance analysis of the above patch
>> and noticed it performed very well usually but not when
>> ulc_casecoll or ulc_casecmp were called, when not in a
>> UTF-8 locale. In fact it seemed to slow down everything
>> by about 5 times?
>>
>> $ seq -f%010.0f 30000 | tee f1 > f2
>>
>> # with ulc_casecmp
>> $ time LANG=en_US join -i f1 f2 >/dev/null
>> real    0m1.281s
>>
>> # with ulc_casecoll
>> $ time LANG=en_US join -i f1 f2 >/dev/null
>> real    0m2.120s
>>
>> # with ulc_casecmp in UTF8 locale
>> $ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
>> real    0m0.260s
>>
>> # with ulc_casecoll in UTF8 locale
>> $ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
>> real    0m0.437s
> 
> Thanks for the test case. I've produced a similar test case that operates
> on the same data, in a way similar to 'join', and profiled it:
>   $ valgrind --tool=callgrind ./a.out
>   $ kcachegrind callgrind.out.*
> 
>> # Doing encoding outside gives an indication
>> # what the performance should be like:
>> # with ulc_casecoll in UTF8 locale
>> $ time LANG=en_US.utf8 join -i <(iconv -fiso-8859-1 -tutf8 f1) \
>>   <(iconv -fiso-8859-1 -t utf8 f2) >/dev/null
>> real    0m0.462s
> 
> This is an unfair comparison, because it throws overboard the requirement
> that 'join' has to be able to deal with input that 'iconv' would not accept.
> 
>> A quick callgraph of ulc_casecoll gives:
>>
>> ulc_casecoll(s1,s2)
>>   ulc_casexfrm(s1)
>>     u8_conv_from_encoding()
>>       mem_iconveha(from,"UTF8",translit=true)
>>         mem_iconveh()
>>           iconveh_open()
>>           mem_cd_iconveh()
>>             mem_cd_iconveh_internal()
>>               iconv()
>>           iconveh_close()
>>     u8_casexfrm
>>       u8_ct_casefold()
>>         u8_casemap()
>>       u8_conv_to_encoding()
>>          ...
>>       memxfrm()
>>   ulc_casexfrm(s2)
>>   memcmp(s1,s2)
> 
> Unfortunately such a callgraph is not quantitative. I needed a
> profiler to really make sense out of it.
> 
> The profiling showed me three things:
>   - The same string is being converted to UTF-8, then case-folded
>     according to Unicode and locale rules, then converted back to locale
>     encoding, and then strxfrm'ed, more than once. CPU time would be
>     minimized (at the expense of more memory use) if this was done only
>     once for each input line. It is not clear at this point how much
>     it must be done in join.c and how much help for this can be added
>     to libunistring.

Well my first thought is that libunistring might expose
a version of these "conv" functions with a context parameter
to reference the state.

>   - There are way too many iconv_open calls. In particular, while
>     ulc_casecmp and ulc_casecoll have to convert two string arguments
>     and could use the same iconv_t objects for both arguments, they
>     open fresh iconv_t objects for each argument.
>   - Additionally, every call to iconveh_open with target encoding UTF-8
>     calls iconv_open twice, where once would be sufficient.

Good stuff.

>> So one can see the extra overhead involved when not in UTF-8.
>> Seems like I'll have to efficiently convert fields to utf8
>> internally first before calling u8_casecoll()?
> 
> If there was a guarantee that all strings can be converted to UTF-8, yes.
> But there is no such guarantee for 'join'. The keycmp function in
> <http://lists.gnu.org/archive/html/coreutils/2010-09/msg00029.html> is
> therefore actually wrong: It does not always satisfy the requirement
>    keycmp(A,B) < 0  && keycmp(B,C) < 0  ==>  keycmp(A,C) < 0
> The required handling of input strings that are not valid in the locale
> encoding is more elaborate that I thought. I need to think more about
> this issue.

These are the sort of edge cases I was worrying about
when looking at normalization and punted until later.

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]