[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[coreutils] Re: [PATCH] join: support multi-byte character encodings
From: |
Pádraig Brady |
Subject: |
[coreutils] Re: [PATCH] join: support multi-byte character encodings |
Date: |
Mon, 20 Sep 2010 09:57:19 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
On 20/09/10 01:55, Bruno Haible wrote:
> [CCing bug-libunistring. This is a reply to
> <http://lists.gnu.org/archive/html/coreutils/2010-09/msg00032.html>.]
>
> Hi Pádraig,
>
>> I was doing some performance analysis of the above patch
>> and noticed it performed very well usually but not when
>> ulc_casecoll or ulc_casecmp were called, when not in a
>> UTF-8 locale. In fact it seemed to slow down everything
>> by about 5 times?
>>
>> $ seq -f%010.0f 30000 | tee f1 > f2
>>
>> # with ulc_casecmp
>> $ time LANG=en_US join -i f1 f2 >/dev/null
>> real 0m1.281s
>>
>> # with ulc_casecoll
>> $ time LANG=en_US join -i f1 f2 >/dev/null
>> real 0m2.120s
>>
>> # with ulc_casecmp in UTF8 locale
>> $ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
>> real 0m0.260s
>>
>> # with ulc_casecoll in UTF8 locale
>> $ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
>> real 0m0.437s
>
> Thanks for the test case. I've produced a similar test case that operates
> on the same data, in a way similar to 'join', and profiled it:
> $ valgrind --tool=callgrind ./a.out
> $ kcachegrind callgrind.out.*
>
>> # Doing encoding outside gives an indication
>> # what the performance should be like:
>> # with ulc_casecoll in UTF8 locale
>> $ time LANG=en_US.utf8 join -i <(iconv -fiso-8859-1 -tutf8 f1) \
>> <(iconv -fiso-8859-1 -t utf8 f2) >/dev/null
>> real 0m0.462s
>
> This is an unfair comparison, because it throws overboard the requirement
> that 'join' has to be able to deal with input that 'iconv' would not accept.
>
>> A quick callgraph of ulc_casecoll gives:
>>
>> ulc_casecoll(s1,s2)
>> ulc_casexfrm(s1)
>> u8_conv_from_encoding()
>> mem_iconveha(from,"UTF8",translit=true)
>> mem_iconveh()
>> iconveh_open()
>> mem_cd_iconveh()
>> mem_cd_iconveh_internal()
>> iconv()
>> iconveh_close()
>> u8_casexfrm
>> u8_ct_casefold()
>> u8_casemap()
>> u8_conv_to_encoding()
>> ...
>> memxfrm()
>> ulc_casexfrm(s2)
>> memcmp(s1,s2)
>
> Unfortunately such a callgraph is not quantitative. I needed a
> profiler to really make sense out of it.
>
> The profiling showed me three things:
> - The same string is being converted to UTF-8, then case-folded
> according to Unicode and locale rules, then converted back to locale
> encoding, and then strxfrm'ed, more than once. CPU time would be
> minimized (at the expense of more memory use) if this was done only
> once for each input line. It is not clear at this point how much
> it must be done in join.c and how much help for this can be added
> to libunistring.
Well my first thought is that libunistring might expose
a version of these "conv" functions with a context parameter
to reference the state.
> - There are way too many iconv_open calls. In particular, while
> ulc_casecmp and ulc_casecoll have to convert two string arguments
> and could use the same iconv_t objects for both arguments, they
> open fresh iconv_t objects for each argument.
> - Additionally, every call to iconveh_open with target encoding UTF-8
> calls iconv_open twice, where once would be sufficient.
Good stuff.
>> So one can see the extra overhead involved when not in UTF-8.
>> Seems like I'll have to efficiently convert fields to utf8
>> internally first before calling u8_casecoll()?
>
> If there was a guarantee that all strings can be converted to UTF-8, yes.
> But there is no such guarantee for 'join'. The keycmp function in
> <http://lists.gnu.org/archive/html/coreutils/2010-09/msg00029.html> is
> therefore actually wrong: It does not always satisfy the requirement
> keycmp(A,B) < 0 && keycmp(B,C) < 0 ==> keycmp(A,C) < 0
> The required handling of input strings that are not valid in the locale
> encoding is more elaborate that I thought. I need to think more about
> this issue.
These are the sort of edge cases I was worrying about
when looking at normalization and punted until later.
cheers,
Pádraig.