[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] [PATCH] join: support multi-byte character encodings
From: |
Pádraig Brady |
Subject: |
Re: [coreutils] [PATCH] join: support multi-byte character encodings |
Date: |
Wed, 15 Sep 2010 01:21:29 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
I was doing some performance analysis of the above patch
and noticed it performed very well usually but not when
ulc_casecoll or ulc_casecmp were called, when not in a
UTF-8 locale. In fact it seemed to slow down everything
by about 5 times?
$ seq -f%010.0f 30000 | tee f1 > f2
# with ulc_casecmp
$ time LANG=en_US join -i f1 f2 >/dev/null
real 0m1.281s
# with ulc_casecoll
$ time LANG=en_US join -i f1 f2 >/dev/null
real 0m2.120s
# with ulc_casecmp in UTF8 locale
$ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
real 0m0.260s
# with ulc_casecoll in UTF8 locale
$ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
real 0m0.437s
# Doing encoding outside gives an indication
# what the performance should be like:
# with ulc_casecoll in UTF8 locale
$ time LANG=en_US.utf8 join -i <(iconv -fiso-8859-1 -tutf8 f1) \
<(iconv -fiso-8859-1 -t utf8 f2) >/dev/null
real 0m0.462s
A quick callgraph of ulc_casecoll gives:
ulc_casecoll(s1,s2)
ulc_casexfrm(s1)
u8_conv_from_encoding()
mem_iconveha(from,"UTF8",translit=true)
mem_iconveh()
iconveh_open()
mem_cd_iconveh()
mem_cd_iconveh_internal()
iconv()
iconveh_close()
u8_casexfrm
u8_ct_casefold()
u8_casemap()
u8_conv_to_encoding()
...
memxfrm()
ulc_casexfrm(s2)
memcmp(s1,s2)
So one can see the extra overhead involved when not in UTF-8.
Seems like I'll have to efficiently convert fields to utf8
internally first before calling u8_casecoll()?
cheers,
Pádraig.