coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] [PATCH] join: support multi-byte character encodings


From: Pádraig Brady
Subject: Re: [coreutils] [PATCH] join: support multi-byte character encodings
Date: Wed, 15 Sep 2010 01:21:29 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

I was doing some performance analysis of the above patch
and noticed it performed very well usually but not when
ulc_casecoll or ulc_casecmp were called, when not in a
UTF-8 locale. In fact it seemed to slow down everything
by about 5 times?

$ seq -f%010.0f 30000 | tee f1 > f2

# with ulc_casecmp
$ time LANG=en_US join -i f1 f2 >/dev/null
real    0m1.281s

# with ulc_casecoll
$ time LANG=en_US join -i f1 f2 >/dev/null
real    0m2.120s

# with ulc_casecmp in UTF8 locale
$ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
real    0m0.260s

# with ulc_casecoll in UTF8 locale
$ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
real    0m0.437s

# Doing encoding outside gives an indication
# what the performance should be like:
# with ulc_casecoll in UTF8 locale
$ time LANG=en_US.utf8 join -i <(iconv -fiso-8859-1 -tutf8 f1) \
  <(iconv -fiso-8859-1 -t utf8 f2) >/dev/null
real    0m0.462s

A quick callgraph of ulc_casecoll gives:

ulc_casecoll(s1,s2)
  ulc_casexfrm(s1)
    u8_conv_from_encoding()
      mem_iconveha(from,"UTF8",translit=true)
        mem_iconveh()
          iconveh_open()
          mem_cd_iconveh()
            mem_cd_iconveh_internal()
              iconv()
          iconveh_close()
    u8_casexfrm
      u8_ct_casefold()
        u8_casemap()
      u8_conv_to_encoding()
         ...
      memxfrm()
  ulc_casexfrm(s2)
  memcmp(s1,s2)

So one can see the extra overhead involved when not in UTF-8.
Seems like I'll have to efficiently convert fields to utf8
internally first before calling u8_casecoll()?

cheers,
Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]