Re: [coreutils] [PATCH] join: support multi-byte character encodings

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] [PATCH] join: support multi-byte character encodings

From:	Pádraig Brady
Subject:	Re: [coreutils] [PATCH] join: support multi-byte character encodings
Date:	Wed, 15 Sep 2010 01:21:29 +0100
User-agent:	Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3

I was doing some performance analysis of the above patch
and noticed it performed very well usually but not when
ulc_casecoll or ulc_casecmp were called, when not in a
UTF-8 locale. In fact it seemed to slow down everything
by about 5 times?

$ seq -f%010.0f 30000 | tee f1 > f2

# with ulc_casecmp
$ time LANG=en_US join -i f1 f2 >/dev/null
real    0m1.281s

# with ulc_casecoll
$ time LANG=en_US join -i f1 f2 >/dev/null
real    0m2.120s

# with ulc_casecmp in UTF8 locale
$ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
real    0m0.260s

# with ulc_casecoll in UTF8 locale
$ time LANG=en_US.utf8 join -i f1 f2 >/dev/null
real    0m0.437s

# Doing encoding outside gives an indication
# what the performance should be like:
# with ulc_casecoll in UTF8 locale
$ time LANG=en_US.utf8 join -i <(iconv -fiso-8859-1 -tutf8 f1) \
  <(iconv -fiso-8859-1 -t utf8 f2) >/dev/null
real    0m0.462s

A quick callgraph of ulc_casecoll gives:

ulc_casecoll(s1,s2)
  ulc_casexfrm(s1)
    u8_conv_from_encoding()
      mem_iconveha(from,"UTF8",translit=true)
        mem_iconveh()
          iconveh_open()
          mem_cd_iconveh()
            mem_cd_iconveh_internal()
              iconv()
          iconveh_close()
    u8_casexfrm
      u8_ct_casefold()
        u8_casemap()
      u8_conv_to_encoding()
         ...
      memxfrm()
  ulc_casexfrm(s2)
  memcmp(s1,s2)

So one can see the extra overhead involved when not in UTF-8.
Seems like I'll have to efficiently convert fields to utf8
internally first before calling u8_casecoll()?

cheers,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

[coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/13
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Pádraig Brady <=
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/20
    - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/20
- Re: [coreutils] [PATCH] join: support multi-byte character encodings, Jim Meyering, 2010/09/15
- [coreutils] Re: [PATCH] join: support multi-byte character encodings, Bruno Haible, 2010/09/19
  - [coreutils] Re: [PATCH] join: support multi-byte character encodings, Pádraig Brady, 2010/09/20
  - Re: [coreutils] Re: [PATCH] join: support multi-byte character encodings, Eric Blake, 2010/09/20

Prev by Date: Re: [coreutils] [PATCH] dircolors: add rxvt-unicode-256color terminal type
Next by Date: Re: [coreutils] cut feature request
Previous by thread: [coreutils] [PATCH] join: support multi-byte character encodings
Next by thread: [coreutils] Re: [PATCH] join: support multi-byte character encodings
Index(es):
- Date
- Thread