[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bug in join: case comparisons don't work in multibyte locales

From: Pádraig Brady
Subject: Re: bug in join: case comparisons don't work in multibyte locales
Date: Wed, 11 Mar 2009 02:29:20 +0000
User-agent: Thunderbird (X11/20071008)

Bruno Haible wrote:
> Hi Jim,

Thanks for looking at this Bruno.

> In coreutils/src/join.c, there is a FIXME mentioning that the -i option for
> case insensitive comparison of the input lines does not work in multibyte
> locales.

Utils that have this issue are:
join -i, uniq -i, sort -f, ptx -f

> Before going on, let me summarize the case comparison functions for strings
> that we have available with gnulib:
>                       | on NUL terminated    | on memory areas or
>                       | strings              | strings with embedded NULs
> ----------------------+----------------------+---------------------------
> For ASCII strings     | c_strcasecmp,        |
> only                  | STRCASEEQ            |
> ----------------------+----------------------+---------------------------
> For unibyte locales   | strcasecmp           | memcasecmp
> only                  |                      |
> ----------------------+----------------------+---------------------------
> Support for multibyte | mbscasecmp           | mbmemcasecmp
> locales               |                      |
>     ------------------+----------------------+---------------------------
>   + German, Greek etc.|                      | ulc_casecmp
> ----------------------+----------------------+---------------------------
> Support for multibyte |                      | mbmemcasecoll
> locales and locale    |                      |
> collation order       |                      |
>     ------------------+----------------------+---------------------------
>   + German, Greek etc.|                      | ulc_casecoll
> ----------------------+----------------------+---------------------------

That's _really_ helpful.

> Find attached a draft patch for the 'join' program, that fixes the bug
> mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It
> is not ready to apply, because there are three big questions:
> 1) Which functions to use for case comparison in coreutils?


>    I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half 
> correct".
>    The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs
>    have some assumptions built-in that are not valid in some languages:


I think if we're going to do it we should do it right.
I.E. use ulc_casecmp

> 2) There is a problem with the case comparison in "sort -f": POSIX specifies
>    how this option should behave, in terms of the old POSIX terms
>    ("all lowercase characters that have uppercase equivalents").
>    How to deal with that?
>      a) Use mbmemcasecmp for the option -f, and introduce a long option that
>         works with ulc_casecmp?
>      b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set,
>         and ulc_casecmp otherwise?

b) would be preferable. We could also do:
c) Always use ulc_casecmp for consistency and assume that's POSIX' intent.

> 3) There is also a problem with the executable size: the ulc_casecmp (and
>    ulc_casecoll) functions are implemented using a couple of tables. I
>    squeezed them already, while still guaranteeing O(1) time for each
>    access. Most of the tables are about 10 KB large, the largest one ca. 45 
> KB.
>    But it sums up:
>             join executable              size (decimal)
>        coreutils-7.1 unmodified             35436
>        with mbmemcasecmp                    36473
>        with ulc_casecmp                    174336
>        with ulc_casecmp and mbmemcasecmp   176521
>        (switched at runtime)

That's a little painful,

>    When an executable grows from 35 KB to 175 KB, just for correct string
>    comparisons, some people will certainly complain. Especially embedded
>    developers, like the busybox guys, try to reduce total executable size.

>    And that's not only about 'join', it's ultimately about every coreutils
>    program that has an option to perform case-insensitive comparisons on
>    user's data.

Only 4 mentioned above I think.

>    How do deal with that?
>      a) Add a configure option --disable-extra-i18n, that will refrain from
>         using the ulc_casecmp function?

That might help embedded stuff, though I don't think there's a real need.
If disk space is tight, busybox etc. will be used anyway.

>      b) Let coreutils build and install a shared library for these large
>         modules?

The would help with mem usage which is what worries me the most.
Though I suppose if only 4 utils had the tables linked in, then
there would be at most 4 copies in mem? I'm not sure if it's
worth the effort of dealing with shared libs for this mem.
Could you run the ps_mem.py util mentioned previously on join
with the tables. One without here is taking: 88.0 KiB + 416.0 KiB = 504.0 KiB

>      c) Should these Unicode string functions be packaged externally to
>         coreutils, and coreutils can link to it as an external dependency
>         (like it does for libiconv, libintl, libacl, etc.)?

That would allow any other non coreutils programs using them to also
share the memory used for the tables. That would be best I think,
but again lots of effort.

>      d) any other idea?

no unfortunately.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]