[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug in join: case comparisons don't work in multibyte locales

From: Bruno Haible
Subject: bug in join: case comparisons don't work in multibyte locales
Date: Wed, 11 Mar 2009 01:40:50 +0100
User-agent: KMail/1.9.9

Hi Jim,

In coreutils/src/join.c, there is a FIXME mentioning that the -i option for
case insensitive comparison of the input lines does not work in multibyte
locales. And indeed, in an UTF-8 locale, I see this:

  $ cat > in1 <<EOF
  $ cat > in2 <<EOF
  $ join -i in1 in2
  [empty result]

The expected result is:

  $ join -i in1 in2

Similarly, with a German word in lower and upper case:

  $ cat > in1 <<EOF
  $ cat > in2 <<EOF
  $ join -i in1 in2
  [empty result]

The expected result is:

  $ join -i in1 in2

Before going on, let me summarize the case comparison functions for strings
that we have available with gnulib:

                      | on NUL terminated    | on memory areas or
                      | strings              | strings with embedded NULs
For ASCII strings     | c_strcasecmp,        |
only                  | STRCASEEQ            |
For unibyte locales   | strcasecmp           | memcasecmp
only                  |                      |
Support for multibyte | mbscasecmp           | mbmemcasecmp
locales               |                      |
  + German, Greek etc.|                      | ulc_casecmp
Support for multibyte |                      | mbmemcasecoll
locales and locale    |                      |
collation order       |                      |
  + German, Greek etc.|                      | ulc_casecoll

Find attached a draft patch for the 'join' program, that fixes the bug
mentioned above by use of the mbmemcasecmp or ulc_casecmp functions. It
is not ready to apply, because there are three big questions:

1) Which functions to use for case comparison in coreutils?

   The difference between mbmemcasecmp and ulc_casecmp (or between
   mbmemcasecoll and ulc_casecoll) is:
   mbmemcasecmp treats only English and a few European languages correctly,
     - Turkish i / I is halfway correct, but not fully,
   whereas ulc_casecmp handles all known specialities of languages:
     - Turkish i / I is fully correct,
     - German ß is equivalent to ss,
     - Croatian and Bosnian: Characters with 3 forms, such as DZ dz Dz, are
       considered equivalent,
     - Greek final sigma (lowercase) is considered equivalent to uppercase
       sigma, (There is no difference between final and non-final sigma in the
       upper case.)
     - Lithuanian soft-dot,
     - etc.

   I think ulc_casecmp is "correct", whereas mbmemcasecmp is only "half 

   The reason is that mbmemcasecmp is based on the POSIX APIs, but these APIs
   have some assumptions built-in that are not valid in some languages:
     - It assumes that there is only uppercase and lowercase - not true for
       DZ dz Dz.
     - It assumes that uppercasing of 1 character leads to 1 character - not
       true for German ß.
     - It assumes that there is 1:1 mapping between uppercase and lowercase
       forms - not true for Greek sigma.
     - It assumes that the upper/lowercase mappings are position independent -
       not true for Greek sigma and Lithuanian i.

2) There is a problem with the case comparison in "sort -f": POSIX specifies
   how this option should behave, in terms of the old POSIX terms
   ("all lowercase characters that have uppercase equivalents").

   How to deal with that?
     a) Use mbmemcasecmp for the option -f, and introduce a long option that
        works with ulc_casecmp?
     b) Use mbmemcasecmp if the environment variable POSIXLY_CORRECT is set,
        and ulc_casecmp otherwise?

3) There is also a problem with the executable size: the ulc_casecmp (and
   ulc_casecoll) functions are implemented using a couple of tables. I
   squeezed them already, while still guaranteeing O(1) time for each
   access. Most of the tables are about 10 KB large, the largest one ca. 45 KB.
   But it sums up:

            join executable              size (decimal)

       coreutils-7.1 unmodified             35436

       with mbmemcasecmp                    36473

       with ulc_casecmp                    174336

       with ulc_casecmp and mbmemcasecmp   176521
       (switched at runtime)

   When an executable grows from 35 KB to 175 KB, just for correct string
   comparisons, some people will certainly complain. Especially embedded
   developers, like the busybox guys, try to reduce total executable size.
   And that's not only about 'join', it's ultimately about every coreutils
   program that has an option to perform case-insensitive comparisons on
   user's data.

   How do deal with that?
     a) Add a configure option --disable-extra-i18n, that will refrain from
        using the ulc_casecmp function?
     b) Let coreutils build and install a shared library for these large
     c) Should these Unicode string functions be packaged externally to
        coreutils, and coreutils can link to it as an external dependency
        (like it does for libiconv, libintl, libacl, etc.)?
     d) any other idea?


Attachment: coreutils-7.1-join-i18n-fix.diff
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]