uniq i18n implementation

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

uniq i18n implementation

From:	Pádraig Brady
Subject:	uniq i18n implementation
Date:	Mon, 31 Jul 2006 10:08:00 +0100
User-agent:	Mozilla Thunderbird 1.0.8 (X11/20060502)

Jim Meyering wrote:
> Pádraig Brady <address@hidden> wrote:
> 
>>However I notice v5.2.1 at least only seems to handle ascii:
>>
>>$ LANG=ga_IE.utf8 uniq -i < Pádraig
>>Pádraig
>>PÁdraig
> 
> 
> Yes, that's still a problem.
> Would you like to work on it?
> 

You should have warned me :)

Anyway I've had a look at implementing this.

The following was done on my glibc-2.3.5-10, Fedora Core 4, laptop.

I decided to implement a completely seperate uniq (attached)
without looking at any other implementations, so
as not to be tarnished with possibly bad ideas.
After doing this I went back to see why
the existing ones are so slow ;)

As I see it coreutils' uniq currently does:

    if ignorecase
        memcasecmp() //always C locale!
    else
        if !C locale
            memcoll() //wraps strcoll
        else
            memcmp()

memcoll does 2 errno accesses per call, which shows up significantly
in profiles. Does strcoll even set errno? I gave it iso-8859-15 data
while running in a UTF-8 locale, and it just ignored the iso-8859-15
characters in the lines. Note it didn't compare them using C even, it
just ignored the invalid UTF8 chars.
Using strcoll is inefficient anyway as essentially for every input line
you're converting 2 lines into the internal wide character set for
comparison, albeit on a char by char basis.
Note, perfomance results are presented below.

I noticed coreutils doesn't shortcut the string comparisons
by checking lengths before doing memcoll if !C locale,
which is fair enough, but maybe a bit restrictive?
Can't one just check lengths when MB_CUR_MAX == 1 ?

In general can someone give a non theoretical example
of 2 different byte sequences (even of the same length),
that compare equal with strcoll() and/or transform to the same
wide character with mbstowcs() in any locale.
For example I expected the 2 UTF8 sequences 0x65 0xCC 0x81 and 0xC3 0xA9
to result in the same char (é) from mbstowcs().
In the former case you get 2 distinct characters.
Also wcscoll() doesn't handle comparing these either.
If this is not supported can't we just do the simple C compare
(when we're not transforming case etc.)

Another thing I was wondering about was how
to specify that 'a' and 'á' for example, should
be treated as equal which seems like useful functionality to have.
I.E. how to get strcoll &/or wcscoll to only compare the primary weights.
I don't think this functionality is in glibc, but
it probably is possible in ICU? I notice the unac package does this
and also msort has StripDiacritics(), ConvertEnclosed(),
ConvertStylistic(), TurkicFoldCase() and UnicodeFoldCase().

Talking about unicode case folding, POSIX' recommendation
doesn't seem sensible as described here:
http://www.gnu.org/software/grep/devel.html
I suggest we do the towupper();towlower(); method described there.

I notice that both redhat's uniq and the equivalent
openi18n version, are horrendously inefficient:
I think this is because they do char by char processing,
so that only parts of a line with invalid multibyte
sequences are excluded from the comparison (like strcoll does above).
My test version of uniq treats the whole line as "C"
if it isn't all a valid multibyte sequence, and I can't
think of any situation where this is not better.
I thought maybe they processed each char so that
embedded NULs were handled correctly. But in
fact they do not handle embedded NULs correctly,
and just output all lines containing them!

Right so on to performance testing.

Input for test:
    1 = short lines with ascii text
    2 = long lines with many invalid utf8 lines
    3 = long lines with all valid utf8 lines
    4 = ascii long lines, with all same length (85 chars),
        and 26 identical lines for every 27

Programs under test:
    a=fedora    uniq 5.2.1
    b=openi18n  uniq 5.1.1
    c=pixelbeat uniq 1.0       <-- test prog attached
    d=coreutils uniq 5.97
    e=ubuntu    uniq 5.2.1

Results:
    LANG=C
    \  1       2       3       4
    -------------------------------
    a| 0.130   0.197   0.200   0.475
    b| 0.088   0.146   0.147   0.274
    c| 0.096   0.123   0.124   0.193
    d| 0.080   0.137   0.137   0.266
    e| 0.092   0.159   0.160   0.308

    LANG=en_IE.utf8
    \  1       2       3       4
    -------------------------------
    a| 1.003   3.420   3.434  12.500
    b| 1.022   3.567   3.587  12.671
    c| 0.173   0.266   0.273   0.412
    d| 0.306   0.324   0.324   5.290
    e| 0.317   0.345   0.346   5.477

    LANG=en_IE.utf8 and ignorecase
    \  1       2       3       4
    -------------------------------
    a| 1.475   5.137   5.144  17.120
    b| 1.456   5.154   5.214  16.953
    c| 0.349   0.925   0.952   1.926
    d| 0.081   0.137   0.138   0.288
    e| 0.091   0.156   0.158   0.315

Note my test program is slightly slower with short lines,
as it uses glibc's getline rather than coretuils' getc.
glibc's getline is faster for lines >= 10 chars.

Pádraig.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <wchar.h>
#include <wctype.h>
#include <string.h>
#include <locale.h>

int ignorecase=0;
int ascii_input=1;
int invalid_input=0;

typedef struct {
    char* line;           //char data
    int len;              //# of chars
    unsigned int size;    //# of bytes
    unsigned int lastsize;//# of bytes from previous alloc
} line_t;

typedef struct {
    wchar_t* wline;        //wchar data
    size_t* nuls;          //indexes of NULs in line_t.line
    int chars;             //# of wchars
    int valid;             //valid mb data
} wline_t;

line_t last_line={NULL,0,0,0};
line_t curr_line={NULL,0,0,0};

wline_t wlast_line={NULL,0,0};
wline_t wcurr_line={NULL,0,0};

void swaplines(line_t* prev, line_t* this)
{
    line_t temp=*prev;
    *prev = *this;
    *this = temp;
}

void swapwlines(wline_t* prev, wline_t* this)
{
    wline_t temp=*prev;
    *prev = *this;
    *this = temp;
}

int linecasecmp(line_t* prev, line_t* this)
{
    if (!prev->line) {
        return -1;
    } else if (prev->len == this->len) {
        return strcasecmp(prev->line,this->line); //TODO: use memcasecmp
    } else {
        return -1;
    }
}

int linecmp(line_t* prev, line_t* this)
{
    if (!prev->line) {
        return -1;
    } else if (prev->len == this->len) {
        return memcmp(prev->line,this->line,prev->len); //TODO: use memcoll?
    } else {
        return -1;
    }
}

//wcscoll merges locale equivalents, but I can't
//find any that actually merge to be the same character?
//TODO: Also how do you get wcscoll to only look at primary weights?
int wlinecmp(wline_t* prev, wline_t* this)
{
    if (!prev->valid) {
        return -1;
    } else if (prev->chars == this->chars) {
        //return wmemcmp(prev->wline,this->wline,prev->chars);
        return wcscoll(prev->wline,this->wline);
    } else {
        return -1;
    }
}

//convert in place
wchar_t * wcsupper(wchar_t* wcs)
{
    wchar_t* WC=wcs;
    while(*WC) {
        *WC = towupper(*WC);
        WC++;
    }
    return wcs;
}

//convert in place
wchar_t * wcslower(wchar_t* wcs)
{
    wchar_t* WC=wcs;
    while(*WC) {
        *WC = towlower(*WC);
        WC++;
    }
    return wcs;
}

void wline_free(wline_t* wline)
{
    free(wline->wline);
    free(wline->nuls);
    wline->wline=NULL;
    wline->nuls=NULL;
}

void wcurr_line_find_nuls(void)
{
    int len_to_check=curr_line.len;
    int substrs=0;
    wcurr_line.nuls[0]=0;
    size_t str_len=0;
    const char* str_pos=curr_line.line;
    while ((str_len=strlen(str_pos)) < len_to_check) {
        wcurr_line.nuls[substrs++]=str_len;
        str_pos+=str_len+1;
        len_to_check-=str_len+1;
    }
    /* Note position of final NUL not recorded */
    wcurr_line.nuls[substrs]=0;
}

void wcurr_line_init(void)
{
    size_t mblen=curr_line.len;
    if (curr_line.lastsize != curr_line.size) {
        wline_free(&wcurr_line);
        wcurr_line.wline=malloc(curr_line.size * sizeof (wchar_t));
        wcurr_line.nuls=malloc(curr_line.size * sizeof (size_t));
        curr_line.lastsize = curr_line.size;
    }

    /*
     * Note mbstowcs doesn't handle embedded NULs in the string.
     * (length is max chars to store rather than len of source).
     * The GNU extension mbsnrtowcs() which takes a length doesn't
     * work either as it only specifies a max len :(
     * Therefore here we find the NULs and then
     * process each substring seperately.
     */
    wcurr_line_find_nuls();

    int i=0;
    const char* str_pos=curr_line.line;
    wchar_t* wstr_pos=wcurr_line.wline;
    wcurr_line.chars=0;
    do {
        /* Replace NULs in internal wide string as they
           preclude the use of the wide string functions.
           WEOF is guaranteed not to be in the wide char set.
           Alternatively we could remove this line and
           ensure wcurr_line.wline ends in NULs */
        if (i) *wstr_pos++=WEOF;

        /* TODO: See why mbstowcs does not canonicalise the UTF8 sequences.
           I.E. 0x65 + 0xCC 0x81 and 0xC3 0xA9 result in the same char?
           Note without this one might as well just do memcmp of the
           multibyte strings (if !ignorecase)
           NOTE: We leave mblen constant as we know there will always
           be enough space in the destination buffer. */
        size_t chars = mbstowcs (wstr_pos, str_pos, mblen+1);
        if (chars == (size_t) -1) {
            //Invalid MB string => treat whole line as C,
            invalid_input=1;
            wcurr_line.valid=0;
            return;
        }
        //TODO check whether it's valid to assume ascii if chars == mblen
        wcurr_line.chars+=chars+1;
        str_pos+=wcurr_line.nuls[i]+1;
        wstr_pos+=chars;
    //} while (i++<substrs);
    } while (wcurr_line.nuls[i++]);

    wcurr_line.valid=1;
    ascii_input=0;
}

/*
 * Note can transform once in place for wide internal string
 * as it is never required untransformed for output etc.
 */
void transform(wchar_t* wcs)
{
    if (ignorecase) {
        /* Note this handles Turkic Case folding */
        (void) wcsupper(wcs);
        (void) wcslower(wcs);
    }
    /* Other possible transformations one could do here are:
           StripDiacritics:  Ã -> A
           ConvertEnclosed:  â¶ -> A
           ConvertStylistic: ï¼¡-> A
           TurkicFoldCase:   Ä° -> i
       Note the above are transformations done in msort.
    */
}

/* compare each line on char by char basis.
   Note possibility of last_line being lowered multiple times */
int cuniq(const char* new_line)
{
    if (!last_line.line ||
        (ignorecase && linecasecmp(&last_line,&curr_line)) ||
        (!ignorecase && linecmp(&last_line,&curr_line))
       ) {
        fwrite(curr_line.line, 1, curr_line.len, stdout);
        swaplines(&last_line,&curr_line);
        return 0;
    } else {
        return 1;
    }
}

/* Note this actually redundant unless transformations happen on multibyte 
string.
    Currently this is only ignore case as far as I can see. */
int wuniq(const char* line)
{
    wcurr_line_init();
    if (!wcurr_line.valid) {
        if (wlast_line.valid) {
            fwrite(line, 1, curr_line.len, stdout);
            swapwlines(&wlast_line,&wcurr_line);
            swaplines(&last_line,&curr_line);
        } else if (!cuniq(line)) {
            swapwlines(&wlast_line,&wcurr_line);
        }
    } else {
        transform(wcurr_line.wline);
        if (wlinecmp(&wlast_line,&wcurr_line)) {
            fwrite(line, 1, curr_line.len, stdout);
            swapwlines(&wlast_line,&wcurr_line);
            swaplines(&last_line,&curr_line);
        }
    }
    return 0;
}


int main(int argc, char** argv) {

    /* This is a single threaded app, so mark as such for performance. */
    #include <stdio_ext.h>
    __fsetlocking(stdin,FSETLOCKING_BYCALLER);
    __fsetlocking(stdout,FSETLOCKING_BYCALLER);

    if (!setlocale(LC_CTYPE, "")) { //TODO: What about LC_COLLATE?
       fprintf(stderr,"Warning locale not supported by glibc, using 'C' 
locale\n");
    }

    if (argc==2 && (!strcmp(argv[1],"-i") || !strcmp(argv[1],"--ignorecase"))) {
        ignorecase=1;
    }

    int (*uniq)(const char* line) = (MB_CUR_MAX > 1) ? wuniq : cuniq;

    last_line.line=malloc(120); //must malloc one as getline mallocs only 1 
internal buffer.
    last_line.size=120;
    last_line.len=0;
    while ((curr_line.len=getline(&curr_line.line,&curr_line.size,stdin))>=0) {
        (void) uniq(curr_line.line);
    }
    free(last_line.line);
    free(curr_line.line);
    wline_free(&wcurr_line);
    wline_free(&wlast_line);

    if (MB_CUR_MAX>1) {
        if (invalid_input)
            fprintf(stderr,"Warning: Invalid multi byte string(s) received for 
LC_CTYPE=%s\n",setlocale(LC_CTYPE,NULL));
        if (ascii_input /*&& verbose*/)
            fprintf(stderr,"Warning: inefficiently processing ascii only data. 
Consider LC_CTYPE=C\n");
    }
    return EXIT_SUCCESS;
}

[Prev in Thread]

Current Thread

[Next in Thread]

uniq i18n implementation, Pádraig Brady <=

Prev by Date: Coreutils testsuite failures
Next by Date: Re: Providing support for the Win32 API.
Previous by thread: Coreutils testsuite failures
Next by thread: bug in date utillity
Index(es):
- Date
- Thread