bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: join bug


From: Bob Proulx
Subject: Re: join bug
Date: Wed, 5 Mar 2008 22:25:01 -0700
User-agent: Mutt/1.5.13 (2006-08-11)

Martin,

Martin Schmeing wrote:
> Hi Bob,
> Join works fine with my test smaller files, giving an appropriate
> output.  When both files are 1000 (short) lines long, it outputs
> maybe one or two of the joined lines, but there should be hundreds
> output. The files are sorted, and there is no error message given.
> Here are my test files:

  pcmodel.list
  pcmodel1000.list
  radmodel.list
  radmodel1000.list

This one is tricky.  At first pass it would seem that everything is in
good shape for join.  For example the input files to join must be
sorted and not having them sorted is a common problem.  But these are
obvously sorted.  The first thing I did was to check this.

  for f in *.list; do sort -c $f; done

No errors from sort.  All of the files were sorted.  So I tried
joining the larger files.

  join pcmodel1000.list radmodel1000.list
  992 16023 239 3915 2793 43472.2226562 257.2904053
  993 16023 240 4134 2889 44867.9531250 393.2121582

Two lines.  What are in these files?  The first 15 lines of the first
file show the problem.  But it is tricky.  In fact I missed it until
this point.

     1  16021     1          834    6525
     2  16021     2         1005    6699
     3  16021     3         1296    6651
     4  16021     4         1380    6594
     5  16021     5         1188    6534
     6  16021     6         1044    6363
     7  16021     7          498    6240
     8  16021     8          357    6405
     9  16021     9          270    5886
    10  16021    10          957    5436
    11  16021    11         1122    6096
    12  16021    12         1506    5865
    13  16021    13         1407    6030
    14  16021    14         1383    5922
    15  16021    15         1533    6045

The first field is lined up with a variable number of spaces in the
first column.  That is the root of the issue here.  Sort by default
sorts the entire line using the character collating sequence specified
by the LC_COLLATE locale.  Join does the same but does so ignoring
blanks at the start of the field.  Because of the variable number of
blanks sort and join are seeing a different sort order for the first
field.

Just last month (Feb 19 2008) James Youngman added a new feature to
join that warns about this case.  Using this very recent join the
following diagnostic is printed.  Eventually this will help people be
made aware of this problem much more easily than with older versions
of join.

  join: File 1 is not in sorted order
  join: File 2 is not in sorted order

Knowing this makes it obvious that I used the wrong sort check.  What
I should have done was using -b to skip blanks to match what join is
doing.  Or more precisely 'sort -k 1b,1'.

  for f in *.list; do sort -c -k 1b,1 $f; done
  sort: pcmodel1000.list:10: disorder:     10     16021    10          957    
5436
  sort: radmodel1000.list:116: disorder:   1001     44867.9531250     
393.2121582

Now the problem is much more apparent.  The file needs to be sorted in
the same order that join would expect it.  Not numberically but
lexically using 'sort -k 1b,1'.

  sort -k 1b,1 -o pcmodel1000.list pcmodel1000.list
  sort -k 1b,1 -o radmodel1000.list radmodel1000.list

  head -n10
     1  16021     1          834    6525
    10  16021    10          957    5436
   100  16021   100         1764     714
  1000  16023   247         4833    3609
   101  16021   101         1932     588
   102  16021   102         2058     501
   103  16021   103         2418     399
   104  16021   104         2256     447
   105  16021   105         1644     849

Looks better for join even if it looks worse for humans.  That is the
ordering that is needed for character sorting.

  join pcmodel1000.list radmodel1000.list | wc -l
  115

That looks a little more reasonable.

Hope that explanation helped.
Bob




reply via email to

[Prev in Thread] Current Thread [Next in Thread]