UNIX join command bug

From: Guillaume Smits
Date: Thu, 21 Aug 2008 16:45:12 +0100

Dear GNU,

I have two files exactly identical composed of:

6 Fields, tab separated, with a /n at the end of the line, sorted
numerically on the key identifier (field #2).

Here is the head of the files:


CHR     SNP     A1      A2      MAF     NCHROBS
13      rs4     G       A       0.0648148       216
7       rs8     T       C       0.166667        216
7       rs16    T       C       0.475962        208


CHR     SNP     A1      A2      MAF     NCHROBS
7       rs8     A       G       0.215674        9876
7       rs16    G       A       0.477102        9870
7       rs19    G       A       0.385628        9880

The first file is ~ 1,400,000 lines long

The second file is ~ 330,000 lines long

There should be ~322,000 lines in common (i.e., with the same SNP
identifier - field #2).

When I perform a very simple join command as follows:

Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt

I obtain a joinedfile of ~213.000 lines in place of the expected
~322.000 lines (65% of the lines). 

The lines missing are scattered everywhere in the original files (at the
beginning, middle or end). There is also no logic to find while
considering the SNP identifier of the missing lines.

For example a line which is missing is the following one:

File 1

11      rs1535  G       A       0.348624        218

File 2

11      rs1535  G       A       0.440218        9886

As one can see, the key field identifier is identical (rs1535) hence
this line should be printed in the output.

I can't find any difference between the files (e.g., no hidden
characters) or the key identifiers. The files are sorted in the same
way, tabulated in the same way,...

The only difference is the number of lines (1.4 million in file 1; 300
thousands in file 2). While big, these line numbers should not be a
limiting factor to the join command... (and why would be the missing
line scattered all along the files?)

Using a Perl script to print lines having the same field 2 identifier, I
obtain the ~322,000 lines expected proving that it is nearly surely a
join command bug.

Question: Is there any trivial (or less trivial) explanation to this
join command bug?

Thanks for your help,


