bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: GNU join command bug


From: Guillaume Smits
Subject: RE: GNU join command bug
Date: Wed, 27 Aug 2008 18:23:04 +0100

Dear James,

No need to use strong language like in the answer I received below. Some
of us are occasional users (thanks Bob for the apologising email). 

Furthermore you had all the info to solve the issue, see below.



Clue:


I found a mail in the gnu mail-list from another user called Kevin that
encountered exactly the same problem as me (mail in Apr 2008) and
received the following answer from Bob: 


kevin wrote:
> I want to use join command with this 2 files :

> test1:
> 1 a
> 2 a
> 3 a
> 45 a
> 78 a
> 152 a
> 1896 a

The input files to join must be sorted.  The above is not.
Please see this reference for more information.
http://www.gnu.org/software/coreutils/faq/coreutils-faq.html#join-requir
es-sorted-input-files 

Bob




Hence why are those data not sorted ???? (the link given is totally
useless to understand why the data are not sorted)


Answer:

Like mine, these data were NUMERICALLY sorted (sort -n ).


But as experimentally found while trying to solve this issue the join
command needs the files to be alpha-numerically sorted (= the default
sort) but absolutely not numerically sorted.



Hence:

1. Because the sort command is a very versatile one, GNU could need to
be more precise in their answers.

2. Suggestion: To add the line: 'Default (alpha-numerical) sort required
(avoid sort -n)' in the join command manual and --help to help future
users.


Sincerely yours,

Guillaume




-----Original Message-----
From: address@hidden [mailto:address@hidden On Behalf Of James Youngman
Sent: 21 August 2008 21:57
To: Guillaume 
Cc: address@hidden
Subject: Re: UNIX join command bug


> Dear GNU,
>
>
> I have two files exactly identical composed of:
>
> 6 Fields, tab separated, with a /n

That would be \n - I assume you mean ASCII LF.

> at the end of the line, sorted
> numerically on the key identifier (field #2).
>
>
> Here is the head of the files:
>
>
> File1
>
> CHR     SNP     A1      A2      MAF     NCHROBS
> 13      rs4     G       A       0.0648148       216
> 7       rs8     T       C       0.166667        216
> 7       rs16    T       C       0.475962        208
> ...
>
>
> File2
>
> CHR     SNP     A1      A2      MAF     NCHROBS
> 7       rs8     A       G       0.215674        9876
> 7       rs16    G       A       0.477102        9870
> 7       rs19    G       A       0.385628        9880
> ...
>
>
>
> The first file is ~ 1,400,000 lines long
>
> The second file is ~ 330,000 lines long

You're not making it easy for people to help you.    You don't
indicate what version of coreutils you are using.    You don't provide
a minimal example.   You just tell us you have two vast inputs you
won't show us that don't join in the way you expect.



> When I perform a very simple join command as follows:
>
> Join -1 2 -2 2 file1.txt file2.txt > joinedfile.txt
>
>
> I obtain a joinedfile of ~213.000 lines in place of the expected
> ~322.000 lines (65% of the lines).
>
> The lines missing are scattered everywhere in the original files (at
the
> beginning, middle or end). There is also no logic to find while
> considering the SNP identifier of the missing lines.
>
>
>
> For example a line which is missing is the following one:

This is not a helpful example; 99% of join problems are caused by
out-of-order input and you haven't provided a complete example that
domenstrates the problem so that we can eliminate that possibility.


> I can't find any difference between the files (e.g., no hidden
> characters) or the key identifiers. The files are sorted in the same
> way, tabulated in the same way,...

My guess is that this is not actually the case.

> The only difference is the number of lines (1.4 million in file 1; 300
> thousands in file 2). While big, these line numbers should not be a
> limiting factor to the join command... (and why would be the missing
> line scattered all along the files?)
>
>
> Using a Perl script to print lines having the same field 2 identifier,
I
> obtain the ~322,000 lines expected proving that it is nearly surely a
> join command bug.
>
>
>
> Question: Is there any trivial (or less trivial) explanation to this
> join command bug?

Operator error?      Try coreutils 6.11, which should notify you if
the input is out of order - see the Info documentation for details.

James.


--
 The Wellcome Trust Sanger Institute is operated by Genome Research
 Limited, a charity registered in England with number 1021457 and a
 company registered in England with number 2742969, whose registered
 office is 215 Euston Road, London, NW1 2BE.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]