[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#26029: Problems with join
From: |
Reuti |
Subject: |
bug#26029: Problems with join |
Date: |
Thu, 9 Mar 2017 19:24:40 +0100 |
Hi,
> Am 09.03.2017 um 18:20 schrieb Assaf Gordon <address@hidden>:
>
>> […]
>> Aha, I didn't check this. Then the "-j" option should be moved to a new
>> section "Deprecated" in the man/info page of the coreutils version too. (And
>> mention the special handling of -j1 resp. -j2, while -j3 … works as one
>> expects.)
>
> I would humbly suggest other wording: I'm not sure '-j' is deprecated.
> It is useful, and does work as expected in most cases.
It's only mentioned in the addendum here:
http://pubs.opengroup.org/onlinepubs/9699919799//utilities/join.html
"Earlier versions of this standard allowed -j, -j1, -j2 options, and a
form of the -o option that allowed the list option-argument to be multiple
arguments. These forms are no longer specified by POSIX.1-2008 but may be
present in some implementations.
…
The obsolescent -j options and the multi-argument -o option are removed in this
version."
Therefore I still favor to move "-j" at the end of the man page in a separate
section, also taking:
Q15: http://www.opengroup.org/austin/papers/posix_faq.html
into account.
>
> But, it should be better documented to warn against this edge-case.
>
> Reuti wrote:
>> -j FIELD equivalent to '-1 FIELD -2 FIELD'
>> does not work in all cases essentially.
>
> It 'just works' in most cases, but indeed we should improve the documentation
> about edge cases.
>
> First,
> this is the relevant section that handles the '-j' parameter:
> https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/join.c#n1079
Yep, this I checked in the source too.
>
> Second,
> Let's ensure '-jN' works in the common cases,
> when it is *not* followed by a number:
>
> Two input files:
>
> $ cat a.txt
> 1 2 3 aaa
> 2 3 4 bbb
>
> $ cat b.txt
> 1 2 3 XXX
> 2 3 4 YYY
>
> '-j1' alone is equivalent to '-1 1 -2 1':
>
> $ join -1 1 -2 1 a.txt b.txt
> 1 2 3 aaa 2 3 XXX
> 2 3 4 bbb 3 4 YYY
>
> $ join -j1 a.txt b.txt
> 1 2 3 aaa 2 3 XXX
> 2 3 4 bbb 3 4 YYY
>
> '-j2' alone is equivalent to '-1 2 -2 2':
>
> $ join -1 2 -2 2 a.txt b.txt
> 2 1 3 aaa 1 3 XXX
> 3 2 4 bbb 2 4 YYY
>
> $ join -j2 a.txt b.txt
> 2 1 3 aaa 1 3 XXX
> 3 2 4 bbb 2 4 YYY
>
> '-j3' alone is equivalent to '-1 3 -2 3':
>
> $ join -1 3 -2 3 a.txt b.txt
> 3 1 2 aaa 1 2 XXX
> 4 2 3 bbb 2 3 YYY
>
> $ join -j3 a.txt b.txt
> 3 1 2 aaa 1 2 XXX
> 4 2 3 bbb 2 3 YYY
>
> So, in the most common cases, '-jN' works for all Ns
> (for "all" being 1,2,3 but really, who needs more than 3 numbers? :) ).
> This is perhaps not like BSD's join.
>
>
> Now comes the tricky part:
> If the '-j1' or '-j2' is followed by another parameter,
> and that parameter turns out *not* to be an valid field number,
> It is treated like '-j 1' (or '-1 1 -2 1'), and join just "does the right
> thing":
>
> $ join -j2 -i a.txt b.txt
> 2 1 3 aaa 1 3 XXX
> 3 2 4 bbb 2 4 YYY
>
> This is implemented here:
> https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/join.c#n1171
Aha, I didn't spot this. That's really tricky. I only observed the changing
error message complaining about the remaining arguments depending on removing
and adding an additional field number. And in case the filename is just a
number it's even getting more convoluted, as also the overall number of
arguments come into play then.
$ join -j1 1 2
generates no error, although -j1 got a 1, but it predicts that it must be the
name of a file, as otherwise one argument would be missing on the command line
AFAICS.
> And the result is that most of the time, join "just works" (IMHO, but
> other opinions welcomed).
>
>
> If the '-j1' or '-j2' is followed by a number, this is were the unexpected
> behaviour occurs, as it sets the key field for that file alone. E.g. '-j1 2'
> is equivalent to '-1 2' (and the key for the second
> file is not set, thus defaults to 1):
>
> $ join -j1 2 a.txt b.txt
> 2 1 3 aaa 3 4 YYY
>
> $ join -1 2 a.txt b.txt
> 2 1 3 aaa 3 4 YYY
>
>
> Is the above a satisfactory explanation?
Yes, absolutely.
> If so, it'll be more-or-less what I'll add to the manual.
>
> I see that this has been implemented back in 2005, here:
> https://git.savannah.gnu.org/cgit/coreutils.git/commit/src/join.c?id=f9118c1c2e35b
> with the comment:
> "Parse obsolete options -j1 and -j2
> so that it is a pure extension to POSIX 1003.1-2001."
>
> I can perhaps guestimate that since this usage is never
> mentioned anywhere, it is considered undocumented and discouraged usage
> (and indeed, I don't think I've ever encountered it, or previously
> saw a bug-report or question about it - so it's rather rare).
>
> We could add a warning to the man page - what do others think?
+1
-- Reuti
signature.asc
Description: Message signed with OpenPGP using GPGMail
bug#26029: Problems with join, Bernhard Voelker, 2017/03/08