[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] join feature: auto-format
From: |
Pádraig Brady |
Subject: |
Re: [coreutils] join feature: auto-format |
Date: |
Thu, 07 Oct 2010 11:22:13 +0100 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
On 07/10/10 01:03, Pádraig Brady wrote:
> On 06/10/10 21:41, Assaf Gordon wrote:
>> Hello,
>>
>> I'd like to (re)suggest a feature for the join program - the ability to
>> automatically build an output format line (similar but easier than using
>> "-o").
>>
>> I've previously mentioned it here (but got no favorable responses):
>> http://lists.gnu.org/archive/html/bug-coreutils/2009-11/msg00151.html
>>
>> Several people have been using this option for a year now (on our local
>> servers), so I thought I might try to suggest it again.
>>
>> The full patch is attached, and also available here:
>> http://cancan.cshl.edu/labmembers/gordon/files/join_auto_format_2010_10_06.patch
>>
>> Here's the common use case:
>>
>> Given two tabular files, with a common key at first column, and many numeric
>> (or other) values on other columns, the user wants to join them together
>> easily.
>> One requirement is that empty/missing values should be populated with "00".
>>
>> File 1
>> ======
>> bar 10 13 15 16 11 32
>> foo 10 10 11 12 13 14
>>
>>
>> File 2
>> ======
>> bar 99 91 90 93 91 93
>> baz 90 91 99 96 97 95
>>
>>
>> Desired joined output
>> ==============
>> bar 10 13 15 16 11 32 99 91 90 93 91 93
>> baz 00 00 00 00 00 00 90 91 99 96 97 95
>> foo 10 10 11 12 13 14 00 00 00 00 00 00
>>
>> There is no technical problem in achieving this, the parameters would be:
>> "-a1 -a2 -e 00 -o 0,1.2,1.3,1.4,1.5,1.6,1.7,2.2,2.3,2.4,2.5,2.6,2.7"
>>
>> But building the "-o" parameter is cumbersome, and error-prone (imaging
>> files with dozens of columns, which is very common in my case).
>>
>> The "--auto-format" feature simply builds the "-o" format line
>> automatically, based on the number of columns from both input files.
>
> Thanks for persisting with this and presenting a concise example.
> I agree that this is useful and can't think of a simple workaround.
> Perhaps the interface would be better as:
>
> -o {all (default), padded, FORMAT}
>
> where padded is the functionality you're suggesting?
Thinking more about it, we mightn't need any new options at all.
Currently -e is redundant if -o is not specified.
So how about changing that so that if -e is specified
we operate as above by auto inserting empty fields?
Also I wouldn't base on the number of fields in the first line,
instead auto padding to the biggest number of fields
on the current lines under consideration.
cheers,
Pádraig.