bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

join suggestion: auto-output-format


From: Assaf Gordon
Subject: join suggestion: auto-output-format
Date: Wed, 04 Nov 2009 20:36:43 -0500
User-agent: Mozilla-Thunderbird 2.0.0.22 (X11/20090707)

Hello,

I'd like to suggest another small feature for join (not related to the 
'--header' feature I previously sent).

This feature allows join to automatically guess the output format without specifying '-o', allowing easier use (IMHO) of "-e". This is mostly a convenience, DWIM kind of feature.
Here a simple use case:

$ cat 1.txt
1 alice
2 bob
4 dave

$ cat 2.txt
1 red
2 green
3 blue

Joining with "-a 1 -a 2" will display the third and fourth items without proper 
field 'fillers':

$ join -j1 -a1 -a2   1.txt 2.txt
1 alice red
2 bob   green
3 blue
4 dave

This behavior is of course by design.
If one needs the empty columns to be filled, it requires both "-e" and "-o", and to use 
"-o" properly, one needs to know beforehand the columns in the input files:

$ join -j1 -a1 -a2 -e FOO -o 0,1.2,2.2   1.txt   2.txt
1 alice red
2 bob   green
3 FOO   blue
4 dave  FOO

If there are many columns in the input fields, writing the proper "-o" format 
string is cumbersome.

I suggest a simple feature:
When adding "--auto-format" argument, join will automatically generate an output format 
(simulating "-o"), by putting the joined field first, followed by all the fields from 
file1, followed by all fields from file2.
(This feature assumes the number of columns in the first lines represents the 
number of columns in all lines).
This allows using "-e" without specifying "-o", as so:

$ join -j1 -a1 -a2 -e FOO --auto-format   1.txt   2.txt
1 alice red
2 bob   green
3 FOO   blue
4 dave  FOO


Attached is a first draft of this feature (also available here: 
http://cancan.cshl.edu/labmembers/gordon/coreutils8/join_auto_format.patch ).
Comments are welcomed.
Please tell me if you're willing to consider adding this feature to coreutils.

Thanks,
 gordon



src/join.c |   36 +++++++++++++++++++++++++++++++++++-
1 files changed, 35 insertions(+), 1 deletions(-)

diff --git a/src/join.c b/src/join.c
index d734a91..71219f9 100644
--- a/src/join.c
+++ b/src/join.c
@@ -146,6 +146,7 @@ static struct option const longopts[] =
  {"ignore-case", no_argument, NULL, 'i'},
  {"check-order", no_argument, NULL, CHECK_ORDER_OPTION},
  {"nocheck-order", no_argument, NULL, NOCHECK_ORDER_OPTION},
+  {"auto-format", no_argument, NULL, 'F'},
  {GETOPT_HELP_OPTION_DECL},
  {GETOPT_VERSION_OPTION_DECL},
  {NULL, 0, NULL, 0}
@@ -157,6 +158,12 @@ static struct line uni_blank;
/* If nonzero, ignore case when comparing join fields.  */
static bool ignore_case;

+/* if nonzero, automatically build a specific output field list,
+   based on the first line of each input file */
+static bool auto_output_format;
+
+static void build_output_format(const struct line const *line1, const struct 
line const* line2);
+
void
usage (int status)
{
@@ -191,6 +198,8 @@ by whitespace.  When FILE1 or FILE2 (not both) is -, read 
standard input.\n\
  --check-order     check that the input is correctly sorted, even\n\
                      if all input lines are pairable\n\
  --nocheck-order   do not check that the input is correctly sorted\n\
+  -F, --auto-format  Automatically build output format, based on the first\n\
+                    line of each input file. Allows '-e' without using '-o'.\n\
"), stdout);
      fputs (HELP_OPTION_DESCRIPTION, stdout);
      fputs (VERSION_OPTION_DESCRIPTION, stdout);
@@ -616,6 +625,9 @@ join (FILE *fp1, FILE *fp2)
  initseq (&seq2);
  getseq (fp2, &seq2, 2);

+  if (auto_output_format && seq1.count && seq2.count)
+    build_output_format(seq1.lines[0],seq2.lines[0]);
+
  while (seq1.count && seq2.count)
    {
      size_t i;
@@ -926,6 +938,24 @@ add_file_name (char *name, char *names[2],
    *optc_status = MIGHT_BE_O_ARG;
}

+static void
+build_output_format(const struct line const *line1, const struct line const* 
line2)
+{
+  int i ;
+  if (outlist_head.next)
+    return;
+
+  add_field(0,0);
+  for (i = 0; i < join_field_1 && i < line1->nfields; ++i)
+    add_field(1,i);
+  for (i = join_field_1 + 1; i < line1->nfields; ++i)
+    add_field(1,i);
+  for (i = 0; i < join_field_2 && i < line2->nfields; ++i)
+    add_field(2,i);
+  for (i = join_field_2 + 1; i < line2->nfields; ++i)
+    add_field(2,i);
+}
+
int
main (int argc, char **argv)
{
@@ -954,7 +984,7 @@ main (int argc, char **argv)
  issued_disorder_warning[0] = issued_disorder_warning[1] = false;
  check_input_order = CHECK_ORDER_DEFAULT;

-  while ((optc = getopt_long (argc, argv, "-a:e:i1:2:j:o:t:v:",
+  while ((optc = getopt_long (argc, argv, "-a:e:i1:2:j:o:t:v:F",
                              longopts, NULL))
         != -1)
    {
@@ -1052,6 +1082,10 @@ main (int argc, char **argv)
                         &nfiles, &prev_optc_status, &optc_status);
          break;

+        case 'F':
+          auto_output_format = true;
+          break;
+
        case_GETOPT_HELP_CHAR;

        case_GETOPT_VERSION_CHAR (PROGRAM_NAME, AUTHORS);






reply via email to

[Prev in Thread] Current Thread [Next in Thread]