[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH]: uniq: add "--group" option

From: Pádraig Brady
Subject: Re: [PATCH]: uniq: add "--group" option
Date: Thu, 21 Feb 2013 16:11:26 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1

On 02/21/2013 03:42 PM, Assaf Gordon wrote:
Hello Pádraig,

Pádraig Brady wrote, On 02/20/2013 08:47 PM:
On 02/20/2013 06:44 PM, Assaf Gordon wrote:

Attached is a suggestion for "--group" option in uniq, as discussed here:

The patch adds two parameters:
        --group=[method]  separate each unique line (whether duplicated or not)
                          with a marker.
        --group-separator=SEP   with --group, separates group using SEP
                          (default: empty line)

--group-sep is probably overkill.
I'd just use \n or \0 if -z specified.


As for separation methods I'd just go with what we have for
--all-repeated (but remove 'none' which wouldn't be useful with --group),
as we've never had requests for anything else. so:
--group={prepend, separate(default)}

I'd like to have at least "append" or "both", for the added convenience of 
downstream analysis.
It's obviously a "nice-to-have" and not "must-have" feature, and can be 
implemented in other ways, but knowing that there will always be a terminating marker *after* a 
group (even the last group) makes downstream processing code simpler.

Typical example:
  $ cat INPUT | uniq --group=append | \
       awk '$0!="" { ## item in the group, collect it }
            $0=="" { ## end of group, do something }'

Without the final group marker, any downstream code will require two points of 
"group processing": when a marker is found, and at EOF.
Something like:

  $ cat INPUT | uniq --group=append | \
       awk '$0!="" { ## item in the group, collect it }
            $0=="" { ## end of group, do something }
            END { ## end of last group, do something, duplicated code }'

Similar reason for having "both", as it ensures there I can put any special 
initialization code in the group-marker case, and doesn't need to duplicate it in a 
separate 'BEGIN{}' clause (Of course, this doesn't have to be awk - can be 
perl/python/ruby/whatever that will do downstream processing).

I realize it's not a "make-or-break" feature - but if we're trying to make text 
processing easier, I believe "append/both" makes it even easier.

OK good arguments. Thanks.
Let's keep all apart from 'none' so.

So on to operation...

And it behaves "as expected":
$ printf "a\na\na\nb\nc\nc\n" | ./src/uniq --group-sep="--" --group=separate

The above isn't that useful and could be done with sed.

I assume you're specifically referring to the "group-sep" part - then OK.

Actually I was referring to the fact that in your example
--group didn't output all entries by default.
If it only output unique entries then you can separate with:

uniq | sed 'G'     # (note sed also supports -z)
uniq | sed '$q;G'

So `uniq --group` should output all items by default I think.

Supporting -u or -d with --group wouldn't be useful either really.
It's probably most consistent to just disallow those combinations.

Just to be clear on the reasoning: because with "-u" and "-d", each *line* is 
implicitly a separate group, there's no apparent utility for an end-of-group marker.


I guess it's true from a technical POV - but again, for downstream analysis 
convenience it's nice to have a fixed end-of-group marker.
I could use the same downstream script (which expects end-of-group markers) with uniq, whether I 
used "-d" or "-u" or nothing at all.

But what's the point in such processing if there is only ever going
to be a single line in each group?


reply via email to

[Prev in Thread] Current Thread [Next in Thread]