[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Extend uniq to support unsorted list based on hashtable
From: |
Yair Lenga |
Subject: |
Re: Extend uniq to support unsorted list based on hashtable |
Date: |
Sat, 30 May 2020 08:47:46 +0300 |
Hi Assaf, Thanks for prompt reply
You bring up good points about POSIX compliant, and the availability
of the datamash tool.
For the first point, I would note that most coreutils goes well beyond
POSIX. Consider "cp", which has many useful additions beyond the POSIX
features.
The second point is about availability of other tools to achieve
similar task. This is a "judgement call where this functionality
belong. There is no single right answer here. Such implementation can
be done with few lines of code in any scripting solution My main point
is that given that the very common use case for 'uniq' is combined
with other coreutils functions (sort, cut, sed), it make sense to have
an efficient implementation for "counting unique values" available
within "coreutils", instead of sending the user to look for a solution
elsewhere, or to implement his own.
Hope this make sense.
Yair
On Sat, May 30, 2020 at 7:47 AM Assaf Gordon <assafgordon@gmail.com> wrote:
>
> Hello,
>
> On 2020-05-29 10:16 p.m., Yair Lenga wrote:
> > Wanted to suggest that the team will look (again) at implementing
> > --unsorted option for 'uniq'.
> >
> > The idea was proposed (and rejected) about 10 years ago
> > (https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html).
> > Lot of things have changed from the past.
> >
> [...]
> >
> > Can you advise/provide feedback. I'm sure that there will be many
> > volunteers (me included) to contribute to such important improvement.
>
> "uniq" is standardize by POSIX to work on "comparing adjacent lines"
> (from:
> https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html ) -
> hence the requirement to pre-sort the input.
>
> While it could be extended with a completely different hash-based
> implementation, I don't think this is likely to happen.
>
> As an alternative (and a shameless plug), allow me to point to
> GNU Datamash ( https://www.gnu.org/software/datamash/ ).
> On one hand, it already has a hash-based implementation to
> remove duplicated fields (called "rmdup").
> consider the following contrived example:
>
> $ (printf "%s\t%s\n" 9 B 3 A ; seq 10 | paste - -) | datamash rmdup 1
> 9 B
> 3 A
> 1 2
> 5 6
> 7 8
>
> And on the other hand, because 'datamash' is non-standard,
> there's less of a problem in adding new functionality (i.e. "bloat" is
> not as big as a concern as it is for coreutils).
>
> Hope this helps.
>
> regards,
> - assaf
>
>