[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Extend uniq to support unsorted list based on hashtable
From: |
Assaf Gordon |
Subject: |
Re: Extend uniq to support unsorted list based on hashtable |
Date: |
Fri, 29 May 2020 22:47:26 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 |
Hello,
On 2020-05-29 10:16 p.m., Yair Lenga wrote:
Wanted to suggest that the team will look (again) at implementing
--unsorted option for 'uniq'.
The idea was proposed (and rejected) about 10 years ago
(https://lists.gnu.org/archive/html/coreutils/2011-11/msg00016.html).
Lot of things have changed from the past.
[...]
Can you advise/provide feedback. I'm sure that there will be many
volunteers (me included) to contribute to such important improvement.
"uniq" is standardize by POSIX to work on "comparing adjacent lines"
(from:
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html ) -
hence the requirement to pre-sort the input.
While it could be extended with a completely different hash-based
implementation, I don't think this is likely to happen.
As an alternative (and a shameless plug), allow me to point to
GNU Datamash ( https://www.gnu.org/software/datamash/ ).
On one hand, it already has a hash-based implementation to
remove duplicated fields (called "rmdup").
consider the following contrived example:
$ (printf "%s\t%s\n" 9 B 3 A ; seq 10 | paste - -) | datamash rmdup 1
9 B
3 A
1 2
5 6
7 8
And on the other hand, because 'datamash' is non-standard,
there's less of a problem in adding new functionality (i.e. "bloat" is
not as big as a concern as it is for coreutils).
Hope this helps.
regards,
- assaf