[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
uniq for unsorted input
From: |
Bruno Haible |
Subject: |
uniq for unsorted input |
Date: |
Tue, 26 Mar 2024 17:40:44 +0100 |
The documentation of 'uniq' [1] says:
"The input need not be sorted, but repeated input lines are
detected only if they are adjacent. If you want to discard
non-adjacent duplicate lines, perhaps you want to use 'sort -u'."
When I wrote gnulib-tool many years ago, as a shell script, I naturally
used
- 'sort -u' to remove duplicates,
- 'sort' + 'join' to compute set intersections, and
- 'sort' + 'join -v' to compute set differences.
In a couple of places I only wanted to remove duplicates. Since the coreutils
don't offer a way to remove duplicates without sorting, I accepted the sorting
as a side effect. This has led to inconsistencies: In some places, gnulib-tool
sorts file names, and in other places it doesn't. [2]
In order to avoid mistakes of this kind, I propose that coreutils at least
gives a hint that removing duplicate lines from an unsorted list of lines
is possible, in O(N log N) run time and O(1) space.
The way to do so is described in [3]:
$ cat data.txt | nl | sort -k 2 | uniq -f 1 | sort -n | sed -e 's/^[^ ]*
//'
or
$ cat -n data.txt | sort --key=2.1 -b -u | sort -n | sed -e 's/^[^ ]*
//'
Other methods described in [3], such as counters maintained in an 'awk'
or 'perl' process, or the 'unique' program that is part of the 'john' package
[4], can be ignored, because they need O(N) space and are thus not usable for
40 GB large inputs [5].
* Would you accept a documentation change to the 'uniq' page that explains
the alternative with the line numbering trick mentioned above?
* What do you think about adding
- a 'uniq' option -k / --keep-unsorted
that implies unsorted input and removal of occurrences 2,...,n of a line
that occurs n times,
- a 'uniq' option --keep-unsorted-last
that implies unsorted input and removal of occurrences 1,...,n-1 of a line
that occurs n times?
These options would be implemented by spawning a pipe of 'cat', 'sort',
'sort', 'sed' programs as shown above, optionally resorting to all-in-memory
processing if the input's size is below a certain threshold (like 'sort'
does).
Bruno
[1] https://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html
[2] https://lists.gnu.org/archive/html/bug-gnulib/2024-03/msg00334.html
[3] https://unix.stackexchange.com/questions/11939/
[4] https://www.openwall.com/john/
[5] https://stackoverflow.com/questions/30906191/
- uniq for unsorted input,
Bruno Haible <=