Re: uniq for unsorted input

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: uniq for unsorted input

From:	Pádraig Brady
Subject:	Re: uniq for unsorted input
Date:	Tue, 26 Mar 2024 18:54:48 +0000
User-agent:	Mozilla Thunderbird

On 26/03/2024 16:40, Bruno Haible wrote:

The documentation of 'uniq' [1] says:

   "The input need not be sorted, but repeated input lines are
    detected only if they are adjacent. If you want to discard
    non-adjacent duplicate lines, perhaps you want to use 'sort -u'."

When I wrote gnulib-tool many years ago, as a shell script, I naturally
used
   - 'sort -u' to remove duplicates,
   - 'sort' + 'join' to compute set intersections, and
   - 'sort' + 'join -v' to compute set differences.

In a couple of places I only wanted to remove duplicates. Since the coreutils
don't offer a way to remove duplicates without sorting, I accepted the sorting
as a side effect. This has led to inconsistencies: In some places, gnulib-tool
sorts file names, and in other places it doesn't. [2]

In order to avoid mistakes of this kind, I propose that coreutils at least
gives a hint that removing duplicate lines from an unsorted list of lines
is possible, in O(N log N) run time and O(1) space.

The way to do so is described in [3]:
$ cat data.txt | nl | sort -k 2 | uniq -f 1 | sort -n | sed -e 's/^[^   ]*      
//'
or
$ cat -n data.txt | sort --key=2.1 -b -u | sort -n | sed -e 's/^[^      ]*      
//'

Other methods described in [3], such as counters maintained in an 'awk'
or 'perl' process, or the 'unique' program that is part of the 'john' package
[4], can be ignored, because they need O(N) space and are thus not usable for
40 GB large inputs [5].

* Would you accept a documentation change to the 'uniq' page that explains
   the alternative with the line numbering trick mentioned above?

* What do you think about adding
   - a 'uniq' option -k / --keep-unsorted
     that implies unsorted input and removal of occurrences 2,...,n of a line
     that occurs n times,
   - a 'uniq' option --keep-unsorted-last
     that implies unsorted input and removal of occurrences 1,...,n-1 of a line
     that occurs n times?
   These options would be implemented by spawning a pipe of 'cat', 'sort',
   'sort', 'sed' programs as shown above, optionally resorting to all-in-memory
   processing if the input's size is below a certain threshold (like 'sort'
   does).

Bruno

[1] https://www.gnu.org/software/coreutils/manual/html_node/uniq-invocation.html
[2] https://lists.gnu.org/archive/html/bug-gnulib/2024-03/msg00334.html
[3] https://unix.stackexchange.com/questions/11939/
[4] https://www.openwall.com/john/
[5] https://stackoverflow.com/questions/30906191/


A documentation patch would be very useful.
If you search for "Decorate Sort Undecorate" in coreutils.texi
you can see an existing example of this DSU pattern.
I would also mention DSU if adding another example.

Re adding new --keep-unsorted options I'm a bit reluctant
as the scalability is not improved by bringing within the command.

thanks!
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

uniq for unsorted input, Bruno Haible, 2024/03/26
- Re: uniq for unsorted input, Pádraig Brady <=
  - Re: uniq for unsorted input, Bruno Haible, 2024/03/26
- Re: uniq for unsorted input, Kaz Kylheku, 2024/03/26

Prev by Date: Re: coreutils-9.4.170-7b206 ls/removed-directory test failure
Next by Date: Re: coreutils-9.4.170-7b206 misc/numfmt test failure
Previous by thread: uniq for unsorted input
Next by thread: Re: uniq for unsorted input
Index(es):
- Date
- Thread