[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9780: sort -u throws out non-duplicates

From: Jim Meyering
Subject: bug#9780: sort -u throws out non-duplicates
Date: Tue, 14 Aug 2012 23:09:11 +0200

Paul Eggert wrote:
> Thanks very much for that test case; I've confirmed the bug on
> my platform with the latest 'sort'.  If nobody else gets to it
> I will try to take a look at it when I find the time
> (most likely in a week or so).

Yes, thanks again!
That is a serious bug.
It has been around for a long time.

This small reproducer (on a 2-core i686 system)
  perl -e 'printf "33\n"x2 ."7\n"x31 ."1\n"' | src/sort -S1 -u
prints this:
but should print this:

On a multi-core x86_64, I need slightly different input to trigger
the failure.  Note how this uses only 22 '7's and --parallel=1.

  $ perl -e 'printf "33\n"x2 ."7\n"x22 ."1\n"'|src/sort --para=1 -S1 -u

The problem starts with write_unique's static variable:

static void
write_unique (struct line const *line, FILE *tfp, char const *temp_output)
  static struct line saved;
  if (unique)
      if (saved.text && ! compare (line, &saved))
      saved = *line;

Note how that merely makes a shallow copy of "*line".
I.e., it merely copies line's 4 members, 3 of which are pointers.

  (gdb) p *line
  $1 = {
    text = 0x806221e "1",
    length = 2,
    keybeg = 0x62626262 <Address 0x62626262 out of bounds>,
    keylim = 0x62626262 <Address 0x62626262 out of bounds>

In that example, the two key* variables are not even initialized,
which is not a problem, since they're not used in this example.
The one that causes trouble is the .text pointer.
The line buffer storage into which it points ends
up being overwritten when new data is read in (via fread),
and so if you are unlucky, you'll get an accidental match
and mistakenly skip the sole (in reduced temp files) occurrence
of a line, resulting in incorrect output.

The solution may be to make a deep copy, and store it in
an extensible (probably never-freed) buffer.
However, that looks like it will be comparatively expensive,
since determining whether keybeg and keylim must
also be copied depends on many global options,
the same ones used by sort's compare function.

A slower-yet-correct "sort -u" is obviously preferable to
our currently faster-yet-buggy one.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]