bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9780: sort -u throws out non-duplicates


From: Jim Meyering
Subject: bug#9780: sort -u throws out non-duplicates
Date: Thu, 16 Aug 2012 23:03:02 +0200

Jim Meyering wrote:
...
> In case anyone is chomping at the bit, here's a preliminary patch:
>
> Here's a smaller test case that appears to be host/nproc-independent:
> It should print two lines: 1, then 7.
> Without this patch, it prints only "7".
>
>     (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u
>
> Of course, it needs more/better comments, NEWS and
> tests -- and not just the one above, but also one that
> demonstrates the need for the key* adjustments below.
>
> This solution doesn't incur much of a performance penalty
> because it copies the line only rarely: just before an fread
> call that might modify the currently-saved text.
>
...
> Subject: [PATCH] sort: fix bug with --unique
>
> ---
>  src/sort.c | 37 ++++++++++++++++++++++++++++++++++---
>  1 file changed, 34 insertions(+), 3 deletions(-)

Here's a complete patch:

>From 431102766cbf7c360ee6fa1f157ebcd7d8b9ca0e Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Wed, 15 Aug 2012 12:30:44 +0200
Subject: [PATCH] sort: sort --unique (-u) could cause data loss

sort -u could omit one or more lines of expected output.
This bug arose because sort recorded the most recently printed line via
reference, and if you were unlucky, the storage for that line would be
reused (overwritten) as additional input was read into memory.  If you
were doubly unlucky, the new value of the "saved" line would not only
match the very next line, but if that next line were also the first in
a series of identical, not-yet-printed lines, then the corrupted "saved"
line value would result in the omission of all matching lines.

* src/sort.c (saved_line): New static/global, renamed and moved from...
(write_unique): ...here.  Old name was "saved", which was too generic
for its new role as file-scoped global.
(fillbuf): With --unique, when we're about to read into a buffer that
overlaps the saved "preceding" line (saved_line), copy the line's .text
member to a realloc'd-as-needed temporary buffer and adjust the line's
key-defining members if they're set.
(overlap): New function.
* tests/misc/sort: New tests.
* NEWS (Bug fixes): Mention it.
* THANKS.in: Update.
Bug introduced via commit v8.5-89-g9face83.
Reported by Rasmus Borup Hansen in
http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/23173/focus=24647
---
 NEWS            |  5 +++++
 THANKS.in       |  1 +
 src/sort.c      | 44 ++++++++++++++++++++++++++++++++++++++++----
 tests/misc/sort |  9 +++++++++
 4 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/NEWS b/NEWS
index 012a633..f39a76a 100644
--- a/NEWS
+++ b/NEWS
@@ -9,6 +9,11 @@ GNU coreutils NEWS                                    -*- 
outline -*-
   certain options like -a, -l, -t and -x.
   [This bug was present in "the beginning".]

+  sort -u could fail to output one or more result lines.
+  For example, this command would fail to print "1":
+  (yes 7 | head -11; echo 1) | sort --p=1 -S32b -u
+  [bug introduced in coreutils-8.6]
+
 ** New features

   rm now accepts the --dir (-d) option which makes it remove empty directories.
diff --git a/THANKS.in b/THANKS.in
index 5db443b..a736201 100644
--- a/THANKS.in
+++ b/THANKS.in
@@ -508,6 +508,7 @@ Primoz PETERLIN                     address@hidden
 Rainer Orth                         address@hidden
 Ralf W. Stephan                     address@hidden
 Ralph Loader                        address@hidden
+Rasmus Borup Hansen                 address@hidden
 Raul Miller                         address@hidden
 Raúl Núñez de Arenas Coronado       address@hidden
 Richard A Downing                   address@hidden
diff --git a/src/sort.c b/src/sort.c
index d362dc5..c2d2d49 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -262,6 +262,9 @@ struct merge_node_queue
                                    when popping. */
 };

+/* Used to implement --unique (-u).  */
+static struct line saved_line;
+
 /* FIXME: None of these tables work with multibyte character sets.
    Also, there are many other bugs when handling multibyte characters.
    One way to fix this is to rewrite 'sort' to use wide characters
@@ -1702,6 +1705,14 @@ limfield (struct line const *line, struct keyfield const 
*key)
   return ptr;
 }

+/* Return true if LINE and the buffer BUF of length LEN overlap.  */
+static inline bool
+overlap (char const *buf, size_t len, struct line const *line)
+{
+  char const *line_end = line->text + line->length;
+  return !(line_end <= buf || buf + len <= line->text);
+}
+
 /* Fill BUF reading from FP, moving buf->left bytes from the end
    of buf->buf to the beginning first.  If EOF is reached and the
    file wasn't terminated by a newline, supply one.  Set up BUF's line
@@ -1742,6 +1753,33 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file)
              rest of the input file consists entirely of newlines,
              except that the last byte is not a newline.  */
           size_t readsize = (avail - 1) / (line_bytes + 1);
+
+          /* With --unique, when we're about to read into a buffer that
+             overlaps the saved "preceding" line (saved_line), copy the line's
+             .text member to a realloc'd-as-needed temporary buffer and adjust
+             the line's key-defining members if they're set.  */
+          if (unique && overlap (ptr, readsize, &saved_line))
+            {
+              /* Copy saved_line.text into a buffer where it won't be clobbered
+                 and if KEY is non-NULL, adjust saved_line.key* to match.  */
+              static char *safe_text;
+              static size_t safe_text_n_alloc;
+              if (safe_text_n_alloc < saved_line.length)
+                {
+                  safe_text_n_alloc = saved_line.length;
+                  safe_text = x2nrealloc (safe_text, &safe_text_n_alloc, 1);
+                }
+              memcpy (safe_text, saved_line.text, saved_line.length);
+              if (key)
+                {
+                  #define s saved_line
+                  s.keybeg = safe_text + (s.keybeg - s.text);
+                  s.keylim = safe_text + (s.keylim - s.text);
+                  #undef s
+                }
+              saved_line.text = safe_text;
+            }
+
           size_t bytes_read = fread (ptr, 1, readsize, fp);
           char *ptrlim = ptr + bytes_read;
           char *p;
@@ -3348,13 +3386,11 @@ queue_pop (struct merge_node_queue *queue)
 static void
 write_unique (struct line const *line, FILE *tfp, char const *temp_output)
 {
-  static struct line saved;
-
   if (unique)
     {
-      if (saved.text && ! compare (line, &saved))
+      if (saved_line.text && ! compare (line, &saved_line))
         return;
-      saved = *line;
+      saved_line = *line;
     }

   write_line (line, tfp, temp_output);
diff --git a/tests/misc/sort b/tests/misc/sort
index 5d15d75..050d2f8 100755
--- a/tests/misc/sort
+++ b/tests/misc/sort
@@ -227,6 +227,15 @@ my @Tests =
 ["15d", '-i -u', {IN=>"\1a\na\n"}, {OUT=>"\1a\n"}],
 ["15e", '-i -u', {IN=>"a\n\1\1\1\1\1a\1\1\1\1\n"}, {OUT=>"a\n"}],

+# This would fail (printing only the 7) for 8.6..8.18.
+["unique-1", '--p=1 -S32b -u', {IN=>"7\n"x11 . "1\n"}, {OUT=>"1\n7\n"}],
+# Demonstrate that 8.19's key-spec-adjusting code is required.
+# These are more finicky in that they are arch-dependent.
+["unique-key-i686", '-k2,2 --p=1 -S32b -u',
+  {IN=>"a 7\n"x10 . "b 1\n"}, {OUT=>"b 1\na 7\n"}],
+["unique-key-x86_64", '-k2,2 --p=1 -S1k -u',
+  {IN=>"a 7\n"x20 . "b 1\n"}, {OUT=>"b 1\na 7\n"}],
+
 # From Erick Branderhorst -- fixed around 1.19e
 ["16a", '-f',
  {IN=>"éminence\nüberhaupt\n's-Gravenhage\naëroclub\nAag\naagtappels\n"},
--
1.7.12.rc2





reply via email to

[Prev in Thread] Current Thread [Next in Thread]