bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#9780: sort -u throws out non-duplicates


From: Jim Meyering
Subject: bug#9780: sort -u throws out non-duplicates
Date: Wed, 15 Aug 2012 19:47:33 +0200

Jim Meyering wrote:
> Paul Eggert wrote:
>> Thanks very much for that test case; I've confirmed the bug on
>> my platform with the latest 'sort'.  If nobody else gets to it
>> I will try to take a look at it when I find the time
>> (most likely in a week or so).
>
> Yes, thanks again!
> That is a serious bug.
> It has been around for a long time.
>
> This small reproducer (on a 2-core i686 system)
>   perl -e 'printf "33\n"x2 ."7\n"x31 ."1\n"' | src/sort -S1 -u
> prints this:
>   33
>   7
> but should print this:
>   1
>   33
>   7
>
> On a multi-core x86_64, I need slightly different input to trigger
> the failure.  Note how this uses only 22 '7's and --parallel=1.
>
>   $ perl -e 'printf "33\n"x2 ."7\n"x22 ."1\n"'|src/sort --para=1 -S1 -u
>   33
>   7
>
> The problem starts with write_unique's static variable:
>
> static void
> write_unique (struct line const *line, FILE *tfp, char const *temp_output)
> {
>   static struct line saved;
>   if (unique)
>     {
>       if (saved.text && ! compare (line, &saved))
>         return;
>       saved = *line;
>     }
>     ...
>
> Note how that merely makes a shallow copy of "*line".
> I.e., it merely copies line's 4 members, 3 of which are pointers.
>
>   (gdb) p *line
>   $1 = {
>     text = 0x806221e "1",
>     length = 2,
>     keybeg = 0x62626262 <Address 0x62626262 out of bounds>,
>     keylim = 0x62626262 <Address 0x62626262 out of bounds>
>
> In that example, the two key* variables are not even initialized,
> which is not a problem, since they're not used in this example.
> The one that causes trouble is the .text pointer.
> The line buffer storage into which it points ends
> up being overwritten when new data is read in (via fread),
> and so if you are unlucky, you'll get an accidental match
> and mistakenly skip the sole (in reduced temp files) occurrence
> of a line, resulting in incorrect output.
>
> The solution may be to make a deep copy, and store it in
> an extensible (probably never-freed) buffer.
> However, that looks like it will be comparatively expensive,
> since determining whether keybeg and keylim must
> also be copied depends on many global options,
> the same ones used by sort's compare function.
>
> A slower-yet-correct "sort -u" is obviously preferable to
> our currently faster-yet-buggy one.

I'm technically "off" today, so have had little time.
In case anyone is chomping at the bit, here's a preliminary patch:

Here's a smaller test case that appears to be host/nproc-independent:
It should print two lines: 1, then 7.
Without this patch, it prints only "7".

    (yes 7|head -11; echo 1)|sort --parallel=1 -S32b -u

Of course, it needs more/better comments, NEWS and
tests -- and not just the one above, but also one that
demonstrates the need for the key* adjustments below.

This solution doesn't incur much of a performance penalty
because it copies the line only rarely: just before an fread
call that might modify the currently-saved text.


>From 24f9646e3b954b7c914c0f3139054dfce466d314 Mon Sep 17 00:00:00 2001
From: Jim Meyering <address@hidden>
Date: Wed, 15 Aug 2012 12:30:44 +0200
Subject: [PATCH] sort: fix bug with --unique

---
 src/sort.c | 37 ++++++++++++++++++++++++++++++++++---
 1 file changed, 34 insertions(+), 3 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index d362dc5..6b07c22 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -262,6 +262,9 @@ struct merge_node_queue
                                    when popping. */
 };

+/* Used to implement --unique (-u).  */
+static struct line saved_line;
+
 /* FIXME: None of these tables work with multibyte character sets.
    Also, there are many other bugs when handling multibyte characters.
    One way to fix this is to rewrite 'sort' to use wide characters
@@ -1702,6 +1705,14 @@ limfield (struct line const *line, struct keyfield const 
*key)
   return ptr;
 }

+/* Return true if LINE and the buffer BUF of length LEN overlap.  */
+static inline bool
+overlap (char const *buf, size_t len, struct line const *line)
+{
+  char const *line_end = line->text + line->length;
+  return !(line_end <= buf || buf + len <= line->text);
+}
+
 /* Fill BUF reading from FP, moving buf->left bytes from the end
    of buf->buf to the beginning first.  If EOF is reached and the
    file wasn't terminated by a newline, supply one.  Set up BUF's line
@@ -1742,6 +1753,27 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file)
              rest of the input file consists entirely of newlines,
              except that the last byte is not a newline.  */
           size_t readsize = (avail - 1) / (line_bytes + 1);
+          if (unique && overlap (ptr, readsize, &saved_line))
+            {
+              /* Copy saved_line.text into a buffer where it won't be clobbered
+                 and if KEY is non-NULL, adjust saved_line.key* to match.  */
+              static char *safe_text;
+              static size_t safe_text_n_alloc;
+              if (safe_text_n_alloc < saved_line.length)
+                {
+                  safe_text_n_alloc = saved_line.length;
+                  safe_text = x2nrealloc (safe_text, &safe_text_n_alloc, 1);
+                }
+              memcpy (safe_text, saved_line.text, saved_line.length);
+              if (key)
+                {
+                  #define s saved_line
+                  s.keybeg = safe_text + (s.keybeg - s.text);
+                  s.keylim = safe_text + (s.keylim - s.text);
+                  #undef s
+                }
+              saved_line.text = safe_text;
+            }
           size_t bytes_read = fread (ptr, 1, readsize, fp);
           char *ptrlim = ptr + bytes_read;
           char *p;
@@ -3348,13 +3380,12 @@ queue_pop (struct merge_node_queue *queue)
 static void
 write_unique (struct line const *line, FILE *tfp, char const *temp_output)
 {
-  static struct line saved;

   if (unique)
     {
-      if (saved.text && ! compare (line, &saved))
+      if (saved_line.text && ! compare (line, &saved_line))
         return;
-      saved = *line;
+      saved_line = *line;
     }

   write_line (line, tfp, temp_output);
--
1.7.12.rc2.16.g034161a





reply via email to

[Prev in Thread] Current Thread [Next in Thread]