coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[coreutils] Bug (?) in sort -R


From: Jason
Subject: [coreutils] Bug (?) in sort -R
Date: Mon, 16 Aug 2010 12:22:51 -0700

I can't decide if this is a bug or not. Apologies if this has already been discussed I am pretty new to the list. I'm using the latest git version, 8.5.136-6d78c.

If you do

sort -R -k 4,4 a > b

the relative ordering of column 4 is different then if you do

sort -R -k 4,5 a > b.

(obviously the actual order in the output file is different on every run unless you pass in the same random data to get the same ordering)

It'd seem that the individual columns should be hashed and sorted independently in order to maintain the normal ordering of the primary sort column. It appears that the sort is on the hash of concatenated key list, so the same values of the primary sort column do not appear next to each other when sorting on multiple columns. e.g., if have an input file called "a":

a b c d e
a b c d f
a b c d g
a b c e e
a b c e f

The output file should always contain all the "a b c d" lines contiguously, and all the "a b c e" lines contiguously. As it is, the output might be

~/coreutils/coreutils> src/sort -R -k 4,5 a
a b c d e
a b c d g
a b c e e
a b c e f
a b c d f

~/coreutils/coreutils> src/sort --version
sort (GNU coreutils) 8.5.136-6d78c

This is also true if you use the -s flag with only one field specified, which is a slightly different flavor of the same bug.

~/coreutils/coreutils> src/sort -s -R -k 4 a
a b c d g
a b c e f
a b c d f
a b c d e
a b c e e

Whereas

src/sort -s -R -k 4,4 a
a b c e e
a b c e f
a b c d e
a b c d f
a b c d g

src/sort -s -R -k 4,4 a
a b c d e
a b c d f
a b c d g
a b c e e
a b c e f

yields expected results.

The real-world use case is to prevent sequential scanning of sharded databases by using the flag when grouping data from multiple sources.

Jason







reply via email to

[Prev in Thread] Current Thread [Next in Thread]