bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Paul Eggert
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Fri, 15 Jul 2005 09:25:46 -0700
User-agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux)

Thanks for working on this.  You've gotten further than anyone else
has!  Some quick comments:

Frederik Eaton <address@hidden> writes:

> Is there a script for making a patch with all the right files excluded
> by the way?

Not yet.  That's on the list of things to do.  The fix will be to remove
those files from the repository, and have a bootstrap script that generates
them.  Then "cvs diff" will do the right thing.

>> shred also tries to obtain a random seed.
>> It'd be nice (eventually) to have both programs
>> use the same mechanism.
>
> I did not realize this. Yes, perhaps it would.

That sounds right to me as well.

> I do not have any idea, but it would be good to have random seed
> functionality somewhere standard-ish. It would be nice if it were in
> libc but that's an area I don't want to tackle.

It's a reasonable thing to put in coreutils/lib.  We can then put it
into gnulib.

> Also, I didn't personally think it was necessary to do anything but
> look at the time and process ID for a default seed - I only added the
> "/dev/random" stuff because Paul Eggert seems to think that security
> is very important here

If "sort -R" is used for anything security-relevant, then it will be
important, yes.

> - and once "/dev/random" is referenced then I figured compatibility
> with other kinds of systems would become an issue

No, the method used by "shred" is good enough.  We don't need to worry
about EGD any more.  It's an obsolete hack.  You might be better off
starting with the code in "shred".

> I would like not to spend much more time on this, by the way,

Alas, more work needs to be done.  Perhaps we can recruit someone else
to look into it?  I will put it on my list of things to do as well,
but it's a pretty long list....

+       * src/sort.c: add functions init_salt, get_hash; 'init_salt' is
+       used to provide seeding, and 'get_hash' calculates the hash value
+       by which a key is sorted

Please use the ChangeLog style recommended by the GNU coding standards
<http://www.gnu.org/prep/standards/html_node/Change-Logs.html>.
For example, the first line should start with
"* src/sort.c (init_salt, get_hash):".
It's important to log every externally-visible identifier whose meaning
has changed.

Also, each external symbol (function, macro, variable) should have a comment
explaining what it does.  Currently I'm at a bit of a loss trying to
figure out what things do, so my comments will be limited.

+#ifndef _CHECKSUM_H
+#define _CHECKSUM_H 1
+
+#include <sys/types.h>
+#include <config.h>
+#include "sha1.h"
+#include "md5.h"

This seems overkill to me.  sort is just using md5, right?

+  int len = strlen(str);

There should be a space before the "(".  There are several other instances
of that.

+  return (len*2) >= ops->bits;

Please put spaces around the "*".  The parentheses aren't needed here.

Also, we prefer <= to >=.  This is a programming style rule that I
learned from D. Val Schorre.  It's an application of Leibnitz's
criterion for notation: textual order should reflect numeric order.

+  --seed                    use specified seed for random number generator\n\

--seed has an operand, so it should be mentioned here.

+  void *ctx = alloca(digops.ctx_size);

What are the bounds on digops.ctx_size?  If it can be large (more than
a few kilobytes) then we shouldn't rely on alloca, due to stack-overflow
detection issues.

+      else if (key->random_hash)
+       {
+         int dig_bytes = digops.bits/8;
+         char diga[dig_bytes];
+         char digb[dig_bytes];
+         get_hash(texta, lena, diga);
+         get_hash(textb, lenb, digb);
+         diff = memcmp(diga, digb, sizeof(diga));
+       }

It should be possible to combine -R with -b, -d, -f, -g, -i, -M, -n,
and -r.  For example, sort -nR should compute the same hash for 1.0
and 01.0 that it computes for 1.00.  Currently, though, the -R is
silently ignored in this case.  Conversely, sort -MR should compute
the same hash for " Jan" that it does for "Jan"; but here, the "-M" is
silently ignored.


       else if (key->month)
        diff = getmonth (texta, lena) - getmonth (textb, lenb);
       /* Sorting like this may become slow, so in a simple locale the user

@@ -1986,7 +2038,7 @@ badfieldspec (char const *spec, char con
 {
   error (SORT_FAILURE, 0, _("%s: invalid field specification %s"),
         _(msgid), quote (spec));
-  abort ();
+  abort (); // inserted to avoid a compiler warning
 }

Please don't use //-style comments, as we can't assume C99 yet.
Also, please prefer declarative sentences in comments, and put
the comments on separate lines before the code they describe,
e,g.:

  /* Avoid a compiler warning.  */
  abort ();


+  char *randseed = 0;

This should be type "char const *", not "char *".  Note that we prefer
the const after the char.  Also, the 0 should be NULL.
 
+         randseed = strdupa(optarg);

We can't assume strdupa exists.  But you don't need to copy optarg; just
use it without copying it.

+  -R, --random                sort by random hash of keys\n\

I would prefer the long option name --random-hash, since it's not a
truly random sort.

There is a key problem here: one of documentation.  doc/coreutils.texi
and NEWS need to be modified to say exactly what's going on, and why
sorting based on a random hash is not the same as generating a random
permutation, so that people who are doing security-relevant stuff
won't rely on something that they shouldn't.

Also, I can't easily tell from the code how many random bits are
currently acquired from the environment, and whether there are enough
bits to actually satisfy the claim that we are using a random hash.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]