[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] split: --chunks option

From: Pádraig Brady
Subject: Re: [PATCH] split: --chunks option
Date: Fri, 05 Feb 2010 12:40:05 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20100111 Thunderbird/3.0.1

I got a bit of time for the review last night...

This was your last interface change for this:

   -b, --bytes=SIZE        put SIZE bytes per output file\n\
+  -b, --bytes=/N          generate N output files\n\
+  -b, --bytes=K/N         print Kth of N chunks of file\n\
   -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file\n\
   -d, --numeric-suffixes  use numeric suffixes instead of alphabetic\n\
   -l, --lines=NUMBER      put NUMBER lines per output file\n\
+  -l, --lines=/N          generate N eol delineated output files\n\
+  -l, --lines=K/N         print Kth of N eol delineated chunks\n\
+  -n, --number=N          same as --bytes=/N\n\
+  -n, --number=K/N        same as --bytes=K/N\n\
+  -r, --round-robin=N     generate N eol delineated output files using\n\
+                            round-robin style distribution.\n\
+  -r. --round-robin=K/N   print Kth of N eol delineated chunk as -rN would\n\
+                            have generated.\n\
+  -t, --term=CHAR         specify CHAR as eol. This will also convert\n\
+                            -b to its line delineated equivalent (-C if\n\
+                            splitting normally, -l if splitting by\n\
+                            chunks). C escape sequences are accepted.\n\

Thinking more about it, I think adding 2 modes of operation to the
already slightly complicated -bCl options is too confusing.
Since this is a separate mode of operation; i.e. one would be
specifying a particular number of files for a different reason
than a particular size, it would be better as a separate option.
  So I changed -n to operate as follows. This is more general if
we want to add new split methods in future, and also compatible with
the existing BSD -n without needing a redundant option.

-n N       split into N files based on size of input
-n K/N     output K of N to stdout
-n l/N     split into N files while maintaining lines
-n l/K/N   output K of N to stdout while maintaining lines
-n r/N     like `l' but use round robin distribution instead of size
-n r/K/N   likewise but only output K of N to stdout

Other changes I made in the attached version are:

Removed the -t option as that's separate.
Removed erroneous 'c' from getopt() parameters.
Use K/N in code rather than M/N to match user instructions.
Added suffix len setter/checker based on N so that
  we fail immediately if the wrong -a is specified, or
  if -a is not specified we auto set it.
Flagged 0/N as an error, rather than treating like /N.
Changed r/K/N to buffer using stdio for much better performance (see below)
Fixed up the errno on some errors()
Normalized all "write error" messages so that all of these commands
output a single translated error message, of the form:
"split: write error: No space left on device"
  split -n 1/10 $(which split) >/dev/full
  stdbuf -o0 split -n 1/10 $(which split) >/dev/full
  seq 10 | split -n r/1/10 >/dev/full
  seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full

Re the performance of the round robin implementation;
using stdio helps a LOT as can be seen with:
$ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l

real    0m1.568s
user    0m1.486s
sys     0m0.072s

$ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l

real    0m50.988s
user    0m7.548s
sys     0m43.250s

I still need to look at the round robin implementation when
outputting to file rather than stdout. I may default to using
stdio, but give an option to flush each line. I'm testing
with this currently which is performing badly just doing write()
#create fifos
yes | head -n4 | ../split -n r/4 fifo
for f in x*; do rm $f && mkfifo $f; done

(for f in x*; do md5sum $f& done) > md5sum.out

seq 100000 | split -n r/4

BTW, other modes perform well with write()
$ yes | head -n10000000 > 10m.txt
$ time ./split -n l/1/1 <10m.txt | wc -l

real    0m0.201s
user    0m0.145s
sys     0m0.043s

$ time ./split -n 1/1 <10m.txt | wc -l

real    0m0.199s
user    0m0.154s
sys     0m0.041s

$ time ./split -n 1 <10m.txt

real    0m0.088s
user    0m0.000s
sys     0m0.081s

Here is stuff I intend TODO before checking in:
 s/pread()/dd::skip()/ or at least add pread to bootstrap.conf
 fix info docs for reworked interface
 try to refactor duplicated code


Attachment: split-chunks.diff
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]