[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] split: --chunks option
From: |
Pádraig Brady |
Subject: |
Re: [PATCH] split: --chunks option |
Date: |
Fri, 05 Feb 2010 12:40:05 +0000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.7) Gecko/20100111 Thunderbird/3.0.1 |
I got a bit of time for the review last night...
This was your last interface change for this:
-b, --bytes=SIZE put SIZE bytes per output file\n\
+ -b, --bytes=/N generate N output files\n\
+ -b, --bytes=K/N print Kth of N chunks of file\n\
-C, --line-bytes=SIZE put at most SIZE bytes of lines per output file\n\
-d, --numeric-suffixes use numeric suffixes instead of alphabetic\n\
-l, --lines=NUMBER put NUMBER lines per output file\n\
+ -l, --lines=/N generate N eol delineated output files\n\
+ -l, --lines=K/N print Kth of N eol delineated chunks\n\
+ -n, --number=N same as --bytes=/N\n\
+ -n, --number=K/N same as --bytes=K/N\n\
+ -r, --round-robin=N generate N eol delineated output files using\n\
+ round-robin style distribution.\n\
+ -r. --round-robin=K/N print Kth of N eol delineated chunk as -rN would\n\
+ have generated.\n\
+ -t, --term=CHAR specify CHAR as eol. This will also convert\n\
+ -b to its line delineated equivalent (-C if\n\
+ splitting normally, -l if splitting by\n\
+ chunks). C escape sequences are accepted.\n\
Thinking more about it, I think adding 2 modes of operation to the
already slightly complicated -bCl options is too confusing.
Since this is a separate mode of operation; i.e. one would be
specifying a particular number of files for a different reason
than a particular size, it would be better as a separate option.
So I changed -n to operate as follows. This is more general if
we want to add new split methods in future, and also compatible with
the existing BSD -n without needing a redundant option.
-n N split into N files based on size of input
-n K/N output K of N to stdout
-n l/N split into N files while maintaining lines
-n l/K/N output K of N to stdout while maintaining lines
-n r/N like `l' but use round robin distribution instead of size
-n r/K/N likewise but only output K of N to stdout
Other changes I made in the attached version are:
Removed the -t option as that's separate.
Removed erroneous 'c' from getopt() parameters.
Use K/N in code rather than M/N to match user instructions.
Added suffix len setter/checker based on N so that
we fail immediately if the wrong -a is specified, or
if -a is not specified we auto set it.
Flagged 0/N as an error, rather than treating like /N.
Changed r/K/N to buffer using stdio for much better performance (see below)
Fixed up the errno on some errors()
Normalized all "write error" messages so that all of these commands
output a single translated error message, of the form:
"split: write error: No space left on device"
split -n 1/10 $(which split) >/dev/full
stdbuf -o0 split -n 1/10 $(which split) >/dev/full
seq 10 | split -n r/1/10 >/dev/full
seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full
Re the performance of the round robin implementation;
using stdio helps a LOT as can be seen with:
-------------------------------------------------------
$ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l
10000000
real 0m1.568s
user 0m1.486s
sys 0m0.072s
$ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l
10000000
real 0m50.988s
user 0m7.548s
sys 0m43.250s
-------------------------------------------------------
I still need to look at the round robin implementation when
outputting to file rather than stdout. I may default to using
stdio, but give an option to flush each line. I'm testing
with this currently which is performing badly just doing write()
-------------------------------------------------------
#create fifos
yes | head -n4 | ../split -n r/4 fifo
for f in x*; do rm $f && mkfifo $f; done
#consumer
(for f in x*; do md5sum $f& done) > md5sum.out
#producer
seq 100000 | split -n r/4
-------------------------------------------------------
BTW, other modes perform well with write()
-------------------------------------------------------
$ yes | head -n10000000 > 10m.txt
$ time ./split -n l/1/1 <10m.txt | wc -l
10000000
real 0m0.201s
user 0m0.145s
sys 0m0.043s
$ time ./split -n 1/1 <10m.txt | wc -l
10000000
real 0m0.199s
user 0m0.154s
sys 0m0.041s
$ time ./split -n 1 <10m.txt
real 0m0.088s
user 0m0.000s
sys 0m0.081s
-------------------------------------------------------
Here is stuff I intend TODO before checking in:
s/pread()/dd::skip()/ or at least add pread to bootstrap.conf
fix info docs for reworked interface
try to refactor duplicated code
cheers,
Pádraig.
split-chunks.diff
Description: Text Data
- Re: [PATCH] split: --chunks option,
Pádraig Brady <=