[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#7401: [PATCH] split: --chunks option
From: |
Pádraig Brady |
Subject: |
bug#7401: [PATCH] split: --chunks option |
Date: |
Sun, 14 Nov 2010 20:54:09 +0000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
On 05/02/10 12:40, Pádraig Brady wrote:
> I got a bit of time for the review last night...
>
> This was your last interface change for this:
>
> -b, --bytes=SIZE put SIZE bytes per output file\n\
> + -b, --bytes=/N generate N output files\n\
> + -b, --bytes=K/N print Kth of N chunks of file\n\
> -C, --line-bytes=SIZE put at most SIZE bytes of lines per output
> file\n\
> -d, --numeric-suffixes use numeric suffixes instead of alphabetic\n\
> -l, --lines=NUMBER put NUMBER lines per output file\n\
> + -l, --lines=/N generate N eol delineated output files\n\
> + -l, --lines=K/N print Kth of N eol delineated chunks\n\
> + -n, --number=N same as --bytes=/N\n\
> + -n, --number=K/N same as --bytes=K/N\n\
> + -r, --round-robin=N generate N eol delineated output files using\n\
> + round-robin style distribution.\n\
> + -r. --round-robin=K/N print Kth of N eol delineated chunk as -rN
> would\n\
> + have generated.\n\
> + -t, --term=CHAR specify CHAR as eol. This will also convert\n\
> + -b to its line delineated equivalent (-C if\n\
> + splitting normally, -l if splitting by\n\
> + chunks). C escape sequences are accepted.\n\
>
> Thinking more about it, I think adding 2 modes of operation to the
> already slightly complicated -bCl options is too confusing.
> Since this is a separate mode of operation; i.e. one would be
> specifying a particular number of files for a different reason
> than a particular size, it would be better as a separate option.
> So I changed -n to operate as follows. This is more general if
> we want to add new split methods in future, and also compatible with
> the existing BSD -n without needing a redundant option.
>
> -n N split into N files based on size of input
> -n K/N output K of N to stdout
> -n l/N split into N files while maintaining lines
> -n l/K/N output K of N to stdout while maintaining lines
> -n r/N like `l' but use round robin distribution instead of size
> -n r/K/N likewise but only output K of N to stdout
>
> Other changes I made in the attached version are:
>
> Removed the -t option as that's separate.
> Removed erroneous 'c' from getopt() parameters.
> Use K/N in code rather than M/N to match user instructions.
> Added suffix len setter/checker based on N so that
> we fail immediately if the wrong -a is specified, or
> if -a is not specified we auto set it.
> Flagged 0/N as an error, rather than treating like /N.
> Changed r/K/N to buffer using stdio for much better performance (see below)
> Fixed up the errno on some errors()
> Normalized all "write error" messages so that all of these commands
> output a single translated error message, of the form:
> "split: write error: No space left on device"
> split -n 1/10 $(which split) >/dev/full
> stdbuf -o0 split -n 1/10 $(which split) >/dev/full
> seq 10 | split -n r/1/10 >/dev/full
> seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full
>
>
> Re the performance of the round robin implementation;
> using stdio helps a LOT as can be seen with:
> -------------------------------------------------------
> $ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l
> 10000000
>
> real 0m1.568s
> user 0m1.486s
> sys 0m0.072s
>
> $ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l
> 10000000
>
> real 0m50.988s
> user 0m7.548s
> sys 0m43.250s
> -------------------------------------------------------
>
>
> I still need to look at the round robin implementation when
> outputting to file rather than stdout. I may default to using
> stdio, but give an option to flush each line. I'm testing
> with this currently which is performing badly just doing write()
> -------------------------------------------------------
> #create fifos
> yes | head -n4 | ../split -n r/4 fifo
> for f in x*; do rm $f && mkfifo $f; done
>
> #consumer
> (for f in x*; do md5sum $f& done) > md5sum.out
>
> #producer
> seq 100000 | split -n r/4
> -------------------------------------------------------
>
> BTW, other modes perform well with write()
> -------------------------------------------------------
> $ yes | head -n10000000 > 10m.txt
> $ time ./split -n l/1/1 <10m.txt | wc -l
> 10000000
>
> real 0m0.201s
> user 0m0.145s
> sys 0m0.043s
>
> $ time ./split -n 1/1 <10m.txt | wc -l
> 10000000
>
> real 0m0.199s
> user 0m0.154s
> sys 0m0.041s
>
> $ time ./split -n 1 <10m.txt
>
> real 0m0.088s
> user 0m0.000s
> sys 0m0.081s
> -------------------------------------------------------
>
> Here is stuff I intend TODO before checking in:
> s/pread()/dd::skip()/ or at least add pread to bootstrap.conf
> fix info docs for reworked interface
> try to refactor duplicated code
Attached is the finished split --number review.
It has these additional changes:
change all chunk function param names from n/tot to k/n
report -n 1/3/2 as an error with "3" not "2"
allow num chunks < file size with empty files created for the rest
don't restrict chunk extraction to stdout to size_t limits
fix chunk function param types to uintmax_t, off_t, size_t etc.
reorganise limit checks so there are no integer overflows
rewrite ofd_check
rename to ofile_open
fix off by one errors
when rotating files, append rather than truncate
only try to rotate open files for E[NM]FILE errors
if we run out of open files, give ENFILE or EMFILE as appropriate
if we run out of open files, close all files and resort
to open,write,close, as we'll effectively be doing that anyway.
rewrite lines_chunk_split to use the same style as the other funcs
fix off by ones errors
fix chunk_size calculation
merged lines_chunk_extract() into lines_chunk_split().
merged lines_rr_extract() into lines_rr().
don't use pread as it resets the offset so only valid if called once
also it's not available on some systems
support input offsets in the chunk extraction modes. I.E.
(dd skip=1 count=0; split ...) < file
For all -n modes, create all files if any data read
in case there are existing files,
and to signal any waiting fifo consumers
Handle file size increases and decreases in {bytes,lines}_chunk_*
don't loop endlessly if file is truncated while reading
don't include buffer slop in output if file grows while reading
added an --unbuffered option for use in the round robin cases
added an --elide-empty-files option to suppress empty files
Also use safe_read rather than full_read in round robin cases
so we can immediately copy input to output
Beefed up the tests a lot
Explained what each mode does exactly in the info docs
cheers,
Pádraig.
split-number.diff
Description: Text Data
- bug#7401: [PATCH] split: --chunks option,
Pádraig Brady <=