[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#7401: [PATCH] split: --chunks option

From: Pádraig Brady
Subject: bug#7401: [PATCH] split: --chunks option
Date: Sun, 14 Nov 2010 20:54:09 +0000
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv: Gecko/20100227 Thunderbird/3.0.3

On 05/02/10 12:40, Pádraig Brady wrote:
> I got a bit of time for the review last night...
> This was your last interface change for this:
>    -b, --bytes=SIZE        put SIZE bytes per output file\n\
> +  -b, --bytes=/N          generate N output files\n\
> +  -b, --bytes=K/N         print Kth of N chunks of file\n\
>    -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output
> file\n\
>    -d, --numeric-suffixes  use numeric suffixes instead of alphabetic\n\
>    -l, --lines=NUMBER      put NUMBER lines per output file\n\
> +  -l, --lines=/N          generate N eol delineated output files\n\
> +  -l, --lines=K/N         print Kth of N eol delineated chunks\n\
> +  -n, --number=N          same as --bytes=/N\n\
> +  -n, --number=K/N        same as --bytes=K/N\n\
> +  -r, --round-robin=N     generate N eol delineated output files using\n\
> +                            round-robin style distribution.\n\
> +  -r. --round-robin=K/N   print Kth of N eol delineated chunk as -rN
> would\n\
> +                            have generated.\n\
> +  -t, --term=CHAR         specify CHAR as eol. This will also convert\n\
> +                            -b to its line delineated equivalent (-C if\n\
> +                            splitting normally, -l if splitting by\n\
> +                            chunks). C escape sequences are accepted.\n\
> Thinking more about it, I think adding 2 modes of operation to the
> already slightly complicated -bCl options is too confusing.
> Since this is a separate mode of operation; i.e. one would be
> specifying a particular number of files for a different reason
> than a particular size, it would be better as a separate option.
>   So I changed -n to operate as follows. This is more general if
> we want to add new split methods in future, and also compatible with
> the existing BSD -n without needing a redundant option.
> -n N       split into N files based on size of input
> -n K/N     output K of N to stdout
> -n l/N     split into N files while maintaining lines
> -n l/K/N   output K of N to stdout while maintaining lines
> -n r/N     like `l' but use round robin distribution instead of size
> -n r/K/N   likewise but only output K of N to stdout
> Other changes I made in the attached version are:
> Removed the -t option as that's separate.
> Removed erroneous 'c' from getopt() parameters.
> Use K/N in code rather than M/N to match user instructions.
> Added suffix len setter/checker based on N so that
>   we fail immediately if the wrong -a is specified, or
>   if -a is not specified we auto set it.
> Flagged 0/N as an error, rather than treating like /N.
> Changed r/K/N to buffer using stdio for much better performance (see below)
> Fixed up the errno on some errors()
> Normalized all "write error" messages so that all of these commands
> output a single translated error message, of the form:
> "split: write error: No space left on device"
>   split -n 1/10 $(which split) >/dev/full
>   stdbuf -o0 split -n 1/10 $(which split) >/dev/full
>   seq 10 | split -n r/1/10 >/dev/full
>   seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full
> Re the performance of the round robin implementation;
> using stdio helps a LOT as can be seen with:
> -------------------------------------------------------
> $ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l
> 10000000
> real    0m1.568s
> user    0m1.486s
> sys     0m0.072s
> $ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l
> 10000000
> real    0m50.988s
> user    0m7.548s
> sys     0m43.250s
> -------------------------------------------------------
> I still need to look at the round robin implementation when
> outputting to file rather than stdout. I may default to using
> stdio, but give an option to flush each line. I'm testing
> with this currently which is performing badly just doing write()
> -------------------------------------------------------
> #create fifos
> yes | head -n4 | ../split -n r/4 fifo
> for f in x*; do rm $f && mkfifo $f; done
> #consumer
> (for f in x*; do md5sum $f& done) > md5sum.out
> #producer
> seq 100000 | split -n r/4
> -------------------------------------------------------
> BTW, other modes perform well with write()
> -------------------------------------------------------
> $ yes | head -n10000000 > 10m.txt
> $ time ./split -n l/1/1 <10m.txt | wc -l
> 10000000
> real    0m0.201s
> user    0m0.145s
> sys     0m0.043s
> $ time ./split -n 1/1 <10m.txt | wc -l
> 10000000
> real    0m0.199s
> user    0m0.154s
> sys     0m0.041s
> $ time ./split -n 1 <10m.txt
> real    0m0.088s
> user    0m0.000s
> sys     0m0.081s
> -------------------------------------------------------
> Here is stuff I intend TODO before checking in:
>  s/pread()/dd::skip()/ or at least add pread to bootstrap.conf
>  fix info docs for reworked interface
>  try to refactor duplicated code

Attached is the finished split --number review.
It has these additional changes:

change all chunk function param names from n/tot to k/n
report -n 1/3/2 as an error with "3" not "2"
allow num chunks < file size with empty files created for the rest
don't restrict chunk extraction to stdout to size_t limits
fix chunk function param types to uintmax_t, off_t, size_t etc.
reorganise limit checks so there are no integer overflows
rewrite ofd_check
  rename to ofile_open
  fix off by one errors
  when rotating files, append rather than truncate
  only try to rotate open files for E[NM]FILE errors
  if we run out of open files, give ENFILE or EMFILE as appropriate
  if we run out of open files, close all files and resort
   to open,write,close, as we'll effectively be doing that anyway.
rewrite lines_chunk_split to use the same style as the other funcs
  fix off by ones errors
  fix chunk_size calculation
merged lines_chunk_extract() into lines_chunk_split().
merged lines_rr_extract() into lines_rr().
don't use pread as it resets the offset so only valid if called once
 also it's not available on some systems
support input offsets in the chunk extraction modes. I.E.
  (dd skip=1 count=0; split ...) < file
For all -n modes, create all files if any data read
  in case there are existing files,
  and to signal any waiting fifo consumers
Handle file size increases and decreases in {bytes,lines}_chunk_*
  don't loop endlessly if file is truncated while reading
  don't include buffer slop in output if file grows while reading
added an --unbuffered option for use in the round robin cases
added an --elide-empty-files option to suppress empty files
Also use safe_read rather than full_read in round robin cases
  so we can immediately copy input to output
Beefed up the tests a lot
Explained what each mode does exactly in the info docs


Attachment: split-number.diff
Description: Text Data

reply via email to

[Prev in Thread] Current Thread [Next in Thread]