bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] split: --chunks option


From: Jim Meyering
Subject: Re: [PATCH] split: --chunks option
Date: Sat, 06 Feb 2010 08:56:36 +0100

Pádraig Brady wrote:

> I got a bit of time for the review last night...
>
> This was your last interface change for this:
>
>    -b, --bytes=SIZE        put SIZE bytes per output file\n\
> +  -b, --bytes=/N          generate N output files\n\
> +  -b, --bytes=K/N         print Kth of N chunks of file\n\
>    -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file\n\
>    -d, --numeric-suffixes  use numeric suffixes instead of alphabetic\n\
>    -l, --lines=NUMBER      put NUMBER lines per output file\n\
> +  -l, --lines=/N          generate N eol delineated output files\n\
> +  -l, --lines=K/N         print Kth of N eol delineated chunks\n\
> +  -n, --number=N          same as --bytes=/N\n\
> +  -n, --number=K/N        same as --bytes=K/N\n\
> +  -r, --round-robin=N     generate N eol delineated output files using\n\
> +                            round-robin style distribution.\n\
> +  -r. --round-robin=K/N   print Kth of N eol delineated chunk as -rN would\n\
> +                            have generated.\n\
> +  -t, --term=CHAR         specify CHAR as eol. This will also convert\n\
> +                            -b to its line delineated equivalent (-C if\n\
> +                            splitting normally, -l if splitting by\n\
> +                            chunks). C escape sequences are accepted.\n\
>
> Thinking more about it, I think adding 2 modes of operation to the
> already slightly complicated -bCl options is too confusing.
> Since this is a separate mode of operation; i.e. one would be
> specifying a particular number of files for a different reason
> than a particular size, it would be better as a separate option.
>   So I changed -n to operate as follows. This is more general if
> we want to add new split methods in future, and also compatible with
> the existing BSD -n without needing a redundant option.

I like it.

> -n N       split into N files based on size of input
> -n K/N     output K of N to stdout

s/K/Kth/ removes an ambiguity (and two more, below)

> -n l/N     split into N files while maintaining lines

s/while maintaining/without splitting/ (and below)

> -n l/K/N   output K of N to stdout while maintaining lines
> -n r/N     like `l' but use round robin distribution instead of size
> -n r/K/N   likewise but only output K of N to stdout
>
> Other changes I made in the attached version are:
>
> Removed the -t option as that's separate.
> Removed erroneous 'c' from getopt() parameters.
> Use K/N in code rather than M/N to match user instructions.
> Added suffix len setter/checker based on N so that
>   we fail immediately if the wrong -a is specified, or
>   if -a is not specified we auto set it.
> Flagged 0/N as an error, rather than treating like /N.
> Changed r/K/N to buffer using stdio for much better performance (see below)
> Fixed up the errno on some errors()
> Normalized all "write error" messages so that all of these commands
> output a single translated error message, of the form:
> "split: write error: No space left on device"
>   split -n 1/10 $(which split) >/dev/full
>   stdbuf -o0 split -n 1/10 $(which split) >/dev/full
>   seq 10 | split -n r/1/10 >/dev/full
>   seq 10 | stdbuf -o0 split -n r/1/10 >/dev/full

Nice.

> Re the performance of the round robin implementation;
> using stdio helps a LOT as can be seen with:
> -------------------------------------------------------
> $ time yes | head -n10000000 | ./split-fwrite -n r/1/1 | wc -l
> 10000000
>
> real    0m1.568s
> user    0m1.486s
> sys     0m0.072s
>
> $ time yes | head -n10000000 | ./split-write -n r/1/1 | wc -l
> 10000000
>
> real    0m50.988s
> user    0m7.548s
> sys     0m43.250s

Indeed.  We don't see a 32-x speed-up (esp. non-algorithmic) very often.

> I still need to look at the round robin implementation when
> outputting to file rather than stdout. I may default to using
> stdio, but give an option to flush each line. I'm testing
> with this currently which is performing badly just doing write()
> -------------------------------------------------------
> #create fifos
> yes | head -n4 | ../split -n r/4 fifo
> for f in x*; do rm $f && mkfifo $f; done
>
> #consumer
> (for f in x*; do md5sum $f& done) > md5sum.out
>
> #producer
> seq 100000 | split -n r/4
> -------------------------------------------------------
>
> BTW, other modes perform well with write()
> -------------------------------------------------------
> $ yes | head -n10000000 > 10m.txt
> $ time ./split -n l/1/1 <10m.txt | wc -l
> 10000000
>
> real    0m0.201s
> user    0m0.145s
> sys     0m0.043s
>
> $ time ./split -n 1/1 <10m.txt | wc -l
> 10000000
>
> real    0m0.199s
> user    0m0.154s
> sys     0m0.041s
>
> $ time ./split -n 1 <10m.txt
>
> real    0m0.088s
> user    0m0.000s
> sys     0m0.081s
> -------------------------------------------------------
>
> Here is stuff I intend TODO before checking in:
>  s/pread()/dd::skip()/ or at least add pread to bootstrap.conf
>  fix info docs for reworked interface
>  try to refactor duplicated code

Thanks for doing all that!




reply via email to

[Prev in Thread] Current Thread [Next in Thread]