bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: split behavior


From: Pádraig Brady
Subject: Re: split behavior
Date: Sat, 12 Sep 2009 04:51:41 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Roger McNichols wrote:
> 
> Currently using version 5.2.1 of coreutils 'split' command produces files 
> with 'intelligent' suffixes.  That is, the number of letters (or digits) 
> required
> is based on the known number of output files that will be required.

Actually coreutils does not employ 'intelligent' suffixes, as the
size of the input is not taken into account and the suffix length
defaults to 2. One could set it 'intelligently' outside of split using
something like the following. However this should really be done within split:

size=$(du -b "$file" | cut -f1)
chunk=4096
suffix_len=$(
  python -c "
import math as m
print int(m.ceil(m.log($size/$chunk,26)))
"
)
split -a$suffix_len "$file"

> An OLD version of split (and I dont know which one becuase I dont have it 
> anymore)
> used 'dumb' suffixes.  That is, it would start with aa, ab, ac, ..., ba, bb, 
> bc, ...
> util it got to zz and then would jump to zzaa, zzab, zzac, ... etc and then 
> on 
> to zzaaaa, zzaaab, zzaaac, etc...

I think I've seen this method before but it's not in solaris,
freebsd or alexautils? Grr that's bugging me now.
Whatever implementation of split that was, it seems like a
good way to split arbitrary sized input while file names
name sort lexically.

Also if the file size _is_ known but a suffix length that's too short
is specified, one could use this algorithm to ensure that you don't
get the "suffixes exhausted" error.
In fact, for consistency it would probably be better to always default
to 2 as the suffix len, and fall back to this zzaa suffix scheme rather
than "intelligently" select the suffix length as described above.

I'll look at doing this soon.

thanks,
Pádraig.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]