[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: split: support unlimited number of split files
From: |
Pádraig Brady |
Subject: |
Re: split: support unlimited number of split files |
Date: |
Fri, 24 Feb 2012 23:12:29 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0 |
On 02/24/2012 10:08 PM, Jérémy Compostella wrote:
> All,
>
> I'm interesting in implementing this feature. In fact, I already made a
> quick implementation to play with.
>
> I refer to the original thread : "split behavior"
> http://lists.gnu.org/archive/html/bug-coreutils/2009-09/msg00217.html
>
> To summarise it (quick version), in the past the split command provided
> this unlimited number of split files as its default behavior. But it did
> not conform to POSIX, so it has been removed (see
> http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=65cbf7d1).
Just to consider POSIX again, it is fairly explicit:
"By default, the names of the output files shall be 'x' , followed by a
two-character
suffix from the character set as described above, starting with "aa" , "ab" ,
"ac" ,
and so on, and continuing until the suffix "zz" , for a maximum of 676 files."
However I think it's incorrect to impose the arbitrary limit. Note the spec
also says:
"The -a option was added to overcome the limitation of being able to create
only 676 files."
So there doesn't seem to be an intention to limit the number of output files.
I think it's just that alternative solutions were not considered.
Also with numeric suffixes the limit is only 100 files.
> This old behavior was:
> $ cat /var/log/messages | split -2 - /tmp/x.
> x.aa
> x.ab
> ...
> x.yz
> x.zaaa
> x.zaab
> ...
> x.zyzz
> x.zzaaaa
> x.zzaaab
>
> But, others in the "split behavior" thread propose something like:
> x.aa
> ...
> x.zz
> x.zzaa
> ...
> x.zzzz
> x.zzzzaa
>
> These two possibilities deserves the same goal, split files order, once
> alphabetically sorted, is the correct order.
>
> However, the second possibility does not satisfy me since it will make the
> use of the --additional-suffix option break this:
> $ cat /var/log/messages | split --additional-suffix=.txt -2 - /tmp/x. && ls
> /tmp/x.* | sort
> x.aa.txt
> ...
> x.zy.txt
> x.zzaa.txt
> ...
> x.zztw.txt
> x.zz.txt <---- :(
> x.zztx.txt
> ...
>
> Therefore, my opinion is : the old behavior is more adapted to the
> current split option set.
Good. That's what I'd prefer anyway so as to be compatible
with old data sets. Note '.' sorts before digits (-d) too,
so there should be no ordering issues with --additional-suffix=... either.
> In the "split behavior" thread it was proposed to look at the
> POSIXLY_CORRECT environment variable to activate or not the unlimited
> split files behavior. But, I think it's dangerous. Indeed, it breaks the
> usual files list: x.aa ... x.zz ... vs. x.aa ... x.yz x.zaa .. (the x.zz
> file does not exist anymore). User may be surprised and older scripts
> may failed.
We could key the new behavior on POSIXLY_CORRECT, but there is no need IMHO.
Using POSIXLY_CORRECT is not desired and only used as a very last resort.
> Maybe adding a new option or a new argument would be fine, I was
> thinking to the following:
> * --unlimited-suffixes
> * --suffix-length=unlimited or --suffix-length=auto
If we were to add an option --suffix-length=auto is the best IMHO.
But I don't think we even need that. Just do it by default.
> With this new option (or argument), user would keep the ability to
> select the start suffix length. For example:
> $ cat /var/log/messages | split --suffix-length=auto --suffix-length 3 -2 -
> /tmp/x.
> x.aaa <--- start with suffix length = 3
No need for that functionality I think.
cheers,
Pádraig.