Re: split: support unlimited number of split files

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: split: support unlimited number of split files

From:	Pádraig Brady
Subject:	Re: split: support unlimited number of split files
Date:	Fri, 24 Feb 2012 23:12:29 +0000
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:6.0) Gecko/20110816 Thunderbird/6.0

On 02/24/2012 10:08 PM, Jérémy Compostella wrote:
> All,
> 
> I'm interesting in implementing this feature. In fact, I already made a
> quick implementation to play with.
> 
> I refer to the original thread : "split behavior"
> http://lists.gnu.org/archive/html/bug-coreutils/2009-09/msg00217.html
> 
> To summarise it (quick version), in the past the split command provided
> this unlimited number of split files as its default behavior. But it did
> not conform to POSIX, so it has been removed (see
> http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=65cbf7d1).


Just to consider POSIX again, it is fairly explicit:

"By default, the names of the output files shall be 'x' , followed by a 
two-character
suffix from the character set as described above, starting with "aa" , "ab" , 
"ac" ,
and so on, and continuing until the suffix "zz" , for a maximum of 676 files."

However I think it's incorrect to impose the arbitrary limit. Note the spec 
also says:

"The -a option was added to overcome the limitation of being able to create 
only 676 files."

So there doesn't seem to be an intention to limit the number of output files.
I think it's just that alternative solutions were not considered.
Also with numeric suffixes the limit is only 100 files.

> This old behavior was:
> $ cat /var/log/messages | split -2 - /tmp/x.
> x.aa
> x.ab
> ...
> x.yz
> x.zaaa
> x.zaab
> ...
> x.zyzz
> x.zzaaaa
> x.zzaaab
> 
> But, others in the "split behavior" thread propose something like:
> x.aa
> ...
> x.zz
> x.zzaa
> ...
> x.zzzz
> x.zzzzaa
> 
> These two possibilities deserves the same goal, split files order, once
> alphabetically sorted, is the correct order.
> 
> However, the second possibility does not satisfy me since it will make the
> use of the --additional-suffix option break this:
> $ cat /var/log/messages | split --additional-suffix=.txt -2 - /tmp/x. && ls 
> /tmp/x.* | sort
> x.aa.txt
> ...
> x.zy.txt
> x.zzaa.txt
> ...
> x.zztw.txt
> x.zz.txt      <---- :(
> x.zztx.txt
> ...
> 
> Therefore, my opinion is : the old behavior is more adapted to the
> current split option set.

Good. That's what I'd prefer anyway so as to be compatible
with old data sets. Note '.' sorts before digits (-d) too,
so there should be no ordering issues with --additional-suffix=... either.

> In the "split behavior" thread it was proposed to look at the
> POSIXLY_CORRECT environment variable to activate or not the unlimited
> split files behavior. But, I think it's dangerous. Indeed, it breaks the
> usual files list: x.aa ... x.zz ... vs. x.aa ... x.yz x.zaa .. (the x.zz
> file does not exist anymore). User may be surprised and older scripts
> may failed.

We could key the new behavior on POSIXLY_CORRECT, but there is no need IMHO.
Using POSIXLY_CORRECT is not desired and only used as a very last resort.

> Maybe adding a new option or a new argument would be fine, I was
> thinking to the following:
> * --unlimited-suffixes
> * --suffix-length=unlimited or --suffix-length=auto

If we were to add an option --suffix-length=auto is the best IMHO.
But I don't think we even need that. Just do it by default.

> With this new option (or argument), user would keep the ability to
> select the start suffix length. For example:
> $ cat /var/log/messages | split --suffix-length=auto --suffix-length 3 -2 - 
> /tmp/x.
> x.aaa  <--- start with suffix length = 3

No need for that functionality I think.

cheers,
Pádraig.

[Prev in Thread]

Current Thread

[Next in Thread]

split: support unlimited number of split files, Jérémy Compostella, 2012/02/24
- Re: split: support unlimited number of split files, Pádraig Brady <=

Prev by Date: split: support unlimited number of split files
Next by Date: unix core command index
Previous by thread: split: support unlimited number of split files
Next by thread: unix core command index
Index(es):
- Date
- Thread