findutils-patches
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Findutils-patches] new predicate


From: Konrad Eisele
Subject: Re: [Findutils-patches] new predicate
Date: Thu, 27 May 2010 23:49:46 +0200

-------- Original-Nachricht --------
> Datum: Thu, 27 May 2010 15:12:12 -0600
> Von: Eric Blake <address@hidden>
> An: Konrad Eisele <address@hidden>
> CC: address@hidden
> Betreff: Re: [Findutils-patches] new predicate

> On 05/27/2010 02:04 PM, Konrad Eisele wrote:
> > I wanted to submit a patch that is quite short and 
> > more thought as a feature request. It adds the predicate
> > "-dtype <regex>" (dtype meaning datatype). The dtype
> > predicate uses libmagic from the "file" command to get
> > the *content datatype* of the file in view, then doing a regex on
> > it. i.e. "echo abc>f.txt; file f.txt" yealds "ASSCII text".
> > Therefore "file f.txt -dtype .*text.*" would do a regex ".*text.*"
> > on "ASCII text" (and match). 
> 
> Personally, I'm a bit reluctant to add this patch, because you can
> achieve the same effect with more efficient use of existing predicates:
> 
> > 
> > The problem this patch addresses is like this:
> > I have several source project directory with serveral million
> > files in them. I want to make a backup, however i want 
> > to only backup text files, (Makefiles, shell sripts, c and
> > h files etc). Currently I do something like this:
> > (for f in `find <srcdir> -type f`; do if (file $f | cut -d: -f2 | grep
> text &> /dev/null ); then echo $f; fi; done) > file.list
> 
> find <srcdir> -type f -exec sh -c \
>   'file "$@" | sed -n "s/:.*text.*//p"' sh {} + > file.list

Now, thanks, I wasnt aware (or able to come up with)
such a expression. For me this works well, my previous
version would run forever, this now is usable. I guess
that even if with my patch it would be faster and 
simpler to type it would introduce dependencies
to libmagic that might not be worth the effort.

Here is the results of when running it on the linux
sourcetree:

time /usr/bin/find /usr/src/linux-2.6.29.6/ -type f -exec sh -c 'file "$@" | 
sed -n "s/:.*text.*//p"' sh {} + | xargs file $1
real    3m17.519s
user    5m0.162s
sys     0m6.233s

time /usr/bin/find /usr/src/linux-2.6.29.6/ -dtype .*text.*  | xargs file $1
real    1m56.629s
user    3m9.618s
sys     0m3.565s



> 
> Remember, the reason your version was so slow is that it was spawning a
> subshell, file, cut, and grep command per file; my version uses exec {}
> + to cram as many files as possible per file(1) invocation, then uses
> sed instead of cut|grep for a further reduction in processes.
> 
> Meanwhile, be aware that this solution assumes that none of the files
> found will contain : or newline; you may want to add some defensive
> programming into your find expression to reject file names matching
> those patterns.
> 
> -- 
> Eric Blake   address@hidden    +1-801-349-2682
> Libvirt virtualization library http://libvirt.org
> 

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!
Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01



reply via email to

[Prev in Thread] Current Thread [Next in Thread]