[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [bug #45803] More URI filters (regex etc) from commandline, f

From: grarpamp
Subject: [Bug-wget] [bug #45803] More URI filters (regex etc) from commandline, file, and program
Date: Wed, 02 Sep 2015 06:11:38 +0000
User-agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:39.0) Gecko/20100101 Firefox/39.0

Follow-up Comment #2, bug #45803 (project wget):

# Parallel

If wget only fetches things serially...
I deferred any parallelism to the sole filter program, in case it
wanted to spread out and recombine its decision process into a
single logical answer.
If wget does fetching in parallel...
yes it could spawn checks in parallel, but it would have to be the
same program, not prog1 prog2 prog3, else there could be three
different results.


WGET_FILTER_URI. This variable shall contain exactly what is passed
to the current commandline regexes today, ie:

I wanted wget to have a base mode where each of the three methods
would be fed the same exact string by default, such that the user
can test and swap regexes between them equally.
Of course wget could feed other things (such as CSV) via enhanced
modes to the filter program which could in turn do anything it

# Referer, etc

the following optional set
of variables should also be passed to the program if readily
implementable today (each of them can result in different serving
hierarchy contexts

What would WGET_FILTER_REFERER be used for? Yes, referers, time of
day, agent, dynamic pages, etc can all serve up different content...
those are typically *logical* differences within the same service

The variables I put in this section are the typical set used in
server side configs to present entirely different *physical*
hierarchies of data or server [virtual] instances and apply to both
HTTP and FTP. (Wget is currently dumb about that regarding its on
disk storage... it doesn't encode such info into the basedir pathname
and thus will clobber itself by physically merging multiple contexts
on disk during recursive spidering. That's a wget design failure
to fix.)

Thus if logical things like referer are felt needed, I'd rather see
the entire set of client request headers to the server be stuffed
into this CSV you speak of as WGET_FILTER_CLIREQ_CSV.

I also wanted the input passing mechanism to be via environment
variables since novice scripters and coders can use those but may
not yet know how to process standard input (or the filesystem) which
would prevent them from using --uri-filter-prog.

I'm not keen on passing more things via the filesystem unless wget's
other metafile handling (such as cookies, logs, and even a future
"resume full prior state of crawl") is also cleaned up in the
process. By this I mean that there should probably be some control
flags such as --statefile-basedir and --statefile-basedir-auto that
will put all these statefiles under one dir (optionally auto mkstemp),
and under default filenames.

The filesystem is also slower, but could be useful in other ways.

It would be possible to support multiple input methods with:

env-basic: WGET_FILTER_URI
env-phys: my full set of vars
stdin-req: the entire client request via stdin
fs-csv: client request via filesystem
...: and other permutations

The idea was to keep it simple enough to get the three feature
enhancements out to people quickly.

For the first, I put setenv() and system() at utils.c:949
rev: 52228516b5d00c1dcf3623c4e3250490d1eb1d60

I added exit status 2 to the spec as reserved. The program may
utilize it as an exit catchall for URI's that fall through its
explicit accept / reject checks, to whatever the default sense is,
as set with
--uri-filter-prog-default=accept:reject, default reject.

WGET_FILTER_HOST should be as in the original, no DNS conversion.

Feel free to run with it as desired, it should be readily expandable
to anyone's needs.


Reply to this item at:


  Message sent via/by Savannah

reply via email to

[Prev in Thread] Current Thread [Next in Thread]