[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filena

From: Ángel González
Subject: Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filename with perl
Date: Sat, 03 Aug 2013 23:50:48 +0200
User-agent: Thunderbird

On 03/08/13 21:07, Micah Cowan wrote:
On Sat, Aug 03, 2013 at 04:11:59PM +0200, Tim Rühsen wrote:
As a second option, we could introduce (now or later)
        --name-filter-program="program REGEX"

The 'program' answers each line it gets (the original filename) by excactly one
output line (the new filename) as long as Wget does not close the pipe.
The 'program' needs to be started only once...
Given the difficulty for novice users to ensure that the program is
line-buffered (unless, again, we do something like allocate a ptty), I
still feel that spawn-once will pose too much "surprise" (as in
"principle of least surprise") to non-expert shell folks to be the
default. And I still feel that it doesn't necessarily even pose any
realistic advantage, given that we're likely to wait on network reads
long enough for the transform to take place in the meantime.
If stdbuf(1) was installed, wget could use it to disable the std buffering.
Adding yet more variation between systems...

That said, I'd be in favor of supporting line buffering as an option,
to be made available to those who know what they're getting themselves
into... but not as the default. But even there, if it were me, I'd wait
until there was a clear benefit in doing so.

I remember once fielding a support complaint (it may have been here on
the Wget list?) from someone complaining that something was spawning too
many quick processes and raising the "next process id" count too high.
I never understood why that's supposed to be a problem (especially on
Unix, where it's pretty well expected as a matter of course - that's
what pretty much any reasonably-sized shell script will do), but for
such people, at least, having the spawned-once process would be an
advantage, apparently. Provided they know how to force their program to
be line-buffered.
I don't think wget should care about “not using too many pids”.
Although when continuing a recursive download where most files are already
downloaded, it will need to rewrite a lot of filenames in rapid sucession, so I wonder if it could trigger some forking rate limit (intended to prevent fork
bombs, presumably).

I admit, that I am not an regex expert (neither PCRE nor Posix) and I do not
know, how a proper match/replace pattern would look like (e.g. what syntax or
separation character should we use ?). Experts please....
I'd imagine it should be like Perl syntax (which is the same as sed
syntax, except sed only uses the crippled BRE syntax for the actual
regex), which lets you choose any arbitrary separation character to
place after the s. If we know it's always going to be a substitution,
the s is really unnecessary, and possibly should be optional (but
perhaps not disallowed... principle of least surprise). Making it
optional but allowed would mean you couldn't use "s" itself, as a
separation character... but that would be rather perverse anyway,
wouldn't it? :)

...I don't know anything about PCRE, but I'm hoping it has its own
parser for the common "s///" idiom, so Wget wouldn't have to write/debug
our own.
I don't think we should allow letters as separation character. Which should
fix the issue (inspired by php behavior on preg_* functions: “Delimiter must
not be alphanumeric or backslash”).

Oh yeah, while we're still on the subject, it might be worth pointing
out that Niwt also has a "unique name" protocol that works as follows
(Wget might find it handy, especially in combination with a name
transform). When Niwt can't save the file name it wants, it feeds the
"name-uniquer" program the intended file name as an argument, and the
uniquer is expected to print an infinite series of incremented names;
Niwt reads file names until it finds the first one that it can create
exclusively, and then closes the pipe.

The shell script Niwt uses to do this by default can be viewed at:

It's just a few short lines of shell, with some documentation describing
how its choice of unique names differs from wget, and what the
shortcomings of that choice are. ...actually, now that I look at it, it
lies about how Wget names things. Ah well.

The wget-1.10.2.tar.gz example isn't the worst vresioned-program transfomation. If you had
program-2.0.tgz, it would become program-2.1.0.tgz :(

reply via email to

[Prev in Thread] Current Thread [Next in Thread]