bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filena


From: Micah Cowan
Subject: Re: [Bug-wget] [PATCH] New option: --rename-output: modify output filename with perl
Date: Sat, 3 Aug 2013 12:07:17 -0700
User-agent: Mutt/1.5.21 (2010-09-15)

On Sat, Aug 03, 2013 at 04:11:59PM +0200, Tim Rühsen wrote:
> As a second option, we could introduce (now or later)
>       --name-filter-program="program REGEX"
> 
> The 'program' answers each line it gets (the original filename) by excactly 
> one 
> output line (the new filename) as long as Wget does not close the pipe.
> The 'program' needs to be started only once...

Given the difficulty for novice users to ensure that the program is
line-buffered (unless, again, we do something like allocate a ptty), I
still feel that spawn-once will pose too much "surprise" (as in
"principle of least surprise") to non-expert shell folks to be the
default. And I still feel that it doesn't necessarily even pose any
realistic advantage, given that we're likely to wait on network reads 
long enough for the transform to take place in the meantime.

That said, I'd be in favor of supporting line buffering as an option,
to be made available to those who know what they're getting themselves
into... but not as the default. But even there, if it were me, I'd wait
until there was a clear benefit in doing so.

I remember once fielding a support complaint (it may have been here on
the Wget list?) from someone complaining that something was spawning too
many quick processes and raising the "next process id" count too high.
I never understood why that's supposed to be a problem (especially on
Unix, where it's pretty well expected as a matter of course - that's
what pretty much any reasonably-sized shell script will do), but for
such people, at least, having the spawned-once process would be an
advantage, apparently. Provided they know how to force their program to
be line-buffered.

> I admit, that I am not an regex expert (neither PCRE nor Posix) and I do not 
> know, how a proper match/replace pattern would look like (e.g. what syntax or 
> separation character should we use ?). Experts please....

I'd imagine it should be like Perl syntax (which is the same as sed
syntax, except sed only uses the crippled BRE syntax for the actual
regex), which lets you choose any arbitrary separation character to
place after the s. If we know it's always going to be a substitution,
the s is really unnecessary, and possibly should be optional (but
perhaps not disallowed... principle of least surprise). Making it
optional but allowed would mean you couldn't use "s" itself, as a
separation character... but that would be rather perverse anyway,
wouldn't it? :)

...I don't know anything about PCRE, but I'm hoping it has its own
parser for the common "s///" idiom, so Wget wouldn't have to write/debug
our own.

.

Oh yeah, while we're still on the subject, it might be worth pointing
out that Niwt also has a "unique name" protocol that works as follows
(Wget might find it handy, especially in combination with a name
transform). When Niwt can't save the file name it wants, it feeds the
"name-uniquer" program the intended file name as an argument, and the
uniquer is expected to print an infinite series of incremented names;
Niwt reads file names until it finds the first one that it can create
exclusively, and then closes the pipe.

The shell script Niwt uses to do this by default can be viewed at:
http://micah.cowan.name/hg/niwt/file/tip/script/niwt-uniquer

It's just a few short lines of shell, with some documentation describing
how its choice of unique names differs from wget, and what the
shortcomings of that choice are. ...actually, now that I look at it, it
lies about how Wget names things. Ah well.

-mjc




reply via email to

[Prev in Thread] Current Thread [Next in Thread]