bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [RFE / project idea]: convert-links for "transparent prox


From: Gabriel L. Somlo
Subject: Re: [Bug-wget] [RFE / project idea]: convert-links for "transparent proxy" mode
Date: Mon, 31 Aug 2015 20:02:41 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

On Mon, Aug 31, 2015 at 03:14:42AM +0200, Ander Juaristi wrote:
> Hi,
> 
> Since no one expressed either interest or refusal to this idea (and I found
> myself in an unexpected situation of having more free time than usual :D), I
> decided to work on it a bit, which I've been doing during this week.
> 
> After hacking some code over your inline comments, I did several test runs
> over your provided test servers (www.contrib.andrew.cmu.edu/...) and still
> Wget was processing net paths automatically by prefixing the protocol
> ("http://";). So I thought the problem could be tackled down by just not
> converting net paths ("//") into schemes (ie "http://";), when transforming
> the downloaded HTML/CSS files.
> 
> Sorry if I'm still unable to see through your use case but I think it all
> could be solved by simply introducing a new switch that prevents that
> conversion. For example:
> 
>     $ wget --keep-net-paths ...
> 
> So that "//mirror.cmu.edu/..." would not be converted into
> "http://mirror.cmu.edu/...";. The rest of the job (such as #1 in your
> previous answer) would be done by the other switches, such as
> '--convert-links' itself.
> 
> You've got a broader overview than me. You think this is enough?

I started by looking at

        char *newname = construct_relative(file, link->localname)

That function uses the disk file name of the file containing the link,
and the disk file name of the file the link is pointing at.

All it needs is the two file names, so it can build a relative file
system reference (e.g. backing out of the current dir. of 'file'
enough to then be able to descend into the current dir of
'link->localname').

It returns a freshly allocated string (newname) which then gets
quoted, and used to replace the value of the original link in the
referencing file.



To accomplish something kinda like that, but not really -- we still
want a newly allocated 'newname', except not something related to a
local-disk file name.

We need the original value of the link from the downloaded document
(may start with '[http[s]:]//...', depending on whatever the author
of the web page used in their original html), and we need the
extension-adjusted name of the saved link target (that's still
link->localname, BTW).

The original value of the link starts at 'p' (or 'url_start'), and
its size is given by link->size.

So we could call a function

        char *newname = construct_tpu(p, link)

p points at a string which looks like this:

        "original_dirname/original_basename"

or

        'original_dirname/original_basename'

or simply

        original_dirname/original_basename

link->size includes the surrounding single or double quotes, if
present in the original file.

So, if *p=='"' or if *p=='\'', the real link size is shorter by
two characters than the value of link->size :), and the actual link
text starts at *(p+1).

link->localname will be something like

        "/local_fs_dirname/extension_adjusted_basename"

All we need to do is calculate dirname(p) and basename(link-localname),
concatenate them together, and we've ended up with a "transparent
proxy URL" link, which uses the "online" (i.e. NOT file://...)
protocol to request the *adjusted* filename scraped and saved by wget.
In other words,

        "original_dirname/extension_adjusted_basename"

Does that make sense ?

Please feel free to grab me on IRC some time during "work hours" (I'm on
US Eastern time, hope there's some useful overlap with your active
hours :) and we can chat about it in some more detail, if you'd like.


Thanks much,
--Gabriel



reply via email to

[Prev in Thread] Current Thread [Next in Thread]