bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] wget "referer" development idea


From: hysterix
Subject: [Bug-wget] wget "referer" development idea
Date: Thu, 22 Jan 2009 19:18:02 -0800

Hello everyone,

This is not a bug, but it something I think should probably be developed as it does not seem that difficult to accomplish.  I was about to give the patch a go myself but upon looking at the source code I thought it may be smarter to simply suggest this idea and see if it is possible or easy to do.

Basically there are sites out there, specifically vb forums that require the referrer to actually be the page that you came from! (imagine that)!  My project is to mirror an entire vb forum and I got pretty far along doing it.  Storing cookies, simulating post logins, everything, and after many hours I finally got in and am able to do it, but there is a problem.  On the majority of the pages, if the referrer is not set; in the .wgetrc file as referer = http://somepage.com the forum kicks the page to the log in screen, and what I am left with is hundreds of pages that are all 15 kb and are just the log in screen of the forum!

Now, if I manually change the referer to a certain directory within the domain, I can see the page instead of a log-in page, but when I try to follow those links and save them, it throws me back to the log in screen.  After many hours of tedious and careful study, I realized that when I changed the referrer manually, I was able to see the page I couldn't see before, but only in that directory, the second I tried to traverse one directory deep, it would kick me out because referrer was then wrong.  I studied the headers with live http headers and sure enough the referrer variable is changing around so I assume their vb software is programmed to pick it up and check it with every page load!

So, my question, or comment or statement is, how hard would it be to implement a switch, say for example --recursive-referrer and when this switch is used, wget will actively change the 'referer' value to whatever page it just previously came from whilst traversing through all directories, enabling full mirroring of sites that check the referrer variable and if it is wrong kicks you out (in this case vb forums).

Thanks a lot for the otherwise great program, and I hope I atiquietly described what the problem I ran into was!

I was wondering if there was a quickie fix, such as piping the header output to a file, using perl or something to grab the referrer out of it, and then piping that back into the next wget execution? (seems hoge-poge to me and probably not a good solution)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]