bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget mirror site failing due to file / directory name cla


From: Paul Beckett (ITCS)
Subject: Re: [Bug-wget] wget mirror site failing due to file / directory name clashes
Date: Mon, 15 Oct 2012 09:43:51 +0000

Thanks for the suggestions.

Micah, unfortunately the CMS system we're using doesn't seem to allow people to 
create the links with a trailing slash (although it still servers correct page, 
if the slash is added). 

Ángel, I agree this would work, but our management do not want to have .html 
extensions on the URL's. I previously experimented with the adjust-extension to 
add '/index.html' . From my recollection I was able to do this as a command 
line option, but this meant all the links got adjusted to include the 
/index.html which I didn't want. I then attempted to hack the C code a little 
to add it, without adjusting the links, but that broke all the links to CSS / 
JS and other HTML pages, as I was moving the relative location of the HTML file 
into a sub-directory, and the CSS/JS and other HTML links weren't being 
adjusted.

Thanks,
Paul


>-----Original Message-----
>From: Ángel González [mailto:address@hidden
>Sent: Saturday, October 13, 2012 2:45 PM
>To: Paul Beckett (ITCS)
>Cc: address@hidden
>Subject: Re: [Bug-wget] wget mirror site failing due to file / directory name
>clashes
>
>On 12/10/12 15:38, Paul Beckett (ITCS) wrote:
>> I am attempting to use wget to create a mirrored copy of a CMS (Liferay)
>website. I want to be able to failover to this static copy in case the 
>application
>server goes offline. I therefore need the URL's to remain absolutely identical.
>The problem I have is that I cannot figure out how I can configure wget in a
>way that will cope with:
>> http://www.example.com/about
>> http://www.example.com/about/something
>>
>> In this case either the file or directory 'about' already exists at prevents 
>> the
>second being created.
>>
>> Initially I though the most obvious solution, was to rely on Apache's
>DirectoryIndex, and save the files as:
>> /about/index.html
>> /about/something/index.html
>>
>> But, currently I can't figure out how I can do this in a way that doesn't 
>> break
>either the relative path to other pages or create links to the index.html 
>rather
>than the original location. I need the links (a href etc.) to still go to 
>/about and
>not explicitly call /index.html - as this will mean people may bookmark things
>that won't exist when the CMS came back.
>>
>> If anyone can offer me any advice on how I can achieve this (either correct
>options), or how I could patch the source code to achieve this, I would be
>extremely grateful.
>>
>> Thanks,
>> Paul
>>
>>
>>
>> /usr/local/bin/wget --background --append-output=/tmp/wget-log
>> --no-verbose --tries=20 --waitretry=10 --retry-connrefused
>> --limit-rate=100m --quota=10000m --timestamping
>> --directory-prefix=/usr/local/apache2/content/uk.ac.uea.www_flat2
>> --protocol-directories --user-agent="UEA WebSite Flattener"
>> --backup-converted -e robots=off --page-requisites --convert-links
>> --recursive --level=inf --trust-server-names --domains example.com
>> www.example.com
>Download with --adjust-extension
>This way, you will get:
>
>/about.html
>/about/something.html
>
>
>Then configure the root of the static copy:
>RewriteEngine On
>RewriteCond  %{SCRIPT_FILENAME} !\.html$ RewriteRule
>^(.*[^/])/?$ $1.html
>
>to append the .html extension to the requested urls.
>If your CMS returns non-html contents on some urls you will need to adjust
>this to exclude them from the rewrite.
>
>Also, I'd remove --convert-links from the command line, since you want the
>same page contents as the real pages.
>
>
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]