bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug #64714] --no-clobber not working with --mirror??


From: anonymous
Subject: [bug #64714] --no-clobber not working with --mirror??
Date: Sun, 24 Sep 2023 12:17:57 -0400 (EDT)

URL:
  <https://savannah.gnu.org/bugs/?64714>

                 Summary: --no-clobber not working with --mirror??
                   Group: GNU Wget
               Submitter: None
               Submitted: Sun 24 Sep 2023 04:17:55 PM UTC
                Category: None
                Severity: 3 - Normal
                Priority: 5 - Normal
                  Status: None
                 Privacy: Public
             Assigned to: None
         Originator Name: 
        Originator Email: 
             Open/Closed: Open
                 Release: trunk
         Discussion Lock: Any
        Operating System: GNU/Linux
         Reproducibility: None
           Fixed Release: None
         Planned Release: None
              Regression: None
           Work Required: None
          Patch Included: None


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Sun 24 Sep 2023 04:17:55 PM UTC By: Anonymous
Hi, I'm not entirely certain that the behavior I'm seeing is a bug and not me
using it incorrectly. But it definitely is not intuitive.

I tried to mirror multiple sites into the same folder, as  wanted to be able
to have them reference each other but get "deeper" at some of the referenced
pages and not as deep on others, so I thought I "just" delete the index.html
file of these pages and re-invoke wget with --mirror again to mirror that
webpage as well and write it into that place (so that the reference from the
one mirrored before would still work)

However, even though I specified --no-clobber, wget sometimes overwrote
already downloaded and adjusted webpages with a non adjusted version from the
server. It looks like this is some kind of recursion issue.


When pages are cross linked linke this:

a.com/index.html => Links to a.com/page2.html which has an out link to
b.com/page3.html the first invokation will download all three pages but no
out-links of b.com/page3.html (so the created page3.html file will have the
original links in it).

b.com/page3.html now has a backlink to a.com/index.html, the 1st invokation of
wget tasked to downlaod a.com/index.html doesn't care about this, and may even
correctly adjust the backlink to a.com/index.html. HOWEVER when the local copy
of "b.com/page3.html" is deleted and wget is invoked a 2nd time and tasked to
now only download "b.com/page3.html" potentially with different arguments
(like a specified recursion depth or with an option to not download any
out-links, or a domain restriction) it'll sometimes overwrite a.com/index.html
with a version that no longer has relative (adjusted) urls, but the original
non-adjusted ones, effectively breaking the local copy of the the page. Even
though "--no-clobber" was specified.


The full commands I used were:
* `wget --mirror --recursive=on --level=1 --convert-links --adjust-extension
--page-requisites --span-hosts --no-clobber -e robots=off`
* `wget --mirror --http-user=user --http-password=pass --recursive=on
--level=1 --convert-links --adjust-extension --page-requisites --span-hosts
--no-clobber -e robots=off`
* `wget --mirror --recursive=on --level=0 --convert-links --adjust-extension
--page-requisites --no-parent --no-clobber -e robots=off`

(All three with different target URLs, but each time I first deleted the local
index.html to get a "deeper" copy of that subtree of linked-web-pages)







    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?64714>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]