bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: wget2: How to understand log output


From: Tim Rühsen
Subject: Re: wget2: How to understand log output
Date: Sat, 1 Jun 2024 18:37:23 +0200
User-agent: Mozilla Thunderbird

Hey David,

> It appears that wget2 is getting files outside of what my regex-es allow, but on closer inspection, the files don't exist on my FS.

Indeed, wget2 is acting slightly different to wget.
In this case, wget2 fetches URLs from pages outside your regex, but will only store those matching your regex. The idea is to fetch more of the stuff that is interesting to you. I can see why this can be debatable. What is your opinion on this apart from "I want keep old behavior".

> Normally, you'd get HTTP response 200, or 404, or something, but wget2 says that it's 0. What does that mean?

Hm, I thought we fixed this issue already. Did you try with the latest version from trunk/master?

"[x] Checking <URL> ..." means that a HEAD request is made to the URL to determine whether that page content may contain more URLs. E.g. HTML, CSS and RSS pages are downloaded and parsed for yet unknown URLs.

> What does "Adding URL: $URL" mean?

It mean that a URL has been found and it now is checked whether it will be enqueued into the list of to-be-downloaded URLs. These checks are e.g. if the URL is parsable/valid, has a known scheme (HTTP or HTTPS), isn't already known, matches filters etc. One of the next lines will tell you whether the URL was actually enqueued or whether it has been sorted out (the reason is given as well).

If you still run into an issue with the latest wget2, it would be good if you give information on how to reproduce. Ideally, a comand line that everybody here can execute. If you have concerns putting that into the public, you may email one of the maintainers directly (but don't expect a fast response, we are just volunteers).

Regards, Tim

On 5/30/24 05:00, David Niklas wrote:
Hello,
I don't think that the log output should be that complex of a question.
Would someone kindly get back to me about the matter?

Thanks,
David


On Wed, 15 May 2024 19:46:18 -0400
David Niklas <deference@null.net> wrote:
Hello,

I'm a long term user of wget, and I'm trying to make the switch to
wget2. I'm having a problem understanding what exactly is going on. It
appears that wget2 is getting files outside of what my regex-es allow,
but on closer inspection, the files don't exist on my FS.

Aside: I would attach the complete wget2 log output to this email, but
it's 27MB in size uncompressed and, even using xz, it still comes out to
1MB in size.
I'm uncertain what your particular email list recommends. Normally I
have to get special permission from the list admin.


If there's some fine documentation which explains all this, I haven't
found it, so feel free to point me to it.

Normally, you'd get HTTP response 200, or 404, or something, but wget2
says that it's 0. What does that mean?

When you check something, it's normally because you have it, but wget2
doesn't appear to have downloaded the files it then says that it's
checking (although I may have forgotten to retain them for the purpose
of this email).
So what does '[3] Checking $URL ...' mean?

When you add a URL, one would normally think that it's going to be
downloaded, but that doesn't appear to be the case with wget2. What does
"Adding URL: $URL" mean?

As you probably noticed, I'm rather confused. Here's a portion of
wget2's output followed by the command that I used.

Thanks,
David


#############################################################################

Adding URL:
https://web.archive.org/web/20220305001008js_/https:/americasfrontlinedoctors.org/_next/static/bBQU-7wbyVqBHhpUeRiRF/_middlewareManifest.js
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCtr6Uw9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCvr6Ew9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCs16Ew9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCtr6Ew9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCtZ6Ew9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCu170w9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCuM70w9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCvr70w9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUHjIg1_i6t8kCHKm4532VJOt5-QNFgpCvC70w9.woff
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUSjIg1_i6t8kCHKm459WRhyyTh89ZNpQ.woff2
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUSjIg1_i6t8kCHKm459W1hyyTh89ZNpQ.woff2
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUSjIg1_i6t8kCHKm459WZhyyTh89ZNpQ.woff2
Adding URL:
https://web.archive.org/web/20220305001008im_/https://fonts.gstatic.com/s/montserrat/v23/JTUSjIg1_i6t8kCHKm459WdhyyTh89ZNpQ.woff2

###################...#######################################################

[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/modernwisdompodcast'
... [2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/IsaacArthur'
... HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/MLChristiansen]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/BlacktipH]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/TomAntosFilms'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/DonaldJTrumpJr]
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/KimIversen'
...
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/Homesteadonomics'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/MariaBartiromo]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/StevenCrowder]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/LifeStories]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/TheAdventureAgents'
...
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/NDWoodworkingArt'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/Styxhexenhammer666]
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/EarthTitan'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/NYPost]
HTTP response 0
[https://web.archive.org/web/20220130164746js_/https://www.americasfrontlinedoctors.org/_next/static/chunks/d0447323-9a7a3aa3a90e5cd2.js]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/KenDBerryMD'
...
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/HeresyFinancial'
...
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/RepJimBanks'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/PageSix]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/SamuelEarpArtist]
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/HOTDANGSHOW'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/Decider]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/JohnStossel]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/ATRestoration'
...
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/ThisSouthernGirlCan'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/MikhailaPeterson]
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/RockFeed'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/Locals]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/Timcast]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/CountryCast'
...
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/ShaunAttwood'
...
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/diywife'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/CWLemoine]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/TimcastIRL]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/Entrepreneur]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/RekietaLaw'
...
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/MontyFranklin'
...
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/GeeksandGamers'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/TheBodyLanguageGuy]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/Yarnhub]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/DrDrew]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/SportsWars'
...
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/nfldaily'
...
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/nbanow'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/TulsiGabbard]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/MattKohrs]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/GamingWithGeeks'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/HabibiPowerHour]
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/FactsChannel]
[3] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/ParkHoppin'
...
[2] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/GeeksAndGamersClips'
...
HTTP response 0
[https://web.archive.org/web/20220404154147/https:/rumble.com/c/phetasy]
[1] Checking
'https://web.archive.org/web/20220404154147/https:/rumble.com/c/chiefstv'
...
#############################################################################



The wget2 command is as follows. I had to wrap it.

wget2 -NEkrl9 -t 13 --regex-type=posix --timeout 45 --reject-regex

'http.*http.*http|\.html?.*\.html?.*\.html?|www\..*www\..*www\.|\{|url[^/]+query|data[^/]+\.(url|image)|
/activity|/members|/groups|/%5C$|xmlrpc\.php|/phpBB/|/socialauth/|/googlebooks/|xmlrpc\.php|/admin\.php|
/rsc/|/htsrv/|/skins/|/activate/|blogger.com/.*(profile|share-post|delete|comment|post-edit)|delete-comment.g|
from=|target=|/public\.api/|/mshots/v1/|/public.api/|/(remote-login|press-this|wp-signup|wp-login)\.php|
Translations:|Sandbox:|Template:|title=User_talk|/pricecompare|\?[rs]=[x0-9]+&|/likers|/following|new\?user=|
/discussion-|\?(resize|w|h)&|signin\?|/signup|/messages|/followers|/likesandfollows|/add\?|/destroy/|
/create\?|/Layout|/Selected_page|redirect=no|Template:|_talk:|User:|User_talk:|sign_up|sign_in|
\.img(\.(xz|gz|bz2))?|/secured_requests|/usenet/|/rss\.php|/design-tools|/supportLink|eesimUrl|
/reliabilityLink|/markets|/storefront.html|distributorData|/mymaxim|/samplecart|/comment-subscriptions|
/walkthroughs|like_comment=|screenToRender=|/UserAccount|/myprofile|layout=siteinfo|/Subscribe|\?cid=|
\+url\+|captcha|utm_(medium|source)=|bc(lid|tid)|pubdate|HQS|tid|eid|kcid|pid=|screenToView=login|
[^[:alnum:]]search/|companies/|directory/|cat/(news|reviews|previews-unboxing)|PrintView|contentItemId|
/[Aa]uth|comment_mail|replytocom=|[^[:alnum:]]search\?|amp$|\.(rss|atom|json)$|/maintenance|
/lib/exe/indexer.php|dataflt|datasrt|\.iso|(show|focused)Comment(Area|s|Id)|decoration|(bookmarks|
browsespace|changes|diffpages(byversion)?|listattachmentsforspace|login|peopledirectory|recentlyupdated|
replycomment|report|space-bookmarks|tinyurl|view(follow|info|mailarchive|page(attachments|src)|
previousversions|recentblogposts|spacesummary|userprofile))\.action|edit$|recentchanges|revisions|
/WantedPages|/forum|cgi-bin/|(do|sectok|mode|action|oldid|diff|showComment|share|replyto)=|Talk:|
Special:|wp-admin/|feed|login|/(EU|FR-FR|anp|ar|az|bg|bgn|bn|ca|cn|cs|da|de|de-de|diq|el|en-au|en-ca|
en-gb|en-sg|en-za|eo|es|es-co|es-mx|es-es|eu|fa|fr|fr-fr|he|hi|hr|hu|hy|ia|id|it|it-it|ja|jbo|jp|kk|ko|
lb|lt|map-bms|ml|mni|nb|ne|nl|nl-nl|no|oc|pa|pl|pl-pl|pt|pt-br|ro|ru|sco|sd|sl|si|sq|sr|sr-ec|ta|te|th|
tr|ua|udm|ug-arab|uk|ur|vi|zh|zh_CN|zh-cn|zh_cn|zh_tw)(:|$)'
--accept-regex

'(.*\.(css|gif|png|jpe?g)$|https?://web\.archive\.org/web/[^ *]+/
https?://?(i0.wp.com|i[0-9].wp.com|s[0-9].wp.com|([0-9]\.)?bp.blogspot.com|
www.blogger.com|www.blogblog.com|lh[0-9]\.googleusercontent.com|
fonts.googleapies.com|(ssl|www|fonts).gstatic.com|(www[0-9]*?\.)?americasfrontlinedoctors.org))'

https://web.archive.org/web/20220305001008/https://americasfrontlinedoctors.org/
|& tee -a 2wget.log



Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]