bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget -crNl inf --- filenames mangled


From: Tim Rühsen
Subject: Re: [Bug-wget] wget -crNl inf --- filenames mangled
Date: Thu, 14 Feb 2019 13:03:32 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.5.0

On 2/14/19 12:25 PM, Andres Valloud wrote:
> Tim,
> 
> On 2/14/19 02:03, Tim Rühsen wrote:
>>> I looked at the downloaded html files with grep.  They do contain the
>>> substring "1f43", seemingly after a ^M character (I did not check every
>>> single occurrence).  Sometimes, the ^M character is within a file name
>>> such as this:
>>>
>>> <tr><td valign="top"><img src="https://some.url/icons/mp3ogg.png^M
>>> 1f43^M
>>> "
>>
>> If this is contained in the HTML file, then 'mp3ogg.png1f43' seems
>> correct. ^M is a Carriage Return (Microsoft uses ^M plus linefeed for
>> End-Of-Line (EOL). In a HTML file, EOL has no meaning - parsers simply
>> ignore it. This is nothing that can be addressed with
>> --restrict-file-names.
>>
>> But to make sure, look at the original file by downloading it with 'wget
>> <URL>'. Does the file have the above 'lf43'/^M stuff in it as well ? If
>> so, we can't do much about it.
>>
>> If all looks ok in there, please attach both files so we can compare and
>> possibly reproduce.
>>
>> If you set the 'User-Agent' header to e.g. "Mozilla/5.0 (X11; Linux
>> x86_64; rv:65.0) Gecko/20100101 Firefox/65.0", the server thinks the
>> request is coming via Firefox.
>> curl and wget have both the --user-agent option for this.
>>
>> Do you get a different file when using that option ?
> 
> There was one additional detail to make this work.  Instead of placing a
> request for index.html, I had to ask curl to get just the directory name
> ending with a slash.  Then the server responded with (essentially)
> index.html.

A web server might give different content on 'dir', 'dir/' and
'dir/index.html'. This is sometimes puzzling and as you can see, 'dir/'
can't be used as filename - so we use 'dir/index.html' for that. Which
is not correct if the server serves 'dir/index.php' when we request 'dir/'.

> 
> Both curl and wget retrieve index.html contents without '1f43' when
> asking for just that URL.  vimdiff says the retrieved files are identical.

Try to start with this URL using your original wget command line. You
could add a quota (-Q) to limit the amount of data. In the hope to
reproduce your issue with far less files/data to be downloaded.

> I am at a loss as to how to explain how the '1f43' problem appears when
> asking wget to update the mirror of the site (rather than downloading a
> single file).  I'll look at the log file tomorrow and see if I get more
> ideas.

Try to reduce the needed amount of data to reproduce it.

Regards, Tim

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]