bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: CRITICAL BUG: wget -N is leaving corrupted files


From: Romain Morotti (London)
Subject: RE: CRITICAL BUG: wget -N is leaving corrupted files
Date: Wed, 19 Jun 2024 11:32:56 +0000

Hello,

Sorry for delay in getting back.

I think the appropriate solution is to send the HEAD request by default, as 
wget was doing before.
From the previous PRs that changed the behavior, it's not clear to me why the 
behavior was changed in the first place? I think they really missed the edge 
case where it leaves corrupted files.
There were actually a few threads that reported the issue afterwards but the 
buggy behavior was not reverted, I think people didn't manage to understand the 
root cause, the new behavior is bugged.
I wonder if the purpose of the change was a micro-optimization to save a HEAD 
request?

The whole point of "wget -N" is to avoid redownloading large files if unchanged 
(in my case GB of files to deploy). Workarounds that require to download the 
file then compare are not viable. ^^
Personally, I don't think the HEAD request needs to be optimized away. "wget 
-N" flag is meant to avoid downloading large files, it's very reasonable to 
send one HEAD request to save MB or GB of downloads.


I think there could be an alternative behavior for wget by using a temporary 
file, as suggested in the last 2 emails:
Obviously this would need to be corrected in wget itself.

1) do the --if-modified-since with the file timestamp, when the file already 
exists.
2) download to a temporary file name (it must be in the same directory or you 
will have issues with rename across volumes)
3) set the timestamp on the temporary file upon completion
4) rename the temporary file



I can think of another workaround if it's possible to set the timestamp 
initially.
Wget can create the file, set the timestamp to "oldest timestamp", write the 
content gradually, and finally set the timestamp when the download is completed.
However that doesn't work if every write is setting the file timestamp to now? 
I don't know how the filesystem operates.
You mentioned an option "c) to write the timestamp after every write operation 
if needs be".  Unfortunately that doesn't fix the issue. The download can be 
interrupted between the write and the writetimestamp calls, leaving a corrupted 
file with a newer date. It doesn't resolve the issue.

Regards.


-----Original Message-----
From: Derek Martin <demartin@akamai.com>
Sent: Wednesday, June 12, 2024 7:14 PM
To: Tim Rühsen <tim.ruehsen@gmx.de>
Cc: Romain Morotti (London) <Romain.Morotti@man.com>; wget-dev@gnu.org; 
bug-wget <bug-wget@gnu.org>
Subject: Re: CRITICAL BUG: wget -N is leaving corrupted files

[You don't often get email from demartin@akamai.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

External Email: Caution advised


On Sat, Jun 08, 2024 at 07:21:14PM +0200, Tim Rühsen wrote:
> What other options do we have to make --if-modified-since workable in
> your scenario? (Apart from switching --if-modified-since off)
>
> a) When you download a file, use a temporary file name. After wget
> exists, check the return status and if it is 0, rename the file.
>
> The downside is that you always have download the file, even if it
> didn't change.

I think this is probably the right solution, except:

1. ALWAYS rename the file, even if the download fails / is
  interrupted.

2. BEFORE the rename, set the timestamp appropriately:

  - set it to the original local file's timestamp if the transfer did
    not complete successfully

  - set it to the upstream file's timestamp if it did complete
    successfully.

3. To successfully do that for the most possible cases, you'll need to
   catch signals and delay their handling until the above is done, in
   addition to whatever other error handling is already required.

And probably:

4. Document that in cases where clean-up procedures can't catch every last 
case, temporary files may be left behind, so the user can expect them on errors 
and manually clean them up.  Probably also name the temporary file something 
like ${original_file_name}_tmp.XXXXXX so that the user can, if they so choose, 
rename it to ${original_file_name} and manually reset the time stamp to get 
wget to resume/redownload or whatever.



This email has been sent by a member of the Man group (“Man”). Man's parent 
company, Man Group plc, is registered in Jersey (company number 127570) with 
its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The 
contents of this email are for the named addressee(s) only. It contains 
information which may be confidential and privileged. If you are not the 
intended recipient, please notify the sender immediately, destroy this email 
and any attachments and do not otherwise disclose or use them. Email 
transmission is not a secure method of communication and Man cannot accept 
responsibility for the completeness or accuracy of this email or any 
attachments. Whilst Man makes every effort to keep its network free from 
viruses, it does not accept responsibility for any computer virus which might 
be transferred by way of this email or any attachments. This email does not 
constitute a request, offer, recommendation or solicitation of any kind to buy, 
subscribe, sell or redeem any investment instruments or to perform other such 
transactions of any kind. Man reserves the right to monitor, record and retain 
all electronic and telephone communications through its network in accordance 
with applicable laws and regulations.

During the course of our business relationship with you, we may process your 
personal data, including through the monitoring of electronic communications. 
We will only process your personal data to the extent permitted by laws and 
regulations; for the purposes of ensuring compliance with our legal and 
regulatory obligations and internal policies; and for managing client 
relationships. For further information please see our Privacy Notice: 
https://www.man.com/privacy-policy

reply via email to

[Prev in Thread] Current Thread [Next in Thread]