[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: CRITICAL BUG: wget -N is leaving corrupted files
From: |
Romain Morotti (London) |
Subject: |
RE: CRITICAL BUG: wget -N is leaving corrupted files |
Date: |
Wed, 19 Jun 2024 11:32:56 +0000 |
Hello,
Sorry for delay in getting back.
I think the appropriate solution is to send the HEAD request by default, as
wget was doing before.
From the previous PRs that changed the behavior, it's not clear to me why the
behavior was changed in the first place? I think they really missed the edge
case where it leaves corrupted files.
There were actually a few threads that reported the issue afterwards but the
buggy behavior was not reverted, I think people didn't manage to understand the
root cause, the new behavior is bugged.
I wonder if the purpose of the change was a micro-optimization to save a HEAD
request?
The whole point of "wget -N" is to avoid redownloading large files if unchanged
(in my case GB of files to deploy). Workarounds that require to download the
file then compare are not viable. ^^
Personally, I don't think the HEAD request needs to be optimized away. "wget
-N" flag is meant to avoid downloading large files, it's very reasonable to
send one HEAD request to save MB or GB of downloads.
I think there could be an alternative behavior for wget by using a temporary
file, as suggested in the last 2 emails:
Obviously this would need to be corrected in wget itself.
1) do the --if-modified-since with the file timestamp, when the file already
exists.
2) download to a temporary file name (it must be in the same directory or you
will have issues with rename across volumes)
3) set the timestamp on the temporary file upon completion
4) rename the temporary file
I can think of another workaround if it's possible to set the timestamp
initially.
Wget can create the file, set the timestamp to "oldest timestamp", write the
content gradually, and finally set the timestamp when the download is completed.
However that doesn't work if every write is setting the file timestamp to now?
I don't know how the filesystem operates.
You mentioned an option "c) to write the timestamp after every write operation
if needs be". Unfortunately that doesn't fix the issue. The download can be
interrupted between the write and the writetimestamp calls, leaving a corrupted
file with a newer date. It doesn't resolve the issue.
Regards.
-----Original Message-----
From: Derek Martin <demartin@akamai.com>
Sent: Wednesday, June 12, 2024 7:14 PM
To: Tim Rühsen <tim.ruehsen@gmx.de>
Cc: Romain Morotti (London) <Romain.Morotti@man.com>; wget-dev@gnu.org;
bug-wget <bug-wget@gnu.org>
Subject: Re: CRITICAL BUG: wget -N is leaving corrupted files
[You don't often get email from demartin@akamai.com. Learn why this is
important at https://aka.ms/LearnAboutSenderIdentification ]
External Email: Caution advised
On Sat, Jun 08, 2024 at 07:21:14PM +0200, Tim Rühsen wrote:
> What other options do we have to make --if-modified-since workable in
> your scenario? (Apart from switching --if-modified-since off)
>
> a) When you download a file, use a temporary file name. After wget
> exists, check the return status and if it is 0, rename the file.
>
> The downside is that you always have download the file, even if it
> didn't change.
I think this is probably the right solution, except:
1. ALWAYS rename the file, even if the download fails / is
interrupted.
2. BEFORE the rename, set the timestamp appropriately:
- set it to the original local file's timestamp if the transfer did
not complete successfully
- set it to the upstream file's timestamp if it did complete
successfully.
3. To successfully do that for the most possible cases, you'll need to
catch signals and delay their handling until the above is done, in
addition to whatever other error handling is already required.
And probably:
4. Document that in cases where clean-up procedures can't catch every last
case, temporary files may be left behind, so the user can expect them on errors
and manually clean them up. Probably also name the temporary file something
like ${original_file_name}_tmp.XXXXXX so that the user can, if they so choose,
rename it to ${original_file_name} and manually reset the time stamp to get
wget to resume/redownload or whatever.
This email has been sent by a member of the Man group (“Man”). Man's parent
company, Man Group plc, is registered in Jersey (company number 127570) with
its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The
contents of this email are for the named addressee(s) only. It contains
information which may be confidential and privileged. If you are not the
intended recipient, please notify the sender immediately, destroy this email
and any attachments and do not otherwise disclose or use them. Email
transmission is not a secure method of communication and Man cannot accept
responsibility for the completeness or accuracy of this email or any
attachments. Whilst Man makes every effort to keep its network free from
viruses, it does not accept responsibility for any computer virus which might
be transferred by way of this email or any attachments. This email does not
constitute a request, offer, recommendation or solicitation of any kind to buy,
subscribe, sell or redeem any investment instruments or to perform other such
transactions of any kind. Man reserves the right to monitor, record and retain
all electronic and telephone communications through its network in accordance
with applicable laws and regulations.
During the course of our business relationship with you, we may process your
personal data, including through the monitoring of electronic communications.
We will only process your personal data to the extent permitted by laws and
regulations; for the purposes of ensuring compliance with our legal and
regulatory obligations and internal policies; and for managing client
relationships. For further information please see our Privacy Notice:
https://www.man.com/privacy-policy