bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CRITICAL BUG: wget -N is leaving corrupted files


From: Derek Martin
Subject: Re: CRITICAL BUG: wget -N is leaving corrupted files
Date: Fri, 21 Jun 2024 16:19:52 -0400

On Wed, Jun 19, 2024 at 11:32:56AM +0000, Romain Morotti (London) wrote:
> I think the appropriate solution is to send the HEAD request by
> default, as wget was doing before.

I came to that conclusion as well.

> I wonder if the purpose of the change was a micro-optimization to
> save a HEAD request?

That's what I intuited.  Note that it isn't necessarily a
micro-optimization:  The HEAD request isn't a big deal when you're
transferring large files (overhead is practically negligible, as a
percentage), but when your use case involves a large number of small
files, it becomes quite significant.  It would be a good optimization
for that use case...

However, in this case, it is probably equivalent to simply NOT use -N
in the first place.  Caveat user. :)  (This might be worth a note in
the man page--assuming the behavior is reverted.)

> I think there could be an alternative behavior for wget by using a
> temporary file, as suggested in the last 2 emails: Obviously this
> would need to be corrected in wget itself.

In practice, though, this would still leave behind turds in at least
some cases.  You can eliminate SOME of those cases by adding some
signal handling and relatively complicated plumbing to detect and
clean up those temp files, but you couldn't eliminate all of them,
e.g. power failure.  There's also the question of what to do with
SIGSEGV, which you would probably have to ignore, since you have no
way to know what caused the SIGSEGV, which could indeed have been the
code to clean things up, or literally anything else.  So cleaning up
might be unsafe and probably should be assumed to be.

All in all, this can't save you in every case, and the complexity is
likely not worthwhile in that context.

> I can think of another workaround if it's possible to set the
> timestamp initially.  Wget can create the file, set the timestamp to
> "oldest timestamp", write the content gradually, and finally set the
> timestamp when the download is completed.  However that doesn't work
> if every write is setting the file timestamp to now?

They do, at least on some systems.  I don't know if it's universal,
but I believe at least most Unixen will behave that way.  [Individual
file system implementations can have their own semantics, so it's hard
to say conclusively anyway.]  So that solution wouldn't work,
generally.

> You mentioned an option "c) to write the timestamp after every write
> operation if needs be".  Unfortunately that doesn't fix the issue.
> The download can be interrupted between the write and the
> writetimestamp calls, leaving a corrupted file with a newer date. It
> doesn't resolve the issue.

[Not me, but Tim did mention that.]

Quite so.

So, for all those reasons, I concluded that the original behavior was
the right one.  I don't think there is a workable alternative that
would be reliable, regardless of complexity.

Well, maybe there is one:  Do the GET regardless, and look at the
headers.  If you don't need the file, close the connection.  But this
surely would be considered rude in some circles, and you'd have to set
up a new connection to continue.  Probably saves very little if
anything, in practice, as compared to just doing the HEAD request,
particularly when wget is able to use persistent connections.  


> -----Original Message-----
> From: Derek Martin <demartin@akamai.com>
> Sent: Wednesday, June 12, 2024 7:14 PM
> To: Tim Rühsen <tim.ruehsen@gmx.de>
> Cc: Romain Morotti (London) <Romain.Morotti@man.com>; wget-dev@gnu.org; 
> bug-wget <bug-wget@gnu.org>
> Subject: Re: CRITICAL BUG: wget -N is leaving corrupted files
> 
> [You don't often get email from demartin@akamai.com. Learn why this is 
> important at 
> https://urldefense.com/v3/__https://aka.ms/LearnAboutSenderIdentification__;!!GjvTz_vk!QwkWph3nrUuPAxdL54U8mbKusjLUtvAnAbX1ZWg_m_evjKuXA9rj2YS7PpGE9K5CRrzvYNLsBxHbWYwDCYPQqg$
>   ]
> 
> External Email: Caution advised
> 
> 
> On Sat, Jun 08, 2024 at 07:21:14PM +0200, Tim Rühsen wrote:
> > What other options do we have to make --if-modified-since workable in
> > your scenario? (Apart from switching --if-modified-since off)
> >
> > a) When you download a file, use a temporary file name. After wget
> > exists, check the return status and if it is 0, rename the file.
> >
> > The downside is that you always have download the file, even if it
> > didn't change.
> 
> I think this is probably the right solution, except:
> 
> 1. ALWAYS rename the file, even if the download fails / is
>   interrupted.
> 
> 2. BEFORE the rename, set the timestamp appropriately:
> 
>   - set it to the original local file's timestamp if the transfer did
>     not complete successfully
> 
>   - set it to the upstream file's timestamp if it did complete
>     successfully.
> 
> 3. To successfully do that for the most possible cases, you'll need to
>    catch signals and delay their handling until the above is done, in
>    addition to whatever other error handling is already required.
> 
> And probably:
> 
> 4. Document that in cases where clean-up procedures can't catch every last 
> case, temporary files may be left behind, so the user can expect them on 
> errors and manually clean them up.  Probably also name the temporary file 
> something like ${original_file_name}_tmp.XXXXXX so that the user can, if they 
> so choose, rename it to ${original_file_name} and manually reset the time 
> stamp to get wget to resume/redownload or whatever.
> 
> 
> 
> This email has been sent by a member of the Man group (“Man”). Man's parent 
> company, Man Group plc, is registered in Jersey (company number 127570) with 
> its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The 
> contents of this email are for the named addressee(s) only. It contains 
> information which may be confidential and privileged. If you are not the 
> intended recipient, please notify the sender immediately, destroy this email 
> and any attachments and do not otherwise disclose or use them. Email 
> transmission is not a secure method of communication and Man cannot accept 
> responsibility for the completeness or accuracy of this email or any 
> attachments. Whilst Man makes every effort to keep its network free from 
> viruses, it does not accept responsibility for any computer virus which might 
> be transferred by way of this email or any attachments. This email does not 
> constitute a request, offer, recommendation or solicitation of any kind to 
> buy, subscribe, sell or redeem any investment instruments or to perform other 
> such transactions of any kind. Man reserves the right to monitor, record and 
> retain all electronic and telephone communications through its network in 
> accordance with applicable laws and regulations.
> 
> During the course of our business relationship with you, we may process your 
> personal data, including through the monitoring of electronic communications. 
> We will only process your personal data to the extent permitted by laws and 
> regulations; for the purposes of ensuring compliance with our legal and 
> regulatory obligations and internal policies; and for managing client 
> relationships. For further information please see our Privacy Notice: 
> https://urldefense.com/v3/__https://www.man.com/privacy-policy__;!!GjvTz_vk!QwkWph3nrUuPAxdL54U8mbKusjLUtvAnAbX1ZWg_m_evjKuXA9rj2YS7PpGE9K5CRrzvYNLsBxHbWYyKnq2pHQ$
>  

-- 
Derek Martin
Principal System Software Engineer
Akamai Technologies
demartin@akamai.com



reply via email to

[Prev in Thread] Current Thread [Next in Thread]