bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CRITICAL BUG: wget -N is leaving corrupted files


From: Tim Rühsen
Subject: Re: CRITICAL BUG: wget -N is leaving corrupted files
Date: Sat, 8 Jun 2024 19:21:14 +0200
User-agent: Mozilla Thunderbird

Hey,

just repeating what you already wrote:

The --if-modified-since (which is enabled with -N) relies on the proper timestamp of the file when the file exists on your local disk.

When stopping wget in the middle of a download, the partial file's timestamp is the current timestamp. Restarting the download with -N will very likely get a 304 from the server because it is unlikely that the file content changed on the server side in the meantime.

Using the HEAD request instead of --if-modified-since helps in your special case. But I recently learned that not all servers even allow HEAD requests. And the HEAD request always adds another round trip (extra request + response), so this is a sub-optimal solution.

What other options do we have to make --if-modified-since workable in your scenario? (Apart from switching --if-modified-since off)

a) When you download a file, use a temporary file name. After wget exists, check the return status and if it is 0, rename the file.

The downside is that you always have download the file, even if it didn't change.

b) Wget could make sure that the file timestamp is set properly when existing. This doesn't always work, e.g. when the whole system is switched off, no exit function in wget will be executed.

c) After every write, the file's timestamp is set to the server's timestamp. This doubles the amount of syscalls and will increase CPU usage.

d) Use the metalink protocol. It provides checksums for your files and if the checksum of the local file diverts, only the corrupted parts are re-downloaded. It's a great protocol (working via HTTP/HTTPS) that sadly never got real traction. So most distributions don't compile it in and you have to build your own version of wget.

e) Use FTP(S)... the downside is that your admin likely isn't happy about it.

Maybe b) is a compromise, even if it's not perfect!?

Again, I am not saying that switching back to HEAD requests is out of the race. First, I'd like to see more ideas / suggestions on this topic.

Regards, Tim


On 5/24/24 13:53, Romain Morotti (London) via Public discussion list for GNU Wget development wrote:
Hello,

Apologies for the long email, it is quite long and was quite difficult to 
debug. I hope you can roll a fix.
There are previous bug reports related to this issue, but they never reached a 
repro or an explanation.


TL;DR critical bug in wget, wget is leaving corrupted files when using the -N 
flag.

ROOT CAUSE: Change of behaviour in or around version v1.17. wget -N code was 
rewritten and a new flag was added --no-if-modified-since off by default, 
unfortunately the new code and behaviour is incorrect and leaves corrupted 
downloads.

FIX: -N must always be used together with --no-if-modified-since behavior, 
otherwise wget will leave corrupted files.
The flag --no-if-modified-since should be set by default when -N is used.

WORKAROUND: As a workaround, you can set together “-N --no-if-modified-since” 
in the command line, however the flag does not exist on older versions of wget 
and will fail. You may have to detect wget versions and pass relevant flags if 
you plan to deploy on multiple systems with various wget versions.


CONTEXT:

We use wget to download archives and large files to deploy. We started getting 
regular issues with corrupted archives after moving to ubuntu 22 and latest 
version of wget.


```
               $ wget -N https://mycompany.com/myarchive.tar.gz
               $ tar -xf myarchive.tar.gz
                 (stdin): File ends unexpectedly at pos 94479367
               tar: Unexpected EOF in archive
               tar: Unexpected EOF in archive
               tar: Error is not recoverable: exiting now
```

It took me forever to get to the bottom of it, it's an issue with wget leaving 
partial corrupted downloads. It is a bug in wget itself.
wget -N flag is meant to only (re)download a file when the timestamp of the 
file or the file size has changed. It simply stopped working as expected in 
recent versions, like the recent version in ubuntu 22.


We see the issue happening regularly in production,
It triggers after wget is interrupted once. Interruptions can happen for any 
reasons, like the user can Ctrl+C a script, a deployment can be cancelled, the 
process can be killed or the machine rebooted any moment.
When wget is interrupted, it leaves a partial downloaded file. The timestamp is 
newer but the size doesn't match the expected file size.

* In older versions of wget, wget was sending a HEAD request to get the 
filesize and the timestamp, then it downloaded the file if the date changed or 
the sized changed. wget worked as expected.
* In recent versions of wget, wget does not detect the file size is incorrect. 
wget is stuck with a bad file and can never recover.

Recovery requires intervention from a developer or SRE to go onto the affected 
machine and delete bad files leftover by wget.


REPRO:

You can Ctrl+C to interrupt wget or you can run “truncate” to simulate a 
partial download.

```
wget --version
wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response
truncate --size 1 myarchive.tar.lz
wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response
```


DEBUGGING: see logs below for the last call to wget, after truncate

Notice in recent versions, wget is sending a single GET request with an 
if-modified-since header, the server replies with a 304 response to tell the 
content did not change.
The 304 response has no content-size header and no content.

This is an edge case of the HTTP spec. The content-size header is not required 
on a 304 response. The header may be set but it is not required.
Having a look at the web server response (artifactory/tomcat), the content-size 
is not set.
See HTTP RFC https://datatracker.ietf.org/doc/html/rfc7232#section-4.1

This is a very interesting side effect of the HTTP spec and the real world. It 
prevents wget from knowing about the file size or getting the content.
Turns out, detecting the file size is critical for "wget -N" to operate as 
expected. Otherwise it will get into a bad state where a file on disk is bad but wget 
can’t detect the issue and can’t redownload.

I think wget must always send a HEAD request first.


```
wget 1.14 on centos 7
works as expected, send a HEAD request, detect the size has changed, then 
redownload

wget -N --server-response https://mycompany.com/myarchive.tar.lz
--2024-05-24 10:42:52--  https://mycompany.com/myarchive.tar.lz
Resolving mycompany.com (mycompany.com)... 10.192.10.20
Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected.
HTTP request sent, awaiting response...
   HTTP/1.1 200 OK
   Date: Fri, 24 May 2024 09:42:52 GMT
   Content-Type: application/octet-stream
   Content-Length: 185751081
   Connection: keep-alive
   Server: Artifactory
   X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
   X-Artifactory-Node-Id: dc09bebb5d42
   Last-Modified: Thu, 23 May 2024 10:46:13 GMT
   ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
   X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
   X-Checksum-Sha256: 
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
   X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
   Accept-Ranges: bytes
   X-Artifactory-Filename: myarchive.tar.lz
   Content-Disposition: attachment; filename="myarchive.tar.lz"; 
filename*=UTF-8''myarchive.tar.lz
Length: 185751081 (177M) [application/octet-stream]
The sizes do not match (local 1) -- retrieving.

--2024-05-24 10:42:52--  https://mycompany.com/myarchive.tar.lz
Reusing existing connection to mycompany.com:443.
HTTP request sent, awaiting response...
   HTTP/1.1 200 OK
   Date: Fri, 24 May 2024 09:42:52 GMT
   Content-Type: application/octet-stream
   Content-Length: 185751081
   Connection: keep-alive
   Server: Artifactory
   X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
   X-Artifactory-Node-Id: dc09bebb5d42
   Last-Modified: Thu, 23 May 2024 10:46:13 GMT
   ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
   X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
   X-Checksum-Sha256: 
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
   X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
   Accept-Ranges: bytes
   X-Artifactory-Filename: myarchive.tar.lz
   Content-Disposition: attachment; filename="myarchive.tar.lz"; 
filename*=UTF-8''myarchive.tar.lz
Length: 185751081 (177M) [application/octet-stream]
Saving to: ‘myarchive.tar.lz’

100%[==============================================================================>]
 185,751,081  277MB/s   in 0.6s

2024-05-24 10:42:53 (277 MB/s) - ‘myarchive.tar.lz’ saved [185751081/185751081]
```


```
wget 1.21 on ubuntu 22
doesn’t work. wget incorrectly think there is nothing to download.

wget -N --server-response https://mycompany.com/myarchive.tar.lz
--2024-05-24 10:42:11--  https://mycompany.com/myarchive.tar.lz
Resolving mycompany.com (mycompany.com)... 10.192.10.20
Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected.
HTTP request sent, awaiting response...
   HTTP/1.1 304 Not Modified
   Date: Fri, 24 May 2024 09:42:11 GMT
   Connection: keep-alive
   Server: Artifactory
   X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
   X-Artifactory-Node-Id: dc09bebb5d42
   Last-Modified: Thu, 23 May 2024 10:46:13 GMT
   ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
   X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
   X-Checksum-Sha256: 
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
   X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
   Accept-Ranges: bytes
   X-Artifactory-Filename: myarchive.tar.lz
   Content-Disposition: attachment; filename="myarchive.tar.lz"; 
filename*=UTF-8''myarchive.tar.lz
File ‘myarchive.tar.lz’ not modified on server. Omitting download.
```



Regards.





This email has been sent by a member of the Man group (“Man”). Man's parent 
company, Man Group plc, is registered in Jersey (company number 127570) with 
its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The 
contents of this email are for the named addressee(s) only. It contains 
information which may be confidential and privileged. If you are not the 
intended recipient, please notify the sender immediately, destroy this email 
and any attachments and do not otherwise disclose or use them. Email 
transmission is not a secure method of communication and Man cannot accept 
responsibility for the completeness or accuracy of this email or any 
attachments. Whilst Man makes every effort to keep its network free from 
viruses, it does not accept responsibility for any computer virus which might 
be transferred by way of this email or any attachments. This email does not 
constitute a request, offer, recommendation or solicitation of any kind to buy, 
subscribe, sell or redeem any investment instruments or to perform other such 
transactions of any kind. Man reserves the right to monitor, record and retain 
all electronic and telephone communications through its network in accordance 
with applicable laws and regulations.

During the course of our business relationship with you, we may process your 
personal data, including through the monitoring of electronic communications. 
We will only process your personal data to the extent permitted by laws and 
regulations; for the purposes of ensuring compliance with our legal and 
regulatory obligations and internal policies; and for managing client 
relationships. For further information please see our Privacy Notice: 
https://www.man.com/privacy-policy

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]