Hello,
Apologies for the long email, it is quite long and was quite difficult to
debug. I hope you can roll a fix.
There are previous bug reports related to this issue, but they never reached a
repro or an explanation.
TL;DR critical bug in wget, wget is leaving corrupted files when using the -N
flag.
ROOT CAUSE: Change of behaviour in or around version v1.17. wget -N code was
rewritten and a new flag was added --no-if-modified-since off by default,
unfortunately the new code and behaviour is incorrect and leaves corrupted
downloads.
FIX: -N must always be used together with --no-if-modified-since behavior,
otherwise wget will leave corrupted files.
The flag --no-if-modified-since should be set by default when -N is used.
WORKAROUND: As a workaround, you can set together “-N --no-if-modified-since”
in the command line, however the flag does not exist on older versions of wget
and will fail. You may have to detect wget versions and pass relevant flags if
you plan to deploy on multiple systems with various wget versions.
CONTEXT:
We use wget to download archives and large files to deploy. We started getting
regular issues with corrupted archives after moving to ubuntu 22 and latest
version of wget.
```
$ wget -N https://mycompany.com/myarchive.tar.gz
$ tar -xf myarchive.tar.gz
(stdin): File ends unexpectedly at pos 94479367
tar: Unexpected EOF in archive
tar: Unexpected EOF in archive
tar: Error is not recoverable: exiting now
```
It took me forever to get to the bottom of it, it's an issue with wget leaving
partial corrupted downloads. It is a bug in wget itself.
wget -N flag is meant to only (re)download a file when the timestamp of the
file or the file size has changed. It simply stopped working as expected in
recent versions, like the recent version in ubuntu 22.
We see the issue happening regularly in production,
It triggers after wget is interrupted once. Interruptions can happen for any
reasons, like the user can Ctrl+C a script, a deployment can be cancelled, the
process can be killed or the machine rebooted any moment.
When wget is interrupted, it leaves a partial downloaded file. The timestamp is
newer but the size doesn't match the expected file size.
* In older versions of wget, wget was sending a HEAD request to get the
filesize and the timestamp, then it downloaded the file if the date changed or
the sized changed. wget worked as expected.
* In recent versions of wget, wget does not detect the file size is incorrect.
wget is stuck with a bad file and can never recover.
Recovery requires intervention from a developer or SRE to go onto the affected
machine and delete bad files leftover by wget.
REPRO:
You can Ctrl+C to interrupt wget or you can run “truncate” to simulate a
partial download.
```
wget --version
wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response
truncate --size 1 myarchive.tar.lz
wget -N https://mycompany.com/myarchive.tar.lz --debug --server-response
```
DEBUGGING: see logs below for the last call to wget, after truncate
Notice in recent versions, wget is sending a single GET request with an
if-modified-since header, the server replies with a 304 response to tell the
content did not change.
The 304 response has no content-size header and no content.
This is an edge case of the HTTP spec. The content-size header is not required
on a 304 response. The header may be set but it is not required.
Having a look at the web server response (artifactory/tomcat), the content-size
is not set.
See HTTP RFC https://datatracker.ietf.org/doc/html/rfc7232#section-4.1
This is a very interesting side effect of the HTTP spec and the real world. It
prevents wget from knowing about the file size or getting the content.
Turns out, detecting the file size is critical for "wget -N" to operate as
expected. Otherwise it will get into a bad state where a file on disk is bad but wget
can’t detect the issue and can’t redownload.
I think wget must always send a HEAD request first.
```
wget 1.14 on centos 7
works as expected, send a HEAD request, detect the size has changed, then
redownload
wget -N --server-response https://mycompany.com/myarchive.tar.lz
--2024-05-24 10:42:52-- https://mycompany.com/myarchive.tar.lz
Resolving mycompany.com (mycompany.com)... 10.192.10.20
Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Fri, 24 May 2024 09:42:52 GMT
Content-Type: application/octet-stream
Content-Length: 185751081
Connection: keep-alive
Server: Artifactory
X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
X-Artifactory-Node-Id: dc09bebb5d42
Last-Modified: Thu, 23 May 2024 10:46:13 GMT
ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
X-Checksum-Sha256:
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
Accept-Ranges: bytes
X-Artifactory-Filename: myarchive.tar.lz
Content-Disposition: attachment; filename="myarchive.tar.lz";
filename*=UTF-8''myarchive.tar.lz
Length: 185751081 (177M) [application/octet-stream]
The sizes do not match (local 1) -- retrieving.
--2024-05-24 10:42:52-- https://mycompany.com/myarchive.tar.lz
Reusing existing connection to mycompany.com:443.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Fri, 24 May 2024 09:42:52 GMT
Content-Type: application/octet-stream
Content-Length: 185751081
Connection: keep-alive
Server: Artifactory
X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
X-Artifactory-Node-Id: dc09bebb5d42
Last-Modified: Thu, 23 May 2024 10:46:13 GMT
ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
X-Checksum-Sha256:
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
Accept-Ranges: bytes
X-Artifactory-Filename: myarchive.tar.lz
Content-Disposition: attachment; filename="myarchive.tar.lz";
filename*=UTF-8''myarchive.tar.lz
Length: 185751081 (177M) [application/octet-stream]
Saving to: ‘myarchive.tar.lz’
100%[==============================================================================>]
185,751,081 277MB/s in 0.6s
2024-05-24 10:42:53 (277 MB/s) - ‘myarchive.tar.lz’ saved [185751081/185751081]
```
```
wget 1.21 on ubuntu 22
doesn’t work. wget incorrectly think there is nothing to download.
wget -N --server-response https://mycompany.com/myarchive.tar.lz
--2024-05-24 10:42:11-- https://mycompany.com/myarchive.tar.lz
Resolving mycompany.com (mycompany.com)... 10.192.10.20
Connecting to mycompany.com (mycompany.com)|10.192.10.20|:443... connected.
HTTP request sent, awaiting response...
HTTP/1.1 304 Not Modified
Date: Fri, 24 May 2024 09:42:11 GMT
Connection: keep-alive
Server: Artifactory
X-Artifactory-Id: 5e06b1f8f8c7e195:5afd0284:18d702c2085:-8000
X-Artifactory-Node-Id: dc09bebb5d42
Last-Modified: Thu, 23 May 2024 10:46:13 GMT
ETag: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
X-Checksum-Sha1: 25f65d47dde6ae2015c0fb7fe8fb895ec988ceb0
X-Checksum-Sha256:
12a219e5c632629f11cfcd954069c1bc5e2273c1684d0877fdea0cf60b2e0d78
X-Checksum-Md5: 8b8a1d9db73eb2fbb635b45317320f19
Accept-Ranges: bytes
X-Artifactory-Filename: myarchive.tar.lz
Content-Disposition: attachment; filename="myarchive.tar.lz";
filename*=UTF-8''myarchive.tar.lz
File ‘myarchive.tar.lz’ not modified on server. Omitting download.
```
Regards.
This email has been sent by a member of the Man group (“Man”). Man's parent
company, Man Group plc, is registered in Jersey (company number 127570) with
its registered office at 22 Grenville Street, St Helier, Jersey, JE4 8PX. The
contents of this email are for the named addressee(s) only. It contains
information which may be confidential and privileged. If you are not the
intended recipient, please notify the sender immediately, destroy this email
and any attachments and do not otherwise disclose or use them. Email
transmission is not a secure method of communication and Man cannot accept
responsibility for the completeness or accuracy of this email or any
attachments. Whilst Man makes every effort to keep its network free from
viruses, it does not accept responsibility for any computer virus which might
be transferred by way of this email or any attachments. This email does not
constitute a request, offer, recommendation or solicitation of any kind to buy,
subscribe, sell or redeem any investment instruments or to perform other such
transactions of any kind. Man reserves the right to monitor, record and retain
all electronic and telephone communications through its network in accordance
with applicable laws and regulations.
During the course of our business relationship with you, we may process your
personal data, including through the monitoring of electronic communications.
We will only process your personal data to the extent permitted by laws and
regulations; for the purposes of ensuring compliance with our legal and
regulatory obligations and internal policies; and for managing client
relationships. For further information please see our Privacy Notice:
https://www.man.com/privacy-policy