bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Race condition on downloaded files among multiple wget in


From: Tim Ruehsen
Subject: Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
Date: Wed, 04 Sep 2013 09:38:15 +0200
User-agent: KMail/4.10.5 (Linux/3.10-2-amd64; KDE/4.10.5; x86_64; ; )

On Tuesday 03 September 2013 23:17:09 Ángel González wrote:
> On 03/09/13 11:16, Tim Ruehsen wrote:
> > What should it say than ?
> > My ideas are limited to something like
> > "There was an unexpected signal SIGBUS. It may be a bug or a misuse of
> > Wget or your hardware is broken. Please think about it.".
> > 
> > This does not give more information than a "SIGBUS".
> > Ideas welcome.
> 
> Well, if it shall provide more information...
> 
> Error reading links.html file. I was expecting it to have 23K, but it
> now suddenly has
> only 420 bytes. Seems that another program has changed it behind my
> back. It is
> unacceptable to perform my job under this conditions.
> *wget exited*

Very well, if this would be possible. Right now I have no idea how to print 
something like the above. I made Tomas Hozza's test with valgrind and wget 
having debug info. I got 18x (out of 20x) SIGBUS, but on completely different 
places in the code. Within the misuse test situation, SIGBUS could occur at 
any place where memory access (read or write) allocated by wget_read_file().
Absolutely randomly / unpredictable if an outside process changes the file 
size and/or content at the same time.

And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget).

As was already said, replacing mmap by read would not crash (wget_read_file() 
reads as many bytes as there are without prior checking the length of the 
file). But without additional logic, it might read random data (many processes 
writing into the file at the same time, not necessarily the same data). Wget 
would try to parse / change (-k) it, the result would be broken, but no error 
would be printed. So, replacing mmap is not a solution, but maybe a part of a 
solution.

Now to the possible solutions that come into my mind:
1. While downloading / writing data, Wget could build a checksum of the file.
It allows checking later when re-reading the file. In this case we could 
really tell the user: hey, someone trashed our file while we are working...
To get this working, we must remove the mmap code.

2. Using tempfiles / tempdirs only and move them to the right place. That 
would bring in some kind of atomicity, though there are still conflicts to 
solve (e.g. a second Wget instance is faster - should we overwrite existing 
files / directories).

3. Keeping html/css files in memory after downloading. These are the ones we 
later re-read to parse them for links/URLs. Writing them to disk after parsing 
using a tempfile and a move/rename to have atomicity.

4. Using (advisory) file-locks just helps against other Wget instances (is 
that enough ?). And with -k you have to keep the descriptor open for each file 
until Wget is done with downloading everything. This is not practical, since 
there could be (10-, 100-)thousands of files to be downloaded.

If someone likes to work on a patch, here is my opinion: I would implement 1. 
as the least complex to code (but it needs more CPU). Point 4 is would not 
work in any cases.

Regards, Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]