bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Race condition on downloaded files among multiple wget in


From: Tomas Hozza
Subject: Re: [Bug-wget] Race condition on downloaded files among multiple wget instances
Date: Mon, 9 Sep 2013 04:49:52 -0400 (EDT)

----- Original Message -----
> Very well, if this would be possible. Right now I have no idea how to print
> something like the above. I made Tomas Hozza's test with valgrind and wget
> having debug info. I got 18x (out of 20x) SIGBUS, but on completely different
> places in the code. Within the misuse test situation, SIGBUS could occur at
> any place where memory access (read or write) allocated by wget_read_file().
> Absolutely randomly / unpredictable if an outside process changes the file
> size and/or content at the same time.
> 
> And SIGBUS could also occur out of any other reason (e.g. real bugs in Wget).
> 
> As was already said, replacing mmap by read would not crash (wget_read_file()
> reads as many bytes as there are without prior checking the length of the
> file). But without additional logic, it might read random data (many
> processes
> writing into the file at the same time, not necessarily the same data). Wget
> would try to parse / change (-k) it, the result would be broken, but no error
> would be printed. So, replacing mmap is not a solution, but maybe a part of a
> solution.
> 
> Now to the possible solutions that come into my mind:
> 1. While downloading / writing data, Wget could build a checksum of the file.
> It allows checking later when re-reading the file. In this case we could
> really tell the user: hey, someone trashed our file while we are working...
> To get this working, we must remove the mmap code.
> 
> 2. Using tempfiles / tempdirs only and move them to the right place. That
> would bring in some kind of atomicity, though there are still conflicts to
> solve (e.g. a second Wget instance is faster - should we overwrite existing
> files / directories).
> 
> 3. Keeping html/css files in memory after downloading. These are the ones we
> later re-read to parse them for links/URLs. Writing them to disk after
> parsing
> using a tempfile and a move/rename to have atomicity.
> 
> 4. Using (advisory) file-locks just helps against other Wget instances (is
> that enough ?). And with -k you have to keep the descriptor open for each
> file
> until Wget is done with downloading everything. This is not practical, since
> there could be (10-, 100-)thousands of files to be downloaded.
> 
> If someone likes to work on a patch, here is my opinion: I would implement 1.
> as the least complex to code (but it needs more CPU). Point 4 is would not
> work in any cases.
> 
> Regards, Tim

Thanks for the brainstorming. The solution #1 seems as most reasonable
to me. I was thinking about 2. and 4., but there are possible issues
that you've already mentioned.

I had a look at the source, but unfortunately the changes to create and
verify the checksum of downloaded files is not trivial.

Regards,

Tomas



reply via email to

[Prev in Thread] Current Thread [Next in Thread]