[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: My experience with using cp to copy a lot of files (432 millions, 3
Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Thu, 21 Aug 2014 09:10:23 +0200
Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.7.0
On 08/11/2014 03:55 PM, Rasmus Borup Hansen wrote:
Trusting that resizing the hash table would eventually finish, the cp
command was allowed to continue, and after a while it started copying
> again. It stopped again and resized the hash table a couple of times,
each taking more and more time. Finally, after 10 days of copying and
hash table resizing, the new file system used as many blocks and inodes
as the old one according to df, but to my surprise the cp command didn't
exit. Looking at the source again, I found that cp disassembles its hash
table data structures nicely after copying (the forget_all call). Since
the virtual size of the cp process was now more than 17 GB and the
server only had 10 GB of RAM, it did a lot of swapping.
Thinking about this case again, I find this very surprising:
a) that cp(1) uses 17 GB of memory when copying 39 TB of data.
That means roughly 2300 bytes per file:
$ bc <<<'39 * 1024 / 17'
... although the hashed structure only has these members:
I think either the file names where rather long (in average!),
or there is something wrong in the code.
b) that cp(1) is increasing the hash table that often.
This is because it uses the default Hash_tuning (hash.c):
/* [...] The growth threshold defaults to 0.8, and the growth factor
defaults to 1.414, meaning that the table will have doubled its size
every second time 80% of the buckets get used. */
#define DEFAULT_GROWTH_THRESHOLD 0.8f
#define DEFAULT_GROWTH_FACTOR 1.414f
It is like this since the introduction of hashing, and
I wonder if cp(1) couldn't use better values for this.
Have a nice day,