[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: My experience with using cp to copy a lot of files (432 millions, 3
From: |
Pádraig Brady |
Subject: |
Re: My experience with using cp to copy a lot of files (432 millions, 39 TB) |
Date: |
Thu, 21 Aug 2014 10:31:13 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 08/21/2014 08:10 AM, Bernhard Voelker wrote:
> On 08/11/2014 03:55 PM, Rasmus Borup Hansen wrote:
>> Trusting that resizing the hash table would eventually finish, the cp
>> command was allowed to continue, and after a while it started copying
>> again. It stopped again and resized the hash table a couple of times,
>> each taking more and more time. Finally, after 10 days of copying and
>> hash table resizing, the new file system used as many blocks and inodes
>> as the old one according to df, but to my surprise the cp command didn't
>> exit. Looking at the source again, I found that cp disassembles its hash
>> table data structures nicely after copying (the forget_all call). Since
>> the virtual size of the cp process was now more than 17 GB and the
>> server only had 10 GB of RAM, it did a lot of swapping.
>
> Thinking about this case again, I find this very surprising:
>
> a) that cp(1) uses 17 GB of memory when copying 39 TB of data.
> That means roughly 2300 bytes per file:
>
> $ bc <<<'39 * 1024 / 17'
> 2349
>
> ... although the hashed structure only has these members:
>
> struct Src_to_dest
> {
> ino_t st_ino;
> dev_t st_dev;
> char *name;
> };
>
> I think either the file names where rather long (in average!),
> or there is something wrong in the code.
>
> b) that cp(1) is increasing the hash table that often.
> This is because it uses the default Hash_tuning (hash.c):
>
> /* [...] The growth threshold defaults to 0.8, and the growth factor
> defaults to 1.414, meaning that the table will have doubled its size
> every second time 80% of the buckets get used. */
> #define DEFAULT_GROWTH_THRESHOLD 0.8f
> #define DEFAULT_GROWTH_FACTOR 1.414f
>
> It is like this since the introduction of hashing, and
> I wonder if cp(1) couldn't use better values for this.
>
> Have a nice day,
> Berny
>
The amount of files rather than the amount of data is pertinent here.
So 17G/432M is about 40 bytes per entry which is about right.
cheers,
Pádraig.
- My experience with using cp to copy a lot of files (432 millions, 39 TB), Rasmus Borup Hansen, 2014/08/11
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/08/11
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Jim Meyering, 2014/08/11
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Bernhard Voelker, 2014/08/21
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB),
Pádraig Brady <=