[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: My experience with using cp to copy a lot of files (432 millions, 3
From: |
Pádraig Brady |
Subject: |
Re: My experience with using cp to copy a lot of files (432 millions, 39 TB) |
Date: |
Thu, 18 Sep 2014 11:57:42 +0100 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 09/12/2014 03:11 PM, Pádraig Brady wrote:
> On 09/12/2014 04:42 AM, address@hidden wrote:
>> This post has made it to Hacker News[1].
>>
>> We have discussed optimization possibilities a bit, and I made the
>> suggestion to replace the usage of a hash table in cp with sorting a
>> list.
>>
>> For example: walk the source tree and write a list of ino/dev/path to
>> a temporary file, then sort that file according to ino/dev (e.g. using
>> GNU sort, which I seem to remember is already using a memory-efficient
>> algorithm (i.e. works well with files much bigger than RAM)?), then
>> parse the file back and copy the first path of every group with the
>> same ino/dev and link the rest.
>>
>> The assumption is that sorting a list requires much less RAM to be
>> efficient than the hash table. (I can't find my copy of TAOCP right
>> now, I think it describes solutions.)
>>
>> Christian.
>>
>> [1] https://news.ycombinator.com/item?id=8305283
>
> The main problem here is that you need the full set somewhere
> at some stage to identify the hardlinks among the full set.
>
> Using an external sorted list would result in a larger disk usage
> (than having cp swap out its mem), but probably would benefit
> from better mem locality. I.E. while more expensive CPU wise,
> sort(1) would use a divide and conquer approach to achieve better
> memory locality, while cp updates its hash as it copies, which is the
> best approach assuming no swapping, but that results in essentially
> random parts of mem (swap) being updated as it runs.
>
> That resulted on the 10G RAM system in around 30/420 inodes/direntries
> being copied per second on average over the 10 day run.
>
> It's worth mentioning that mechanical disks with their orders of magnitude
> slower random access latencies are going away, so we must consider that
> along with the frequency of this use case when making any changes to
> the complexity of the code.
>
> If we were to address this, the two approaches would be to
> 1. reduce the size of the hash
> 2. employ locality-sensitive hashing
>
> Looking at 1, I see there was an attempt to reduce the size
> of the hash by excluding traversed source directories:
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035
>
> Now Rasmus' analysis suggests that this didn't happen for him,
> resulting in a 350% mem increase for his case, which would have been
> more than enough to move the working set from RAM to swap.
>
> The patch above didn't cater for the fact that dirs
> generally have st_nlink > 1, which is fixed up in the following patch:
>
> Note there is a caveat where this will now not detect "hardlinked dirs"
> as reported (intermittently?) on netapp file systems, and thus not
> exit with an error, and instead just create new directories in that case.
> Perhaps that's what we should be doing anyway?
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6#l2173
>
> diff --git a/src/copy.c b/src/copy.c
> index a9561c6..a2e297f 100644
> --- a/src/copy.c
> +++ b/src/copy.c
> @@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const
> *dst_name,
> }
> else if (x->preserve_links
> && !x->hard_link
> - && (1 < src_sb.st_nlink
> + && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode))
> || (command_line_arg
> && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS)
> || x->dereference == DEREF_ALWAYS))
To get an idea of the impact of removing the dirs from the hash as per the
above commit,
I did a search for the corresponding "cp: will not create hardlink" error:
https://www.google.com/search?q="cp:+will+not+create+hardlink"
There were a few correct hits but also a few cases where
the error should not have been emitted!
The two patches attached fix some of these and change cp to behave as follows:
$ mkdir d1 dest
$ cp -a d1 d1 dest
cp: warning: source directory ‘d1’ specified more than once
$ ln -s d1 d_sl
$ cp -aH d1 d_sl dest
$ ls dest
d1 d_sl
As for the patch above, I'm still holding off due to wariness about
dir hardlinks which OS X supports it seems
(http://stackoverflow.com/q/80875/4421)
and other cases I still need to check, like bind mounted dirs causing cycles.
thanks,
Pádraig.
cp-dir-hardlink-warning.diff
Description: Text Data
- My experience with using cp to copy a lot of files (432 millions, 39 TB), chrjae, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Christian Jaeger, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/14
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB),
Pádraig Brady <=