Re: My experience with using cp to copy a lot of files (432 millions, 3

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: My experience with using cp to copy a lot of files (432 millions, 3

From:	Pádraig Brady
Subject:	Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Date:	Thu, 18 Sep 2014 11:57:42 +0100
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 09/12/2014 03:11 PM, Pádraig Brady wrote:
> On 09/12/2014 04:42 AM, address@hidden wrote:
>> This post has made it to Hacker News[1].
>>
>> We have discussed optimization possibilities a bit, and I made the
>> suggestion to replace the usage of a hash table in cp with sorting a
>> list.
>>
>> For example: walk the source tree and write a list of ino/dev/path to
>> a temporary file, then sort that file according to ino/dev (e.g. using
>> GNU sort, which I seem to remember is already using a memory-efficient
>> algorithm (i.e. works well with files much bigger than RAM)?), then
>> parse the file back and copy the first path of every group with the
>> same ino/dev and link the rest.
>>
>> The assumption is that sorting a list requires much less RAM to be
>> efficient than the hash table. (I can't find my copy of TAOCP right
>> now, I think it describes solutions.)
>>
>> Christian.
>>
>> [1] https://news.ycombinator.com/item?id=8305283
> 
> The main problem here is that you need the full set somewhere
> at some stage to identify the hardlinks among the full set.
> 
> Using an external sorted list would result in a larger disk usage
> (than having cp swap out its mem), but probably would benefit
> from better mem locality.  I.E. while more expensive CPU wise,
> sort(1) would use a divide and conquer approach to achieve better
> memory locality, while cp updates its hash as it copies, which is the
> best approach assuming no swapping, but that results in essentially
> random parts of mem (swap) being updated as it runs.
> 
> That resulted on the 10G RAM system in around 30/420 inodes/direntries
> being copied per second on average over the 10 day run.
> 
> It's worth mentioning that mechanical disks with their orders of magnitude
> slower random access latencies are going away, so we must consider that
> along with the frequency of this use case when making any changes to
> the complexity of the code.
> 
> If we were to address this, the two approaches would be to
> 1. reduce the size of the hash
> 2. employ locality-sensitive hashing
> 
> Looking at 1, I see there was an attempt to reduce the size
> of the hash by excluding traversed source directories:
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=commitdiff;h=3ece035
> 
> Now Rasmus' analysis suggests that this didn't happen for him,
> resulting in a 350% mem increase for his case, which would have been
> more than enough to move the working set from RAM to swap.
> 
> The patch above didn't cater for the fact that dirs
> generally have st_nlink > 1, which is fixed up in the following patch:
> 
> Note there is a caveat where this will now not detect "hardlinked dirs"
> as reported (intermittently?) on netapp file systems, and thus not
> exit with an error, and instead just create new directories in that case.
> Perhaps that's what we should be doing anyway?
> http://git.sv.gnu.org/gitweb/?p=coreutils.git;a=blob;f=src/copy.c;h=a9561c6#l2173
> 
> diff --git a/src/copy.c b/src/copy.c
> index a9561c6..a2e297f 100644
> --- a/src/copy.c
> +++ b/src/copy.c
> @@ -2152,7 +2152,7 @@ copy_internal (char const *src_name, char const 
> *dst_name,
>      }
>    else if (x->preserve_links
>             && !x->hard_link
> -           && (1 < src_sb.st_nlink
> +           && ((1 < src_sb.st_nlink && ! S_ISDIR (src_mode))
>                 || (command_line_arg
>                     && x->dereference == DEREF_COMMAND_LINE_ARGUMENTS)
>                 || x->dereference == DEREF_ALWAYS))

To get an idea of the impact of removing the dirs from the hash as per the 
above commit,
I did a search for the corresponding "cp: will not create hardlink" error:
  https://www.google.com/search?q="cp:+will+not+create+hardlink";
There were a few correct hits but also a few cases where
the error should not have been emitted!

The two patches attached fix some of these and change cp to behave as follows:

  $ mkdir d1 dest
  $ cp -a d1 d1 dest
  cp: warning: source directory ‘d1’ specified more than once

  $ ln -s d1 d_sl
  $ cp -aH d1 d_sl dest
  $ ls dest
  d1  d_sl

As for the patch above, I'm still holding off due to wariness about
dir hardlinks which OS X supports it seems 
(http://stackoverflow.com/q/80875/4421)
and other cases I still need to check, like bind mounted dirs causing cycles.

thanks,
Pádraig.

cp-dir-hardlink-warning.diff
Description: Text Data

[Prev in Thread]

Current Thread

[Next in Thread]

My experience with using cp to copy a lot of files (432 millions, 39 TB), chrjae, 2014/09/12
- Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/12
  - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/12
  - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Christian Jaeger, 2014/09/12
  - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady, 2014/09/14
    - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Christian Jaeger, 2014/09/14
  - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Pádraig Brady <=
    - Re: My experience with using cp to copy a lot of files (432 millions, 39 TB), Bernhard Voelker, 2014/09/19

Prev by Date: Re: [PATCH] tests: fix false du failure on newer XFS
Next by Date: Re: Bug#762092: sha...sum man pages refer to nonexistent 'sha...sum invocation' info pages
Previous by thread: Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Next by thread: Re: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Index(es):
- Date
- Thread