coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

My experience with using cp to copy a lot of files (432 millions, 39 TB)


From: Rasmus Borup Hansen
Subject: My experience with using cp to copy a lot of files (432 millions, 39 TB)
Date: Mon, 11 Aug 2014 15:55:20 +0200

Hi! I recently had to copy a lot of files and even though I've 20 years 
experience with various Unix variants I was still surprised by the behaviour of 
cp and I think my observations should be shared with the community.

The setup: An old Dell server (2 cores, 2 GB initially, 10 GB later, running 
Ubuntu Trusty) with a new Dell storage enclosure (MD 1200) containing 12 4 TB 
disks configured with RAID 6 for a total of 40 TB capacity allowing two drives 
to fail simultaneously. The server is used for our off-site backup, and the 
only thing it does is writing stuff to the disks. We use rsnapshot for that so 
most of the files have a high link count (30+).

One morning I was notified that a disk had failed. No big deal, this happens 
now and then. I called Dell and next day I had a replacement disk. While 
rebuilding, the replacement disk failed, and in the meantime another disk had 
also failed. Now Dell's support wisely suggested that I did not just replace 
the failed disks as the array may have been punctured. Apparently, and as I 
understand it, disks are only reported as failed when they have sufficiently 
many bad blocks, and if you're unlucky you can lose data if 3 corresponding 
blocks on different disks become bad within a short time, so that the RAID 
controller does not have a chance to detect the failures, recalculate the data 
from the parity, and store it somewhere else. So even though only two drives 
flashed red, data might have been lost.

Having almost used up the capacity we decided to order another storage 
enclosure, copy the files from the old one to the new one, and then get the old 
one into a trustworthy state and use it to extend the total capacity. Normally 
I'd have copied/moved the files at block-level (eg. using dd or pvmove), but 
suspecting bad blocks, I went for a file-level copy because then I'd know which 
files contained the bad blocks. I browsed the net for other peoples' experience 
with copying many files and quickly decided that cp would do the job nicely. 
Knowing that preserving the hardlinks would require bookkeeping of which files 
have already been copied I also ordered 8 GB more RAM for the server and 
configured more swap space.

When the new hardware had arrived I started the copying, and at first it 
proceeded nicely at around 300-400 MB/s as measured with iotop. After a while 
the speed decreased considerably, because most of the time was spent creating 
hardlinks, and it takes time to ensure that the filesystem is always in a 
consistent state. We use XFS, and we were probably suffering for not disabling 
write barriers which can be done when the RAID controller has a write cache 
with a trustworthy battery backup. As expected, the memory usage of the cp 
command increased steadily and was soon in the gigabytes.

After some days of copying the first real surprise came: I noticed that the 
copying had stopped, and cp did not make any system calls at all according to 
strace. Reading the source code revealed that cp keeps track of which files 
have been copied in a hash table that now and then has to be resized to avoid 
too many collisions. When the RAM has been used up, this becomes a slow 
operation.

Trusting that resizing the hash table would eventually finish, the cp command 
was allowed to continue, and after a while it started copying again. It stopped 
again and resized the hash table a couple of times, each taking more and more 
time. Finally, after 10 days of copying and hash table resizing, the new file 
system used as many blocks and inodes as the old one according to df, but to my 
surprise the cp command didn't exit. Looking at the source again, I found that 
cp disassembles its hash table data structures nicely after copying (the 
forget_all call). Since the virtual size of the cp process was now more than 17 
GB and the server only had 10 GB of RAM, it did a lot of swapping.

I had started cp with the "-v" option and piped its output (both stdout and 
stderr) to a tee command to capture the output in a (big!) logfile. This meant 
that somewhere the output from cp was buffered because my logfile ended in the 
middle of a line. Wanting the buffers to be flushed so that I had a complete 
logfile, I gave cp more than a day to finish disassembling its hash table, 
before giving up and killing the process.

As I write this, I'm running an "ls -laR" on both file systems to be sure that 
everything is copied. But unless the last missing part of the output from cp 
contained more error messages, it appears that only a single file had i/o 
errors (luckily we had another copy of it).

I know this is not going to happen right away, but it would be nice if cp 
somehow used a data structure where the bookkeeping could be done while waiting 
for i/o instead of piling up the bookkeeping. And unless old systems without 
working memory management must be supported, I don't see any harm in simply 
removing the call to the forget_all function towards the end of cp.c.

To summarise the lessons I learned:

If you trust that your hardware and your filesystem are ok, use block level 
copying if you're copying an entire filesystem. It'll be faster, unless you 
have lots of free space on it. In any case it will require less memory.

If you copy many files and want to preserve hardlinks, make sure you have 
enough memory if you copy at file level.

Disassembling data structures nicely can take much more time than just tearing 
them down brutally when the process exits.

The number of hard drives flashing red is not the same as the number of hard 
drives with bad blocks. With RAID 6 you don't need three drives flashing red to 
loose data, if you're unlucky. Fewer can do. The same will be true for RAID 5, 
where you can loose data with only one or no drive flashing red, if you're 
really unlucky.


I hope this can help or at least be interesting for someone.

Best,

Rasmus Borup Hansen

Intomics is a contract research organization specialized in deriving core 
biological insight from large scale data. We help our clients in the 
pharmaceutical industry develop tomorrow's medicines better, faster, and 
cheaper through optimized use of biomedical data.
-----------------------------------------------------------------
Hansen, Rasmus Borup              Intomics - from data to biology
System Administrator              Diplomvej 377
Scientific Programmer             DK-2800 Kgs. Lyngby
                                  Denmark
E: address@hidden               W: http://www.intomics.com/
P: +45 5167 7972                  P: +45 8880 7979



reply via email to

[Prev in Thread] Current Thread [Next in Thread]