[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on co

savannah-hackers-public

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on co

From:	Sylvain Beucler
Subject:	[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone
Date:	Thu, 12 Nov 2009 12:33:17 +0100
User-agent:	Mutt/1.5.20 (2009-06-14)

On Sat, Oct 31, 2009 at 11:13:51AM +0100, Sylvain Beucler wrote:
> > On Thu, Oct 29, 2009 at 01:20:55PM -0400, Daniel Clark via RT wrote:
> > > Ah I see, I was waiting for comments on this - should be able to go out 
> > > this weekend to do 
> > > replacements / reshuffles / etc, but I need to know if savannah-hackers 
> > > has a strong 
> > > opinion on how to proceed:
> > > 
> > > (1) Do we keep the 1TB disks?
> > > > - Now that the cause of the failure is known to be a software failure,
> > > > do we forget about this, or still pursue the plan to remove 1.0TB
> > > > disks that are used nowhere else at the FSF?
> > > 
> > > That was mostly a "this makes no sense, but that's the only thing that's 
> > > different about 
> > > that system" type of response; it is true they are not used elsewhere, 
> > > but if they are 
> > > actually working fine I am fine with doing whatever savannah-hackers 
> > > wants to do.
> > > 
> > > (2) Do we keep the 2 eSATA drives connected?
> > > > - If not, do you recommend moving everything (but '/') on the 1.5TB
> > > > disks?
> > > 
> > > Again if they are working fine it's your call; however the bigger issue 
> > > is if you want to 
> > > keep the 2 eSATA / external drives connected, since that is a legitimate 
> > > extra point of 
> > > failure, and there are some cases where errors in the external enclosure 
> > > can bring a system 
> > > down (although it's been up and running fine for several months now).
> > > 
> > > (3) Do we make the switch to UUIDs now?
> > > > - About UUIDs, everything in fstab in using mdX, which I'd rather not
> > > > mess with.
> > > 
> > > IMHO it would be better to mess with this when the system is less 
> > > critical; not using UUIDs 
> > > everywhere tends to screw you during recovery from hardware failures.
> > > 
> > > And BTW totally off-topic, but eth1 on colonialone is now connected via 
> > > crossover ethernet 
> > > cable to eth1 on savannah (and colonialone is no longer on fsf 10. 
> > > management network, 
> > > which I believe we confirmed no one cared about)
> > > 
> > > (4) We need to change to some technique that will give us RAID1 
> > > redundancy even if one 
> > > drives dies. I think the safest solution would be to not use eSATA, and 
> > > use 4 1.5TB drives 
> > > all inside the computer in a 1.5TB quad RAID1 array, so all 4 drives 
> > > would need to fail to 
> > > bring savannah down. Other option would be 2 triple RAID1s using eSATA, 
> > > each with 2 disks 
> > > inside the computer and the 3rd disks in the external enclosure.
> 
> On Thu, Oct 29, 2009 at 07:29:50PM +0100, Sylvain Beucler wrote:
> > Hi,
> > 
> > As far as the hardware is concerned, I think it is best that we do
> > what the FSF sysadmins think is best.
> > 
> > We don't have access to the computer, don't really know anything about
> > what it's made of, don't understand the eSATA/internal
> > differences. We're even using Xen as you do, to ease this kind of
> > interaction. In short, you're more often than not in better position
> > to judge the hardware issues.
> > 
> > 
> > So:
> > 
> > If you think it's safer to use 4x1.5TB RAID-1, then let's do that.
> > 
> > Only, we need to discuss how to migrate the current data, since
> > colonialone is already in production.
> > 
> > In particular, fixing the DNS issues I reported would help if
> > temporary relocation is needed.
> 
> 
> I see that there are currently 4x 1.5TB disks.
> 
> 
> sda 1TB   inside
> sdb 1TB   inside
> sdc 1.5TB inside?
> sdd 1.5TB inside?
> sde 1.5TB external/eSATA?
> sdf 1.5TB external/eSATA?
> 
> 
> Here's what I started doing:
> 
> - recreate 4 partitions on sdc and sde (but 2 of them in an extended
>   partition)
> 
> - added sdc and sdd to the current RAID-1 arrays
> 
>   mdadm /dev/md0 --add /dev/sdc1
>   mdadm /dev/md0 --add /dev/sdd1
>   mdadm /dev/md1 --add /dev/sdc2
>   mdadm /dev/md1 --add /dev/sdd2
>   mdadm /dev/md2 --add /dev/sdc5
>   mdadm /dev/md2 --add /dev/sdd5
>   mdadm /dev/md3 --add /dev/sdc6
>   mdadm /dev/md3 --add /dev/sdd6
>   mdadm /dev/md0 --grow -n 4
>   mdadm /dev/md1 --grow -n 4
>   mdadm /dev/md2 --grow -n 4
>   mdadm /dev/md3 --grow -n 4
> 
> colonialone:~# cat /proc/mdstat 
> Personalities : [raid1] 
> md3 : active raid1 sdd6[4] sdc6[5] sdb4[1] sda4[0]
>       955128384 blocks [4/2] [UU__]
>       [>....................]  recovery =  0.0% (43520/955128384) 
> finish=730.1min speed=21760K/sec
>       
> md2 : active raid1 sdc5[2] sdd5[3] sdb3[1] sda3[0]
>       19534976 blocks [4/4] [UUUU]
>       
> md1 : active raid1 sdd2[2] sdc2[3] sda2[0] sdb2[1]
>       2000000 blocks [4/4] [UUUU]
>       
> md0 : active raid1 sdd1[2] sdc1[3] sda1[0] sdb1[1]
>       96256 blocks [4/4] [UUUU]
> 
> - install GRUB on sdc and sdd
> 
> 
> With this setup, the data is both on the 1TB and the 1.5TB disks.
> 
> If you confirm that this is OK, we can:
> 
> * extend this to sde and sdf,
> 
> * unplug sda+sdb and plug all the 1.5TB disks internaly
> 
> * reboot while you are at the colo, and ensure that there's no device
>   renaming mess
> 
> * add the #7 partitions in sdc/d/e/f as a new RAID device / LVM
>   Physical Volume and get the remaining 500GB
> 
> 
> Can you let me know if this sounds reasonable?

up!

-- 
Sylvain

[Prev in Thread]

Current Thread

[Next in Thread]

[Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone, Sylvain Beucler <=
- [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone, Sylvain Beucler, 2009/11/19
- [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone, Sylvain Beucler, 2009/11/27

Prev by Date: Re: [Savannah-hackers-public] FriBID: Legal issues with reverse engineering
Next by Date: Re: [Savannah-hackers-public] "or any later" clause mandatory?
Previous by thread: [Savannah-hackers-public] FriBID: Legal issues with reverse engineering
Next by thread: [Savannah-hackers-public] Re: [gnu.org #498996] Hard-disk failures on colonialone
Index(es):
- Date
- Thread