savannah-hackers-public
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing


From: Sylvain Beucler
Subject: [Savannah-hackers-public] Re: [gnu.org #649274] colonialone ailing
Date: Mon, 20 Dec 2010 22:49:06 +0100
User-agent: Mutt/1.5.20 (2009-06-14)

Hi,

I don't have a recommendation for the hardware issues.

However:

- Note that you backup savannah-backup.gnu.org (a 1TB VM that I think
  runs on cloud9).  That VM in turn backs-up the colonialone at 12:00
  UTC daily and usually runs in 1/2h-1h.

  So the time at which monolith starts is not relevant to
  colonialone's disks load.

- ward recommended using smaller RAID partitions used as separate LVM
  PVs for a bigger VG, in order to minimize the disks resync, so if
  there's a way to do that during the downtime, please do :)

- Sylvain

On Mon, Dec 20, 2010 at 01:21:12PM -0500, Peter Olson via RT wrote:
> This is timeline for savannah disk problem.
> 
> 2010-12-18T04:25 /dev/sdd was added to /dev/md0,1,2,3
> 
> The first three were small, resynched very quickly.
> 
> 2010-12-18T04:31 /dev/md3 starts to resynch /dev/sdd6 (955 million blocks).
> 
> Everything is fine for a while.
> 
> Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md 
> device /dev/md3
> 
> This means resynch is 20% complete.
> 
> I'm not sure what this is all about:
> 
> Dec 18 07:00:30 colonialone kernel: [1106742.561695] EXT3-fs: INFO: recovery 
> required on readonly 
> filesystem.
> Dec 18 07:00:30 colonialone kernel: [1106742.568646] EXT3-fs: write access 
> will be enabled during 
> recovery.
> Dec 18 07:00:53 colonialone kernel: [1106765.725031] kjournald starting.  
> Commit interval 5 seconds
> Dec 18 07:00:53 colonialone kernel: [1106765.732416] EXT3-fs: recovery 
> complete.
> Dec 18 07:00:53 colonialone kernel: [1106765.897655] EXT3-fs: mounted 
> filesystem with ordered data 
> mode.
> 
> I had erroneously thought that the monolith backup starts at 07:20, but it 
> starts at 20:07 and runs to 
> typically midnight or one AM.  Sometimes it runs longer.  For example it ran 
> until 1 in the afternoon 
> Monday, Dec 13.  Perhaps this is caused by log rotation or some other massive 
> change in the file 
> system.  We are considering standardizing on datestamped logs for our other 
> machines, so the log 
> filename doesn't change at every rotation.  This morning the backup ran until 
> 8 AM:
> 
> rsync of savannah-backup.gnu.org started at Sun Dec 12 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Mon Dec 13 13:00:55 EST 2010
> rsync of savannah-backup.gnu.org started at Mon Dec 13 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Mon Dec 13 22:07:03 EST 2010
> rsync of savannah-backup.gnu.org started at Tue Dec 14 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Wed Dec 15 00:36:57 EST 2010
> rsync of savannah-backup.gnu.org started at Wed Dec 15 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Thu Dec 16 00:59:14 EST 2010
> rsync of savannah-backup.gnu.org started at Thu Dec 16 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Fri Dec 17 00:36:40 EST 2010
> rsync of savannah-backup.gnu.org started at Fri Dec 17 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Sat Dec 18 00:04:56 EST 2010
> rsync of savannah-backup.gnu.org started at Sat Dec 18 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Sun Dec 19 00:01:14 EST 2010
> rsync of savannah-backup.gnu.org started at Sun Dec 19 20:07:01 EST 2010
> rsync of savannah-backup.gnu.org finished at Mon Dec 20 08:17:39 EST 2010
> 
> On Saturday the hung task notifications start around 07:20 AM, but my theory 
> why this happened is 
> faulty (it's not because of load from monolith).
> 
> Dec 18 07:21:50 colonialone kernel: [1108026.400926] INFO: task 
> blkback.5.sda2:5001 blocked for more 
> than 120 seconds.
> Dec 18 07:21:51 colonialone kernel: [1108026.406840] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> Dec 18 07:21:51 colonialone kernel: [1108026.418761] blkback.5.sda D 
> ffffffff8044af00     0  5001      
> 2
> Dec 18 07:21:51 colonialone kernel: [1108026.424213]  ffff8802bbc439c0 
> 0000000000000246 
> 0000000000000000 ffff8805df47a740
> Dec 18 07:21:51 colonialone kernel: [1108026.431214]  ffff88053cfad440 
> ffffffff804ff460 
> ffff88053cfad6c0 0000000080266ba3
> Dec 18 07:21:51 colonialone kernel: [1108026.441373]  00000000ffffffff 
> 0000000000000000 
> 0000000000000010 ffff8805e25d4c40
> Dec 18 07:21:51 colonialone kernel: [1108026.447733] Call Trace:
> Dec 18 07:21:51 colonialone kernel: [1108026.450819]  [<ffffffffa00ef395>] 
> :raid1:wait_barrier+0xb8/0x12c
> Dec 18 07:21:51 colonialone kernel: [1108026.457985]  [<ffffffff80224c8a>] 
> default_wake_function+0x0/0xe
> Dec 18 07:21:51 colonialone kernel: [1108026.464714]  [<ffffffffa00f83f8>] 
> :dm_mod:__split_bio+0x363/0x374
> Dec 18 07:21:51 colonialone kernel: [1108026.471151]  [<ffffffffa00f0ff9>] 
> :raid1:make_request+0x73/0x5b6
> Dec 18 07:21:51 colonialone kernel: [1108026.477958]  [<ffffffff802fdcc1>] 
> elv_merged_request+0x30/0x39
> Dec 18 07:21:51 colonialone kernel: [1108026.484758]  [<ffffffff80311b13>] 
> __up_read+0x13/0x8a
> Dec 18 07:21:51 colonialone kernel: [1108026.490338]  [<ffffffff802ffb1e>] 
> generic_make_request+0x2fe/0x339
> Dec 18 07:21:51 colonialone kernel: [1108026.497154]  [<ffffffff80266ba3>] 
> mempool_alloc+0x24/0xda
> Dec 18 07:21:51 colonialone kernel: [1108026.503219]  [<ffffffff80300df8>] 
> submit_bio+0xdb/0xe2
> Dec 18 07:21:51 colonialone kernel: [1108026.509646]  [<ffffffff80387254>] 
> dispatch_rw_block_io+0x5b9/0x65d
> Dec 18 07:21:51 colonialone kernel: [1108026.515434]  [<ffffffff802ffbb3>] 
> blk_unplug+0x5a/0x60
> Dec 18 07:21:51 colonialone kernel: [1108026.522210]  [<ffffffffa00f9b19>] 
> :dm_mod:dm_table_unplug_all+0x2a/0x3d
> Dec 18 07:21:51 colonialone kernel: [1108026.529606]  [<ffffffff80387747>] 
> blkif_schedule+0x3c7/0x4dc
> Dec 18 07:21:51 colonialone kernel: [1108026.535638]  [<ffffffff80387380>] 
> blkif_schedule+0x0/0x4dc
> Dec 18 07:21:51 colonialone kernel: [1108026.541776]  [<ffffffff8023f57f>] 
> kthread+0x47/0x74
> Dec 18 07:21:51 colonialone kernel: [1108026.547366]  [<ffffffff802282ec>] 
> schedule_tail+0x27/0x5c
> Dec 18 07:21:51 colonialone kernel: [1108026.553793]  [<ffffffff8020be28>] 
> child_rip+0xa/0x12
> Dec 18 07:21:51 colonialone kernel: [1108026.559343]  [<ffffffff8023f538>] 
> kthread+0x0/0x74
> Dec 18 07:21:51 colonialone kernel: [1108026.564794]  [<ffffffff8020be1e>] 
> child_rip+0x0/0x12
> 
> Note that this may not be a fatal problem, it may just be that the machine is 
> very overloaded.
> 
> Practically speaking it is a show stopper, because the resynchronization 
> process makes very little 
> progress once this happens.
> 
> Dec 18 07:21:51 colonialone kernel: [1108026.572483] INFO: task 
> md3_resync:713 blocked for more than 
> 120 seconds.
> Dec 18 07:21:51 colonialone kernel: [1108026.580017] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> Dec 18 07:21:51 colonialone kernel: [1108026.588702] md3_resync    D 
> ffffffff8044af00     0   713      
> 2
> Dec 18 07:21:51 colonialone kernel: [1108026.595326]  ffff8802bbfffce0 
> 0000000000000246 
> 0000000000000000 ffff8805df963ee0
> Dec 18 07:21:51 colonialone kernel: [1108026.603863]  ffff8805e25e7040 
> ffff8805e245e040 
> ffff8805e25e72c0 0000000200000000
> Dec 18 07:21:51 colonialone kernel: [1108026.612292]  00000000ffffffff 
> ffffffff8023f6b6 
> ffff8805dfa705c0 ffffffff80221555
> 
> For the DomUs it causes functional problems, such as sshd not being able to 
> complete the protocol 
> negotiation for a successful login to a VM.
> 
> This may have failed on the Dom0 as well, since I wound up using servent to 
> contact colonialone 
> (perhaps because I could not ssh to it).
> 
> At 9 AM Bernie rebooted colonialone:
> 
> Dec 18 09:03:54 colonialone kernel: [1114157.641661] md: cannot remove active 
> disk sdd6 from md3 ...
> Dec 18 09:04:00 colonialone kernel: [1114163.556083] raid1: Disk failure on 
> sdd6, disabling device.
> Dec 18 09:04:00 colonialone kernel: [1114163.556086] raid1: Operation 
> continuing on 2 devices.
> Dec 18 09:04:00 colonialone mdadm[5026]: FailSpare event detected on md 
> device /dev/md3, component 
> device /dev/sdd6
> Dec 18 09:04:08 colonialone shutdown[8193]: shutting down for system reboot
> Dec 18 09:04:08 colonialone init: Switching to runlevel: 6
> Dec 18 09:04:08 colonialone rpc.statd[2395]: Caught signal 15, un-registering 
> and exiting.
> Dec 18 09:04:08 colonialone snmpd[2839]: Received TERM or STOP signal...  
> shutting down...
> 
> 
> Dec 18 09:10:10 colonialone kernel: [1114537.764182]  [<ffffffff8020b528>] 
> system_call+0x68/0x6d
> Dec 18 09:10:10 colonialone kernel: [1114537.770695]  [<ffffffff8020b4c0>] 
> system_call+0x0/0x6d
> Dec 18 09:10:10 colonialone kernel: [1114537.776042] 
> 
> Dec 18 09:12:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg 
> started.
> Dec 18 09:12:00 colonialone rsyslogd: [origin software="rsyslogd" 
> swVersion="3.18.6" x-pid="2509" x-
> info="http://www.rsyslog.com";] restart
> Dec 18 09:12:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
> cpuset
> Dec 18 09:12:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
> cpu
> Dec 18 09:12:00 colonialone kernel: [    0.000000] Linux version 
> 2.6.26-2-xen-amd64 (Debian 2.6.26-
> 26lenny1) (address@hidden) (gcc version 4.1.3 20080704 (prerelease) (Debian 
> 4.1.2-25)) #1 SMP Thu Nov 
> 25 06:39:26 UTC 2010
> Dec 18 09:12:00 colonialone kernel: [    0.000000] Command line: 
> root=/dev/md2 ro 
> console=ttyS0,115200n8 console=tty0 acpi=off
> 
> Here's where /dev/sda6 and /dev/sdd6 are rejected:
> 
> Dec 18 09:12:00 colonialone kernel: [   10.978740] md: md3 stopped.
> Dec 18 09:12:00 colonialone kernel: [   11.050788] md: bind<sda6>
> Dec 18 09:12:00 colonialone kernel: [   11.053846] md: bind<sdb6>
> Dec 18 09:12:00 colonialone kernel: [   11.057885] md: bind<sdd6>
> Dec 18 09:12:00 colonialone kernel: [   11.061587] md: bind<sdc6>
> Dec 18 09:12:00 colonialone kernel: [   11.064449] md: kicking non-fresh sdd6 
> from array!
> Dec 18 09:12:00 colonialone kernel: [   11.069346] md: unbind<sdd6>
> Dec 18 09:12:00 colonialone kernel: [   11.072410] md: export_rdev(sdd6)
> Dec 18 09:12:00 colonialone kernel: [   11.075384] md: kicking non-fresh sda6 
> from array!
> Dec 18 09:12:00 colonialone kernel: [   11.080268] md: unbind<sda6>
> Dec 18 09:12:00 colonialone kernel: [   11.083244] md: export_rdev(sda6)
> Dec 18 09:12:00 colonialone kernel: [   11.086655] md: md3: raid array is not 
> clean -- starting 
> background reconstruction
> Dec 18 09:12:00 colonialone kernel: [   11.106510] raid1: raid set md3 active 
> with 2 out of 3 mirrors
> 
> Here is some information from SMART (Bernie started smartd Dec 18 19:46, so 
> we have no data prior to 
> that to compare):
> 
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112  
> sdb
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 111 to 112  
> sdc
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 114  
> sdc
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 112 to 115  
> sdb
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 114 to 115  
> sdc
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116  
> sdb
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 115 to 116  
> sdc
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117  
> sdb
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 116 to 117  
> sdc
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 117 to 118  
> sdb
>  SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 119 to 111  
> sdc
>  SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 70 to 71 sdd
>  SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 70 sdd
>  SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 71 to 72 sdc
>  SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 71 sdc
>  SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 72 to 73 sda
>  SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 73 to 72 sda
>  SMART Usage Attribute: 194 Temperature_Celsius changed from 27 to 28 sda
>  SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 27 sda
>  SMART Usage Attribute: 194 Temperature_Celsius changed from 28 to 29 sdc
>  SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 28 sdc
>  SMART Usage Attribute: 194 Temperature_Celsius changed from 29 to 30 sdd
>  SMART Usage Attribute: 194 Temperature_Celsius changed from 30 to 29 sdd
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 41 to 48  sdc
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 43 to 44  sdc
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 44 to 43  sdc
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 45 to 44  sdc
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 45  sdc
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 48 to 49  sdb
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 49 to 50  sdb
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 49  sdb
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 50 to 51  sdb
>  SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 51 to 50  sdb
> 
> Saturday evening we power cycled at 18:28:
> 
> Dec 18 18:28:05 colonialone kernel: [26770.798179]  [<ffffffff8023f538>] 
> kthread+0x0/0x74
> Dec 18 18:28:05 colonialone kernel: [26770.803143]  [<ffffffff8020be1e>] 
> child_rip+0x0/0x12
> Dec 18 18:28:05 colonialone kernel: [26770.811011] 
> Dec 18 18:38:00 colonialone kernel: imklog 3.18.6, log source = /proc/kmsg 
> started.
> Dec 18 18:38:00 colonialone rsyslogd: [origin software="rsyslogd" 
> swVersion="3.18.6" x-pid="2562" x-
> info="http://www.rsyslog.com";] restart
> Dec 18 18:38:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
> cpuset
> Dec 18 18:38:00 colonialone kernel: [    0.000000] Initializing cgroup subsys 
> cpu
> 
> Sunday morning, the resynchronization reached 81% before things went sour.
> 
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Prefailure 
> Attribute: 1 
> Raw_Read_Error_Rate changed from 102 to 113
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdb, SMART Usage 
> Attribute: 195 
> Hardware_ECC_Recovered changed from 51 to 52
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Prefailure 
> Attribute: 1 
> Raw_Read_Error_Rate changed from 117 to 120
> Dec 19 07:16:36 colonialone smartd[7514]: Device: /dev/sdc, SMART Usage 
> Attribute: 195 
> Hardware_ECC_Recovered changed from 47 to 48
> 
> Dec 19 07:17:01 colonialone /USR/SBIN/CRON[3261]: (root) CMD (   cd / && 
> run-parts --report 
> /etc/cron.hourly)
> Dec 19 07:17:19 colonialone kernel: [45888.504962] INFO: task md3_resync:3307 
> blocked for more than 120 
> seconds.
> Dec 19 07:17:19 colonialone kernel: [45888.510214] "echo 0 > 
> /proc/sys/kernel/hung_task_timeout_secs" 
> disables this message.
> Dec 19 07:17:19 colonialone kernel: [45888.519215] md3_resync    D 
> ffff8805bef5be9c     0  3307      2
> Dec 19 07:17:19 colonialone kernel: [45888.527748]  ffff8805bef5bce0 
> 0000000000000246 0000000000000001 
> ffff8805df67dee0
> Dec 19 07:17:19 colonialone kernel: [45888.535827]  ffff8805e25a2480 
> ffff8805e2704e80 ffff8805e25a2700 
> 0000000300000000
> Dec 19 07:17:19 colonialone kernel: [45888.543885]  ffff8805bef5bca0 
> ffffffff8023f6b6 ffff8805df50a140 
> ffffffff80221555
> 
> (It is this, starting at 07:17 which made me realize my theory about the 
> monolith backups was wrong.)
> 
> Perhaps something else happens shortly after 7 AM each day?  But what about 
> the timing of the other 
> failures?
> 
> For planning to resynchronize two more drives into the arrays, here is the 
> timing of things:
> 
> Dec 18 06:52:45 colonialone mdadm[5026]: Rebuild20 event detected on md 
> device /dev/md3
> hung tasks at 07:21
> rebooted at 9 AM due to overload
> hung tasks at 09:41
> Bernie takes down the VMs at 10:21
> rebooted at 11 AM
> 
> Dec 18 13:34:39 colonialone mdadm[4980]: Rebuild20 event detected on md 
> device /dev/md3
> Dec 18 16:15:38 colonialone mdadm[4980]: Rebuild40 event detected on md 
> device /dev/md3
> hung tasks at 18:13
> rebooted at 18:38
> 
> Dec 18 23:29:03 colonialone mdadm[4987]: Rebuild20 event detected on md 
> device /dev/md3
> Dec 19 01:59:03 colonialone mdadm[4987]: Rebuild40 event detected on md 
> device /dev/md3
> Dec 19 04:08:03 colonialone mdadm[4987]: Rebuild60 event detected on md 
> device /dev/md3
> Dec 19 06:48:04 colonialone mdadm[4987]: Rebuild80 event detected on md 
> device /dev/md3
> hung tasks at 07:17
> rebooted at 07:30
> 
> Dec 19 07:36:26 colonialone mdadm[5098]: RebuildStarted event detected on md 
> device /dev/md3
> Dec 19 08:14:26 colonialone mdadm[5098]: Rebuild20 event detected on md 
> device /dev/md3
> Dec 19 08:55:26 colonialone mdadm[5098]: Rebuild40 event detected on md 
> device /dev/md3
> Dec 19 09:38:26 colonialone mdadm[5098]: Rebuild60 event detected on md 
> device /dev/md3
> Dec 19 10:27:26 colonialone mdadm[5098]: Rebuild80 event detected on md 
> device /dev/md3
> Dec 19 11:19:24 colonialone mdadm[5098]: RebuildFinished event detected on md 
> device /dev/md3
> rebooted at 11:22 as a countermeasure in case some corruption had accumulated 
> (no theory about
> this) and also to prove that the RAID would be synchronized after the boot
> no more hung tasks since
> 
> So it appears that if we accept 4 hours of downtime (no VMs running) we can 
> add one disk to the RAID.  
> Adding two disks would take longer, but probably not a lot longer.  Trying to 
> add disks without an 
> explanation of why the overloads take place seems to be risky.
> 
> Peter Olson
> FSF Senior Systems Administrator



reply via email to

[Prev in Thread] Current Thread [Next in Thread]