savannah-hackers-public
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Savannah-hackers-public] Savannah Outage Event Today 2019-12-09


From: Bob Proulx
Subject: [Savannah-hackers-public] Savannah Outage Event Today 2019-12-09
Date: Mon, 9 Dec 2019 15:09:12 -0700
User-agent: Mutt/1.12.2 (2019-09-21)

Savannah git, svn, hg, bzr, download, Outage Event Today 2019-12-09

We had a problem today that caused all of those services to fail for a
while this afternoon.  Things are back online again.  Sorry for the
inconvenience.  Everything is back working normally at this time.
Please make reports if you are seeing problems.

Here is the story of the events today:

The trigger was me upgrading nfs1 to the latest security upgrades and
rebooting it.  There were three packages with upgrades pending that
needed to be installed: systemd systemd-sysv libsystemd0 from Trisquel
8 versions 229-4ubuntu21.22 to 229-4ubuntu21.23.

The system had been running continuously for 90 days.  Which is a bit
of time.  Not a problem for a system. But long enough that some
forgotten change might have been made to affect things on a reboot.
If a system has been up for longer than even a week then as standard
operating practice I always reboot the system before doing any new
maintenance on it just as a paranoid practice.  Then I know if there
is a problem I know it was something previously existing and not
something I was just doing.  I rebooted nfs1.  All normal.

Then applied the upgrades listed above.  Then rebooted again.  On nfs1
all seemed perfectly normal.  But that is when vcs0 and download0
started reporting stale nfs handle on the mount point.  Nagios
detected this and sent out alerts.  Gack!  Jump and try to understand
the problem and fix it.

I poked and probed the patients.  Trying to mount the partition
manually failed with a "mount.nfs4: Connection timed out" error.  Yet
running tcpdump on the clients showed them communicating.  Regardless
of the timeout it did not appear to be a problem with them
communicating.  And in the tcpdump trace nfs1 kept returning "getattr
ERROR: Stale NFS file handle" errors.  Strange.  On the mount?

Without a good explanation I think nfs1 was behaving badly.  Even
though it had freshly been rebooted I decided to reboot it again.

And on this nfs1 reboot then on the clients I could mount the
partition.  Therefore this seems to indicate that nfs1 had for some
unknown reason latched into a bad state.

Rebooted vcs0 so that it would be a clean start from boot.  Ran
through the regression suite and all of the service tests passed.

During the problem event time download0 which has two mount points had
one stale and one okay.  Yet both are basically identical other than
different names.  After the reboot of nfs1 when things started working
then download0 could mount it on the stale partition and everything
was okay again there too.  And the mount point on frontend1 was okay
throughout too.  Odd that some client systems had problems and others
did not.

That leaves us with an unresolved and perhaps not possible to resolve
question of why nfs1 behaved badly.  Will need continued
investigation.  Sometimes things like this only become understood
after a longer research time.

At the start of things I asked Andrew to post a status update:

  https://hostux.social/@fsfstatus/

That's always a good URL to bookmark so as to be able to get an
out-of-band non-gnu-system update of infrastructure system status.
It's on a non-gnu system in case the gnu sites are offline, so that
there is a way to communicate in that event.

Bob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]