[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
cfservd under load - SplayTime quirks
From: |
Eric Sorenson |
Subject: |
cfservd under load - SplayTime quirks |
Date: |
Tue, 7 Dec 2004 13:25:04 -0800 (PST) |
Hi, after upgrading to 2.1.11 last week, our RH9-based cfservd master server
starting behaving oddly. cfservd crashed a lot (four or five times a day) and
when I tried to debug it, sometimes it would go into an unkillable state (like
a zombie but reported by linux as 'defunct'). New cfservd's wouldn't be able
to bind to *:tcp/5308 and rebooting the machine was our only recourse to bring
it back. I investigated a bit further today, and here's a writeup of a few
things I've noticed. This is more of a narrative than a concise bug report
but there are a couple of subtle bugs described herein.
(About our setup: we have about 1200 mostly linux machines, all recently
upgraded to 2.1.11, talking to a RH9 masterserver, also now running 2.1.11.
Cfengine runs from cron every hour, with a SplayTime of 30 minutes. The config
mostly does copy: actions with a few local shellcommands.)
The first thing I noticed was that despite the SplayTime, right at 25 min past
the hour (when the cron job is scheduled), the server gets absolutely
pounded with clients. cfservd crashes within a few seconds, sometimes with
no corefile and nothing logged, and sometimes with stuff like:
Dec 5 04:25:51 sinistar cfservd[20038]: Unable to lookup hostname
(build-15.domain.com) or cfengine service: Temporary failure in name resolution
Dec 5 04:25:51 sinistar cfservd[20038]: Couldn't open last-seen database /var/cfengine/cf_lastseen.db
Dec 5 04:25:51 sinistar cfservd[20038]: db_open: Too many open files
Ok, that seems straightforward, there are a ton of clients connecting, each
one eating up a few file descriptors, and at some point we run out. But
'ulimit -n' permits 1024 open files, and /proc/sys/fs/file-nr shows
"2668 1593 52422" (allocated, used, maximum). When I've been able to snatch
a 'lsof' from a busy cfservd, there's maybe 100 fds in use, so I don't think
either of these system limits are being hit. This led me in two directions:
first, to investigate splaying out the clients more, and second, to tune
cfservd to behave more nicely when it's getting pummeled with connections.
Well, right off I realized I'd made an error. My cfagent.conf was set to a 30
minute splay, but update.conf was only set to 5 minutes. And while the docs
say (from the Tutorial):
Every machine will go to sleep for a different length of time, which is no
longer than the time you specify in minutes. A hashing algorithm, based on
the fully qualified name of the host, is used to compute a unique time for
hosts. The shorter the interval, the more clustered the hosts will be.
However, if you use update.conf, the SplayTime in your cfagent.conf gets
ignored entirely -- something really I wasn't expecting!
* (update context)
Sleeping for SplayTime 398 seconds
* (main context)
Time splayed once already - not repeating
I guess this is the intended behavior (?), if so it could stand to be docuemnted
better. The comment at the code example in the tutorial says:
# Put this in update.conf, so that the updates are also splayed
But what it means is:
# If you put this in update.conf, the whole run will splay to this value
FWIW I wrote a little program to understand the SplayTime hashing
algorithm better, the curious can see it here
http://cfwiki.org/cfwiki/index.php/SplayTime_testing
That is enough jabber for this post, in part two I'll cover cfservd tuning.
--
- Eric Sorenson - Explosive Networking - http://eric.explosive.net -
- cfservd under load - SplayTime quirks,
Eric Sorenson <=