[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Load problem with cfservd
From: |
Baker, Darryl |
Subject: |
Load problem with cfservd |
Date: |
Mon, 14 Mar 2005 16:08:01 -0500 |
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
My master machine is Solaris 9 and all systems are running Solaris 8
or 9 and cfengine 2.1.13.
The problem we have with cfservd manifests itself as a periodic clog
that takes about a minute to resolve. This period is characterized by
the following symptoms:
1. Load average spike from ~3 (on a 4-processor system) to the 6-8
range. Occasionally the spike breaks into double digits.
2. Increase in concurrent port 5308 (cfengine) sessions from a base
level of 0-4 to peaks in the 12-30 range, with the number of LWP's in
the cfservd processes tracking the number of connections linearly.
(Client systems are set to connect twice an hour with a 25-minute
'splay time.)
3. Running lockstat shows severe contention for a single adaptive
mutex:
root@sysadm05:proc# lockstat sleep 5
Adaptive mutex spin: 157416 events in 5.040 seconds (31233
events/sec)
Count indv cuml rcnt spin Lock Caller
- ----------------------------------------------------------------------
- ---------
136805 87% 87% 1.00 75 0x152ec90
sfmmu_mlist_enter+0x84
[...]
Adaptive mutex block: 648 events in 5.040 seconds (129 events/sec)
Count indv cuml rcnt nsec Lock Caller
- ----------------------------------------------------------------------
- ---------
547 84% 84% 1.00 391652 0x152ec90
sfmmu_mlist_enter+0x84
Both of those types of lock run about 2 orders of magnitude lower in
total, with the specific lock running as much as 3 orders of
magnitude lower, (i.e. ~100 spins and no blocks) when the system is
in its 'calm' state.
4. The cfservd process becomes by far the top cpu user, eating 10-25%
of total cpu on a 4-processor system.
5. The system retains some idle time (5-30%) but the time used by the
kernel jumps to the 40-70% range.
The history of troubleshooting this leads me to believe that the
heavy ssh usage on this host is a significant compounding factor,
i.e. that we are hitting some common bottleneck when we have cfservd
accepting connections and are spawning batches of 30-100 outbound ssh
connections at once. Reducing the herds of outbound ssh's has reduced
the frequency and severity of these clog periods, but every time we
change much of anything on the system, we end up getting back to a
state where these clogs become common.
_____________________________________________________________________
Darryl Baker
gedas USA, Inc.
Operational Services Business Unit
3800 Hamlin Road
Auburn Hills, MI 48326
US
phone +1-248-754-5341
fax +1-248-754-6399
Darryl.Baker@gedas.com
http://www.gedasusa.com
_____________________________________________________________________
-----BEGIN PGP SIGNATURE-----
Version: PGP Personal Security 7.0.3
iQA/AwUBQjX9Mle1Bhkj9lZeEQLTgQCeNHbP4+Zf+P2luqNx/QRNpLeOYF8AnRvL
BXCjcj0Rs4JDtgcQzjKv016V
=IHlF
-----END PGP SIGNATURE-----
PGPexch.rtf.asc
Description: Binary data
- Load problem with cfservd,
Baker, Darryl <=