For things like logging and i/o on a frozen system...I agree we'd need some
flag for these kinds of situations. Maybe a disable_logging() flag....i
really don't like this though... I'd imagine even syslogd() could block
virtagent in this type of situation, so that would need to be disabled as
well.
But doing so completely subverts our attempts and providing proper
accounting of what the agent is doing to the user. A user can freeze the
filesystem, knowing that logging would be disabled, then prod at whatever he
wants. So the handling should be something specific to fsfreeze, with
stricter requirements:
If a user calls fsfreeze(), we disable logging, but also disable the ability
to do anything other than fsthaw() or fsstatus(). This actually solves the
potential deadlocking problem for other RPCs as well...since they cant be
executed in the first place.
So I think that addresses the agent deadlocking itself, post-freeze.
However, fsfreeze() itself might lock-up the agent as well...I'm not
confident we can really put any kind of bound on how long it'll take to
execute, and if we timeout on the client-side the agent can still block
here.
Plus there are any number of other situations where an RPC can still hang
things...in the future when we potentially allow things like script
execution, they might do something like attempt to connect to a socket
that's already in use and wait on the server for an arbitrary amount of
time, or open a file on an nfs share that in currently unresponsive.
So a solution for these situations is still needed, and I'm starting to
agree that threads are needed, but I don't think we should do RPCs
concurrently (not sure if that's what is being suggested or not). At least,
there's no pressing reason for it as things currently stand (there aren't
currently any RPCs where fast response times are all that important, so it's
okay to serialize them behind previous RPCs, and HMP/QMP are command at a
time), and it's something that Im fairly confident can be added if the need
arises in the future.
But for dealing with a situation where an RPC can hang the agent, I think
one thread should do it. Basically:
We associate each RPC with a time limit. Some RPCs, very special ones that
we'd trust with our kids, could potentially specify an unlimited timeout.
The client side should use this same timeout on it's end. In the future we
might allow the user to explicitly disable the timeout for a certain RPC.
The logic would then be:
- read in a client RPC request
- start a thread to do RPC
- if there's a timeout, register an alarm(<timeout>), with a handler that
will call something like pthread_kill(current_worker_thread). On the thread
side, this signal will induce a pthread_exit()
- wait for the thread to return (pthread_join(current_worker_thread))
- return it's response back to the caller if it finished, return a timeout
indication otherwise