[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Reliability of RPC services

From: Jonathan S. Shapiro
Subject: Re: Reliability of RPC services
Date: Tue, 25 Apr 2006 11:32:39 -0400

On Mon, 2006-04-24 at 19:30 +0200, Marcus Brinkmann wrote:
> Hi,
> At Mon, 24 Apr 2006 11:48:22 -0400,
> "Jonathan S. Shapiro" <address@hidden> wrote:
> > So one way to guard against a failing server is to use idempotent timer
> > events to implement a "heartbeat" -- in much the way that TCP does.
> >
> > I like this much better than complicating the invocation mechanism or
> > the capability overwrite mechanism, because the majority of interprocess
> > interactions are between components of the same application. These have
> > been separated into processes for reasons of isolation, reuse, and
> > testability, but they still fail as a unit. We do not want to impose
> > capability semantics that discourage this pattern, and death notices
> > between such processes are undesirable.
> > 
> > The heartbeat does introduce a new specification problem. Basically, we
> > are introducing a new class of error that is visible all the way up to
> > the user (X timed out) and a new requirement for wall-clock response
> > time limits.
> If you are going this way, it seems to make more sense to me to design
> the system as a real time operating system in the first place, because
> then one can at least precisely define what the requirements for
> wall-clock (or even CPU) response time limits are.

Indeed. This would make 10 wonderful Ph.D. dissertations.

> The key term you use above is that the processes "fail as a unit".
> This is quite pessimistic.  I am not sure if I accept this yet...

I think you misunderstand what I am saying. I am saying that there are
two cases:

  Processes A and B are in separate failure domains. In this case,
  one must guard against the failure of the other.

  Processes A and B are in the *same* failure domain. In this case,
  neither is required to guard against the other. This is what I
  meant by "the processes fail as a unit". A more precise way to
  say this is "the definition of a failure domain is that all
  processes in that failure domain fail as a unit."

> Timeouts do not scale, and they cause a constant background noise
> that, depending on the details, I suspect would cause performance and
> power management issues.

I seem to recall saying this myself, and I agree. The problem is that
the kind of reliability you want to achieve cannot be had without them.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]