chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-hackers] pastiche db drop


From: Alaric Snell-Pym
Subject: Re: [Chicken-hackers] pastiche db drop
Date: Mon, 03 Feb 2014 14:23:34 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130713 Thunderbird/17.0.7

On 03/02/14 14:13, John Cowan wrote:

> 
> On the other hand, that was the *only* time the system went down in a
> serious way.  It was a Mickey-Mouse-watch design:  if you drop it,
> it stops; but if you pick it up and shake it, it works again.
> In particular, if the web and FTP sites were messed up, I would just
> say "Wait an hour for reprocessing", and everything would be right again.
> 

Yeah! I once had the delight of looking after a somewhat distributed
system (it was an online service composed of many complex parts, with
various back-end shared components such as databases and file stores and
"business logic" RPC servers as well as various front-end systems), on a
shoestring (read: nearly no hardware budget, growing usage requirements,
growing feature requirements, no proper sysadmins: just me writing the
code and maintaining the software/hardware/network).

With firefighting being a constant threat to my time, I tried hard to do
as much as I could to put fallbacks and retries into the system wherever
I could, so that the (frequent) component failures didn't translate into
observed system failures very easily!

A big part of that was making as many actions as possible asynchronous,
and putting them into persistent queues, while daemons pulled jobs from
the queues in such a way that a failed attempt re-queues the job, but
incrementing a try counter so a bad job doesn't wedge the queue forever.
This meant it was hard to overwhelm the system with load spikes (they
just consumed disk space in the queue), and if components went down,
jobs just waited until the component came back up.

I should write all of the tips and tricks I used up in a blog post some
day! I did some fun stuff with system monitoring to figure out where
bottlenecks or deadlocks were, which I've talked about a bit at
http://www.snell-pym.org.uk/archives/2012/12/27/logging-profiling-debugging-and-reporting-progress/
- but not so much on the fault-tolerance side.

ABS

-- 
Alaric Snell-Pym
http://www.snell-pym.org.uk/alaric/

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]