monit-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: monit 4.2 release?


From: Martin Pala
Subject: Re: monit 4.2 release?
Date: Thu, 12 Feb 2004 23:21:35 +0100
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.6) Gecko/20040122 Debian/1.6-1

Jan-Henrik Haukeland wrote:

Unless you plan to refactor the whole event engine, it shouldn't
require to much code to add an up-event in the existing code? If you
are planning to refactor, please let us know your ideas first. The
reason I say this is that I'm unsure if and what needs refactoring in
the event engine.


I'm trying to refactor the whole engine. I will try to describe present idea (partialy done). The following is just development ideas (not clean or fully functional code):



0.) new syntax:
---------------

To the IF...THEN statement is added ELSE statement, which specifies action in common syntax, dummy example:


#    if failed port 443 type tcpssl proto http with timeout 15 seconds
#       then restart
#       else alert

=> in the case that connecion failed, monit will restart service and send alert. As soon as the service is up again, it will send alert.

Note: "else alert" part is implicit => it is not needed to write it. ELSE statement has sense in the case, that you need to do other action (for example EXEC) in the case that the service is up again, like:

#    if failed port 22 proto ssh with timeout 15 seconds
#       then restart
#       else exec "/bin/sms_send 'ssh is up again'"



1.) New structures:
-------------------

Structures provide common objects for all services and service tests. EventAction_T object replaces recurrent Command_T, event_handled, event_flag, etc. It contains actions specification (including optional Command_T for execution) for sevice and test "up" and "down" state. Events affects either primary monitored service object or particular tests - it depends on the event type. For example NONEXIST means that the service (for example process) does not exist at all, so global s->action is taken. On the other side CONNECTION event allows to do test specific s->portlist->action for given test instance (each connection test has its own action). EventList is global.


/** Defines an event action object */
typedef struct myaction {
int action; /**< Action to be done */ Command_T exec; /**< Optional command to be executed */
} *Action_T;


/** Defines event's up and down actions */
typedef struct myeventaction {
Action_T down; /**< Action in the case of failure up */ Action_T up; /**< Action in the case of failure up */
} *EventAction_T;


/** Defines an event */
typedef struct myevent {
int id; /**< The event identification */ short state; /**< TRUE for failed, FALSE for passed event state */ unsigned long counter; /**< Event occurence counter */ EventAction_T action; /**< Description of the event action */ char *message; /**< Optional message describing the event */
} *Event_T;


/** Defines a pending events list */
typedef struct myeventlist {
Event_T entry; /**< Pending events list object */

  /** For internal use */
Event_T next; /**< next event in chain */
} *EventList_T;

...

/** Defines gid object */
typedef struct mygid {
gid_t gid; /**< Owner's gid */ int has_error; /**< TRUE if the service has a GID error */ EventAction_T action; /**< Description of the action upon event occurence */
} *Gid_T;

...

/** Defines service data */
typedef struct myservice {
...
EventAction_T action; /**< Description of the action upon event occurence */
...
EventList_T eventspending; /**< Pending events list */
...
} *Service_T;



2.) Event posting interface
---------------------------

prototype:

/**
 * Post a new Event
 * @param service The Service the event belongs to
 * @param id The event identification
 * @param state TRUE for failed, FALSE for passed event state
 * @param action Description of the event action
 * @param s Optional message describing the event
 */
void Event_post(Service_T service, long id, short state, EventAction_T action, char *s, ...);

Example usage:


static int check_directory(Service_T s) {

  struct stat stat_buf;
  char report[STRLEN]= {0};

  if(stat(s->path, &stat_buf) != 0) {
Event_post(s, EVENT_NONEXIST, TRUE, s->action, "Event: directory '%s' doesn't exist", s->name);
    return FALSE;
  } else {
Event_post(s, EVENT_NONEXIST, FALSE, s->action, "Event: directory '%s' exist", s->name);
  }

  if(!S_ISDIR(stat_buf.st_mode)) {
Event_post(s, EVENT_INVALID, TRUE, s->action, "Event: '%s' is not directory", s->name);
    return FALSE;
  } else {
Event_post(s, EVENT_INVALID, FALSE, s->action, "Event: '%s' is not directory", s->name);
  }

  if(check_perm(s, stat_buf.st_mode, report)) {
    Event_post(s, EVENT_PERMISSION, TRUE, s->perm->action, report);
  } else {
Event_post(s, EVENT_PERMISSION, FALSE, s->perm->action, "Event: '%s' permission passed", s->name);
  }
...




3.) New events:
-----------

NONEXISTENT ... for critical service failure which causes restart by default: process not running/file doesn't exist/etc..

INVALID ... bad type (path is not file in file service check, etc.)



4.) State machine:
------------------

Each service has its own global events list. Events are either positive (in the case of failure) or negative (in the case that test passed). On configuration reload is the list freed.

Event is identified by EventAction_T pointer address, which is uniqueue for each object which supports action (=> can be source of events). When event occures, monit will go through the list and try to find the event with same source (action pointer address) and type. When such event doesn't exist, monit will add it and set event counter to 1 occurence. In the case of another event with the same "polarity" (positive or negative), monit will increment counter.

When event with counter polarity appears, monit will find it in the queue, checge polarity and reset counter to 1.

Event handler is called for the event at the end of the event posting. It sends alert in the case that counter == 1. Event contains event identification and polarity => event handler will do required action specific for given combination.



void Event_post(Service_T service, long id, short state, EventAction_T action, char *s, ...) {

  Event_t e = NULL;

  ASSERT(service);
  ASSERT(action);

  if(service->eventspending == NULL)
  {
    /* initialize event list and add first event */
    NEW(service->eventspending);
    NEW(e);
    e->id = id;
    e->state = state;
    e->counter = 1;
    e->action = action;
    if(s)
      long l;
      va_list ap;

      va_start(ap, s);
      e->message = format(s, ap, &l);
      va_end(ap);
    }
    service->eventspending->entry = e;
  }
  else
  {
/* in the case that event list exists it should contain at least one event already */
    e = service->eventspending->entry;

    /* Try to find the event with the same origin and type identification.
     * Each service and each test have its own custom actions object, so
     * we use actions object address for event source identification. */
    do
    {
      if(e->action == action && e->id == id)
      {
        if(e->state == state)
        {
          /* recurrent event */
          e->counter++;
          break;
        }
        else
        {
          /* event state changed */
          e->state = state;
          e->counter = 1;
          break;
        }
      }

      e = e->next;
    }
    while(e);

    if(!e)
    {
      /* event was not find in pending events list, we will add it */
      NEW(e);
      e->id = id;
      e->type = type;
      e->counter = 1;
      e->action = action;
      if(s)
        long l;
        va_list ap;

        va_start(ap, s);
        e->message = format(s, ap, &l);
        va_end(ap);
      }
      e->next = service->eventspendinglist->entry;
      service->eventspendinglist->entry= e;
    }
  }

  /* FIXME: upravit chovani dle noveho modelu */

  handle_event(e);

}



SUMMARY:
--------

There are lot of ideas and issues which i'm thinking about. The above model is not final, major modification is very possible. It will need change to support following rule:

IF x RESTARTS WITHIN y CYCLES THEN TIMEOUT


I preffer to prepare the model so, that following general syntax will be possible:

IF x event WITHIN y CYCLES THEN action


These actions should be possible to stack and this way support "hard" and "soft" error levels (action depending of error condition level):

This is just add-on - backward compatible implicit rules are predefined, so if you will not specify it, monit will behave like usual.




... lot of work, insufficient time. I didn't wanted to spent much time now (to not freeze my project), so if you preffer simplier solution, i'll by happy :)


Martin











reply via email to

[Prev in Thread] Current Thread [Next in Thread]