Skip to main content

Concepts

Resources status​

Statuses are indicators for the hosts or the services. Each status has a precise meaning for the resource. Each status is determined following the monitoring of the resource according to user-defined thresholds.

Host status​

The table below summarizes all the possible statuses for a host.

StatusDescription
UPThe host is available and reachable
DOWNThe host is unavailable
UNREACHABLEThe host is unreachable

Service status​

The table below summarizes all the possible statuses for a service.

StatusDescription
OKThe service presents no problem
WARNINGThe service has reached the warning threshold
CRITICALThe service has reached the critical threshold
UNKNOWNThe status of the service cannot be checked (e.g.: SNMP agent down, etc.)

Advanced statuses​

In addition to the standard statuses, new statuses can be used to add additional information:

  • The PENDING status is a status displayed for a service or a host freshly configured but which has not yet been checked by the scheduler.
  • The UNREACHABLE status is a status indicating that the host (parental relationship) is situated downstream of a host with a DOWN status.
  • The FLAPPING status is a status indicating that the status change percentage of the resource is very high. This percentage is obtained from calculations performed by the network monitoring engine.
  • The ACKNOWLEDGED status is a status indicating that the incident of the service or of the host has been taken into account by a user.
  • The DOWNTIME status is a status indicating that the incident of the service or of the host occurred during a downtime period.

Status confirmation​

A resource can have two states:

  • SOFT: Signifies that an incident has just been detected and that it has to be confirmed.
  • HARD: Signifies that the status of the incident is confirmed. Once the status is confirmed, the notification process is engaged (sending of a mail, SMS, etc.).

Explanation​

An incident (Not-OK status) is confirmed as soon as the number of validation attempts has reached its end. The configuration of a resource (host or service) requires a regular check interval, a number of attempts to confirm a Not-OK status and an irregular check interval. As soon as the first incident is detected, the state is "SOFT" until its confirmation into "HARD", triggering the notification process.

Example:

A service has the following check settings:

  • Max check attempts: 3
  • Normal check interval: 5 minutes
  • Retry check interval: 1 minute

Let us imagine the following scenario:

image

TimeCheck attemptStatusStateState changeNote
t+01/3OKHARDNoInitial state of the service
t+51/3CRITICALSOFTYesFirst detection of a non-OK state. Event handlers execute.
t+62/3WARNINGSOFTYesService continues to be in a non-OK state. Event handlers execute.
t+73/3CRITICALHARDYesMax check attempts has been reached, so service goes into a HARD state. Event handlers execute and a problem notification is sent out. Check # is reset to 1 immediately after this happens.
t+123/3WARNINGHARDYesService changes to a HARD WARNING state. Event handlers execute and a problem notification is sent out.
t+173/3WARNINGHARDNoService stabilizes in a HARD problem state. Depending on what the notification interval for the service is, another notification might be sent out.
t+221/3OKHARDYesService experiences a HARD recovery. Event handlers execute and a recovery notification is sent out.
t+271/3OKHARDNoService is still OK.
t+321/3UNKNOWNSOFTYesService is detected as changing to a SOFT non-OK state. Event handlers execute.
t+332/3OKSOFTYesService experiences a SOFT recovery. Event handlers execute, but notification are not sent, as this wasn't a "real" problem. State type is set HARD and check # is reset to 1 immediately after this happens.
t+341/3OKHARDNoService stabilizes in an OK state.