Skip to main content
Version: 23.04

Possible statuses of a resource

Statuses show the availability of a host, and the availability or performance of a service. Each status has a precise meaning for the resource.

  • The statuses and states of a resource can be seen on page Resources Status. You can filter the page according to these statuses and to certain states.
  • Some statuses are determined according to user-defined thresholds.

Host status

The table below summarizes all the possible statuses for a host.

StatusDescription
UPThe host is available and reachable
DOWNThe host is unavailable
UNREACHABLEThe host is unreachable: it depends on a host whose status is DOWN
PENDINGThe host has just been created and has not been checked yet by the monitoring engine

Service status

The table below summarizes all the possible statuses for a service.

StatusDescription
OKThe service presents no problem
WARNINGThe service has reached the warning threshold
CRITICALThe service has reached the critical threshold
UNKNOWNThe status of the service cannot be checked (e.g.: SNMP agent down, etc.)
PENDINGThe service has just been created and has not been checked yet by the monitoring engine

States

In addition to their status, resources can be in several states:

  • Acknowledged: indicates that the incident on the service or on the host has been taken into account by a user. Acknowledged resources have a yellow background.

  • In downtime: indicates that notifications are temporarily stopped. A downtime can be planned in advance to avoid receiving alerts during maintenance periods, or be set following an incident. Resources in downtime have a purple background.

  • Flapping: indicates that the status change percentage of the resource is very high. This percentage is obtained from calculations performed by the network monitoring engine. Flapping resources have the following icon in their Details panel: image

Status types

The status of a resource can have one of these 2 types:

  • SOFT: Signifies that an incident has just been detected and that it has to be confirmed.
  • HARD: Signifies that the status of the incident is confirmed. Once the status is confirmed, the notification process is engaged (sending of an email, SMS, etc.).

You can filter the view on the Resources Status page according to the status type.

Explanation

An incident (Not-OK status) is confirmed as soon as the number of validation attempts has reached its end. The configuration of a resource (host or service) requires a regular check interval, a number of attempts to confirm a Not-OK status and an irregular check interval. As soon as the first incident is detected, the state is "SOFT" until its confirmation into "HARD", triggering the notification process.

Example:

A service has the following check settings:

  • Max check attempts: 3
  • Normal check interval: 5 minutes
  • Retry check interval: 1 minute

Let us imagine the following scenario:

image

TimeCheck attemptStatusStatus typeState changeNote
t+01/3OKHARDNoInitial state of the service
t+51/3CRITICALSOFTYesFirst detection of a non-OK state. Event handlers execute.
t+62/3WARNINGSOFTYesService continues to be in a non-OK state. Event handlers execute.
t+73/3CRITICALHARDYesMax check attempts has been reached, so service goes into a HARD state. Event handlers execute and a problem notification is sent out. Check # is reset to 1 immediately after this happens.
t+123/3WARNINGHARDYesService changes to a HARD WARNING state. Event handlers execute and a problem notification is sent out.
t+173/3WARNINGHARDNoService stabilizes in a HARD problem state. Depending on what the notification interval for the service is, another notification might be sent out.
t+221/3OKHARDYesService experiences a HARD recovery. Event handlers execute and a recovery notification is sent out.
t+271/3OKHARDNoService is still OK.
t+321/3UNKNOWNSOFTYesService is detected as changing to a SOFT non-OK state. Event handlers execute.
t+332/3OKSOFTYesService experiences a SOFT recovery. Event handlers execute, but notification are not sent, as this wasn't a "real" problem. State type is set HARD and check # is reset to 1 immediately after this happens.
t+341/3OKHARDNoService stabilizes in an OK state.