Skip to main content

Use cases

Centreon Log Management enables you to detect and resolve a wide variety of issues in an IT system, ranging from minor errors to major incidents. Many typical CLM use cases focus on root cause analysis. Here are a few concrete examples of what CLM can help you detect from analyzing missing logs, unexpected log types, or unusual log volumes.

Integration and communication issues between services​

Microservices or API failures​

If a service interacting with other services or APIs does not respond or fails, this will often be recorded in the logs (e.g., HTTP errors such as 500, 503, or 404).

  • Alert rule (count): Track failed requests from a specific service (HTTP 4xx and 5xx) to detect service or API failures quickly. Example of query that could be used in such an alert rule:

    service_name:"my-service" AND attributes.http.response.status_code >= 400 AND attributes.http.response.status_code < 600

Data inconsistency​

For example, if expected data is not received or sent correctly between different services or components, this can generate error or conflict logs. The query you will use in your alert will depend on what your service or component returns (for instance, "missing argument", "bad deserialization of json" in the body.message attribute - or the corresponding HTTP code).

Synchronization issues​

Errors in the processing of message queues or asynchronous events can be identified by CLM. You can detect these by filtering the log explorer on the service name (service_name:"synchronization service").

Server or infrastructure issues​

Server failures​

If a server experiences a hardware failure (such as a hard drive issue or overload), this will typically appear in the system logs. Root cause analysis can be performed in the log explorer.

Application errors​

Code issues​

Exceptions or errors in an application's code, such as null pointer exceptions, syntax errors, or logic errors, can be easily identified in the logs.

  • Alert rule (count): Trigger an alert if there are X or more exception messages in 5 minutes, indicating a potential problem in the application. Example of query that could be used in such an alert rule:

    service_name:"my-app" AND severity_number>"17" AND body.message:"this is an exception"

Database connection failure​

If an application fails to connect to a database, CLM can report relevant error messages. You would typically detect these in the log explorer.

Configuration errors​

For example, a configuration error in a settings file (such as an incorrect port, API key, or missing configurations).

Compatibility or update issues​

Problems after an update​

After deploying or updating an application, errors or unexpected behavior may appear in the logs.

A typical example is deploying a configuration change to a small set of test machines. You would monitor the logs in the log explorer in real time to spot any issues, filtering by the service name or namespace you are updating.

Version incompatibility​

Conflicts between different versions of software, tools, or libraries can be identified in the error logs. Your query will depend on the type of messages returned by your application.

Automation and batch issues​

Failed batch processes or automated jobs​

If an automated job or batch script fails, CLM can display the associated errors.

Examples:

  • A nightly batch updates and synchronizes financial data across systems: all operations must succeed. Create an alert rule that detects every single failure.
  • A service copies files every night: this is a less critical case, failures can be detected in the morning using filtering the log explorer.

Scheduling issues​

For example, if a cron job fails to run correctly at a given time, this may be reported in the logs.

Compliance issues​

Violation of security rules or policies​

If actions or login attempts do not comply with security or compliance rules (e.g., attempts to access without strong authentication), they can be detected.

  • Create an alert rule on the number of login attempts.

Availability and scalability issues​

Decreased ability to respond to requests​

Logs can also help detect a lack of resources or overload that prevents services from handling a high volume of requests.

  • Create an alert rule on the corresponding HTTP code.

Security incidents​

Failed login attempts or brute force attacks​

If a user or attacker repeatedly tries to log in to a system without success, this generates logs that can be analyzed to detect brute force attack attempts.

  • Alert rule (count): Trigger an alert when there are more than x failed SSH login attempts within a given time window. Example of query that could be used in such an alert rule:

    event.type:"ssh_login" AND attributes.http.response.status_code >= 400 AND attributes.http.response.status_code < 500

Intrusions or unauthorized access​

Logs can reveal attempts to gain unauthorized access to sensitive systems or applications (e.g., alerts for permission changes, connections to a server without a valid key, etc.).

  • Alert rule (count) : Trigger a CRITICAL alert event if the number of logs recording access attempts to a specific GitHub repository is higher than 0 for users that do not belong to my_github_group. Example of query that could be used in such an alert rule:

    repo:"my-github-repo" AND attributes.http.response.status_code IS NOT NULL AND NOT user_groups:"my_github_group"

Network issues​

Timeout errors​

Connection or communication failures between services (e.g., a server that does not respond within the expected time frame) can be detected in the logs. For example, 10 timeouts out of 100 requests suggest a problem, compared to 10 timeouts out of 1,000,000 requests.

  • Alert rule (ratio): Trigger an alert when the ratio of HTTP 408 responses (timeouts) exceeds X%. This indicates that too many requests are timing out. To calculate this ratio, you could use queries like these:

    • Query to divide:

      service:"my-service" AND attributes.http.response.status_code IS NOT NULL
    • Query to divide by:

      service:"my-service" AND attributes.http.response.status_code = 408

Examples of questions you can find answers to​

  • Which service is generating the most errors today? Filter the timeline on today, then filter by severitynumber>17. Stack the graph by service name.
  • Which services have changed their behavior after deployment? In the log explorer, filter by the service name or namespace, and check for any errors.