Use cases

Centreon Log Management enables you to detect and resolve a wide variety of issues in an IT system, ranging from minor errors to major incidents. Many typical CLM use cases focus on root cause analysis. Here are a few concrete examples of what CLM can help you detect from analyzing missing logs, unexpected log types, or unusual log volumes.

Integration and communication issues between services

Microservices or API failures

If a service interacting with other services or APIs does not respond or fails, this will often be recorded in the logs (e.g., HTTP errors such as 500, 503, or 404).

Alert rule (count): Track failed requests from a specific service (HTTP 4xx and 5xx) to detect service or API failures quickly. Example of query that could be used in such an alert rule:
```
service_name:"my-service" AND attributes.http.response.status_code >= 400 AND attributes.http.response.status_code < 600
```

Data inconsistency

For example, if expected data is not received or sent correctly between different services or components, this can generate error or conflict logs. The query you will use in your alert will depend on what your service or component returns (for instance, "missing argument", "bad deserialization of json" in the body.message attribute - or the corresponding HTTP code).

Synchronization issues

Errors in the processing of message queues or asynchronous events can be identified by CLM. You can detect these by filtering the log explorer on the service name (service_name:"synchronization service").

Server or infrastructure issues

Server failures

If a server experiences a hardware failure (such as a hard drive issue or overload), this will typically appear in the system logs. Root cause analysis can be performed in the log explorer.

Application errors

Code issues

Exceptions or errors in an application's code, such as null pointer exceptions, syntax errors, or logic errors, can be easily identified in the logs.

Alert rule (count): Trigger an alert if there are X or more exception messages in 5 minutes, indicating a potential problem in the application. Example of query that could be used in such an alert rule:
```
service_name:"my-app" AND severity_number>"17" AND body.message:"this is an exception"
```

Database connection failure

If an application fails to connect to a database, CLM can report relevant error messages. You would typically detect these in the log explorer.

Configuration errors

For example, a configuration error in a settings file (such as an incorrect port, API key, or missing configurations).

Compatibility or update issues

Problems after an update

After deploying or updating an application, errors or unexpected behavior may appear in the logs.

A typical example is deploying a configuration change to a small set of test machines. You would monitor the logs in the log explorer in real time to spot any issues, filtering by the service name or namespace you are updating.

Version incompatibility

Conflicts between different versions of software, tools, or libraries can be identified in the error logs. Your query will depend on the type of messages returned by your application.

Automation and batch issues

Failed batch processes or automated jobs

If an automated job or batch script fails, CLM can display the associated errors.

Examples:

A nightly batch updates and synchronizes financial data across systems: all operations must succeed. Create an alert rule that detects every single failure.
A service copies files every night: this is a less critical case, failures can be detected in the morning using filtering the log explorer.

Scheduling issues

For example, if a cron job fails to run correctly at a given time, this may be reported in the logs.

Compliance issues

Violation of security rules or policies

If actions or login attempts do not comply with security or compliance rules (e.g., attempts to access without strong authentication), they can be detected.

Create an alert rule on the number of login attempts.

Availability and scalability issues

Decreased ability to respond to requests

Logs can also help detect a lack of resources or overload that prevents services from handling a high volume of requests.

Create an alert rule on the corresponding HTTP code.

Security incidents

If a user or attacker repeatedly tries to log in to a system without success, this generates logs that can be analyzed to detect brute force attack attempts.

Alert rule (count): Trigger an alert when there are more than x failed SSH login attempts within a given time window. Example of query that could be used in such an alert rule:
```
event.type:"ssh_login" AND attributes.http.response.status_code >= 400 AND attributes.http.response.status_code < 500
```

Intrusions or unauthorized access

Logs can reveal attempts to gain unauthorized access to sensitive systems or applications (e.g., alerts for permission changes, connections to a server without a valid key, etc.).

Alert rule (count) : Trigger a CRITICAL alert event if the number of logs recording access attempts to a specific GitHub repository is higher than 0 for users that do not belong to my_github_group. Example of query that could be used in such an alert rule:
```
repo:"my-github-repo" AND attributes.http.response.status_code IS NOT NULL AND NOT user_groups:"my_github_group"
```

Network issues

Timeout errors

Connection or communication failures between services (e.g., a server that does not respond within the expected time frame) can be detected in the logs. For example, 10 timeouts out of 100 requests suggest a problem, compared to 10 timeouts out of 1,000,000 requests.

Alert rule (ratio): Trigger an alert when the ratio of HTTP 408 responses (timeouts) exceeds X%. This indicates that too many requests are timing out. To calculate this ratio, you could use queries like these:
- Query to divide:
```
service:"my-service" AND attributes.http.response.status_code IS NOT NULL
```
- Query to divide by:
```
service:"my-service" AND attributes.http.response.status_code = 408
```

Examples of questions you can find answers to

Which service is generating the most errors today? Filter the timeline on today, then filter by severitynumber>17. Stack the graph by service name.
Which services have changed their behavior after deployment? In the log explorer, filter by the service name or namespace, and check for any errors.

Integration and communication issues between services​

Microservices or API failures​

Data inconsistency​

Synchronization issues​

Server or infrastructure issues​

Server failures​

Application errors​

Code issues​

Database connection failure​

Configuration errors​

Compatibility or update issues​

Problems after an update​

Version incompatibility​

Automation and batch issues​

Failed batch processes or automated jobs​

Scheduling issues​

Compliance issues​

Violation of security rules or policies​

Availability and scalability issues​

Decreased ability to respond to requests​

Security incidents​

Failed login attempts or brute force attacks​

Intrusions or unauthorized access​

Network issues​

Timeout errors​

Examples of questions you can find answers to​