Use cases
Centreon Log Management enables you to detect and resolve a wide variety of issues in an IT system, ranging from minor errors to major incidents. Many typical CLM use cases focus on root cause analysis. Here are a few concrete examples of what CLM can help you detect from analyzing missing logs, unexpected log types, or unusual log volumes.
Integration and communication issues between servicesβ
Microservices or API failuresβ
If a service interacting with other services or APIs does not respond or fails, this will often be recorded in the logs (e.g., HTTP errors such as 500, 503, or 404).
-
Alert rule (count): Track failed requests from a specific service (HTTP 4xx and 5xx) to detect service or API failures quickly. Example of query that could be used in such an alert rule:
service_name:"my-service" AND attributes.http.response.status_code >= 400 AND attributes.http.response.status_code < 600
Data inconsistencyβ
For example, if expected data is not received or sent correctly between different services or components, this can generate error or conflict logs. The query you will use in your alert will depend on what your service or component returns (for instance, "missing argument", "bad deserialization of json" in the body.message attribute - or the corresponding HTTP code).
Synchronization issuesβ
Errors in the processing of message queues or asynchronous events can be identified by CLM. You can detect these by filtering the log explorer on the service name (service_name:"synchronization service").
Server or infrastructure issuesβ
Server failuresβ
If a server experiences a hardware failure (such as a hard drive issue or overload), this will typically appear in the system logs. Root cause analysis can be performed in the log explorer.
Application errorsβ
Code issuesβ
Exceptions or errors in an application's code, such as null pointer exceptions, syntax errors, or logic errors, can be easily identified in the logs.
-
Alert rule (count): Trigger an alert if there are X or more exception messages in 5 minutes, indicating a potential problem in the application. Example of query that could be used in such an alert rule:
service_name:"my-app" AND severity_number>"17" AND body.message:"this is an exception"
Database connection failureβ
If an application fails to connect to a database, CLM can report relevant error messages. You would typically detect these in the log explorer.
Configuration errorsβ
For example, a configuration error in a settings file (such as an incorrect port, API key, or missing configurations).
Compatibility or update issuesβ
Problems after an updateβ
After deploying or updating an application, errors or unexpected behavior may appear in the logs.
A typical example is deploying a configuration change to a small set of test machines. You would monitor the logs in the log explorer in real time to spot any issues, filtering by the service name or namespace you are updating.
Version incompatibilityβ
Conflicts between different versions of software, tools, or libraries can be identified in the error logs. Your query will depend on the type of messages returned by your application.
Automation and batch issuesβ
Failed batch processes or automated jobsβ
If an automated job or batch script fails, CLM can display the associated errors.
Examples:
- A nightly batch updates and synchronizes financial data across systems: all operations must succeed. Create an alert rule that detects every single failure.
- A service copies files every night: this is a less critical case, failures can be detected in the morning using filtering the log explorer.
Scheduling issuesβ
For example, if a cron job fails to run correctly at a given time, this may be reported in the logs.
Compliance issuesβ
Violation of security rules or policiesβ
If actions or login attempts do not comply with security or compliance rules (e.g., attempts to access without strong authentication), they can be detected.
- Create an alert rule on the number of login attempts.
Availability and scalability issuesβ
Decreased ability to respond to requestsβ
Logs can also help detect a lack of resources or overload that prevents services from handling a high volume of requests.
- Create an alert rule on the corresponding HTTP code.
Security incidentsβ
Failed login attempts or brute force attacksβ
If a user or attacker repeatedly tries to log in to a system without success, this generates logs that can be analyzed to detect brute force attack attempts.
-
Alert rule (count): Trigger an alert when there are more than x failed SSH login attempts within a given time window. Example of query that could be used in such an alert rule:
event.type:"ssh_login" AND attributes.http.response.status_code >= 400 AND attributes.http.response.status_code < 500
Intrusions or unauthorized accessβ
Logs can reveal attempts to gain unauthorized access to sensitive systems or applications (e.g., alerts for permission changes, connections to a server without a valid key, etc.).
-
Alert rule (count) : Trigger a CRITICAL alert event if the number of logs recording access attempts to a specific GitHub repository is higher than 0 for users that do not belong to my_github_group. Example of query that could be used in such an alert rule:
repo:"my-github-repo" AND attributes.http.response.status_code IS NOT NULL AND NOT user_groups:"my_github_group"
Network issuesβ
Timeout errorsβ
Connection or communication failures between services (e.g., a server that does not respond within the expected time frame) can be detected in the logs. For example, 10 timeouts out of 100 requests suggest a problem, compared to 10 timeouts out of 1,000,000 requests.
-
Alert rule (ratio): Trigger an alert when the ratio of HTTP 408 responses (timeouts) exceeds X%. This indicates that too many requests are timing out. To calculate this ratio, you could use queries like these:
-
Query to divide:
service:"my-service" AND attributes.http.response.status_code IS NOT NULL -
Query to divide by:
service:"my-service" AND attributes.http.response.status_code = 408
-
Examples of questions you can find answers toβ
- Which service is generating the most errors today? Filter the timeline on today, then filter by severitynumber>17. Stack the graph by service name.
- Which services have changed their behavior after deployment? In the log explorer, filter by the service name or namespace, and check for any errors.