One or more agent servers have not been responding for [n] minutes
Description
This alert is triggered when the Server Manager agent is unable to contact one or more Agent Servers for a specific period of time. This indicates the Agent Server may be unresponsive or has failed to check in.
Resolution Guidance
Impact When Active
-
The Agent Server is likely not functioning correctly.
-
All Agents on the server may stop responding and will not perform their assigned work.
-
Agent check-in times will stop updating and jobs may remain unprocessed.
-
If left unresolved, this may delay workflows or cause job queues to grow.
How To Resolve
-
Log in to the host referenced by the alert.
-
Use the Relativity Service account credentials to access the back end of the Agent Server.
-
Open the Services control panel on the server.
-
Locate this service:
kCura EDDS Agent Manager
-
If the service is not running:
-
Right-click on the service name and choose Start.
-
If already running but not responding, right-click and select Stop, then Start to restart it.
-
-
If the issue persists after restarting services:
-
Open Event Viewer on the Agent Server.
-
Review Windows Logs | Application and System logs for any error messages related to the failing services.
-
Use the information to identify underlying issues preventing the services from starting or functioning properly.
-
Alert Details
Alert Condition Details
| Name | Value | Description |
|---|---|---|
| Rule Type | Elastic Query | Tracks check-in status of Agent Servers |
| Data View | metrics-* | |
| Filter Query | relsvr.agent.disabled : 0 and labels.relsvr_agent_status : "not responding" | Agent not responding |
| Group | Count | Number of Agents not responding |
| Threshold | >= 1 | Triggered when one or more Agents are not responsive |
| Time Window | 1 min | Verified data for last 1 minute |
| Frequency | 30 sec | Alert condition checked every 30 seconds |
Alert Metric Details
Metric Name:labels.relsvr_agent_status
Metric Description: Indicates the agent status. Possible values are Running, Inactive, and Not Responding.
Metric Attributes:
| Attribute Name | Description | Value |
|---|---|---|
| host.name | Hostname of the affected Agent Server | |
| labels.agent_name | Relativity Agent Name | |
| labels.agent_type_name | Relativity Agent Type | |
| labels.application_name | Application Name | Environment Administration & Operations |
| labels.exception_message | Any exception message on Agent | |
| labels.message | Message describes the issue | Agent Manager is not responding. |
| labels.name | Name of metric | Agent Disabled |
| labels.relsvr_artifact_id | Relativity Agent Artifact Id | |
| labels.relsvr_subsystem | Agent Name | |
| labels.relsvr_system | System name | Agents |
| labels.relsvr_agent_status | The current status of the agent | not responding |
| labels.relsvr_agent_type | The name of the agent type of the stale agent | |
| labels.relsvr_resource_server_name | The name of the server | |
| relsvr.agent.disabled | Indicates whether the agent is disabled | If 1, the agent is disabled |