One or more agent servers have not been responding for n minutes

Description

The alert is active when the 'kCura EDDS Agent Manager' Windows service is not running

Alert Details

Alert ID: 18778d10-60f1-4703-ba94-759175f04ce4

Tags: Each tag should follow "key:value" format.

  • FeatureDomain:Agents
  • PageType:Dashboard
  • PageID:cd200ee0-1e61-4645-8220-83ce82914a71
  • CreatedBy:Relativity
  • ResolutionText:Go to 'Windows Services' and restart 'kCura EDDS Agent Manager'
  • Resolution

Metric/Log/Trace Details

Metric Name: relsvr.agent.status

Metric Attributes:

Attribute Name Description Value
labels.agent_name Relativity Agent Name
labels.agent_type_name Relativity Agent Type
labels.application_name Application Name Environment Administration & Operations
labels.exception_message Any exception message on Agent
labels.message Message describes the issue Agent Manager is not responding.
labels.name Name of metric Agent Disabled
labels.relsvr_artifact_id Relativity agent artifact Id
labels.relsvr_subsystem Agent Name
labels.relsvr_system System name Agents
labels.relsvr_agent_status The current status of the agent not responding
labels.relsvr_agent_type The name of the agent type of the stale agent
labels.relsvr_resource_server_name The name of the server

Rule details

Alert Condition Description: Alert triggers on agent server have not been responding count greater than 0 for last 1 minute.

Name Value Description
Rule Type Elastic Query
Data View metrics-*
Filter Query relsvr.agent.disabled:0 and labels.relsvr_agent_status : "not responding" Agent not responding
Group Count number of agent not responding
Threshold > 0 Count greater than 0, alert triggers
Time Window 1 min Verified data for last 1 minute
Frequency 30 sec Checks for each 30 seconds

Requires User Intervention

  • Yes: alert immediately
    • Min time before the alert is active/inactive: 90 seconds

Windows Services Dashboard

Host Heartbeat alert should not be in active state.