Status Timelines & A/R Results
Description
The ARGO Monitoring Service is checking the services at regular intervals. It actually runs explicit tests (checks) in order to assess the status of the service. The result of the checks decides on the state of the service. Based on that each service may have a state :
- OK: the check succeeds
- CRITICAL: the check does not succeed
- WARNING: the check succeeds but performs unusually
- MISSING: the check’s state is not recorded
- UNKNOWN: the check could not apply on the monitoring item and as a result the check’s state is unknown
As configuration problems, troublesome services, or other service internal problems occur, the checks on the monitoring items can result in a problematic state (critical, warning or unknown state), for a time period.
The monitored item is part of a multi-level hierarchy, a.k.a topology. A topology is organized as following:
|--group
|--service
|--endpoint
|--metric
The results of these daily checks a.k.a Metric Data, are ingested and stored in HDFS. Each Metric Data record includes information about the timestamp and the status of the monitored metric check, also information about the topology and some extra information regarding the check .
An example metric result in is shown below:
{
"timestamp": "2021-05-02T10:53:38Z",
"metric": "http.check",
"service_type": "Web-Site",
"hostname": "host1.example.com",
"status": "OK"
"summary":""
"description": ""
}
ARGO Monitoring service using its analytics engine (a big data friendly platform), analyzes the stream of collected Metric Data, and is able to apply calculations on all the levels of the topology. It provides timeline results and availability/reliability results of the monitored item or a group of monitored items.