Compute engine job configurations

This document describes the job configurations

Overview

Job Configuration	Description	Shortcut
`/etc/ar-compute-engine.conf`	This file includes various global parameters used by the engine which are organized in sections, as described next:	Description
`{TENANT_NAME}_ops.json`	The ops files are json filetypes that are used to describe the available status types encountered in the monitoring environment of a tenant and also the available algorithmic operations available to use in status aggregations.	Description
`{TENANT_NAME}_{JOB_ID}_cfg.json`	A job config file is a json file that contains specific information needed during the job run such as grouping parameters, the name of the availability profile used and many more.	Description
`{TENANT_NAME}_{JOB_NAME}_ap.json`	The availability profile is a json file used per specific job that describes the operations that must take place during aggregating statuses up to endpoint group level.	Description

Tenant and Report configuration

The ARGO Compute Engine is multi-tenant, meaning that a single installation of the ARGO Compute Engine can support multiple tenants (customers). Each tenant must be configured properly before the ARGO Compute Engine can start computing availability and reliability reports for that tenant.

For each tenant there should be at least one Job Configuration. A Job Configuration defines the topology, metric and availability profiles that will be used by the ARGO Compute Engine in order to perform the computations using as input the monitoring metric results received.

For example regarding a hypothetical tenant, we might have two Job Configurations based on two different metric profiles: Critical and All. The ARGO Compute Engine, using the Critical job configuration, will compute the availability and reliability report by taking into account the most critical metrics for each service. On the other hand, the All job configuration is more versatile as it takes into account all metrics for each service.

Each Job Configuration includes the following data:

Topology
Metric Profile
Availability Profile
Weights (optional)
Downtimes (optional)

Hadoop client configuration

In order for the engine to be able to connect and submit jobs successfully in a hadoop cluster, proper hadoop client configuration files must be present the installed node (/etc/hadoop/conf/)

ARGO Compute Engine configuration files

The main configuration file of the ARGO Compute Engine component is installed by default at /etc/ar-compute-engine.conf. In addition, a directory with a supplementary secondary configuration files is created in /etc/ar-compute/

/etc/ar-compute-engine.conf

The main configuration files includes various global parameters used by the engine which are organized in sections, as described next:

`[default]`

Name	Type	Description	Required
`mongo_host`	String	Specify the ip address of the datastore node (running mongodb)	`YES`
`mongo_port`	String	Specify the port number of the datastore node (running mongodb)	`YES`
`mode`	String	The mode the engine runs. There are two available options: cluster and local: `cluster`: If the mode is specified as cluster, the engine runs connecting to an existing hadoop cluster. It expects that the hadoop client is properly installed and configured. `local` : If the mode is specified as local, the engine runs local node.	`YES`
`serialization`	String	The serialization type used. There are two available options avro and none: `avro` : If specified as avro, the engine expects metric and sync data in avro format. `none` :If specified as none, the engine expects to find metric and sync data in simple text file delimited format.	`YES`
`prefilter_clean`	Boolean	Controls whether the local prefilter file will be automatically removed after it has been uploaded to the Compute Engine. If set to true, the local prefilter file will be automatically removed after it is uploaded.	`YES`
`sync_clean`	Boolean	Controls whether the uploaded sync files will be automatically removed after a job completion. If set to true the uploaded sync files will be automatically removed after the job completion	`YES`

`[logging]`

In this section we declare the specific logging options for the compute engine

Name	Type	Description	Required
`log_mode`	String	This parameter specifies the log_mode used by the compute engine. Possible values: `syslog` (default), `file`, `none`. a) `syslog`: the compute engine is configured to use the syslog facility, b) `file`: the compute engine can write directly to a file defined by `log_file`, c) `none`: the compute engine does not output any logs	`YES`
`log_file`	String	This parameter must be specified if `log_mode=file`. The file which the compute engine will use in order to write logging information	`NO`
`log_level`	String	Possible values: `DEBUG` (default), `INFO`, `WARNING`, `ERROR`, `CRITICAL`. Defines the log level that is used by the compute engine.	`YES`
`hadoop_log_root`	String	Hadoop clients log level and log appender. If the user wants the hadoop components to log via SYSLOG must make sure to define an appropriate appender in hadoop log4j.properties file. The name of this appender must be added in this parameter.	`YES`

For example at hadoop_log_root if the available appenders in the log4j.properties file are SYSLOG and console the above line will be:

   hadoop_log_root=SYSLOG,console

`[jobs]`

In this section we declare the specific tenant used in the installation and the set of jobs available (as we described them above in the “Tenant and Report configuration”).

Name	Type	Description	Required
`tenants`	List	Comma separated list with the names of the available tenants. For eg: `tenants=tenantFoo,tenantBar`	`YES`
`{tenant-name}_jobs`	List	For each tenant: a comma separated job list with the names of the available reports to be produced. Names are case-sensitive. Each tenant can have multiple report configurations. Each report configuration is defined by a set of topologies, metric profiles, weights etc. For eg: `tenantFoo_jobs=Major,Minor,ExampleA,Critical`	`YES`
`{tenant-name}_prefilter`	Path(String)	For each tenant: An optional attribute that specifies the path of a prefilter wrapper - if and only if the tenant requires it. For eg: `tenantFoo_prefilter=/path/to/the/prefilter/script`	`NO`

`[sampling]`

Name	Type	Description	Required
`s_period`	minutes	The sampling period time in minutes	`YES`
`s_interval`	minutes	The sampling interval time in minutes	`YES`

Note the number of samples used in a/r calculations is determined by the s_period/s_interval value. Default values used

s_period = 1440
s_interval = 5

so number of samples = 1440/5 = 288

`/etc/ar-compute/`

As mentioned above secondary configuration files used by the compute-engine are stored to the /etc/ar-compute directory. Here are files describing the set of status state types used in the monitoring engine, algorithmic operations of how to combine those states, availability profiles & configuration files for available jobs.

`{TENANT_NAME}_ops.json (per tenant)`

These are configuration files expressed in JSON, which describe the available status types encountered in the monitoring environment of a tenant and also the available algorithmic operations to use in status aggregations.

For example if the tenant name is T1 the corresponding ops filename will be T1_ops.json

During computations many operations take place among service statuses which need to be described explicitly. The ARGO Compute Engine gives the flexibility to the end user to declare the available monitoring statuses that are produced by the monitoring infrastructure and to map them to its internal statuses. Then, using truth tables the user can describe the logical operations on these statuses and their results. An ops file contains:

the list of available status types
which status type is considered as default in missing circumstances
which status type is considered as default in downtime circumstances
which status type is considered as default in unknown circumstances
a list of available operations between statuses expressed in the form of truth tables

The available status states produced by the Monitoring Engine(s) are expressed in the “states” list. For example below is the definition of the status states produced by Nagios compatible Monitoring Engines:

"states": [
    "OK",
    "WARNING",
    "UNKNOWN",
    "MISSING",
    "CRITICAL",
    "DOWNTIME"
]

The ARGO Compute Engine requires the user to define a mapping for the default_down, default_missing and default_unknown. For example:

"default_down": "DOWNTIME",
"default_missing": "MISSING",
"default_unknown": "UNKNOWN",

Note: The importance of the default states : Since compute engine gives the ability to define completely custom states based on your monitoring infrastructure output we must also tag some custom states with specific meaning. These states might not be present in the monitoring messages but are produced during computations by the compute engine according to a specific logic. So we need to “tie” some of the custom status we declare to a specific default state of service.

Name	Description
`"default_down": "DOWNTIME"`	Means that whenever compute engine needs to produce a status for a scheduled downtime will mark it using the “DOWNTIME” state.
`"default_missing": "MISSING"`	Means whenever compute engine decides that a service status must declared missing (because there is no information provided from the metric data) will mark it using the “MISSING” state.
`"default_unknown: "UNKNOWN"`	Means whenever compute engine decides that must produce a service status to be considered unknown (for e.g. during recomputation requests) will mark it using the “UNKNOWN” state.

The available operations are declared in the operations list using truth tables as follows:

"operations": {
  "AND":[],
  "OR":[]
}

Each operation consists of a JSON array used to describe a truth table. An example of such a truth table is presented below:

"operations": {
  "AND": [
    { "A":"OK",       "B":"OK",       "X":"OK"       },
    { "A":"OK",       "B":"WARNING",  "X":"WARNING"  },
    { "A":"OK",       "B":"UNKNOWN",  "X":"UNKNOWN"  },
  ]
}

Each element of the JSON array describes a row of the truth table for example:

{ "A":"OK", "B":"WARNING", "X":"WARNING"}

declares that in an algorithmic AND operation between two status states of OK and WARNING the result is WARNING

In the ops file the user is able to declare any number of available monitoring states and any number of available custom operations on those states. The ARGO Compute Engine uses this information to create the corresponding truth tables in memory.

`{TENANT_NAME}_{JOB_ID}_cfg.json (per job)`

A Job Configuration file is a JSON file that contains specific information needed during the computations such as grouping parameters, the name of the availability profile used and many more. If the tenant’s name is T1 and the report name is JobA then the filename of the config file must be T1_JobA_cfg.json

The configuration file of the job contains mandatory and optional fields with rich information describing the parameters of the specific job. Some important fields are:

"tenant": "tenant_name"`
"job": "job_name",
"aprofile": "availability_profile_name",
"egroup": "endpoint_group_type_name",
"ggroup": "group_of_group_type_name",
"weight": "weight_factor_type_name"

In the above snippet we have declared the name of the tenant, the name of the job, the name of the specific availability profile used in the job. Also the type of endpoint grouping that will be used is declared here and the type of upper hierarchical grouping. Also if available here is declared the type of weight factor used for upper level A/R aggregations

Name	Description
`"tenant"`	This field is explicitly linked to the value of the tenant declaration of the global ar-compute-engine.conf file link to description above
`"job"`	This field is explicitly linked to the name of a job declared in the job_set variable of the global ar-compute-engine.conf file link to description above
`"aprofile"`	This field is explicitly linked to one of the availability profile json files declared in the `/etc/ar_compute/` folder and they are described below link to description further below
`"egroup"`	This field is used to declare the endpoint group that will be used during computation aggregations. The value corresponds to one of the values present in the field `type` of the topology file group_endpoints.avro
`"ggroup"`	This field is used to declare the group of groups that will be used during computation aggregations. The value corresponds to one of the values present in the field `type` of the topology file group_groups.avro
`"weight"`	This field is used to declare the type of weight that will be used during computation aggregations. The value corresponds to one of the values present in the field `type` of the weight (factors) file weight_sync.avro

In the configuration file are specified the specific tag values that will be used during the job in order to filter metric data.

For example:

"egroup_tags": {
  "scope":"scope_type",
  "production":"Y",
  "monitored":"Y"
}

In the egroup_tag list are declared values for available tag fields that will be encountered in the endpoint group topology sync file (produced by ar-sync components).These tag fields are explicitly linked to the description of the schema of the group_endpoints.avro file

`{TENANT_NAME}_{JOB_ID}_ap.json (per report)`

The availability profile is a json file used per specific job that describes the operations that must take place during aggregating statuses up to endpoint group level. For example if the tenant name is T1 and the report name JobA the corresponding availability profile name must be T1_ReportA_ap.json

The information in the availability profile JSON file is automatically picked up by the compute-engine during computations.

Name	Type	Description
`"name"`	string	The name of the availability profile
`"namespace"`	string	The name of the namespace used by the profile
`"metric_profile"`	string	The name of the metric profile linked to this availability profile
`"metric_ops"`	string	The default operation to be used when aggregating low level metric statuses
`"group_type"`	string	The default endpoint group type used in aggregation

In the availability profile JSON file also are declared custom grouping of services to be used in the aggregation. The grouping of services are expressed in the JSON “groups” list see example below:

"groups": {
  "my_group_of_services_1": {
    "services":{
      "service_type_A":"OR",
      "service_type_B":"OR"
    },
    "operation":"OR"
  },
  "my_group_of_services_2": {
    "services":{
      "service_type_C":"OR",
      "service_type_D":"OR"
    },
    "operation":"OR"
  },
  "operation":"AND"
}

In the above example the service types are grouped in two groups:

my_group_of_services_1 and
my_group_of_services_2.

Each group contains a “service” list containing service types included in the group as fields and the operation values in order to choose who to aggregate the various instances of a specific service. For example if for “service_type_A” are 3 service endpoints available, they are going to be aggregated using the OR operation. The “operation” field under each group of services is used to declare the operation that will be used to aggregate the service types under that group. The outer “operation” field in the root of the json document is used to declare the operation used to aggregate the various groups in order to produce the final endpoint aggregation result.