This page last changed on Jul 02, 2012 by prodrigu.

Contents

Grid services monitoring probes and metrics by org.sam

Following describes gLite middleware grid services monitoring probes and metrics developed by SAM team and available via grid-monitoring-probes-org.sam RPM.

RPM

grid-monitoring-probes-org.sam RPM is available via this repository http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/ (and egee-NAGIOS meta RPM). It provides the following Nagios probes: SRM-probe, CE-probe, CREAMCE-probe, CREAMCEDJS-probe, WN-probe, WMS-probe, LFC-probe, samtest-run, nagtest-run. The probes can run in active and/or passives modes (in Nagios sense). Publication of passive test results from inside of probes can be done via Nagios command file or NSCA. On worker nodes Nagios is used as probes scheduler and executor. Metrics results from WNs are sent to Message Broker.

Structure and dependencies

  • Structure
    /etc/gridmon/
    /usr/lib/python2.4/site-packages/gridmetrics/
    /usr/libexec/grid-monitoring/probes/org.sam/
    /usr/libexec/grid-monitoring/probes/org.sam/wnjob
    /usr/libexec/grid-monitoring/probes/org.sam/wnjob/nagios.d/{bin/,etc/,lib/,plugins/,probes/,tmp/,var/}
    /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/{etc/wn.d/org.sam/,probes/org.sam/}
    
  • Dependencies
    python >= 2.4
    python-GridMon >= 1.1.3
    python-ldap
    python-suds >= 0.3.5
    grid-monitoring-probes-hr.srce >= 0.20.1
    

SAM:python-GridMon is a Python helper library for grid monitoring tools. It contains helper routines for Nagios and Grid Security, and was used at the development of the Python probes and metrics.

Content of RPM

  • SAM Nagios probes (in /usr/libexec/grid-monitoring/probes/org.sam/):
    • CE-probe - CE probe containing a number of CE tests (metrics) for jobs submission via WMS
    • CREAMCE-probe - as above, but for CREAM CEs
    • CREAMCEDJS-probe - direct job submission to CREAM CEs (asynchronous)
    • SRM-probe - SRM probe containing a number of metrics for SRM service
    • T-probe - template probe, which serves as an example for writing your own probes based on the Python framework currently provided by the package (see Writing a probe under "Python based probes using org.sam's 'gridmetrics' module" section on the same page)
    • WN-probe - WN probe containing a number of metrics to be run on WNs
    • WMS-probe - metrics to test if jobs submission through WMS works (asynchronous)
  • wrapper checks (in /usr/libexec/grid-monitoring/probes/org.sam/):
    • samtest-run - to run "native" SAM tests (see link)
    • nagtest-run - to run "semi"-Nagios checks (see link)
  • /usr/libexec/grid-monitoring/probes/org.sam/wnjob - directory containing
    • nagios.d/ - directory with Nagios used as checks' scheduler on WNs
    • nagrun.sh - wrapper script to be launched on WNs (sets up required environment, launches and monitors Nagios, periodically sends WN metrics results to Message Bus)
    • org.sam/ -
      • probes/ - directory with SAM WN probes/tests ("new and old" ones), samtest-run and nagtest-run wrappers
      • etc/ - WN Nagios configuration for the above checks
  • gridmetics Python package (in /usr/lib/python2.4/site-packages/):
    • used by the above SAM probes.
  • /etc/gridmon/ - configuration directory:
    • org.sam.conf - main configuration file

Source code

Source code can be browsed here: https://svnweb.cern.ch/trac/sam/browser/trunk/probes, http://svn.cern.ch/guest/sam/trunk/probes

To checkout

svn co http://svn.cern.ch/guest/sam/trunk/probes

Probes and metrics

CLI parameters to probes and metrics

Parameters to probes and metrics are sub-divided into three parts.

  • probes' common parameters. This is the standard set of parameters defined for any Nagios compliant check. Predefined by the metrics development framework.
    # /usr/libexec/grid-monitoring/probes/org.sam/*-probe -h
    Usage: /usr/libexec/grid-monitoring/probes/org.sam/CREAMCEDJS-probe
    [-H|--hostname <FQDN>]|[-u|--uri <URI>] [-m|--metric <name>] [-t|--timeout sec]
    [-V] [-h|--help] [--wlcg] [-v|--verbose 0-3] [-l|--list] [-x proxy] [<metric
    specific parameters>]
    
    -V                 Displays version
    -h|--help          Displays help
    -t|--timeout sec   Sets metric's global timeout. (Default: 600)
    -m|--metric <name> Name of a metric to be collected. Eg. org.sam.SRMv2-Put.
                       If not given, a default wrapper metric will be executed.
    -H|--hostname FQDN Hostname where a service to be tested is running on
    -u|--uri <URI>     Service URI to be tested
    -v|--verbose 0-3   Verbosity. (Default: 0)
                       0 Single line, minimal output. Summary
                       1 Single line, additional information
                       2 Multi line, configuration debug output
                       3 Lots of details for plugin problem diagnosis
    -l|--list          Metrics list in WLCG format
    -x                 VOMS proxy (Order: -x, X509_USER_PROXY, /tmp/x509up_u<UID>)
    --nosanity         Don't sanitize metrics output.
    
      Mandatory paramters: hostname (-H) or URI (-u).
    
      If specified with -m|--metric <name>, the given metric will be executed.
      Otherwise, a wrapper metric (acting as an active check) will be run. The
      latter is equivalent to "-m|--metric <nameSpace>.<Service>-All"
    ...
    
  • all metrics common parameters. Predefined by the metrics development framework.
        Metrics common parameters:
    
    Reporting passive checks (when used with wrapper checks)
    
    --pass-check-dest <config|nsca|nagcmd|active> (Default: config)
    
    --pass-check-conf <path> Configuration file for reporting passive checks.
                             Used with '--pass-check-dest config'. Overrides
                             passive checks submission library default one.
    
    --nsca-server <fqdn|ip> NSCA server FQDN or IP. Required if --pass-check-dest
                            is set to 'nsca'.
    --nsca-port <port>      Port NSCA is listening on (Default: 5667)
    --send-nsca <path>      NSCA client binary.  (Default: /usr/sbin/send_nsca)
    --send-nsca-conf <path> NSCA configuration file. (Default: /etc/nagios/send_nsca.cfg)
    
    --nagcmdfile <path>   Nagios command file.
                          Order: $NAGIOS_COMMANDFILE, --nagcmdfile
                          (Default: /var/nagios/rw/nagios.cmd)
    
    --vo <name>           Virtual Organization. (Default: ops)
    --vo-fqan <name>      VOMS primary attribute as FQAN. If given, will be used
                          along with --vo.
    --err-db <file>       Full path. Database file containing gLite CLI/API errors
                          for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
    --err-topics <top1,>  Comma separated list of topics (Default: default)
    
    --work-dir <dir>      Working directory for metrics.
                          (Default: /var/lib/gridprobes)
    
    --stdout              Detailed output of metrics will be printed to stdout as
                          it is being produced by metrics. The default is to store
                          the output in a container and, then, produce Nagios
                          compiant output.
    
    --no-details-header   Don't include header in details data.
    
  • parameters per metric. Defined by metrics developer. They should be long parameters (ie. starting with --).
        Metrics specific parameters:
    
    org.sam.WN-{Rep*}
    ...
    org.sam.WN-ISEnv
    ...
    

Collection of gLite CLI/API error messages

To distinguish between errors and their severity on particular testing cases a mechanism of "consulting" a custom collection (database) of gLite CLI/API error messages was implemented. It's driven by the following parameters

--err-db <file>       Full path. Database file containing gLite CLI/API errors
                      for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb)
--err-topics <top1,>  Comma separated list of topics (Default: default)

/etc/gridmon/gridmon.errdb comes with python-GridMon RPM (a library implementing the functionality). The file is marked as configuration file in the RPM, so it can be modified/adjusted by local administrator w/o being re-written. This is .INI type configuration file with sections.

Structure of the file is the following. Sections define logical "topics" that they cover with their errors and their respective classifications into client/server side problems and their severity. Eg.:

[lcg_util]
client_status = UNKNOWN
client:
 Could NOT load client credentials|
 Bad credentials|
 Host not known

clientw_status = WARNING
clientw:
 Segmentation fault

server_status = CRITICAL
server:
 User timeout over|
 Error reading token data header: Connection closed|
...

When a topic is provided with --err-topics a corresponding regular expressions are built to later match them in the error messages returned by the used CLI or API. *_status should be Nagios compliant exit code (UNKNOWN, WARNING, CRITICAL, OK). The status will be returned by a metric on an error if one of the error strings was found in the error message of the CLI or API. The implementation stops on the first match. So, it's better to be consistent and unambiguous in topics definition.

Probes and metrics per service

CE

Testing of Computing Elements (CE) is done twofold

  • job submission testing via CE probe/metrics: full job submission chain is exercised - job submission, states monitoring, output sand-box retreival
  • WN testing via WN probe/metrics: mainly a set of replica management tests (i.e., WN<->SE communication)
CREAM-CE

CREAMCE-probe

Job submission via WMS

There are no differences between LCG-CE and CREAM-CE w.r.t. this way of jobs submission (and thus monitoring). Please refer to CE section.

Probe and metrics names differ only in the name of the service (CREAM vs CE): probe org.sam/CREAMCE-probe, metrics org.sam.CREAMCE-*.

h7. CREAM-CE Metrics

Name Description
org.sam.CREAMCE-JobState Submits grid job to CE
org.sam.CREAMCE-JobMonit Monitors grid jobs submitted to CEs.
org.sam.CREAMCE-JobSubmit Passive check.
Direct job submission

See CREAM-CE probe and metrics for direct job submission - CREAMCE DJS.

WN

See WN probe and metrics.

SRM

See SRM probe and metrics.

WMS

See WMS probe and metrics.

LFC

See LFC probes and metrics

python-GridMon library

python-GridMon is a Python library for development and running of Nagios grid probes and metrics.

RPM

python-GridMon RPM is available via this repository http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/ (and egee-NAGIOS meta RPM).

Structure and dependencies

  • Structure
    {{/etc/gridmon/}} - configuration directory
    {{/usr/lib/python<Version>/site-packages/gridmon/}} - Python {{'gridmon'}} package
    
  • Dependencies
    none

Content of RPM

  • /etc/gridmon/ - configuration directory:
    • gridmon.errdb - collection of gLite m/w CLI and API error messages and their mapping to Nagios statuses
    • gridmon.conf - configuration file
  • /usr/lib/python<Version>/site-packages/gridmon/ - 'gridmon' package
    • contents of the package
      $ tree -P "*.py" -I "*__.py" /usr/lib/python2.4/site-packages/gridmon
      /usr/lib/python2.4/site-packages/gridmon
      |-- config.py
      |-- errmatch.py
      |-- gridutils.py
      |-- metricoutput.py
      |-- nagios
      |   |-- nagios.py
      |   `-- perfdata.py
      |-- probe.py
      |-- process
      |   |-- pexpect.py
      |   |-- pexpectpgrp.py
      |   |-- popenpgrp.py
      |   `-- signaling.py
      |-- security
      |-- singleton.py
      |-- template.py
      `-- utils.py
      

Sources

For anonymous read-only access

svn co http://www.sysadmin.hep.ac.uk/svn/grid-monitoring/trunk/GridMon/python

Writing Nagios checks with GridMon

Nagios checks - return codes and output

Nagios (from version 3.x) assumes that checks can produce multi-line output. First line is considered as the check's summary and all the rest is the check's details data. Return codes: 0 - OK, 1 - WARNING, 2 - CRITICAL, 3 - UNKNOWN (for more details see Plugin Return Codes).

Example of running a dummy Nagios check:

$ ./check_dummy
check_dummy: Could not parse arguments
Usage: check_dummy <integer state> [optional text]
$
$ ./check_dummy 2 "my summary
> my details data line 1
> my details data line 2"
CRITICAL: my summary
my details data line 1
my details data line 2
$
$ echo $?
2

$ ./check_dummy 2 "my summary
> my details data line 1
> my details data line 2" | nl
     1  CRITICAL: my summary
     2  my details data line 1
     3  my details data line 2
$

When the check is run under Nagios and the output of the check is collected by Nagios, lines 1 and "2 ..." of the check's output (from the example above) will be stored in two different containers - summary and details data (SERVICEOUTPUT and LONGSERVICEOUTPUT in Nagios words respectively). By default, Nagios reads only 4 KB of data returned by checks (can be overridden by recompilation on Nagios - see). Nagios version distributed by OAT has 16 KB limit on metric output (as of [SAM:November 24 2010]).

Python probes using gridmon package from python-GridMon RPM

Python 'gridmon' package provides metric base class probe.MetricGatherer and a library for writing Python based Nagios compliant probes and metrics.

Reference example - org.sam and ch.cern

As an example of probes and metrics developed using 'gridmon' see org.sam and ch.cern LFC probe.

org.sam probes and metrics

Naming

Proposed naming conventions:

  • /path/<nameSpace>/<serviceAbbreviation>-probe - probe is an executable script containing a set of metrics to test a particular service. <nameSpace> can eg. be org.<VOname>. Eg., in case of SAM: /usr/libexec/grid-monitoring/probes/org.sam/SRM-probe.
  • metric - a test examining a particular functionality of a service. Metrics can follow the following naming "convention" - <nameSpace>.<serviceAbbreviation>-<testName>. Eg.: org.sam/SRM-probe contains a number of metrics: org.sam.SRM-LsDir, org.sam.SRM-Put, org.sam.SRM-Del etc.

Probes/metrics naming (in short):

Probe Metric
<nameSpace>.<serviceAbbreviation>-probe <nameSpace>.<serviceAbbreviation>-<testName>
org.sam.SRM-probe org.sam.SRM-LsDir

Writing probe and metrics

Please check org.sam/T-probe (template probe).

Skeleton of your probe:

# my imports
from gridmon import probe

class MYMetrics(probe.MetricGatherer):
   # my class attributes
   def __init__():
      probe.MetricGatherer.__init__()
      # mandatory initialization code
      # my custom initialization code
   def metricMyMet1():
      # metric body
   def metricMyMet2():
      # metric body

runner = probe.Runner(MYMetrics, probe.ProbeFormatRenderer())
sys.exit(runner.run(sys.argv))

NB! class methods starting with metric are considered by metrics launch class Runner as metrics.

Simple step-by-step example:

  • import probe module from gridmon package.
try:
    from gridmon import probe
except ImportError,e:
    print "UNKNOWN: Error loading modules : %s" % (e)
    sys.exit(3)
  • subclass your metrics class from probe.MetricGatherer; set some class attributes
    class LFCMetrics(probe.MetricGatherer):
       # dictionary describing metrics implemented in the class
       __metrics = { 'MyMet1':{'metricDescription':'My metric one', '':},
                     'MyMet2':{'metricDescription':'My metric two'} }
  • provide metrics description in a dictionary. Eg:
    class LFCMetrics(probe.MetricGatherer):
       __metrics = {
                    'Cleanup' : {
                            # required keys
                            'metricDescription' : "Clean test area on LFC",
                            'metricLocality'    : 'remote',
                            'metricType'        : 'status',
                            'metricVersion'     : '0.1',
                            # optional keys
                            'cmdLineOptions'    : ['cleanup-timeout=',
                                                   'cleanup-dir=',
                                                   'cleanup-threads='],
                            'metricChildren'    : []
                            }
       }

General Troubleshooting

"Return code of 139 is out of bounds" and metrics in CRITICAL

This is most probably a check segfaulting and Nagios marks testing host/service as CRITICAL. Check /var/log/messages

$ egrep "kernel.*segfault" /var/log/messages
...
Nov  5 08:33:19 samnag016 kernel: python[28524]: segfault at 0000000000000071 rip 00002b9960e16390 rsp 00007fffee35cd20 error 4
...

NB! a Nagios check for parsing logs was planned to detect such errors in logs and notify Nagios instance admins. Not there yet as of [SAM:Friday, November 05 2010].

To re-solve the problem on the side of the probe, org.sam.SRM-GetTURLs was modified to use CLI instead of Python API. This allows to "shift" segfault to CLI ("wrap" it into a forked sub-shell) and catch it. So, this doesn't crash CPython, thus, Nagios doesn't erroneously mark the tested service as CRITICAL (the default behavior of Nagios on a check segfault).

Manually running checks via nagios-run-check and $<NAGIOS_MACRO>$ parameters

Some metrics may use Nagios macros as arguments to CLI parameters. Eg: after running

nagios-run-check -d -v -s org.sam.CE-JobState-/ops/Role=lcgadmin -H glite02-kvm.hpc2n.umu.se

you may get

su nagios -l -c '... --prev-status $LASTSERVICESTATEID$ ...'

where $LASTSERVICESTATEID$ is intended to be expanded by Nagios when it internally prepares to run the command. To be able to run the command manually you'll have to delete the parameter or substitute the macro with a meaningful value.

For manual submission and monitoring of CE jobs see.

Probes testing Nagios instance

Probes testing Nagios instance

Document generated by Confluence on Feb 27, 2014 10:19