SAM Doc : grid-monitoring-probes-org.sam
This page last changed on Jul 02, 2012 by prodrigu.
Contents
Grid services monitoring probes and metrics by org.samFollowing describes gLite middleware grid services monitoring probes and metrics developed by SAM team and available via grid-monitoring-probes-org.sam RPM. RPMgrid-monitoring-probes-org.sam RPM is available via this repository http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/ (and egee-NAGIOS meta RPM). It provides the following Nagios probes: SRM-probe, CE-probe, CREAMCE-probe, CREAMCEDJS-probe, WN-probe, WMS-probe, LFC-probe, samtest-run, nagtest-run. The probes can run in active and/or passives modes (in Nagios sense). Publication of passive test results from inside of probes can be done via Nagios command file or NSCA. On worker nodes Nagios is used as probes scheduler and executor. Metrics results from WNs are sent to Message Broker. Structure and dependencies
SAM:python-GridMon is a Python helper library for grid monitoring tools. It contains helper routines for Nagios and Grid Security, and was used at the development of the Python probes and metrics. Content of RPM
Source codeSource code can be browsed here: https://svnweb.cern.ch/trac/sam/browser/trunk/probes, http://svn.cern.ch/guest/sam/trunk/probes To checkout svn co http://svn.cern.ch/guest/sam/trunk/probes Probes and metricsCLI parameters to probes and metricsParameters to probes and metrics are sub-divided into three parts.
Collection of gLite CLI/API error messagesTo distinguish between errors and their severity on particular testing cases a mechanism of "consulting" a custom collection (database) of gLite CLI/API error messages was implemented. It's driven by the following parameters --err-db <file> Full path. Database file containing gLite CLI/API errors for categorizing runtime errors. (Default: /etc/gridmon/gridmon.errdb) --err-topics <top1,> Comma separated list of topics (Default: default) /etc/gridmon/gridmon.errdb comes with python-GridMon RPM (a library implementing the functionality). The file is marked as configuration file in the RPM, so it can be modified/adjusted by local administrator w/o being re-written. This is .INI type configuration file with sections. Structure of the file is the following. Sections define logical "topics" that they cover with their errors and their respective classifications into client/server side problems and their severity. Eg.: [lcg_util] client_status = UNKNOWN client: Could NOT load client credentials| Bad credentials| Host not known clientw_status = WARNING clientw: Segmentation fault server_status = CRITICAL server: User timeout over| Error reading token data header: Connection closed| ... When a topic is provided with --err-topics a corresponding regular expressions are built to later match them in the error messages returned by the used CLI or API. *_status should be Nagios compliant exit code (UNKNOWN, WARNING, CRITICAL, OK). The status will be returned by a metric on an error if one of the error strings was found in the error message of the CLI or API. The implementation stops on the first match. So, it's better to be consistent and unambiguous in topics definition. Probes and metrics per serviceCETesting of Computing Elements (CE) is done twofold
CREAM-CECREAMCE-probe Job submission via WMSThere are no differences between LCG-CE and CREAM-CE w.r.t. this way of jobs submission (and thus monitoring). Please refer to CE section. Probe and metrics names differ only in the name of the service (CREAM vs CE): probe org.sam/CREAMCE-probe, metrics org.sam.CREAMCE-*. h7. CREAM-CE Metrics
Direct job submissionSee CREAM-CE probe and metrics for direct job submission - CREAMCE DJS. WNSee WN probe and metrics. SRMWMSLFCpython-GridMon librarypython-GridMon is a Python library for development and running of Nagios grid probes and metrics. RPMpython-GridMon RPM is available via this repository http://www.sysadmin.hep.ac.uk/rpms/egee-SA1/ (and egee-NAGIOS meta RPM). Structure and dependencies
Content of RPM
SourcesFor anonymous read-only access svn co http://www.sysadmin.hep.ac.uk/svn/grid-monitoring/trunk/GridMon/python Writing Nagios checks with GridMonNagios checks - return codes and outputNagios (from version 3.x) assumes that checks can produce multi-line output. First line is considered as the check's summary and all the rest is the check's details data. Return codes: 0 - OK, 1 - WARNING, 2 - CRITICAL, 3 - UNKNOWN (for more details see Plugin Return Codes). Example of running a dummy Nagios check: $ ./check_dummy check_dummy: Could not parse arguments Usage: check_dummy <integer state> [optional text] $ $ ./check_dummy 2 "my summary > my details data line 1 > my details data line 2" CRITICAL: my summary my details data line 1 my details data line 2 $ $ echo $? 2 $ ./check_dummy 2 "my summary > my details data line 1 > my details data line 2" | nl 1 CRITICAL: my summary 2 my details data line 1 3 my details data line 2 $ When the check is run under Nagios and the output of the check is collected by Nagios, lines 1 and "2 ..." of the check's output (from the example above) will be stored in two different containers - summary and details data (SERVICEOUTPUT and LONGSERVICEOUTPUT in Nagios words respectively). By default, Nagios reads only 4 KB of data returned by checks (can be overridden by recompilation on Nagios - see). Nagios version distributed by OAT has 16 KB limit on metric output (as of [SAM:November 24 2010]). Python probes using gridmon package from python-GridMon RPMPython 'gridmon' package provides metric base class probe.MetricGatherer and a library for writing Python based Nagios compliant probes and metrics. Reference example - org.sam and ch.cernAs an example of probes and metrics developed using 'gridmon' see org.sam and ch.cern LFC probe. org.sam probes and metrics
NamingProposed naming conventions:
Probes/metrics naming (in short):
Writing probe and metricsPlease check org.sam/T-probe (template probe). Skeleton of your probe: # my imports from gridmon import probe class MYMetrics(probe.MetricGatherer): # my class attributes def __init__(): probe.MetricGatherer.__init__() # mandatory initialization code # my custom initialization code def metricMyMet1(): # metric body def metricMyMet2(): # metric body runner = probe.Runner(MYMetrics, probe.ProbeFormatRenderer()) sys.exit(runner.run(sys.argv)) NB! class methods starting with metric are considered by metrics launch class Runner as metrics. Simple step-by-step example:
try: from gridmon import probe except ImportError,e: print "UNKNOWN: Error loading modules : %s" % (e) sys.exit(3)
General Troubleshooting"Return code of 139 is out of bounds" and metrics in CRITICALThis is most probably a check segfaulting and Nagios marks testing host/service as CRITICAL. Check /var/log/messages $ egrep "kernel.*segfault" /var/log/messages ... Nov 5 08:33:19 samnag016 kernel: python[28524]: segfault at 0000000000000071 rip 00002b9960e16390 rsp 00007fffee35cd20 error 4 ... NB! a Nagios check for parsing logs was planned to detect such errors in logs and notify Nagios instance admins. Not there yet as of [SAM:Friday, November 05 2010].
To re-solve the problem on the side of the probe, org.sam.SRM-GetTURLs was modified to use CLI instead of Python API. This allows to "shift" segfault to CLI ("wrap" it into a forked sub-shell) and catch it. So, this doesn't crash CPython, thus, Nagios doesn't erroneously mark the tested service as CRITICAL (the default behavior of Nagios on a check segfault). Manually running checks via nagios-run-check and $<NAGIOS_MACRO>$ parametersSome metrics may use Nagios macros as arguments to CLI parameters. Eg: after running nagios-run-check -d -v -s org.sam.CE-JobState-/ops/Role=lcgadmin -H glite02-kvm.hpc2n.umu.se you may get su nagios -l -c '... --prev-status $LASTSERVICESTATEID$ ...' where $LASTSERVICESTATEID$ is intended to be expanded by Nagios when it internally prepares to run the command. To be able to run the command manually you'll have to delete the parameter or substitute the macro with a meaningful value. For manual submission and monitoring of CE jobs see. Probes testing Nagios instance |
Document generated by Confluence on Feb 27, 2014 10:19 |