SAM Doc : WN
This page last changed on Sep 26, 2011 by imamagic.
WN-probeWN-probe acts as a container for WN metrics. -m parameter is used to invoke a particular metric. samtest-run wrapper and standard SAM tests WN Metrics
Metrics are written to use ldap/lcg_util/gfal/lfc APIs. However, if the respective API fails to be loaded the code falls back to using ldap/lcg_util/lfc CLIs. For all the metrics detailed output contains either parameters to API call or command line used for testing a service. org.sam.WN-RepWrapper check to launch the replica management checks and publish passive check results to Nagios. The order and dependency statically defined in the WN-probe (WNMetrics._metrics dictionary attribute). This is the metrics dependency tree: 0:Rep (wrapper) | 1:RepISenv ^ ^ / \ 2:RepFree __3:RepCr____ ^ ^ ^ / | \ 4:RepGet 5:RepRep 6:RepDel To pass the name of the file registered in LFC by WN-RepCr to subsequent tests a simple file-based IPC is used. Thus, the file name in LFC (lfn) is stored under metric' work directory in a text file. org.sam.WN-RepISenvCheck if LCG_GFAL_INFOSYS variable is set. org.sam.WN-RepFreeCheck if VO default SE has any free space left according to the information system (LCG_GFAL_INFOSYS). Returns CRITICAL
Returns WARNING
org.sam.WN-RepCrCopy and register a file to the VO default SE into default space area. The metric sets LFC_HOST to a default LFC (hard-coded in the probe) or uses one provided by --lfc parameter. As for a file registration a writable LFC is required, an attempt to "write" to the given LFC is made. If failed, and the reason of the failure was understood the test is terminated producing corresponding exit code (determined based on mapping in errors DB) and error message. Otherwise, LFC discovery in LCG_GFAL_INFOSYS is made and each found LFC is tried with "write" operation. A working writable LFC is then used with lcg_cr3() or lcg-cr. To store a file only the SE name is given (w/o any path). LFC can be specified via
org.sam.WN-RepGetCopy the file back from SE to the WN (lcg_cp()/lcg-cp). Compare the files (with diff). Both operations are critical. org.sam.WN-RepRepReplicate the file from SE to a given 'central' replication SE. Retrieve list of replicas. Default replication SE is hard-coded in WN-probe. --se-rep <se1,..> can be used to define a list of SEs. The SEs are usually defined at job submission time with org.sam.CE-JobState using --wn-se-rep <se1,..> or --wn-se-rep-file <name>. This is exposed via YAIM as follows. Static and/or dynamic mechanisms are possible. JOBSUBMIT_WN_SE_REP can be defined with a list of comma-separated hostnames; this provides a static mechanism for defining replication SEs. JOBSUBMIT_WN_SE_REP_FILE variable, if specified, should be a file name (w/o path, which is dynamically generated by respective metrics based on VO and/or FQAN for which the metrics are defined) that will be filled in with a list of SEs defined on the Nagios instance that recently successfully passed org.sam.SRM-All set of tests. This triggers execution of local hr.srce.GoodSEs check to generate the list of "good" SEs, as well as provides the file as input parameter to org.sam.{CREAM}CE-JobState metric(s). The latter takes up to max 3 hosts from the file and, if JOBSUBMIT_WN_SE_REP was defined, appends them to the static list. On WN, org.sam.WN-RepRep tries to replicate to all the SEs in the provided order until the replication succeeds. The metric returns CRITICAL, if file couldn't be replicated to any for the SEs. org.sam.WN-RepDelDelete given file(s) from SRM. org.sam.WN-PyVerCheck version of Python installed on WN. org.sam.glexec.WN-gLExecWN test which tries to execute export GLEXEC_CLIENT_CERT=$X509_USER_PROXY; /<path_to>/glexec /usr/bin/id -a A separate CE job submission/monitoring is used (org.sam.glexec.CE-Job{State,Submit,Monit}-<FQAN>) to deliver the test to WNs. OPS VO Role=pilot is used to submit the jobs. Currently [SAM:Nov 23 2010], the test exit codes (Nagios) and summaries are following: status = 'OK' summary = 'success' status = 'UNKNOWN' summary = "glexec command not found." status = 'WARNING' summary = "client cert file error: <error message>" status = 'WARNING' summary = "executable can't be executed (126)" status = 'WARNING' summary = "client error (201)" status = 'WARNING' summary = "system error (202)" status = 'WARNING' summary = "authorization error (203)" status = 'UNKNOWN' summary = "exit code overlap (204)" status = 'UNKNOWN' summary = "unrecognised exit code (<exit code>)" org.sam.WN-CAverCheck availability and validity of CAs public certificates on WN. It's run with SAM-to-Nagios adapter samtest-run. Check the version of CA RPMs which are installed on the WN and compare them with the reference ones. If for any reason RPM check fails (other installation method) fall back to physical files test (MD5 checksum comparison for all CE certs with the reference list).
Building reference CA DB for the metricAdopted from https://docs.google.com/Doc?id=dhm26h7x_54jxrp4vf&invite&pli=1 written for "old SAM" (the file may have restrictive access rights; ask for read permissions from Wojciech.Lapka@NOSPAM-cern.ch). NB! the following steps are for "Developer" role only. You'll need write access to svn+ssh://<your_account>@svn.cern.ch/reps/sam/trunk/probes. When a new version of the CA is released the SAM team is called to update the tests and release them in production as soon as possible. The trigger is a GGUS ticket with a subject "CA update, version X.Y.Z-R" Official procedures available at http://goc.grid.sinica.edu.tw/gocwiki/Procedure_for_new_CA_release Developer
org.sam.WN-BiCheck if BrokerInfo works. It's run with SAM-to-Nagios adapter samtest-run. The procedure is the following:
org.sam.WN-CshCheck if CSH works. It's run with SAM-to-Nagios adapter samtest-run. org.sam.WN-SoftVerDetect the version of midlleware installed on the WN. It's run with SAM-to-Nagios adapter samtest-run. To detect the version lcg-version and glite-version commands are tried and if the commands are not available the script exits with an error. Execution on WNOn WNs statically compiled version of Nagios is used for the probes execution. For a metric to be launched by Nagios one needs to create Nagios service object configuration per metric. For org.sam.WN-* metrics configuration is defined in configuration files located in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/. For more details on how to integrate your WN probes/metrics with job submission see. TroubleshootingIncreasing debugging on WNSee Log levels on WN. Getting logs from WNSee Logs from WN. Missing attributes in message bodyNo results from WNs and you are sure that the problem is not with brokers. Symptom. WN checks are in PENDING. Messages reach Nagios box. You see they are consumed by msg-to-handler from destinations $ grep destination /etc/msg-to-handler.d/*.conf but either all or part of them are not getting into respective local directory queues $ grep CACHE_DIR /etc/msg-to-handler.d/*.conf and, as a consequence, Nagios passive results don't reach Nagios command file. You may see similar messages in /var/log/messages: Oct 31 07:00:17 samnag013 msg-to-handler[23485]: [WARNING] msg-to-handler: could not handle message ID:gridmsg002.cern.ch-41281-1288255386367-4:1287764:-1:1:1: handler warning: Got error creating Nagios passive result: Nagios Parser ERROR: Missing attribute hostname. . This message is from one of the handlers defined for msg-to-handler in /etc/msg-to-handler.d/. NB! In this case [SAM:as of Friday, November 05 2010] message handler doesn't print its name as it is identified in respective /etc/msg-to-handler.d/*.conf. You'll have to "map" it yourself. msg-to-handler subscribes with auto acknowledge and in case of such failure it doesn't re-sent the messages back to a "dead-queue". The best would be to consume couple of messages from a destination you are suspecting is has bad messages. WN metrics in PENDINGThere can be multiple reasons for that. JobSubmit works. Only for some CEs WN metrics in PENDING.Look into job output (stdout/stderr) of the framework from WN: /var/lib/gridprobes/<VO or FQAN>/org.sam/CE/<hostname>/jobOutput*/jobOutput_<jobID>/gridjob.out This may give some hints on what is going on. Although, the framework on WN was designed to catch possible problems with its components, some unpredictable configurations of WNs can still lead to unexpected behavior. Also check : Increasing debugging on WN, Getting logs from WN |
Document generated by Confluence on Feb 27, 2014 10:19 |