SAM Doc : WN

This page last changed on Sep 26, 2011 by imamagic.

WN-probe
WN Metrics

org.sam.WN-Rep

org.sam.WN-RepISenv
org.sam.WN-RepFree
org.sam.WN-RepCr
org.sam.WN-RepGet
org.sam.WN-RepRep
org.sam.WN-RepDel

org.sam.WN-PyVer
org.sam.glexec.WN-gLExec
org.sam.WN-CAver

Building reference CA DB for the metric

org.sam.WN-Bi
org.sam.WN-Csh
org.sam.WN-SoftVer

Execution on WN
Troubleshooting

Increasing debugging on WN
Getting logs from WN
Missing attributes in message body
WN metrics in PENDING

JobSubmit works. Only for some CEs WN metrics in PENDING.

WN-probe

WN-probe acts as a container for WN metrics. -m parameter is used to invoke a particular metric.

samtest-run wrapper and standard SAM tests

WN Metrics

Metrics	Description
org.sam.WN-Rep	Wrapper check to launch the replica management checks and publish passive check results to Nagios.
org.sam.WN-RepISenv	Check if LCG_GFAL_INFOSYS variable is set.
org.sam.WN-RepFree	Check if Close (or VO default) SE has any free space left according to the information system.
org.sam.WN-RepCr	Copy and register a file to the Close (or default) SE into default space area.
org.sam.WN-RepGet	Copy the file back from Close SE to the WN. Compare the files.
org.sam.WN-RepRep	Replicate the file from SE to a given 'central' SE. Retrieve list of replicas.
org.sam.WN-RepDel	Delete given file(s) from SRM.
org.sam.WN-PyVer	Check version of Python installed on WN.
org.sam.glexec.WN-gLExec	Test gLExec on WN with a special FQAN role.
org.sam.WN-Bi	Check if BrokerInfo works.
org.sam.WN-CAver (obsoleted from Update-14)	Check availability of the latest CAs public certificates on WN.
org.sam.WN-Csh	Check if CSH works.
org.sam.WN-SoftVer	Detect the version of midlleware installed on the WN.

Metrics are written to use ldap/lcg_util/gfal/lfc APIs. However, if the respective API fails to be loaded the code falls back to using ldap/lcg_util/lfc CLIs.

For all the metrics detailed output contains either parameters to API call or command line used for testing a service.

org.sam.WN-Rep

Wrapper check to launch the replica management checks and publish passive check results to Nagios. The order and dependency statically defined in the WN-probe (WNMetrics._metrics dictionary attribute). This is the metrics dependency tree:

        0:Rep (wrapper)
          |
     1:RepISenv
       ^     ^
      /       \
 2:RepFree   __3:RepCr____
              ^   ^      ^
             /    |       \
       4:RepGet 5:RepRep 6:RepDel

To pass the name of the file registered in LFC by WN-RepCr to subsequent tests a simple file-based IPC is used. Thus, the file name in LFC (lfn) is stored under metric' work directory in a text file.

org.sam.WN-RepISenv

Check if LCG_GFAL_INFOSYS variable is set.

org.sam.WN-RepFree

Check if VO default SE has any free space left according to the information system (LCG_GFAL_INFOSYS).

Returns CRITICAL

0 of total space is published for both GlueSA{FreeOnlineSize,StateAvailableSpace}
no space is published for both GlueSA{FreeOnlineSize,StateAvailableSpace}
there is at least one SA, that the VO has access to
- with GlueSALocalID VO:<VO> or VOMS:<FQAN> and with no GlueSAFreeOnlineSize attribute
- with a GlueSAStateAvailableSpace attribute with a value not a positive integer
- with a GlueSAFreeOnlineSize attribute with a value not a positive integer

Returns WARNING

for SA with property GlueSALocalID != <VO> FQAN based ACBR is not published
connection to BDII fails

org.sam.WN-RepCr

Copy and register a file to the VO default SE into default space area.

The metric sets LFC_HOST to a default LFC (hard-coded in the probe) or uses one provided by --lfc parameter. As for a file registration a writable LFC is required, an attempt to "write" to the given LFC is made. If failed, and the reason of the failure was understood the test is terminated producing corresponding exit code (determined based on mapping in errors DB) and error message. Otherwise, LFC discovery in LCG_GFAL_INFOSYS is made and each found LFC is tried with "write" operation. A working writable LFC is then used with lcg_cr3() or lcg-cr. To store a file only the SE name is given (w/o any path).

LFC can be specified via

--wn-lfc <hostname> to org.sam.CE-JobState metric during job submission.
wn_lfc = <hostname> in /etc/gridmon/org.sam.conf (section [SAM:ce_metrics]) on submitting Nagios instance. Or in a configuration file given with --config <file1,>.

org.sam.WN-RepGet

Copy the file back from SE to the WN (lcg_cp()/lcg-cp). Compare the files (with diff). Both operations are critical.

org.sam.WN-RepRep

Replicate the file from SE to a given 'central' replication SE. Retrieve list of replicas.

Default replication SE is hard-coded in WN-probe. --se-rep <se1,..> can be used to define a list of SEs.

The SEs are usually defined at job submission time with org.sam.CE-JobState using --wn-se-rep <se1,..> or --wn-se-rep-file <name>. This is exposed via YAIM as follows. Static and/or dynamic mechanisms are possible. JOBSUBMIT_WN_SE_REP can be defined with a list of comma-separated hostnames; this provides a static mechanism for defining replication SEs. JOBSUBMIT_WN_SE_REP_FILE variable, if specified, should be a file name (w/o path, which is dynamically generated by respective metrics based on VO and/or FQAN for which the metrics are defined) that will be filled in with a list of SEs defined on the Nagios instance that recently successfully passed org.sam.SRM-All set of tests. This triggers execution of local hr.srce.GoodSEs check to generate the list of "good" SEs, as well as provides the file as input parameter to org.sam.{CREAM}CE-JobState metric(s). The latter takes up to max 3 hosts from the file and, if JOBSUBMIT_WN_SE_REP was defined, appends them to the static list. On WN, org.sam.WN-RepRep tries to replicate to all the SEs in the provided order until the replication succeeds. The metric returns CRITICAL, if file couldn't be replicated to any for the SEs.

org.sam.WN-RepDel

Delete given file(s) from SRM.

org.sam.WN-PyVer

Check version of Python installed on WN.

org.sam.glexec.WN-gLExec

WN test which tries to execute

export GLEXEC_CLIENT_CERT=$X509_USER_PROXY; /<path_to>/glexec /usr/bin/id -a

A separate CE job submission/monitoring is used (org.sam.glexec.CE-Job{State,Submit,Monit}-<FQAN>) to deliver the test to WNs. OPS VO Role=pilot is used to submit the jobs.

Currently [SAM:Nov 23 2010], the test exit codes (Nagios) and summaries are following:

   status = 'OK'
   summary = 'success'

   status = 'UNKNOWN'
   summary = "glexec command not found."

   status = 'WARNING'
   summary = "client cert file error: <error message>"

   status = 'WARNING'
   summary = "executable can't be executed (126)"

   status = 'WARNING'
   summary = "client error (201)"

   status = 'WARNING'
   summary = "system error (202)"

   status = 'WARNING'
   summary = "authorization error (203)"

   status = 'UNKNOWN'
   summary = "exit code overlap (204)"

   status = 'UNKNOWN'
   summary = "unrecognised exit code (<exit code>)"

org.sam.WN-CAver

Check availability and validity of CAs public certificates on WN. It's run with SAM-to-Nagios adapter samtest-run.

Check the version of CA RPMs which are installed on the WN and compare them with the reference ones. If for any reason RPM check fails (other installation method) fall back to physical files test (MD5 checksum comparison for all CE certs with the reference list).
This metric returns OK if:

the installed CA RPMs are identical to the references.

Building reference CA DB for the metric

Adopted from https://docs.google.com/Doc?id=dhm26h7x_54jxrp4vf&invite&pli=1 written for "old SAM" (the file may have restrictive access rights; ask for read permissions from Wojciech.Lapka@NOSPAM-cern.ch). NB! the following steps are for "Developer" role only. You'll need write access to svn+ssh://<your_account>@svn.cern.ch/reps/sam/trunk/probes.

When a new version of the CA is released the SAM team is called to update the tests and release them in production as soon as possible. The trigger is a GGUS ticket with a subject "CA update, version X.Y.Z-R"

Official procedures available at http://goc.grid.sinica.edu.tw/gocwiki/Procedure_for_new_CA_release

Developer

accept GGUS ticket
- set to "in progress"
contact integration Team
- for synchronization, just to make sure operation will be smooth
create a Feature Request (JIRA) with summary "CA release n.m" specifying as reference GGUS Ticket ID
assign the JIRA tiket to a developer
Developer starts working & sets the bug's status to 'in progress'

get new CA rpms from the URL in the GGUS ticket

    cd <newCA_RPMDIR>
    mkdir newCA
    wget -r -nd -np http://www.cern.ch/groep/cadist/lcgpreview-VERSION/RPMS.lcg/
       
         OR

    wget -r -nd -np http://lcg-igtf.ndpf.info/distribution/lcg/lcgpreview-VERSION/RPMS.lcg/

    Endpoint should be provided in the initial ticket.

NOTE: As long, as it's not allowed to recursively download the RPMS, you'd rather use this command

   wget -np -nd -r -B http://groep.web.cern.ch/groep/cadist/lcgpreview-VERSION/RPMS.lcg/ -i index.html -F

after having the index.html file downloaded with the previous one. In order to check if at least the
number of files is the same:

     grep ".rpm" index.html | wc -l
     ls -la ./*.rpm | wc -l

Result of both commands should be the same.

checks out the code from SVN

svn co svn+ssh://<your_account>@svn.cern.ch/reps/sam/trunk/probes

create a new CA data and respective configuration files

    cd probes/src/wnjob/org.sam/probes/org.sam/sam/
    ./CE-sft-caver -a <newCA_RPMDIR/newCA>

(you should have no errors reported for RPMs called 'ca_<CANAME>'.
The rest (ex. lcg-CA-<VERSION>, ca_patch_eugridpma_gridppvuln-, ca_policy_igtf-mics-, ca_policy_igtf-slcs-, 
ca_policy_igtf-classic-) are OK not to be recognized as CA RPMs, so will report the errors)

    ./CE-sft-caver -m

=> 's' for the 1st occurrence of an RPM of 2(!) versions before

(e.g. new release: 13.1 => 's' for 11.*)

RPMs for this release will not be queried anymore

=> 'y' to RPM release numbers from the previous release
=> 'y' to RPM release numbers from the new release

     ./CE-sft-caver -T     # Update timestamp in config file to this very moment

(Check if
    1. warning timeout
    2. crit. error timeout
are as they are supposed to be. If not, use the -w, --timeout flags (see ./CE-sft-caver -h for parameters)
to set up the correct values., e.g.

        ./CE-sft-caver --timeout 192 # (8*24)

Result: up-to-date 
probes/src/wnjob/org.sam/probes/org.sam/sam/ca_data.dat
  and
probes/src/wnjob/org.sam/probes/org.sam/sam/ca_data.conf

test the new CA test in Testing
commit the changes to SVN trunk
tag
build RPM from tag
"resolve" the JIRA issue

org.sam.WN-Bi

Check if BrokerInfo works. It's run with SAM-to-Nagios adapter samtest-run.

The procedure is the following:

Firstly check if BrokerInfo file is defined in $GLITE_WMS_RB_BROKERINFO, $GLITE_WL_RB_BROKERINFO or $EDG_WL_RB_BROKERINFO variables
Then try to get CE host name using edg-brokerinfo getCE or glite-brokerinfo getCE command respectively. If previous command result value if different from 0 test is failed.

org.sam.WN-Csh

Check if CSH works. It's run with SAM-to-Nagios adapter samtest-run.

org.sam.WN-SoftVer

Detect the version of midlleware installed on the WN. It's run with SAM-to-Nagios adapter samtest-run.

To detect the version lcg-version and glite-version commands are tried and if the commands are not available the script exits with an error.

Execution on WN

On WNs statically compiled version of Nagios is used for the probes execution. For a metric to be launched by Nagios one needs to create Nagios service object configuration per metric. For org.sam.WN-* metrics configuration is defined in configuration files located in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/. For more details on how to integrate your WN probes/metrics with job submission see.

Troubleshooting

Increasing debugging on WN

See Log levels on WN.

Getting logs from WN

See Logs from WN.

Missing attributes in message body

No results from WNs and you are sure that the problem is not with brokers.

Symptom. WN checks are in PENDING. Messages reach Nagios box. You see they are consumed by msg-to-handler from destinations

$ grep destination /etc/msg-to-handler.d/*.conf

but either all or part of them are not getting into respective local directory queues

$ grep CACHE_DIR /etc/msg-to-handler.d/*.conf

and, as a consequence, Nagios passive results don't reach Nagios command file.

You may see similar messages in /var/log/messages:

Oct 31 07:00:17 samnag013 msg-to-handler[23485]: [WARNING] msg-to-handler: could not handle message
ID:gridmsg002.cern.ch-41281-1288255386367-4:1287764:-1:1:1: handler warning: Got error creating Nagios
passive result: Nagios Parser ERROR: Missing attribute hostname. .

This message is from one of the handlers defined for msg-to-handler in /etc/msg-to-handler.d/. NB! In this case [SAM:as of Friday, November 05 2010] message handler doesn't print its name as it is identified in respective /etc/msg-to-handler.d/*.conf. You'll have to "map" it yourself.

msg-to-handler subscribes with auto acknowledge and in case of such failure it doesn't re-sent the messages back to a "dead-queue".

The best would be to consume couple of messages from a destination you are suspecting is has bad messages.

WN metrics in PENDING

There can be multiple reasons for that.

JobSubmit works. Only for some CEs WN metrics in PENDING.

Look into job output (stdout/stderr) of the framework from WN:

/var/lib/gridprobes/<VO or FQAN>/org.sam/CE/<hostname>/jobOutput*/jobOutput_<jobID>/gridjob.out

This may give some hints on what is going on. Although, the framework on WN was designed to catch possible problems with its components, some unpredictable configurations of WNs can still lead to unexpected behavior.

Also check : Increasing debugging on WN, Getting logs from WN

Document generated by Confluence on Feb 27, 2014 10:19