SAM Doc : CE
This page last changed on Dec 09, 2010 by kskaburs.
CE-probeCE-probe CE Metrics
org.sam.CE-JobStateActive + Passive check. By default active check is assumed to be run hourly.
org.sam.CE-JobSubmitPassive check.
org.sam.CE-JobMonitActive check. By default runs each 5 min.
Job submissionSubmission and MonitoringAccording to WMS Job State Machine (link p.17) job can be in the following states
On WMS there are two main parameters responsible for timeouts in job matchmaking
Defaults allow job to be matched at most three times within two hours after job submission. With JDL JobType="Normal"; ... RetryCount = 0; ShallowRetryCount = 1; Requirements = other.GlueCEInfoHostName == "<CE hostname>"; and 1 hour interval between jobs submission it is advisable to set e.g. MatchRetryPeriod = 1320 (22 min) and ExpiryPeriod = 3000 (50 min). This way WMS will naturally abort jobs if info about CE isn't available in IS. In Nagios jobs submission and monitoring was implemented in the following way.
Currently [SAM:Nov 23 2010] we are monitoring with all the defaults T_N_SCHED(1h) T_J_DISCARD T_J_GLOB | T_J_SCHEDRUN | T_J_WAIT | | | | |-----------|---|--#---------------|------...-|--| -> t | | T_WMS_MatchRetr T_WMS_Exp T_J_WAIT - 45 min (--timeout-job-waiting) T_J_GLOB - 55 min (--timeout-job-global) T_WMS_MatchRetr - 58 min (MatchRetryPeriod) T_N_SCHED - 1 hour (Nagios metric scheduling) T_WMS_Exp - 2 hours (ExpiryPeriod) T_J_SCHEDRUN - 5h30min (--timeout-job-schedrun) T_J_DISCARD - 6 hours (--timeout-job-discard) Thus, in most cases we cancel jobs being in Waiting due to no compatible resources when T_J_WAIT kicks in (after only one initial matchmaking in WM) and issue CRITICAL for org.sam.CE-Job{State,Submit}. Moving to the case T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT (or even 2*T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT) is fairly possible. Thus, in case of the jobs to CEs which are not (properly) published in IS the jobs will be naturally discarded by WMS (Aborted; reason no compatible resources). In such case, monitoring metric (org.sam.CE-JobMonit) is ready to handle such cases and will issue CRITICAL against the CE. Integration of third-party WN checksThe following CLI parameters to org.sam.CE-JobState metric are available: --add-wntar-nag <d1,d2,..> Comma-separated list of top level directories with Nagios compliant directories structure to be added to tarball to be sent to WN. --add-wntar-nag-nosam Instructs the metric not to include standard SAM WN probes and their Nagios config to WN tarball. (Default: WN probes are included) --add-wntar-nag-nosamcfg Instructs the metric not to include Nagios configuration for SAM WN probes to WN tarball. The probes themselves and respective Python packages, however, will be included.
For this particular part of Nagios objects configuration and macros please see the following Nagios resources: Object Configuration Overview, Service definitions, Command definitions, Service dependency definitions, Nagios macros, Macros and resource file. Also, as an example you might want to check the objects configurations defined for org.sam WN checks in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/ of grid-monitoring-probes-org.sam RPM. The Nagios resource that will be used on WN on UI is located here /usr/libexec/grid-monitoring/probes/org.sam/wnjob/nagios.d/etc/resource.cfg. Example (only relevant CLI parameters are shown): ./CE-probe -m org.sam.CE-JobState \ --add-wntar-nag /path/to/your/pobes/wnjob/org.my/,/path/to/org.sam.sec This will add into WN tarball two sets of WN checks provided by org.my and org.sam.sec. NB! org.sam WN checks and their respective Nagios configurations will still be added and launched on WN, as well!
Providing JDLThe JDL used by the framework is the following: [ Type="Job"; JobType="Normal"; Executable = "<jdlExecutable>"; StdError = "gridjob.out"; StdOutput = "gridjob.out"; Arguments = "<jdlArguments>"; InputSandbox = {"<jdlInputSandboxExecutable>", "<jdlInputSandboxTarball>"}; OutputSandbox = {"gridjob.out","wnlogs.tgz"}; RetryCount = <jdlRetryCount>; ShallowRetryCount = <jdlShallowRetryCount>; Requirements = other.GlueCEInfoHostName == "<jdlReqCEInfoHostName>"; ] and located in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.gridJob.jdl.template.
"Third-party" JDLOne can provide its own JDL with the following parameter to org.sam.CE-JobState metric: --jdl-templ <file> JDL template file (full path). The provided JDL can be a template as well with all or some of the above mentioned substitutable parameters. For better flexibility RetryCount and ShallowRetryCount ClassAdds were exposed as parameters to --jdl-retrycount <val> JDL RetryCount (Default: 0). --jdl-shallowretrycount <val> JDL ShallowRetryCount (Default: 1). Logs from WNDefault JDL's output sandbox defines two files that will be taken from WN OutputSandbox = {"gridjob.out","wnlogs.tgz"};
On Nagios UI OutputSandbox is stored per CE under /var/lib/gridprobes/<VO or FQAN>/<namespace>/<service>/<hostname>/jobOutput* directories. E.g.: # ls -1d /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CREAMCE/stremsel.nikhef.nl/jobOutput* /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CREAMCE/stremsel.nikhef.nl/jobOutput /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CREAMCE/stremsel.nikhef.nl/jobOutput.LAST # jobOutput.LAST contains last historical output from WN. Log levels on WNLogging level of framework and metrics on WN is guided by the following parameters to org.sam.*-JobState metrics --wn-verb <0-3> Metrics verbosity level on WN. [-v <VERBOSITY>] (Default: 0) --wn-verb-fw <0-3> Framework verbosity level on WN (Default: 1)
Naigos and messaging on WNCodebase for scheduling metrics on WN and delivering metric results to messaging system is located under /usr/libexec/grid-monitoring/probes/org.sam/wnjob/ (will refer to it as /<WN_codebase>/). nagrun.shOn WN (as specified in JDL with Executable = "<jdlExecutable>") nagrun.sh script (on Nagios UI located in /<WN_codebase>/) is used to
Parameters to nagrun.sh specified by Arguments = "<jdlArguments>" define what and how should be launched. nagrun.sh -h usage: nagrun.sh -v <vo> -d <dest> [-b <broker_uri>] [-n <broker_network>] [-t <timeout>] [-w <fw_verb>] [-l <hostname>] [-s <hostname>] [-z <metric_verb>] [-f <fqan>] [-i <host:port,..>] -B -R -N -h -v and -d are mandatory paramters. Defaults: <broker_network> - PROD <timeout> - 600 sec <metrics_verb> - 0 <fw_verb> - 1 (2 - messages, 3 - Nagios config/stats/debug) -l <hostname> - LFC for replica tests -s <hostname> - SE for replica tests -i <host:port,..> - BDII(s) for tests -f <fqan> - VOMS FQAN -B - don't do broker discovery -R - take MB randomly; by default sort by min response time In most cases the parameters is the translation of corresponding ones given to org.sam.*-JobState metric.
MTAMessage Transfer Agent (MTA) is used to send messages with metric results from WN to Message Brokers. On Nagios UI the code for MTA - mta-simple - is located under /<WN_codebase>/bin/ and implementation in /<WN_codebase>/lib/python2.3/site-packages/mig. MTA:
Messages with metric results are stored in outgoing messages queue by a Nagios handler handle_service_check invoked by Nagios after execution of each check. TroubleshootingManual job submission and monitoringTo submit a job manually with modified default command line parameters do the following. E.g.: # nagios-run-check -d -v -s org.sam.CE-JobState-/ops/Role=lcgadmin -H ce03.esac.esa.int Executing command: su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce03.esac.esa.int" -t 600 --vo ops --mb-destination /queue/grid.probe.metricOutput.EGEE.sam-ne-roc-val_cern_ch -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin --mb-uri stomp://gridmsg002.cern.ch:6163/ --prev-status $LASTSERVICESTATEID$ --wn-se-rep-file GoodSEs --wn-se-rep fake.srm1,samdpm002.cern.ch -m org.sam.CE-JobState --err-topics ce_wms,default' # Take the produced command, modify/add/delete parameters as required and run it. For example, we want to increase verbosity level of the framework on WN. Fist, you need to remove all options with parameters of the form $SOMENAMEHERE$, as those are Nagios macros and valid only when the check is run under Nagios. In the example above we remove --prev-status $LASTSERVICESTATEID$. Also we add --wn-verb-fw 3 to set max verbosity level for the framework on WN. So, the resulting command will be # su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce03.esac.esa.int" -t 600 --vo ops --mb-destination /queue/grid.probe.metricOutput.EGEE.sam-ne-roc-val_cern_ch -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin --mb-uri stomp://gridmsg002.cern.ch:6163/ --wn-se-rep-file VeryGoodSEs --wn-se-rep fake.srm1,samdpm002.cern.ch -m org.sam.CE-JobState --err-topics ce_wms,default --wn-verb-fw 3 -v 3' OK: Active job - Submitted [2010-12-09T09:53:16Z] OK: Active job - Submitted [2010-12-09T09:53:16Z] Testing from: samnag011.cern.ch DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=gridmon/CN=137254/CN=Robot: Grid Monitoring-Framework VOMS FQANs: /ops/Role=lcgadmin/Capability=NULL, /ops/Role=NULL/Capability=NULL Reading configuration file(s). Configuration files: /etc/gridmon/org.sam.conf Reading from configuration file(s): /etc/gridmon/org.sam.conf Parsing command line parameters. Loading active job's description ... done. >>> Active job found file: /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CE/ce03.esac.esa.int/activejob.map serviceDesc : org.sam.CE-JobSubmit-/ops/Role=lcgadmin jobState : Submitted submitTimeStamp : 1291888396 jobID : https://wms220.cern.ch:9000/02RCPJTL3CehL-Ldi66MMg hostNameCE : ce03.esac.esa.int lastStateTimeStamp : 1291888396 https://wms220.cern.ch:9000/02RCPJTL3CehL-Ldi66MMg # However, it reported that there is already an active job. Note -v 3 added to see more debugging output during job submission. You can either simply delete the active job bookkeeping file /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CE/ce03.esac.esa.int/activejob.map (acts as a lock). Or starting from grid-monitoring-probes-org.sam-0.1.18, you can run JobCancel metric to gracefully cancel the job: # su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce03.esac.esa.int" -m org.sam.CE-JobCancel -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin --vo ops' OK: job cancelled OK: job cancelled Testing from: samnag011.cern.ch DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=gridmon/CN=137254/CN=Robot: Grid Monitoring-Framework VOMS FQANs: /ops/Role=lcgadmin/Capability=NULL, /ops/Role=NULL/Capability=NULL Job cancellation request sent: glite-wms-job-cancel --noint -i /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CE/ce03.esac.esa.int/jobID Job bookkeeping files deleted. # After you submit a new job with required parameters it can be monitored via Nagios interface or manually with JobMonit check. To find out what command line to run do the following which is similar to what was done for JobState metric. -H sam-ne-roc-val.cern.ch should be hostname of your Nagios instance. # nagios-run-check -d -v -s org.sam.CE-JobMonit-/ops/Role=lcgadmin -H sam-ne-roc-val.cern.ch Executing command: su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "sam-ne-roc-val.cern.ch" -t 600 --vo ops -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin -m org.sam.CE-JobMonit --err-topics ce_wms' # Then, add --hosts ce03.esac.esa.int to monitor only required hosts (comma-separated list is possible) and run the command. # su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "sam-ne-roc-val.cern.ch" -t 600 --vo ops -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin -m org.sam.CE-JobMonit --err-topics ce_wms --hosts ce03.esac.esa.int' OK: Jobs processed - 1 OK: Jobs processed - 1 [Running] : 1|jobs_processed=1;; DONE=0;; RUNNING=1;; SCHEDULED=0;; SUBMITTED=0;; READY=0;; WAITING=0;; WAITING-CANCELLED=0;; WAITING-CANCEL=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2 # We can see that the job is in Running state. The output after "|" is Nagios performance data output. When job enters a terminal state JobMonit will fetch job's output sandbox to the Nagios UI. For location of job output see Logs from WN. |
![]() |
Document generated by Confluence on Feb 27, 2014 10:19 |