This page last changed on Dec 09, 2010 by kskaburs.

CE-probe

CE-probe

CE Metrics

Name Description
org.sam.CE-JobState [SAM:Active+Passive]. Submits grid job to CE.
org.sam.CE-JobMonit [SAM:Active]. Monitors grid jobs submitted to CEs.
org.sam.CE-JobSubmit [SAM:Passive]. Holds terminal status of job submission to CE.

org.sam.CE-JobState

Active + Passive check. By default active check is assumed to be run hourly.

  • submits grid job to CE
    • stores active job attributes into /<metric work dir>/activejob.map. The file acts as a lock and prevents submission of next jobs. activejob.map file is removed in two cases: 1). by org.sam.CE-JobMonit when job enters terminal state; 2). by org.sam.CE-JobState when global timeout for running job is exceeded (--timeout-job-discard defaults to 6h). To force next submission one can simply remove the file.
  • accepts passive check results (from org.sam.CE-JobMonit) for submitted grid job
    • holds a status of the grid job. When grid job enters terminal state (as seen by org.sam.CE-JobMonit) its status is passively updated by org.sam.CE-JobMonit according to job's state.

org.sam.CE-JobSubmit

Passive check.

  • holds terminal status of job submission to CE (mapping from gLite job terminal states [SAM:'Done','Aborted','Canceled'] to Nagios status [SAM:OK,WARNING,CRITICAL,UNKNOWN])
  • passively updated by org.sam.CE-JobMonit.

org.sam.CE-JobMonit

Active check. By default runs each 5 min.

  • monitors status of all submitted jobs (as defined in activejob.map files) and updates states of org.sam.CE-JobState and org.sam.CE-JobMonit metrics. Acts as a babysitter for all grid jobs submitted by org.sam.CE-JobState. org.sam.CE-JobState and org.sam.CE-JobMonit are updated (as passive checks) either via Naigos command file or NSCA.

Job submission

Submission and Monitoring

According to WMS Job State Machine (link p.17) job can be in the following states

  • non-terminal Submitted, Waiting, Ready, Scheduled, Running
    • Submitted: job is entered by the user to the UI but not yet transferred to NS for processing
    • Waiting: job has been accepted by NS and is waiting for WM processing or is being processed by WM Helper modules (e.g., WM is busy, no appropriate CE (cluster) has been found yet, ...).
    • Ready: job has been processed by WM and its Helper modules (especially, appropriate CE has been found) but not yet transferred to the CE (local batch system queue) via Job Controller and CondorC.
    • Scheduled: job is waiting in the queue on the Computing Element.
    • Running: job is running.
  • terminal Done, Aborted, Canceled, Cleared
    • Done: job exited or is considered to be in a terminal state by CondorC (e.g., submission to CE has failed in an unrecoverable way).
    • Aborted: job processing was aborted by WMS (waiting in the WM queue or CE for too long, over-use of quotas, expiration of user credentials, etc.).
    • Canceled: job has been successfully canceled on user request.
    • Cleared: output sandbox was transferred to the user or removed due to the timeout.

On WMS there are two main parameters responsible for timeouts in job matchmaking

  • MatchRetryPeriod = 3500 (58 min) - interval between successive retries to match a job a resource (T_WMS_MatchRetr)
  • ExpiryPeriod = 7200 (2 hours) - time after which job will be aborted with 'no compatible resources' (T_WMS_Exp)

Defaults allow job to be matched at most three times within two hours after job submission.

With JDL

   JobType="Normal";
   ...
   RetryCount = 0;
   ShallowRetryCount = 1;
   Requirements = other.GlueCEInfoHostName == "<CE hostname>";

and 1 hour interval between jobs submission it is advisable to set e.g. MatchRetryPeriod = 1320 (22 min) and ExpiryPeriod = 3000 (50 min). This way WMS will naturally abort jobs if info about CE isn't available in IS.

In Nagios jobs submission and monitoring was implemented in the following way.

  • timeouts defined for org.sam.CE-JobState and org.sam.CE-JobMonit metrics
    org.sam.CE-JobState
    --timeout-job-discard <sec> Discard job after the timeout. (Default: 21600)
    
    org.sam.CE-JobMonit
    --timeout-job-global <sec>  Global timeout for jobs. Job will be canceled
                                and dropped if it is not in terminal state by
                                that time. (Default: 3300)
    --timeout-job-waiting <sec> Time allowed for a job to stay in Waiting with
                                'no compatible resources'. (Default: 2700)
    --timeout-job-discard <sec> Discard job after the timeout. (Default: 21600)
    --timeout-job-schedrun <sec> Scheduled/Running states timeout. (Default: 19800)
    
  • org.sam.CE-JobState metric (active Nagios check). Runs hourly (normal_check_interval 60).
    • initially submits job and saves /<workdirRun>/<voName>/<nameSpace>/<serviceType>/<nodeName>/activejob.map with submitTimeStamp|hostNameCE|serviceDesc|jobID|jobState|lastStateTimeStamp.
    • if activejob.map was found
      • jobState is terminal state - discard the job, proceed with submission
      • jobState is non-terminal state
        • lastStateTimeStamp - submitTimeStamp < timeout-job-discard - exit with OK: Active job - <jobState> [SAM:time]
        • lastStateTimeStamp - submitTimeStamp > timeout-job-discard - discard the job, proceed with submission
  • org.sam.CE-JobMonit metric (active Nagios check; checks all jobs and updates activejob.map, org.sam.CE-JobState & org.sam.CE-JobSubmit). Runs each 5 min (normal_check_interval 5). For all currently submitted jobs (activejob.map files) get job state from WMS
    • on error getting job state
      • UI problem - update org.sam.CE-JobState with WARNING
      • WMS problem
        • timeNow - submitTimeStamp < timeout-job-discard - update org.sam.CE-JobState with WARNING (unable to get job status. Job will be deleted in N min; N = (timeout-job-discard - (timeNow - submitTimeStamp))/60)
        • else - update org.sam.CE-JobState with WARNING and org.sam.CE-JobSubmit with UNKNOWN (unable to get job status. Job discarded.)
    • on OK getting job state
      • Done
        • Current Status: Done (Success) - update org.sam.CE-Job{State,Submit} with OK.
        • Current Status: Done (Exit Code !=0) - Framework on WN exists with Nagios compliant exit codes. Check Exit code:. Update org.sam.CE-Job{State,Submit} respectively with WARNING, CRITICAL, UNKNOWN.
        • delete activejob.map.
      • Aborted
        • get logging info and get reason
          • request expired
              • BrokerHelper: no compatible resources - update org.sam.CE-Job{State,Submit} with CRITICAL (Job was aborted. Failed to match.).
              • else - update org.sam.CE-Job{State,Submit} with UNKOWN (Job was aborted. Check WMS.)
            • else - update org.sam.CE-Job{State,Submit} with CRITICAL (Job was aborted.).
          • delete activejob.map.
        • Cleared
          • delete activejob.map.
        • Cancelled
          • delete activejob.map.
        • Waiting
          • get logging info and get reason
            • no compatible resources
              • timeNow - submitTimeStamp > timeout-job-waiting - update org.sam.CE-Job{State,Submit} with CRITICAL (BrokerHelper: no compatible resources). Cancel and discard the job.
              • else - update org.sam.CE-JobState with WARNING. Update activejob.map.
            • else
              • timeNow - submitTimeStamp > timeout-job-discard - cancel & delete activejob.map.
        • Ready, Submitted
          • timeNow - submitTimeStamp > timeout-job-global - update org.sam.CE-Job{State,Submit} with UNKNOWN. Get logging info & include into details data; cancel & delete activejob.map.
          • else - update activejob.map.
        • Scheduled, Running
          • timeNow - submitTimeStamp > timeout-job-schedrun - update org.sam.CE-Job{State,Submit} with WARNING. Get logging info & include into details data; cancel & delete activejob.map. Issue CRITICAL on the second successive timeout. (JobMonit has the states counter).
          • else - update activejob.map.

Currently [SAM:Nov 23 2010] we are monitoring with all the defaults

                T_N_SCHED(1h)                T_J_DISCARD
          T_J_GLOB |               T_J_SCHEDRUN  |
      T_J_WAIT  |  |                          |  |
|-----------|---|--#---------------|------...-|--| -> t
                  |                |
                T_WMS_MatchRetr  T_WMS_Exp

T_J_WAIT - 45 min (--timeout-job-waiting)
T_J_GLOB - 55 min (--timeout-job-global)
T_WMS_MatchRetr - 58 min (MatchRetryPeriod)
T_N_SCHED - 1 hour (Nagios metric scheduling)
T_WMS_Exp - 2 hours (ExpiryPeriod)
T_J_SCHEDRUN - 5h30min (--timeout-job-schedrun)
T_J_DISCARD - 6 hours (--timeout-job-discard)

Thus, in most cases we cancel jobs being in Waiting due to no compatible resources when T_J_WAIT kicks in (after only one initial matchmaking in WM) and issue CRITICAL for org.sam.CE-Job{State,Submit}.

Moving to the case T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT (or even 2*T_WMS_MatchRetr < T_WMS_Exp < T_J_WAIT) is fairly possible. Thus, in case of the jobs to CEs which are not (properly) published in IS the jobs will be naturally discarded by WMS (Aborted; reason no compatible resources). In such case, monitoring metric (org.sam.CE-JobMonit) is ready to handle such cases and will issue CRITICAL against the CE.

Integration of third-party WN checks

The following CLI parameters to org.sam.CE-JobState metric are available:

--add-wntar-nag <d1,d2,..>  Comma-separated list of top level directories with
                            Nagios compliant directories structure to be added
                            to tarball to be sent to WN.
--add-wntar-nag-nosam       Instructs the metric not to include standard SAM WN
                            probes and their Nagios config to WN tarball.
                            (Default: WN probes are included)
--add-wntar-nag-nosamcfg    Instructs the metric not to include Nagios
                            configuration for SAM WN probes to WN tarball. The
                            probes themselves and respective Python packages,
                            however, will be included.
  • with --add-wntar-nag <d1,d2,..> parameter the respective "Nagios compliant directories structure" should look like this:
      [kvs] ~ tree /path/to/your/pobes/wnjob/org.my/
      /path/to/your/pobes/wnjob/org.my/
      |-- etc
      |   `-- wn.d
      |       `-- org.my
      |           |-- commands.cfg
      |           `-- services.cfg
      `-- probes
          `-- org.my
              |-- check_A
              |-- check_B
              `-- checks_lib.sh
    
    • probes/org.my/* should contain your probes/checks
    • etc/wn.d/org.my/ should contain file(s) with .cfg extension with Nagios command and service objects definitions (optionally, service dependencies definitions). In your etc/wn.d/org.my/*.cfg files please use the following paths defining Nagios macros and the framework template names:
      • $USER3$ - macro defining path to <nagiosRoot>/probes/ directory on WN. Usage:
          define command{
                 command_name   check_A1
                 command_line   $USER3$/org.my/check_A
                 }
        
      • <wnjobWorkDir> - will be substituted with the job's working directory on WN. Handy if your check requires and creates a working directory. Possible usage (assumes -w instructs check_A to create <wnjobWorkDir>/.mygridprobes directory):
          define command{
                 command_name   check_A2
                 command_line   $USER3$/org.my/check_A -w <wnjobWorkDir>/.mygridprobes
                 }
        

For this particular part of Nagios objects configuration and macros please see the following Nagios resources: Object Configuration Overview, Service definitions, Command definitions, Service dependency definitions, Nagios macros, Macros and resource file.

Also, as an example you might want to check the objects configurations defined for org.sam WN checks in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/ of grid-monitoring-probes-org.sam RPM. The Nagios resource that will be used on WN on UI is located here /usr/libexec/grid-monitoring/probes/org.sam/wnjob/nagios.d/etc/resource.cfg.

Example (only relevant CLI parameters are shown):

  ./CE-probe -m org.sam.CE-JobState \
             --add-wntar-nag /path/to/your/pobes/wnjob/org.my/,/path/to/org.sam.sec

This will add into WN tarball two sets of WN checks provided by org.my and org.sam.sec. NB! org.sam WN checks and their respective Nagios configurations will still be added and launched on WN, as well!

  • Use --add-wntar-nag-nosam if you neither want org.sam probes to be run on WN nor want to use the org.sam probes with your probable custom configurations. This will instruct org.sam.CE-JobState metric not to include org.sam WN probes and their Nagios configuration to WN tarball. Example (only relevant CLI parameters are shown):
      ./CE-probe -m org.sam.CE-JobState --add-wntar-nag-nosam \
                 --add-wntar-nag /path/to/your/pobes/wnjob/org.my/
    

    WN tarball will contain only your probes and Nagios configurations from /path/to/your/pobes/wnjob/org.my/.

  • Use --add-wntar-nag-nosamcfg if you don't want org.sam probes to be run on WN, but still want the probes to be included in the WN tarball. This will instruct org.sam.CE-JobState metric to include org.sam WN probes, Python gridmon and gridmetrics packages with respective modules into the WN tarball. Nagios configuration of the probes will not be included to WN tarball. This is done for your convenience:
      1. in case you want to use some of org.sam provided probes/wrappers (eg. $USER3$/org.sam/{nag,sam}test-run), though, you will have to include your custom Nagios objects configurations yourselves (can be taken directly from /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam/etc/wn.d/org.sam/*.cfg and included into your *.cfg-s)
      2. you developed your Python WN probes using gridmon or gridmetrics packages. The latter packages will be added to $PYTHONPATH before launching Nagios on WN, so, your probes can safely import required modules from them.
        Example (only relevant CLI parameters are shown):
          ./CE-probe -m org.sam.CE-JobState --add-wntar-nag-nosamcfg \
                     --add-wntar-nag /path/to/your/pobes/wnjob/org.my/
        

        WN tarball will contain your probes and Nagios configurations from /path/to/your/pobes/wnjob/org.my/, as well as org.sam probes (but without their standard org.sam Nagios configurations), gridmon and gridmetrics Python packages.

Providing JDL

The JDL used by the framework is the following:

[
Type="Job";
JobType="Normal";
Executable = "<jdlExecutable>";
StdError = "gridjob.out";
StdOutput = "gridjob.out";
Arguments = "<jdlArguments>";
InputSandbox =  {"<jdlInputSandboxExecutable>", "<jdlInputSandboxTarball>"};
OutputSandbox = {"gridjob.out","wnlogs.tgz"};
RetryCount = <jdlRetryCount>;
ShallowRetryCount = <jdlShallowRetryCount>;
Requirements = other.GlueCEInfoHostName == "<jdlReqCEInfoHostName>";
]

and located in /usr/libexec/grid-monitoring/probes/org.sam/wnjob/org.sam.gridJob.jdl.template.

  • Substitutable elements of the template are
    • jdlExecutable - name of executable on WN (by default set to nagrun.sh)
    • jdlArguments - list of arguments to jdlExecutable (dynamically composed by JobState metric)
    • jdlInputSandboxExecutable - path to executable on UI (set by the framework to /metric/work/dir/<jdlExecutable>)
    • jdlInputSandboxTarball - path to WN tarball (set by the framework to /metric/work/dir/gridjob.tgz)
    • jdlRetryCount - default 0 (can be modified via CLI parameter --jdl-retrycount (see below))
    • jdlShallowRetryCount - default 1 (can be modified via CLI parameter --jdl-shallowretrycount (see below))
    • jdlReqCEInfoHostName - CE host name (set by the framework)

"Third-party" JDL

One can provide its own JDL with the following parameter to org.sam.CE-JobState metric:

--jdl-templ <file>    JDL template file (full path).

The provided JDL can be a template as well with all or some of the above mentioned substitutable parameters.

For better flexibility RetryCount and ShallowRetryCount ClassAdds were exposed as parameters to
org.sam.CE-JobState metric:

--jdl-retrycount <val>          JDL RetryCount (Default: 0).
--jdl-shallowretrycount <val>   JDL ShallowRetryCount (Default: 1).

Logs from WN

Default JDL's output sandbox defines two files that will be taken from WN

OutputSandbox = {"gridjob.out","wnlogs.tgz"};
  • gridjob.out contains logging output of WN job as seen by WMS job wrapper. I.e., stdout and stderr from the testing framework launching script on WN (nagrun.sh).
  • wnlogs.tgz contains the following directories from WN: /<job working dir>/nagios/{var,tmp}. The framework's messaging and Nagios logging and debuging is stored there.
    • when writing probes for WN, one can direct output into some files in that directories - they will be brought to UI in wnlogs.tgz.

On Nagios UI OutputSandbox is stored per CE under /var/lib/gridprobes/<VO or FQAN>/<namespace>/<service>/<hostname>/jobOutput* directories. E.g.:

# ls -1d /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CREAMCE/stremsel.nikhef.nl/jobOutput*
/var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CREAMCE/stremsel.nikhef.nl/jobOutput
/var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CREAMCE/stremsel.nikhef.nl/jobOutput.LAST
#

jobOutput.LAST contains last historical output from WN.

Log levels on WN

Logging level of framework and metrics on WN is guided by the following parameters to org.sam.*-JobState metrics

--wn-verb <0-3>             Metrics verbosity level on WN. [-v <VERBOSITY>]
                            (Default: 0)
--wn-verb-fw <0-3>          Framework verbosity level on WN (Default: 1)
  • --wn-verb - on WN the given value is substituted for <VERBOSITY> in Nagios metric definition *.cfg files.
  • --wn-verb-fw <0-3> - used by nagrun.sh for setting its debugging output as well as setting logging and debugging output of messages publishing client (Message Transfer Agent - MTA) and Nagios.
    • messaging:
          >= 2  - 'debug'
          == 1  - 'info'
          == 0  - 'warn'
      
    • nagios:
      # Nagios debug codes/types   | FW debug levels
      #    -1 = Everything            -
      #     0 = Nothing               0
      #     1 = Functions             3
      #     2 = Configuration         3
      #     4 = Process information   2,3
      #     8 = Scheduled events      2,3
      #    16 = Host/service checks   1,2,3
      #    32 = Notifications         -
      #    64 = Event broker          3
      #   128 = External commands     1,2,3
      #   256 = Commands (handlers)   2,3
      #   512 = Scheduled downtime    -
      #  1024 = Comments              -
      #  2048 = Macros                1,2,3
      
      >= 3  - 4095 ^ 32 ^ 512 ^ 1024 (bit-wise XOR)
      == 2  - 4 | 8 | 16 | 128 | 256 | 2048 (bit-wise OR)
      == 1  - 16 | 128 | 2048 (bit-wise OR)
      <= 0  - 0
      

Naigos and messaging on WN

Codebase for scheduling metrics on WN and delivering metric results to messaging system is located under /usr/libexec/grid-monitoring/probes/org.sam/wnjob/ (will refer to it as /<WN_codebase>/).

nagrun.sh

On WN (as specified in JDL with Executable = "<jdlExecutable>") nagrun.sh script (on Nagios UI located in /<WN_codebase>/) is used to

  • set required environment variables
  • launch messages transfer agent (MTA) - metric results publisher to message brokers
  • make substitutions in templated (mainly Nagios) configuration files
  • launch and monitor Nagios
  • after all metrics are executed (or on timeout) terminate Nagios and MTA
  • do on-exit cleanup

Parameters to nagrun.sh specified by Arguments = "<jdlArguments>" define what and how should be launched.

nagrun.sh -h
usage: nagrun.sh -v <vo> -d <dest> [-b <broker_uri>]
 [-n <broker_network>] [-t <timeout>] [-w <fw_verb>]
 [-l <hostname>] [-s <hostname>] [-z <metric_verb>] [-f <fqan>]
 [-i <host:port,..>] -B -R -N -h
 -v and -d are mandatory paramters. Defaults:
 <broker_network> - PROD
 <timeout> - 600 sec
 <metrics_verb> - 0
 <fw_verb> - 1 (2 - messages, 3 - Nagios config/stats/debug)
 -l <hostname> - LFC for replica tests
 -s <hostname> - SE for replica tests
 -i <host:port,..> - BDII(s) for tests
 -f <fqan> - VOMS FQAN
 -B - don't do broker discovery
 -R - take MB randomly; by default sort by min response time

In most cases the parameters is the translation of corresponding ones given to org.sam.*-JobState metric.

org.sam.*-JobState nagrun.sh
--mb-network <net> Brokers network for broker discovery on WN. -n <broker_network>
--mb-uri <URI> Message Broker URI. -b <broker_uri>
--mb-destination <dest> queue/topic on MB to publish to. -d <dest>
--mb-no-discovery Do not do broker discovery on WN. -B
--mb-choice <best|random> How to choose MB on WN. -R
--vo <name> Virtual Organization. -v <vo>
--vo-fqan <name> VOMS primary attribute as FQAN. -f <fqan>
--wn-lfc <hostname> LFC for replica tests on WN. -l <hostname>
--wn-se-rep <se1,..> Comma-separated list of SEs for replica tests on WN. -s <hostname>
--wn-bdii <host:port,...> BDII(s) to be used on WN. -i <host:port,..>
--wn-verb <0-3> Metrics verbosity level on WN. -z <metric_verb>
--wn-verb-fw <0-3> Framework verbosity level on WN. -w <fw_verb>
--timeout-wnjob-global <sec> Global timeout for a job on WN. -t <timeout>

MTA

Message Transfer Agent (MTA) is used to send messages with metric results from WN to Message Brokers. On Nagios UI the code for MTA - mta-simple - is located under /<WN_codebase>/bin/ and implementation in /<WN_codebase>/lib/python2.3/site-packages/mig.

MTA:

  • tries to establish connection to a broker (either a given one or found via discovery in IS and successive application of ranking)
  • takes messages from directory based queue (Python dirq: implementation in /<WN_codebase>/lib/python2.3/site-packages/dirq) and sends them to the broker. Default location for outgoing messages queue on WN is /tmp/sam.<pid-of-nagrun.sh>.$RANDOM/msg-outgoing

Messages with metric results are stored in outgoing messages queue by a Nagios handler handle_service_check invoked by Nagios after execution of each check.

Troubleshooting

Manual job submission and monitoring

To submit a job manually with modified default command line parameters do the following. E.g.:

# nagios-run-check -d -v -s org.sam.CE-JobState-/ops/Role=lcgadmin -H ce03.esac.esa.int
Executing command:
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce03.esac.esa.int" -t 600 --vo ops 
  --mb-destination /queue/grid.probe.metricOutput.EGEE.sam-ne-roc-val_cern_ch -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin 
  --vo-fqan /ops/Role=lcgadmin --mb-uri stomp://gridmsg002.cern.ch:6163/ --prev-status $LASTSERVICESTATEID$ --wn-se-rep-file GoodSEs 
  --wn-se-rep fake.srm1,samdpm002.cern.ch -m org.sam.CE-JobState --err-topics ce_wms,default'
# 

Take the produced command, modify/add/delete parameters as required and run it. For example, we want to increase verbosity level of the framework on WN. Fist, you need to remove all options with parameters of the form $SOMENAMEHERE$, as those are Nagios macros and valid only when the check is run under Nagios. In the example above we remove --prev-status $LASTSERVICESTATEID$. Also we add --wn-verb-fw 3 to set max verbosity level for the framework on WN.

So, the resulting command will be

# su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce03.esac.esa.int" -t 600 --vo ops 
  --mb-destination /queue/grid.probe.metricOutput.EGEE.sam-ne-roc-val_cern_ch -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin 
  --vo-fqan /ops/Role=lcgadmin --mb-uri stomp://gridmsg002.cern.ch:6163/ --wn-se-rep-file VeryGoodSEs --wn-se-rep fake.srm1,samdpm002.cern.ch 
  -m org.sam.CE-JobState --err-topics ce_wms,default --wn-verb-fw 3 -v 3'

OK: Active job - Submitted [2010-12-09T09:53:16Z]
OK: Active job - Submitted [2010-12-09T09:53:16Z]
Testing from: samnag011.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=gridmon/CN=137254/CN=Robot: Grid Monitoring-Framework
VOMS FQANs: /ops/Role=lcgadmin/Capability=NULL, /ops/Role=NULL/Capability=NULL
Reading configuration file(s).
Configuration files:
 /etc/gridmon/org.sam.conf
Reading from configuration file(s):
 /etc/gridmon/org.sam.conf
Parsing command line parameters.
Loading active job's description ... done.
>>> Active job found
file: /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CE/ce03.esac.esa.int/activejob.map
serviceDesc : org.sam.CE-JobSubmit-/ops/Role=lcgadmin
jobState : Submitted
submitTimeStamp : 1291888396
jobID : https://wms220.cern.ch:9000/02RCPJTL3CehL-Ldi66MMg
hostNameCE : ce03.esac.esa.int
lastStateTimeStamp : 1291888396
https://wms220.cern.ch:9000/02RCPJTL3CehL-Ldi66MMg
# 

However, it reported that there is already an active job. Note -v 3 added to see more debugging output during job submission. You can either simply delete the active job bookkeeping file /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CE/ce03.esac.esa.int/activejob.map (acts as a lock). Or starting from grid-monitoring-probes-org.sam-0.1.18, you can run JobCancel metric to gracefully cancel the job:

# su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce03.esac.esa.int" -m org.sam.CE-JobCancel 
  -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin --vo ops'
OK: job cancelled
OK: job cancelled
Testing from: samnag011.cern.ch
DN: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=gridmon/CN=137254/CN=Robot: Grid Monitoring-Framework
VOMS FQANs: /ops/Role=lcgadmin/Capability=NULL, /ops/Role=NULL/Capability=NULL
Job cancellation request sent:
glite-wms-job-cancel --noint  -i /var/lib/gridprobes/ops.Role=lcgadmin/org.sam/CE/ce03.esac.esa.int/jobID
Job bookkeeping files deleted.
# 

After you submit a new job with required parameters it can be monitored via Nagios interface or manually with JobMonit check. To find out what command line to run do the following which is similar to what was done for JobState metric.

-H sam-ne-roc-val.cern.ch should be hostname of your Nagios instance.

# nagios-run-check -d -v -s org.sam.CE-JobMonit-/ops/Role=lcgadmin -H sam-ne-roc-val.cern.ch
Executing command:
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "sam-ne-roc-val.cern.ch" -t 600 --vo ops 
   -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin -m org.sam.CE-JobMonit --err-topics ce_wms'
#

Then, add --hosts ce03.esac.esa.int to monitor only required hosts (comma-separated list is possible) and run the command.

# su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "sam-ne-roc-val.cern.ch" -t 600 --vo ops 
   -x /etc/nagios/globus/userproxy.pem--ops-Role_lcgadmin --vo-fqan /ops/Role=lcgadmin -m org.sam.CE-JobMonit --err-topics ce_wms 
   --hosts ce03.esac.esa.int'
OK: Jobs processed - 1
OK: Jobs processed - 1
[Running] : 1|jobs_processed=1;; DONE=0;; RUNNING=1;; SCHEDULED=0;; SUBMITTED=0;; READY=0;; WAITING=0;; 
  WAITING-CANCELLED=0;; WAITING-CANCEL=0;; ABORTED=0;; CANCELLED=0;; CLEARED=0;; MISSED=0;; UNDETERMINED=0;; unknown=0;1;2
#

We can see that the job is in Running state. The output after "|" is Nagios performance data output.

When job enters a terminal state JobMonit will fetch job's output sandbox to the Nagios UI. For location of job output see Logs from WN.

Document generated by Confluence on Feb 27, 2014 10:19