This page last changed on Feb 25, 2014 by mbabik.

Introduction

This pages lists relevant Frequently Asked Questions related to SAM.

The following troubleshooting items are already available under two categories:

Installation

Configuration

Monitoring of the local services

Manual re-configuration is needed to change URL that ATP is using to connect to GOCDB (please note this will be overwritten by yaim). In /etc/atp/atp_synchro.conf, please change the following URLs

[gocdb]
site_url            = https://goc.egi.eu/gocdbpi/private/?method=get_site
serviceendpoint_url = https://goc.egi.eu/gocdbpi/private/?method=get_service_endpoint

to
site_url            = https://goc.egi.eu/gocdbpi/private/?method=get_site&scope=EGI,Local&scope_match=any
serviceendpoint_url = https://goc.egi.eu/gocdbpi/private/?method=get_service_endpoint&scope=EGI,Local&scope_match=any
Nagios access configuration

This is done via voms2htpasswd, which has dynamic and static config that generates /etc/httpd/htpasswd. Dynamic part is generated from VOMS and GOCDB site contacts, e.g.

# cat /etc/voms2htpasswd.conf
vomss://voms1.somewhere:8443/voms/dteam?/dteam
vomss://voms2.somewhereelse:8443/voms/ops?/ops
gocdb://gocdb/api/private/?method=getcontacts&roc=ROC
gocdb://gocdb/api/private/?method=getcontacts&roc=ROC

The is based on site-info.def VO_<VO>_VOMS_SERVERS and NCG_GOCDB_ROC_NAME

There is also a possibility to add list of static DNs in .conf file under /etc/voms2htpasswd-static.d/

# cat /etc/voms2htpasswd-static.d/YAIM-ops-monitor.conf 
/DC=dc/DC=dc2/OU=ou1s/CN=SomeCN
Failover Nagios

Starting from Update-13 SAM supports deployment of hot-standby (active/active) instances. The systems works in the following way:

  • Backup instance is deployed by using the same Yaim configuration with added BACKUP_INSTANCE variable described below.
  • SAM administrator opens GGUS ticket to SAM/Nagios Support Unit requesting addition of the backup host to message consumer filter (file nagios-roles.conf).
  • Backup instance constantly monitors resources, but it has the following features:
    • alarms are not sent to Operations portal
    • email notifications are disabled
    • results are not sent to the central MRS database.
    • note: results are stored to local MRS so the MyEGI shows correct history on both instances.
  • In case of failure of the main instance SAM administrator has to manually switch off BACKUP_INSTANCE variable on the backup instance.

Backup instance can be defined in several ways:

  • Via Yaim variable which sets variable BACKUP_INSTANCE in /etc/sysconfig/ncg (recommended mechanism)
    NCG_BACKUP_INSTANCE=true
  • Fast backup configuration without YAIM execution:
    • Setting variable BACKUP_INSTANCE in /etc/sysconfig/ncg (this approach can be used for fast failover):
      BACKUP_INSTANCE=true
    • Setting global variable in ncg.conf file:
      BACKUP_INSTANCE=1
    • Using ncg.pl argument:
      ncg.pl --backup-instance

In case of backup configuration without Yaim the following additional step is needed:

/sbin/chkconfig send-to-dashboard off
/sbin/service send-to-dashboard stop

In order to turn backup instance into the active one, SAM administrator has to remove BACKUP_INSTANCE variable. If the Yaim is not used the following additional step is needed:

/sbin/chkconfig send-to-dashboard on
/sbin/service send-to-dashboard start
Robot certificates

Starting from Update-09 SAM supports usage of robot certificates, instead of MyProxy credentials. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy servier.

In order to use robot certificates set the following YAIM variables:

NCG_USE_ROBOT_CERT=true
# Robot cert and key can be different for each VO
# and standard Yaim VO notation is used
VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem
VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem
VO_DTEAM_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem-dteam
VO_DTEAM_ROBOT_KEY=/etc/nagios/globus/robot-key.pem-dteam
Variables ROBOT_CERT and ROBOT_KEY use standard Yaim VO notation. If VO directories (vo.d/voname) are used, variables should put in appropriate VO files.
Enabling ACE in MyWLCG

Currently it's only for the central MyWLCG instance. YAIM configuration:

MYEGI_ACE=true
Multi-VO configuration setup

Starting from Update-17.1 you can setup VO-Nagios and defined multiple profiles in your local POEM Web interface https://tomtools.cern.ch/confluence/display/SAMDOC/POEM+User%27s+guide. In order to support multiple VOs you will need to define a VO feed for each given VO as described in https://tomtools.cern.ch/confluence/display/SAM/ATP+VO+Feeds and afterwards define a POEM profile for each given VO. The rest of the configuration should follow standard VO-Nagios setup. Please note that the following Nagios.pm.mvopatch is needed for NCG to make this work.

Setting alternative SE for metric org.sam.WN-RepRep

Starting from the release Update-07, it is possible to specify more than one replication SE for WN replica test org.sam.WN-RepRep. Static and/or dynamic mechanisms are possible.

In order to define static list of comma-separated hostnames set the following Yaim variable:

JOBSUBMIT_WN_SE_REP=se1[,se2,se3...]

Dynamic list is filled with a list of SEs defined on the Nagios instance that recently successfully passed org.sam.SRM-All set of tests. In order to use dynamic list set the following Yaim variable:

JOBSUBMIT_WN_SE_REP_FILE=filename

Filename must be defined without path.
If the dynamic list is used metric hr.srce.GoodSEs will be associated to Nagios host. The hr.srce.GoodSEs metric generates the list of "good" SEs, as well as provides the file as input parameter to org.sam.(CREAM)CE-JobState metric(s).

The org.sam.(CREAM)CE-JobState metric(s) takes up to max 3 hosts from the file and, if JOBSUBMIT_WN_SE_REP was defined, appends them to the static list. On WN, org.sam.WN-RepRep tries to replicate to all the SEs in the provided order until the replication succeeds. The metric returns CRITICAL, if file couldn't be replicated to any for the SEs. This fixes https://tomtools.cern.ch/jira/browse/SAM-442.

Setting alternative BDII for metric org.sam.SRM-All

Metric org.sam.SRM-All uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII.

In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/srm.conf). There are two options:
1. switch to your own top BDII:

MODIFY_METRIC_PARAMETER!org.sam.SRM-All!--ldap-uri!your.top.bdii

2. use site BDII:

MODIFY_METRIC_ATTRIBUTE!org.sam.SRM-All!SITE_BDII!--ldap-uri
Setting alternative LFC for metrics org.sam.WN-Rep*

Metrics org.sam.WN-Rep* use prod-lfc-shared-central.cern.ch LFC by default.

In order to set alternative lfc create localdb file (e.g. /etc/ncg/ncg-localdb.d/LFC.conf):

MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-lfc!lfc.my.domain
MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-lfc!lfc.my.domain
Setting alternative BDII for metric org.sam.CREAMCE-DirectJobState

Metric org.sam.CREAMCE-DirectJobState uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII.

In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/creamcedjs.conf). There are two options:
1. switch to your own top BDII:

MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-DirectJobState!--ldap-uri!your.top.bdii

2. use site BDII:

MODIFY_METRIC_ATTRIBUTE!org.sam.CREAMCE-DirectJobState!SITE_BDII!--ldap-uri
Setting alternative list of CEs for metric org.sam.WMS-JobState

If the monitored infrastructure contains WMS service and no CE services, metric hr.srce.GoodCEs associated to Nagios service will fail with the following error:

HealthyNodes CRITICAL - No healthy hosts found.

There are two options to solve this issue.

1. If the infrastructure contains CREAM-CE services create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content:

MODIFY_METRIC_PARAMETER!hr.srce.GoodCEs!--metric!org.sam.CREAMCE-JobSubmit

2. In order to use static list of CEs or CREAM-CEs create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content:

REMOVE_METRIC!hr.srce.GoodCEs

For each VO supported on SAM instance create file /var/lib/gridprobes/<VO_NAME>/GoodCEs and list CE/CREAM-CE names in it. Example is:

ce1.reliable.my
ce2.reliable.my
cream-ce3.reliable.my

In case VO_FQAN is used (e.g. /ops/Role=lcgadmin) <VO_NAME> should be set to VO_FQAN with "/" replaced with "." (e.g. /var/lib/gridprobes/ops.Role=lcgadmin/GoodCEs).

Monitoring Globus services

Globus services currently do not support VOs. In order to monitor Globus services SAM administrator has to contact all sites and request to add the certificate DN to the grid-mapfile.

Removing metrics from alias only
REMOVE_ALIAS_METRIC!alias!metric
Enabling host and site contacts when global notifications (ENABLE_NOTIFICATIONS) are disabled

It's done in localdb:

ADD_SITE_CONTACT!sitename!emailAddress
ENABLE_SITE_CONTACT!sitename!emailAddress
ENABLE_HOSTCONTACT!hostname!emailAddress
Throttling of MyWLCG WEB API

Performance limits in MyWLCG/MyEGI portal are set by YAIM variables.
This variables have default values as listed below:

# Limit number of rows that can be fetched at a time to avoid DB dumps.
MYWLCG_DB_LIMIT=50000
# Limit number of accesses per IP address in a given time(seconds).
MYWLCG_ACCESS_PERIOD=5
MYWLCG_NUMBER_OF_ACCESSES=100

Runtime

How to know which Messaging Broker is configured in Nagios?
cat /var/cache/msg/broker-cache-file/broker-list
How to remove hosts from Nagios so they are not monitored?

In /etc/ncg/ncg.conf, set:

&lt;NCG::SiteSet&gt;
  &lt;File&gt;
      DB_FILE=/etc/ncg/ncg.localdb.d/sites.conf
      DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  &lt;/File&gt;
&lt;/NCG::SiteSet&gt;

and in the <NCG::ConfigGen> section, set

INCLUDE_EMPTY_HOSTS=0

In /etc/ncg/ncg.localdb.d directory, create a file ending by '.conf', like for instance sites.conf with the list of sites to monitor and the list of services to remove. Lines like these:

ADD_SITE!CERN-PROD
REMOVE_HOST!samdpm001.cern.ch
How to replace the default SE used for replica management tests

There are two possibilities:

In /etc/gridmon/org.sam.conf configuration file, uncomment this line:

#wn_se_rep = samdpm002.cern.ch

and set it to your preferable SE. You don't need to rerun NCG after this change.

Create a file in NCG localdb directory (/etc/ncg/ncg.localdb.d/anyfilename) with the following line in it:

MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-se-rep!your.srm

and then rerun NCG:

[root@vtb-generic-30 ncg]# ncg.pl
[root@vtb-generic-30 ncg]# service nagios reload

Note: after the change of configuration check the output of metric 'org.sam.WN-Rep' to see which was the SE node used, e.g.:

File was copied to SE my-cro-se.srce.hr and registered in LFC prod-lfc-shared-central.cern.ch.

The "Last Check Time" should be posterior to the change of configuration.

How to create/renew the MyProxy proxy for Nagios use?

In order to execute probes that interact with Grid services, Nagios needs a proxy certificate. This proxy certificate is automatically renewed based on a MyProxy certificate stored by the Nagios Admins for its operation.

In order to store a MyProxy certificate one needs to execute the following command at the UI where the certificate is available:

myproxy-init -l nagios -s &lt;MyProxy Server&gt; -k NagiosRetrieve-&lt;Nagios Server&gt;-&lt;VO&gt; -c 336 -x -Z "&lt;Nagios Server certificate's subject DN&gt;"

For example:

myproxy-init -l nagios -s myproxy.example.com -k NagiosRetrieve-nagios.example.com-dteam -c 336 -x -Z "/C=XX/O=The Grid/OU=Monitoring Service/CN=nagios.example.com"
How to manually add services on NGI/ROC Nagioses

1. Create additional config file for the site (e.g. /etc/ncg/ncg.conf.d/sitename.conf) with the following content:

&lt;NCG::SiteInfo sitename&gt;
  # NCG::SiteInfo content from /etc/ncg/ncg.conf
  &lt;ATP&gt;
    ATP_ROOT_URL=https://grid-monitoring.cern.ch/atp
  &lt;/ATP&gt;
  &lt;File&gt;
    DB_FILE=/etc/ncg/ncg.localdb
    DB_DIRECTORY=/etc/ncg/ncg-localdb.d
  &lt;/File&gt;

  # additional config line
  &lt;File&gt;
    DB_FILE=/etc/ncg/ncg.sitename
  &lt;/File&gt;
&lt;/NCG::SiteInfo&gt;
Warning
DB_FILE must not be into default directory /etc/ncg/ncg-localdb.d.
Warning
If the NCG::SiteInfo block in /etc/ncg/ncg.conf is different from the example above, copy its content to /etc/ncg/ncg.conf.d/sitename.conf and add File block with /etc/ncg/ncg.sitename (see comments above).

2. In the local db file /etc/ncg/ncg.sitename add the following content:

# adding VO specific service
ADD_HOST_SERVICE_VO!hostname!MPICH!ops
# adding generic service
ADD_HOST_SERVICE!hostname!BDII
What is the command executed by Nagios to run the check 'org.sam.SRM-All'?

On the Nagios box run:

nagios-run-check your_hostname org.sam.SRM-All-/ops/Role=lcgadmin

To see the detailed log run:

nagios-run-check gridvm02.roma2.infn.it org.sam.SRM-All-/ops/Role=lcgadmin --verbose --dryrun
How to monitor glexec services on ROC/NGI Nagios?

In the Yaim configuration set the following variable:

NCG_HASH_CONFIG_PROFILES=&lt;role_name&gt;,glexec

where <role_name> is name of your role.

In case you want to run glexec metrics with different VO FQAN (e.g. /ops/Role=pilot) set the following variable:

NCG_PROFILE_FQAN_glexec=/ops/Role=pilot
How to execute Nagios check from command line interface.

You can run only active checks. Running this command:

nagios-run-check \-v \-d \-H &lt;hostname&gt; \-s org.sam.CE-JobState-ops

You will see which is the command executed, e.g.:

[root@vtb-generic-30 ~]# nagios-run-check -v -d -H ce02.tier2.hep.manchester.ac.uk -s org.sam.CE-JobState-ops
Executing command:
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce02.tier2.hep.manchester.ac.uk" -t 600 --vo ops --mb-destination /queue/grid.probe.metricOutput.EGEE.vtb-generic-30_cern_ch -x /etc/nagios/globus/userproxy.pem-ops --prev-status $LASTSERVICESTATEID$ -m org.sam.CE-JobState --err-topics ce_wms,default'
How to manage multiple proxies on the Nagios server

Add profile <PROFILE_NAME> in your site-info.def file and rerun yaim:

NCG_HASH_CONFIG_PROFILES="(...),&lt;PROFILE_NAME&gt;"
NCG_PROFILE_FQAN_&lt;PROFILE_NAME&gt;=/ops/Role=pilot
MyEGI portal doesn't show any results.

Do you have MDDB profile defined for your VO?

Nagios doesn't test services from site <SITENAME>

Check if these services support VO configured in your Nagios.

  • Nagios bootstrapped by ATP
    http://grid-monitoring.cern.ch/atp/api/search/servicemap/json?site=&lt;SITENAME&gt;&amp;ismonitored=on
  • Nagios bootstrapped by old SAM
    http://lcg-sam.cern.ch:8080/same-pi/services_per_vo_monitored.jsp?Site_name=&lt;SITENAME&gt;
    http://lcg-sam.cern.ch:8080/same-pi/service_types_per_service_endpoint_monitored.jsp
I don't see metric results for my service for HEP VO(s)
  • Example:
    http://grid-monitoring.cern.ch/myegi/sam-pi/status_of_service_in_profile?vo_name=atlas&amp;profile_name=ATLAS_CRITICAL
  • Solution:
    • Check if the service is being tested by HEP VO's Nagios:
      https://sam-alice.cern.ch/nagios/
      https://sam-atlas.cern.ch/nagios/
      https://sam-cms.cern.ch/nagios/
      https://sam-lhcb.cern.ch/nagios/
    • If not, then assign ticket to 'VOSupport'
      Don't forget to choose in GGUS the affected VO.
    • If yes, then
      • Check if the service is defined in VO feed with correct flavour.
        URLs to VO feeds are defined in https://tomtools.cern.ch/confluence/display/SAM/ATP+VO+Feeds**** If not, then assign ticket to 'VOSupport'
        Don't forget to choose in GGUS the affected VO.
        • If yes, then assign ticket to 'SAM/Nagios 3rd Level Support'
MyEGI/MyWLCG web services give output: "404: File Not Found"
  • Check if you are calling the web service correctly, e.g.
    http://grid-monitoring.cern.ch/myegi/sam-pi/latest_metric_results_in_profile?vo_name=ops&amp;profile_name=ROC_CRITICAL&amp;service_flavour=CREAM-CE
MyEGI/MyWLCG: The services are missing in NGI Nagios instance but visible on central grid-monitoring instance
  • Example:
    http://rnagios.ibergrid.cesga.es/myegi/sam-pi/latest_metric_results_in_profile?vo_name=ops&amp;profile_name=ROC&amp;service_flavour=CREAM-CE&amp;service_hostname=ce09.pic.es
    http://grid-monitoring.cern.ch/myegi/sam-pi/latest_metric_results_in_profile?vo_name=ops&amp;profile_name=ROC&amp;service_flavour=CREAM-CE&amp;service_hostname=ce09.pic.es
  • Solution
    • Check in your database if ATP synchronizer is running correctly and if service is present.
      mysql&gt; use mrs;
      mysql&gt; select * from service where hostname = 'ce09.pic.es';
      mysql&gt; select * from synchronizer_lastrun;
      mysql&gt; select * from synchronizer;
Alarms for 'NGI_A' in Operations Portal (https://operations-portal.in2p3.fr/dashboard) are pointing to wrong regional nagios instance.

Operations portal is filtering alarms based on NGI_NAME and NAGIOS_ROLE (See: http://gridops.cern.ch/config/nagios-roles.conf). Probably someone misconfigured his Nagios instance and started sending alarms for 'NGI_A'.
Solution: Contact the admin of this Nagios instance.

Which metrics have been changed between SAM Update-20 and SAM Update-22?

As part of the Integration of EMI probes, several profiles changes are needed.

Removals

Several metrics need to be removed from profiles, as they have been deprecated by their developers.
With these removals, the following metrics will disappear from all APIs and interfaces (by 1st of June 2013).

  • org.sam.LFC-CertLifetime (there is no replacement)
  • org.arc.AUTH (not needed, as this is tested indirectly at any job submission)
  • org.arc.SW-VERSION (checked the ARC version publishing, this functionality is provided by new test org.nordugrid.ARC-CE-ARIS)
  • org.sam.mpi.CE-JobSubmit (replaced by new MPI tests)
  • org.sam.WN-MPI (replaced by new MPI tests)
Renames/replacements

Some metrics have been renamed by their developers.
The idea is to keep both names till all instances are upgraded:

  • The central instance and upgraded NGIs with the new names
  • Non-upgraded NGIs with the old names

SAM Update-22 includes a metric-renaming mechanism, to ensure correct functionality during transition period.

The metric history will be preserved, this change will not affect Availability and Reliability calculation

The full list of replacements is the following:

Old name New name
org.sam.CE-JobSubmit emi.ce.CREAMCE-JobSubmit
org.sam.CREAMCE-DirectJobSubmit emi.cream.CREAMCE-DirectJobSubmit
org.sam.CREAMCE-JobSubmit emi.cream.CREAMCE-JobSubmit
org.sam.WN-Bi emi.wn.WN-Bi
org.sam.WN-Csh emi.wn.WN-Csh
org.sam.WN-SoftVer emi.wn.WN-SoftVer
org.sam.glexec.CE-JobSubmit emi.cream.glexec.CREAMCE-JobSubmit
org.sam.glexec.WN-gLExec emi.cream.glexec.WN-gLExec
org.arc.ARC-STATUS org.nordugrid.ARC-CE-ARIS
org.arc.CA-VERSION org.nordugrid.ARC-CE-IGTF
org.arc.csh org.nordugrid.ARC-CE-sw-csh
org.arc.gcc org.nordugrid.ARC-CE-sw-gcc
org.arc.perl org.nordugrid.ARC-CE-sw-perl
org.arc.python org.nordugrid.ARC-CE-sw-python
org.arc.Jobsubmit org.nordugrid.ARC-CE-result
org.arc.LFC org.nordugrid.ARC-CE-lfc
org.arc.SRM org.nordugrid.ARC-CE-srm
Additions

These are the new metrics added:

  • emi.ARGUS
    • emi.ARGUS.PDP-memory
    • emi.ARGUS.PDP-status
    • emi.ARGUS.PDP-traffic
    • emi.ARGUS.PEP-memory
    • emi.ARGUS.PEP-status
    • emi.ARGUS.PEP-traffic
  • eu.egi.MPI
    • eu.egi.mpi.complexjob.CREAMCE-JobSubmit
    • eu.egi.mpi.complexjob.WN
    • eu.egi.mpi.simplejob.CREAMCE-JobSubmit
    • eu.egi.mpi.simplejob.WN
    • eu.egi.mpi.EnvSanityCheck
  • ARC-CE
    • org.nordugrid.ARC-CE-submit
    • org.nordugrid.ARC-CE-LFC-submit
    • org.nordugrid.ARC-CE-SRM-submit
    • org.nordugrid.ARC-CE-LFC-result
    • org.nordugrid.ARC-CE-SRM-result

Nagios.pm.mvopatch (application/octet-stream)
Document generated by Confluence on Feb 27, 2014 10:19