This page last changed on Feb 25, 2014 by mbabik.
This pages lists relevant Frequently Asked Questions related to SAM.
The following troubleshooting items are already available under two categories:
Monitoring of the local services
Manual re-configuration is needed to change URL that ATP is using to connect to GOCDB (please note this will be overwritten by yaim). In /etc/atp/atp_synchro.conf, please change the following URLs
site_url = https://goc.egi.eu/gocdbpi/private/?method=get_site
serviceendpoint_url = https://goc.egi.eu/gocdbpi/private/?method=get_service_endpoint
site_url = https://goc.egi.eu/gocdbpi/private/?method=get_site&scope=EGI,Local&scope_match=any
serviceendpoint_url = https://goc.egi.eu/gocdbpi/private/?method=get_service_endpoint&scope=EGI,Local&scope_match=any
Nagios access configuration
This is done via voms2htpasswd, which has dynamic and static config that generates /etc/httpd/htpasswd. Dynamic part is generated from VOMS and GOCDB site contacts, e.g.
# cat /etc/voms2htpasswd.conf
The is based on site-info.def VO_<VO>_VOMS_SERVERS and NCG_GOCDB_ROC_NAME
There is also a possibility to add list of static DNs in .conf file under /etc/voms2htpasswd-static.d/
# cat /etc/voms2htpasswd-static.d/YAIM-ops-monitor.conf
Starting from Update-13 SAM supports deployment of hot-standby (active/active) instances. The systems works in the following way:
- Backup instance is deployed by using the same Yaim configuration with added BACKUP_INSTANCE variable described below.
- SAM administrator opens GGUS ticket to SAM/Nagios Support Unit requesting addition of the backup host to message consumer filter (file nagios-roles.conf).
- Backup instance constantly monitors resources, but it has the following features:
- alarms are not sent to Operations portal
- email notifications are disabled
- results are not sent to the central MRS database.
- note: results are stored to local MRS so the MyEGI shows correct history on both instances.
- In case of failure of the main instance SAM administrator has to manually switch off BACKUP_INSTANCE variable on the backup instance.
Backup instance can be defined in several ways:
- Via Yaim variable which sets variable BACKUP_INSTANCE in /etc/sysconfig/ncg (recommended mechanism)
- Fast backup configuration without YAIM execution:
- Setting variable BACKUP_INSTANCE in /etc/sysconfig/ncg (this approach can be used for fast failover):
- Setting global variable in ncg.conf file:
- Using ncg.pl argument:
In case of backup configuration without Yaim the following additional step is needed:
/sbin/chkconfig send-to-dashboard off
/sbin/service send-to-dashboard stop
In order to turn backup instance into the active one, SAM administrator has to remove BACKUP_INSTANCE variable. If the Yaim is not used the following additional step is needed:
/sbin/chkconfig send-to-dashboard on
/sbin/service send-to-dashboard start
Starting from Update-09 SAM supports usage of robot certificates, instead of MyProxy credentials. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy servier.
In order to use robot certificates set the following YAIM variables:
# Robot cert and key can be different for each VO
# and standard Yaim VO notation is used
|Variables ROBOT_CERT and ROBOT_KEY use standard Yaim VO notation. If VO directories (vo.d/voname) are used, variables should put in appropriate VO files.|
Enabling ACE in MyWLCG
Currently it's only for the central MyWLCG instance. YAIM configuration:
Multi-VO configuration setup
Starting from Update-17.1 you can setup VO-Nagios and defined multiple profiles in your local POEM Web interface https://tomtools.cern.ch/confluence/display/SAMDOC/POEM+User%27s+guide. In order to support multiple VOs you will need to define a VO feed for each given VO as described in https://tomtools.cern.ch/confluence/display/SAM/ATP+VO+Feeds and afterwards define a POEM profile for each given VO. The rest of the configuration should follow standard VO-Nagios setup. Please note that the following Nagios.pm.mvopatch is needed for NCG to make this work.
Setting alternative SE for metric org.sam.WN-RepRep
Starting from the release Update-07, it is possible to specify more than one replication SE for WN replica test org.sam.WN-RepRep. Static and/or dynamic mechanisms are possible.
In order to define static list of comma-separated hostnames set the following Yaim variable:
Dynamic list is filled with a list of SEs defined on the Nagios instance that recently successfully passed org.sam.SRM-All set of tests. In order to use dynamic list set the following Yaim variable:
Filename must be defined without path.
If the dynamic list is used metric hr.srce.GoodSEs will be associated to Nagios host. The hr.srce.GoodSEs metric generates the list of "good" SEs, as well as provides the file as input parameter to org.sam.(CREAM)CE-JobState metric(s).
The org.sam.(CREAM)CE-JobState metric(s) takes up to max 3 hosts from the file and, if JOBSUBMIT_WN_SE_REP was defined, appends them to the static list. On WN, org.sam.WN-RepRep tries to replicate to all the SEs in the provided order until the replication succeeds. The metric returns CRITICAL, if file couldn't be replicated to any for the SEs. This fixes https://tomtools.cern.ch/jira/browse/SAM-442.
Setting alternative BDII for metric org.sam.SRM-All
Metric org.sam.SRM-All uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII.
In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/srm.conf). There are two options:
1. switch to your own top BDII:
2. use site BDII:
Setting alternative LFC for metrics org.sam.WN-Rep*
Metrics org.sam.WN-Rep* use prod-lfc-shared-central.cern.ch LFC by default.
In order to set alternative lfc create localdb file (e.g. /etc/ncg/ncg-localdb.d/LFC.conf):
Setting alternative BDII for metric org.sam.CREAMCE-DirectJobState
Metric org.sam.CREAMCE-DirectJobState uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII.
In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/creamcedjs.conf). There are two options:
1. switch to your own top BDII:
2. use site BDII:
Setting alternative list of CEs for metric org.sam.WMS-JobState
If the monitored infrastructure contains WMS service and no CE services, metric hr.srce.GoodCEs associated to Nagios service will fail with the following error:
HealthyNodes CRITICAL - No healthy hosts found.
There are two options to solve this issue.
1. If the infrastructure contains CREAM-CE services create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content:
2. In order to use static list of CEs or CREAM-CEs create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content:
For each VO supported on SAM instance create file /var/lib/gridprobes/<VO_NAME>/GoodCEs and list CE/CREAM-CE names in it. Example is:
In case VO_FQAN is used (e.g. /ops/Role=lcgadmin) <VO_NAME> should be set to VO_FQAN with "/" replaced with "." (e.g. /var/lib/gridprobes/ops.Role=lcgadmin/GoodCEs).
Monitoring Globus services
Globus services currently do not support VOs. In order to monitor Globus services SAM administrator has to contact all sites and request to add the certificate DN to the grid-mapfile.
Removing metrics from alias only
Enabling host and site contacts when global notifications (ENABLE_NOTIFICATIONS) are disabled
It's done in localdb:
Throttling of MyWLCG WEB API
Performance limits in MyWLCG/MyEGI portal are set by YAIM variables.
This variables have default values as listed below:
# Limit number of rows that can be fetched at a time to avoid DB dumps.
# Limit number of accesses per IP address in a given time(seconds).
How to know which Messaging Broker is configured in Nagios?
How to remove hosts from Nagios so they are not monitored?
In /etc/ncg/ncg.conf, set:
and in the <NCG::ConfigGen> section, set
In /etc/ncg/ncg.localdb.d directory, create a file ending by '.conf', like for instance sites.conf with the list of sites to monitor and the list of services to remove. Lines like these:
How to replace the default SE used for replica management tests
There are two possibilities:
In /etc/gridmon/org.sam.conf configuration file, uncomment this line:
#wn_se_rep = samdpm002.cern.ch
and set it to your preferable SE. You don't need to rerun NCG after this change.
Create a file in NCG localdb directory (/etc/ncg/ncg.localdb.d/anyfilename) with the following line in it:
and then rerun NCG:
[root@vtb-generic-30 ncg]# ncg.pl
[root@vtb-generic-30 ncg]# service nagios reload
Note: after the change of configuration check the output of metric 'org.sam.WN-Rep' to see which was the SE node used, e.g.:
File was copied to SE my-cro-se.srce.hr and registered in LFC prod-lfc-shared-central.cern.ch.
The "Last Check Time" should be posterior to the change of configuration.
How to create/renew the MyProxy proxy for Nagios use?
In order to execute probes that interact with Grid services, Nagios needs a proxy certificate. This proxy certificate is automatically renewed based on a MyProxy certificate stored by the Nagios Admins for its operation.
In order to store a MyProxy certificate one needs to execute the following command at the UI where the certificate is available:
myproxy-init -l nagios -s <MyProxy Server> -k NagiosRetrieve-<Nagios Server>-<VO> -c 336 -x -Z "<Nagios Server certificate's subject DN>"
myproxy-init -l nagios -s myproxy.example.com -k NagiosRetrieve-nagios.example.com-dteam -c 336 -x -Z "/C=XX/O=The Grid/OU=Monitoring Service/CN=nagios.example.com"
How to manually add services on NGI/ROC Nagioses
1. Create additional config file for the site (e.g. /etc/ncg/ncg.conf.d/sitename.conf) with the following content:
# NCG::SiteInfo content from /etc/ncg/ncg.conf
# additional config line
DB_FILE must not be into default directory /etc/ncg/ncg-localdb.d.
If the NCG::SiteInfo block in /etc/ncg/ncg.conf is different from the example above, copy its content to /etc/ncg/ncg.conf.d/sitename.conf and add File block with /etc/ncg/ncg.sitename (see comments above).
2. In the local db file /etc/ncg/ncg.sitename add the following content:
# adding VO specific service
# adding generic service
What is the command executed by Nagios to run the check 'org.sam.SRM-All'?
On the Nagios box run:
nagios-run-check your_hostname org.sam.SRM-All-/ops/Role=lcgadmin
To see the detailed log run:
nagios-run-check gridvm02.roma2.infn.it org.sam.SRM-All-/ops/Role=lcgadmin --verbose --dryrun
How to monitor glexec services on ROC/NGI Nagios?
In the Yaim configuration set the following variable:
where <role_name> is name of your role.
In case you want to run glexec metrics with different VO FQAN (e.g. /ops/Role=pilot) set the following variable:
How to execute Nagios check from command line interface.
You can run only active checks. Running this command:
nagios-run-check \-v \-d \-H <hostname> \-s org.sam.CE-JobState-ops
You will see which is the command executed, e.g.:
[root@vtb-generic-30 ~]# nagios-run-check -v -d -H ce02.tier2.hep.manchester.ac.uk -s org.sam.CE-JobState-ops
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce02.tier2.hep.manchester.ac.uk" -t 600 --vo ops --mb-destination /queue/grid.probe.metricOutput.EGEE.vtb-generic-30_cern_ch -x /etc/nagios/globus/userproxy.pem-ops --prev-status $LASTSERVICESTATEID$ -m org.sam.CE-JobState --err-topics ce_wms,default'
How to manage multiple proxies on the Nagios server
Add profile <PROFILE_NAME> in your site-info.def file and rerun yaim:
MyEGI portal doesn't show any results.
Do you have MDDB profile defined for your VO?
Nagios doesn't test services from site <SITENAME>
Check if these services support VO configured in your Nagios.
- Nagios bootstrapped by ATP
- Nagios bootstrapped by old SAM
I don't see metric results for my service for HEP VO(s)
- Check if the service is being tested by HEP VO's Nagios:
- If not, then assign ticket to 'VOSupport'
Don't forget to choose in GGUS the affected VO.
- If yes, then
- Check if the service is defined in VO feed with correct flavour.
URLs to VO feeds are defined in https://tomtools.cern.ch/confluence/display/SAM/ATP+VO+Feeds**** If not, then assign ticket to 'VOSupport'
Don't forget to choose in GGUS the affected VO.
- If yes, then assign ticket to 'SAM/Nagios 3rd Level Support'
MyEGI/MyWLCG web services give output: "404: File Not Found"
- Check if you are calling the web service correctly, e.g.
MyEGI/MyWLCG: The services are missing in NGI Nagios instance but visible on central grid-monitoring instance
- Check in your database if ATP synchronizer is running correctly and if service is present.
mysql> use mrs;
mysql> select * from service where hostname = 'ce09.pic.es';
mysql> select * from synchronizer_lastrun;
mysql> select * from synchronizer;
Alarms for 'NGI_A' in Operations Portal (https://operations-portal.in2p3.fr/dashboard) are pointing to wrong regional nagios instance.
Operations portal is filtering alarms based on NGI_NAME and NAGIOS_ROLE (See: http://gridops.cern.ch/config/nagios-roles.conf). Probably someone misconfigured his Nagios instance and started sending alarms for 'NGI_A'.
Solution: Contact the admin of this Nagios instance.
Which metrics have been changed between SAM Update-20 and SAM Update-22?
As part of the Integration of EMI probes, several profiles changes are needed.
Several metrics need to be removed from profiles, as they have been deprecated by their developers.
With these removals, the following metrics will disappear from all APIs and interfaces (by 1st of June 2013).
- org.sam.LFC-CertLifetime (there is no replacement)
- org.arc.AUTH (not needed, as this is tested indirectly at any job submission)
- org.arc.SW-VERSION (checked the ARC version publishing, this functionality is provided by new test org.nordugrid.ARC-CE-ARIS)
- org.sam.mpi.CE-JobSubmit (replaced by new MPI tests)
- org.sam.WN-MPI (replaced by new MPI tests)
Some metrics have been renamed by their developers.
The idea is to keep both names till all instances are upgraded:
- The central instance and upgraded NGIs with the new names
- Non-upgraded NGIs with the old names
SAM Update-22 includes a metric-renaming mechanism, to ensure correct functionality during transition period.
|The metric history will be preserved, this change will not affect Availability and Reliability calculation|
The full list of replacements is the following:
| Old name
|| New name
These are the new metrics added: