SAM Doc : FAQs
This page last changed on Feb 25, 2014 by mbabik.
IntroductionThis pages lists relevant Frequently Asked Questions related to SAM. The following troubleshooting items are already available under two categories:
InstallationConfigurationMonitoring of the local servicesManual re-configuration is needed to change URL that ATP is using to connect to GOCDB (please note this will be overwritten by yaim). In /etc/atp/atp_synchro.conf, please change the following URLs [gocdb] site_url = https://goc.egi.eu/gocdbpi/private/?method=get_site serviceendpoint_url = https://goc.egi.eu/gocdbpi/private/?method=get_service_endpoint to site_url = https://goc.egi.eu/gocdbpi/private/?method=get_site&scope=EGI,Local&scope_match=any serviceendpoint_url = https://goc.egi.eu/gocdbpi/private/?method=get_service_endpoint&scope=EGI,Local&scope_match=any Nagios access configurationThis is done via voms2htpasswd, which has dynamic and static config that generates /etc/httpd/htpasswd. Dynamic part is generated from VOMS and GOCDB site contacts, e.g. # cat /etc/voms2htpasswd.conf vomss://voms1.somewhere:8443/voms/dteam?/dteam vomss://voms2.somewhereelse:8443/voms/ops?/ops gocdb://gocdb/api/private/?method=getcontacts&roc=ROC gocdb://gocdb/api/private/?method=getcontacts&roc=ROC The is based on site-info.def VO_<VO>_VOMS_SERVERS and NCG_GOCDB_ROC_NAME There is also a possibility to add list of static DNs in .conf file under /etc/voms2htpasswd-static.d/ # cat /etc/voms2htpasswd-static.d/YAIM-ops-monitor.conf /DC=dc/DC=dc2/OU=ou1s/CN=SomeCN Failover NagiosStarting from Update-13 SAM supports deployment of hot-standby (active/active) instances. The systems works in the following way:
Backup instance can be defined in several ways:
In case of backup configuration without Yaim the following additional step is needed: /sbin/chkconfig send-to-dashboard off /sbin/service send-to-dashboard stop In order to turn backup instance into the active one, SAM administrator has to remove BACKUP_INSTANCE variable. If the Yaim is not used the following additional step is needed: /sbin/chkconfig send-to-dashboard on /sbin/service send-to-dashboard start Robot certificatesStarting from Update-09 SAM supports usage of robot certificates, instead of MyProxy credentials. If your CA supports robot certificates, we suggest switching to robot certificates, as they are easier to maintain. Also robots provide better availability as SAM doesn't depend on availability of MyProxy servier. In order to use robot certificates set the following YAIM variables: NCG_USE_ROBOT_CERT=true # Robot cert and key can be different for each VO # and standard Yaim VO notation is used VO_OPS_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem VO_OPS_ROBOT_KEY=/etc/nagios/globus/robot-key.pem VO_DTEAM_ROBOT_CERT=/etc/nagios/globus/robot-cert.pem-dteam VO_DTEAM_ROBOT_KEY=/etc/nagios/globus/robot-key.pem-dteam
Enabling ACE in MyWLCGCurrently it's only for the central MyWLCG instance. YAIM configuration: MYEGI_ACE=true Multi-VO configuration setupStarting from Update-17.1 you can setup VO-Nagios and defined multiple profiles in your local POEM Web interface https://tomtools.cern.ch/confluence/display/SAMDOC/POEM+User%27s+guide. In order to support multiple VOs you will need to define a VO feed for each given VO as described in https://tomtools.cern.ch/confluence/display/SAM/ATP+VO+Feeds and afterwards define a POEM profile for each given VO. The rest of the configuration should follow standard VO-Nagios setup. Please note that the following Nagios.pm.mvopatch is needed for NCG to make this work. Setting alternative SE for metric org.sam.WN-RepRepStarting from the release Update-07, it is possible to specify more than one replication SE for WN replica test org.sam.WN-RepRep. Static and/or dynamic mechanisms are possible. In order to define static list of comma-separated hostnames set the following Yaim variable: JOBSUBMIT_WN_SE_REP=se1[,se2,se3...] Dynamic list is filled with a list of SEs defined on the Nagios instance that recently successfully passed org.sam.SRM-All set of tests. In order to use dynamic list set the following Yaim variable: JOBSUBMIT_WN_SE_REP_FILE=filename Filename must be defined without path. The org.sam.(CREAM)CE-JobState metric(s) takes up to max 3 hosts from the file and, if JOBSUBMIT_WN_SE_REP was defined, appends them to the static list. On WN, org.sam.WN-RepRep tries to replicate to all the SEs in the provided order until the replication succeeds. The metric returns CRITICAL, if file couldn't be replicated to any for the SEs. This fixes https://tomtools.cern.ch/jira/browse/SAM-442. Setting alternative BDII for metric org.sam.SRM-AllMetric org.sam.SRM-All uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII. In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/srm.conf). There are two options: MODIFY_METRIC_PARAMETER!org.sam.SRM-All!--ldap-uri!your.top.bdii 2. use site BDII: MODIFY_METRIC_ATTRIBUTE!org.sam.SRM-All!SITE_BDII!--ldap-uri Setting alternative LFC for metrics org.sam.WN-Rep*Metrics org.sam.WN-Rep* use prod-lfc-shared-central.cern.ch LFC by default. In order to set alternative lfc create localdb file (e.g. /etc/ncg/ncg-localdb.d/LFC.conf): MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--wn-lfc!lfc.my.domain MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-lfc!lfc.my.domain Setting alternative BDII for metric org.sam.CREAMCE-DirectJobStateMetric org.sam.CREAMCE-DirectJobState uses sam-bdii.cern.ch top BDII by default. In order to make tests less dependent on CERN top BDII it is suggested to set alternative BDII. In order to set alternative BDII create localdb file (e.g. /etc/ncg/ncg-localdb.d/creamcedjs.conf). There are two options: MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-DirectJobState!--ldap-uri!your.top.bdii 2. use site BDII: MODIFY_METRIC_ATTRIBUTE!org.sam.CREAMCE-DirectJobState!SITE_BDII!--ldap-uri Setting alternative list of CEs for metric org.sam.WMS-JobStateIf the monitored infrastructure contains WMS service and no CE services, metric hr.srce.GoodCEs associated to Nagios service will fail with the following error: HealthyNodes CRITICAL - No healthy hosts found. There are two options to solve this issue. 1. If the infrastructure contains CREAM-CE services create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content: MODIFY_METRIC_PARAMETER!hr.srce.GoodCEs!--metric!org.sam.CREAMCE-JobSubmit 2. In order to use static list of CEs or CREAM-CEs create file /etc/ncg/ncg-localdb.d/GoodCEs-fix with the following content: REMOVE_METRIC!hr.srce.GoodCEs For each VO supported on SAM instance create file /var/lib/gridprobes/<VO_NAME>/GoodCEs and list CE/CREAM-CE names in it. Example is: ce1.reliable.my ce2.reliable.my cream-ce3.reliable.my In case VO_FQAN is used (e.g. /ops/Role=lcgadmin) <VO_NAME> should be set to VO_FQAN with "/" replaced with "." (e.g. /var/lib/gridprobes/ops.Role=lcgadmin/GoodCEs). Monitoring Globus servicesGlobus services currently do not support VOs. In order to monitor Globus services SAM administrator has to contact all sites and request to add the certificate DN to the grid-mapfile. Removing metrics from alias onlyREMOVE_ALIAS_METRIC!alias!metric Enabling host and site contacts when global notifications (ENABLE_NOTIFICATIONS) are disabledIt's done in localdb: ADD_SITE_CONTACT!sitename!emailAddress ENABLE_SITE_CONTACT!sitename!emailAddress ENABLE_HOSTCONTACT!hostname!emailAddress Throttling of MyWLCG WEB APIPerformance limits in MyWLCG/MyEGI portal are set by YAIM variables. # Limit number of rows that can be fetched at a time to avoid DB dumps. MYWLCG_DB_LIMIT=50000 # Limit number of accesses per IP address in a given time(seconds). MYWLCG_ACCESS_PERIOD=5 MYWLCG_NUMBER_OF_ACCESSES=100 RuntimeHow to know which Messaging Broker is configured in Nagios?cat /var/cache/msg/broker-cache-file/broker-list How to remove hosts from Nagios so they are not monitored?In /etc/ncg/ncg.conf, set: <NCG::SiteSet> <File> DB_FILE=/etc/ncg/ncg.localdb.d/sites.conf DB_DIRECTORY=/etc/ncg/ncg-localdb.d </File> </NCG::SiteSet> and in the <NCG::ConfigGen> section, set INCLUDE_EMPTY_HOSTS=0 In /etc/ncg/ncg.localdb.d directory, create a file ending by '.conf', like for instance sites.conf with the list of sites to monitor and the list of services to remove. Lines like these: ADD_SITE!CERN-PROD REMOVE_HOST!samdpm001.cern.ch How to replace the default SE used for replica management testsThere are two possibilities: In /etc/gridmon/org.sam.conf configuration file, uncomment this line: #wn_se_rep = samdpm002.cern.ch and set it to your preferable SE. You don't need to rerun NCG after this change. Create a file in NCG localdb directory (/etc/ncg/ncg.localdb.d/anyfilename) with the following line in it: MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--wn-se-rep!your.srm and then rerun NCG: [root@vtb-generic-30 ncg]# ncg.pl [root@vtb-generic-30 ncg]# service nagios reload Note: after the change of configuration check the output of metric 'org.sam.WN-Rep' to see which was the SE node used, e.g.: File was copied to SE my-cro-se.srce.hr and registered in LFC prod-lfc-shared-central.cern.ch. The "Last Check Time" should be posterior to the change of configuration. How to create/renew the MyProxy proxy for Nagios use?In order to execute probes that interact with Grid services, Nagios needs a proxy certificate. This proxy certificate is automatically renewed based on a MyProxy certificate stored by the Nagios Admins for its operation. In order to store a MyProxy certificate one needs to execute the following command at the UI where the certificate is available: myproxy-init -l nagios -s <MyProxy Server> -k NagiosRetrieve-<Nagios Server>-<VO> -c 336 -x -Z "<Nagios Server certificate's subject DN>" For example: myproxy-init -l nagios -s myproxy.example.com -k NagiosRetrieve-nagios.example.com-dteam -c 336 -x -Z "/C=XX/O=The Grid/OU=Monitoring Service/CN=nagios.example.com" How to manually add services on NGI/ROC Nagioses1. Create additional config file for the site (e.g. /etc/ncg/ncg.conf.d/sitename.conf) with the following content: <NCG::SiteInfo sitename> # NCG::SiteInfo content from /etc/ncg/ncg.conf <ATP> ATP_ROOT_URL=https://grid-monitoring.cern.ch/atp </ATP> <File> DB_FILE=/etc/ncg/ncg.localdb DB_DIRECTORY=/etc/ncg/ncg-localdb.d </File> # additional config line <File> DB_FILE=/etc/ncg/ncg.sitename </File> </NCG::SiteInfo>
2. In the local db file /etc/ncg/ncg.sitename add the following content: # adding VO specific service ADD_HOST_SERVICE_VO!hostname!MPICH!ops # adding generic service ADD_HOST_SERVICE!hostname!BDII What is the command executed by Nagios to run the check 'org.sam.SRM-All'?On the Nagios box run: nagios-run-check your_hostname org.sam.SRM-All-/ops/Role=lcgadmin To see the detailed log run: nagios-run-check gridvm02.roma2.infn.it org.sam.SRM-All-/ops/Role=lcgadmin --verbose --dryrun How to monitor glexec services on ROC/NGI Nagios?In the Yaim configuration set the following variable: NCG_HASH_CONFIG_PROFILES=<role_name>,glexec where <role_name> is name of your role. In case you want to run glexec metrics with different VO FQAN (e.g. /ops/Role=pilot) set the following variable: NCG_PROFILE_FQAN_glexec=/ops/Role=pilot How to execute Nagios check from command line interface.You can run only active checks. Running this command: nagios-run-check \-v \-d \-H <hostname> \-s org.sam.CE-JobState-ops You will see which is the command executed, e.g.: [root@vtb-generic-30 ~]# nagios-run-check -v -d -H ce02.tier2.hep.manchester.ac.uk -s org.sam.CE-JobState-ops Executing command: su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce02.tier2.hep.manchester.ac.uk" -t 600 --vo ops --mb-destination /queue/grid.probe.metricOutput.EGEE.vtb-generic-30_cern_ch -x /etc/nagios/globus/userproxy.pem-ops --prev-status $LASTSERVICESTATEID$ -m org.sam.CE-JobState --err-topics ce_wms,default' How to manage multiple proxies on the Nagios serverAdd profile <PROFILE_NAME> in your site-info.def file and rerun yaim: NCG_HASH_CONFIG_PROFILES="(...),<PROFILE_NAME>" NCG_PROFILE_FQAN_<PROFILE_NAME>=/ops/Role=pilot MyEGI portal doesn't show any results.Do you have MDDB profile defined for your VO? Nagios doesn't test services from site <SITENAME>Check if these services support VO configured in your Nagios.
I don't see metric results for my service for HEP VO(s)
MyEGI/MyWLCG web services give output: "404: File Not Found"
MyEGI/MyWLCG: The services are missing in NGI Nagios instance but visible on central grid-monitoring instance
Alarms for 'NGI_A' in Operations Portal (https://operations-portal.in2p3.fr/dashboard) are pointing to wrong regional nagios instance.Operations portal is filtering alarms based on NGI_NAME and NAGIOS_ROLE (See: http://gridops.cern.ch/config/nagios-roles.conf). Probably someone misconfigured his Nagios instance and started sending alarms for 'NGI_A'. Which metrics have been changed between SAM Update-20 and SAM Update-22?As part of the Integration of EMI probes, several profiles changes are needed. RemovalsSeveral metrics need to be removed from profiles, as they have been deprecated by their developers.
Renames/replacementsSome metrics have been renamed by their developers.
SAM Update-22 includes a metric-renaming mechanism, to ensure correct functionality during transition period.
The full list of replacements is the following:
AdditionsThese are the new metrics added:
|
![]() |
Document generated by Confluence on Feb 27, 2014 10:19 |