SAM Doc : Troubleshooting
This page last changed on Jun 29, 2012 by prodrigu.
IntroductionThis pages lists relevant troubleshooting information related to SAM. The following troubleshooting items are already available under two categories:
InstallationNagios web interface shows 0 hosts and servicesWeb interface shows 0 hosts and services. After couple of reloads you see your services but after more reloads they disappear again. nagios -v /etc/nagios/nagios.cfg ... Things look okay - No serious problems were detected during the pre-flight check In the log file you can see entries such as: Warning: Check result queue contained results for host 'bdii.srce.hr', but the host could not be found! Perhaps you forgot to define the host in your config files? Warning: Check result queue contained results for service 'SSH' on host 'localhost', but the service could not be found! Perhaps you forgot to define the service in your config files? Solution: you are running two instances of nagios with two different configuration. Stop all nagios processes and start service all over again with: service nagios start Problem with org.sam.SRM-All and org.sam.CE-JobState[1277737799] SERVICE ALERT: ce1-egee.srce.hr;org.sam.CE-JobState-dteam;UNKNOWN;HARD;2;UNKNOWN: CRITICAL: Problem with job submission to CE. [1277738429] SERVICE NOTIFICATION: msg-contact;egee2.irb.hr;org.sam.SRM-All-dteam;UNKNOWN;ncg-notify-by-msg;UNKNOWN: Error loading modules : No module named lcg_util Solution: most likely nagios service was restarted before glite-UI was configured, run yaim with glite-UI first or perform restart of nagios service. Problem with org.ggus.Tickets(also described in SAM-569) SERVICE ALERT: ce1-egee.srce.hr;org.ggus.Tickets;UNKNOWN;SOFT;2;GGUS UNKNOWN - Could not parse XML from GGUS, Unable to recognise encoding of this document at /usr/lib/perl5/vendor_perl/5.8.8/XML/SAX/PurePerl/EncodingDetect.pm line 96. Solution is to reinstall package perl-XML-SAX-Expat yum reinstall perl-XML-SAX-Expat service nagios restart Problem with org.egee.MDDBSync(also described in SAM-613) [1277737669] SERVICE NOTIFICATION: nagiosadmin;nagiosdev001.cern.ch;org.egee.MDDBSync;CRITICAL;ncg-notify-by-email;MDDB_SYNC: Error running mddb-synchronizer, see logfile. Solution is to chown files in /var/log/mddb/ chown -R nagios:nagios /var/log/mddb Incorrect version of perl-DBD-SQLite during ncg runinstall_driver(SQLite) failed: DBI version 1.57 required--this is only version 1.52 at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/DBD/SQLite.pm line 5. BEGIN failed--compilation aborted at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/DBD/SQLite.pm line 5. Compilation failed in require at (eval 56) line 3. at /usr/lib/perl5/vendor_perl/5.8.8/GridMon/ConfigCache.pm line 38 Solved by excluding perl-DBI package from slc5-os and slc5-updates (see section 3). If the problem occurs, after excluding perl-DBI from base run: yum update perl-DBI Incorrect version of perl-DBD-MySQLDBD::mysql::st execute failed: PROCEDURE mrs.getFreshMetrics can't return a result set in the given context at /usr/libexec/grid-monitoring/plugins/nagios/check_missing_probes_mrs line 191 Solved by using the perl-DBD-MySQL package from RPMforge extras repository. See configuration of Yum repositories. Incorrect version of perl-SOAP-Lite causes voms-htpasswd failureCan't call method "header" on an undefined value at /usr/bin/voms2htpasswd line 144, <CONFIG> line 3. Solved by using the perl-SOAP-Lite package from RPMforge repository. See configuration of Yum repositories. Run-timeErrors: "UNKNOWN: This metric is part of the org.sam.XXX bundle and cannot be executed independently."This kind of error on hosts: "UNKNOWN: This metric is part of the org.sam.XXX bundle and cannot be executed independently. indicate on of these two situations:
A WN metric added to my Hash.pm file is in Pending state and with 'Service is not scheduled to be checked' status.a) Adding a new metric in the Hash.p file is not enough. The documentation on how to integrate a new metric in Nagios is described here: https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#Integration_of_WN_checks b) On the other hand, in the Hash.pm file the metric names should not have a trailing '-<vo name>' (see 'org.sam.WN-sft-vo-swdir-alice' in the example below) because the VO name will be added/removed when needed automatically by Nagios. In this example, 'org.sam.WN-sft-vo-swdir-alice' should be replaced by 'org.sam.WN-sft-vo-swdir' $WLCG_NODETYPE->{VO_ALICE}->{'CE'} = \[ 'org.sam.CE-JobState', 'org.sam.CE-JobSubmit', 'org.sam.WN-SoftVer', 'org.sam.WN-sft-vo-swdir-alice' \]; My Nagios doesn't receive Worker Node metrics.a) Check service 'msg-to-handler': - service msg-to-handler status - https://<NAGIOS_HOST>/nagios/cgi-bin/extinfo.cgi?type=2&host=<NAGIOS_HOST>&service=org.nagios.ProcessMsgToHandler b) Restart 'msg-to-handler' if no messages are received by 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.RecvFromQueue' c) Problem is fixed when check 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.RecvFromQueue' receives messages. d) Check your WMSes if you still see the problem (See details of test org.sam.CE-JobState of any of the computing elements). e) Report problem to 'Nagios' team. Note: WN tests use a discovery mechanism to check which messaging broker they should contact. If you want to specify a particular host, this must be specified in the command line during the execution of the checks. For this, you should update your /etc/ncg/ncg.localdb file including these lines: MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/ MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/ MODIFY_METRIC_PARAMETER!org.sam.glexec.CE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/ MODIFY_METRIC_PARAMETER!org.sam.mpi.CE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/ Then rerun 'ncg.pl' and restart your Nagios when finished: [root@vtb-generic-30 ncg]# ncg.pl [root@vtb-generic-30 ncg]# service nagios reload You can see now that if you run nagios-run-check -v -d -H ce02.tier2.hep.manchester.ac.uk -s org.sam.CE-JobState-ops Nagios will pass the --mb-uri parameter to the probe: [root@vtb-generic-30 ncg]# nagios-run-check -v -d -H ce02.tier2.hep.manchester.ac.uk -s org.sam.CE-JobState-ops Executing command: su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce02.tier2.hep.manchester.ac.uk" -t 600 --vo ops --mb-destination /queue/grid.probe.metricOutput.EGEE.vtb-generic-30_cern_ch -x /etc/nagios/globus/userproxy.pem-ops --mb-uri stomp://gridmsgxxx.cern.ch:6163/ --prev-status $LASTSERVICESTATEID$ -m org.sam.CE-JobState --err-topics ce_wms,default' JobSubmit works but some CEs' WN checks are in PENDING stateLook into stdout/stderr of the framework from WN: /var/lib/gridprobes/atlas.Role=lcgadmin/org.sam/CE/<hostname>/jobOutput*/jobOutput_<jobID>/gridjob.out This may give some hints on what is going on. My metric results aren't sent to the Central Metric Storea) Check details of check 'org.egee.SendToMsg': - 'https://<NAGIOS_HOST>/nagios/cgi-bin/extinfo.cgi?type=2&host=<NAGIOS_HOST>&service=org.egee.SendToMsg' - 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.SendToMsg' b) Run the check manually from the command line. c) Check in /var/spool/msg-nagios-bridge/outgoing-messages/ if there are messages (files) in the directory and with the command: find /var/spool/msg-nagios-bridge/outgoing-messages/ \-name text \-exec ls \-l {} \; check every few seconds the timestamps reported to see if there are outgoing messages from the box to the messaging (if the Nagios handler is producing messages) d) Check if the messages are being sent: find /var/spool/msg-nagios-bridge/outgoing-messages/ \-name header exec ls \-l {} \; then grab any of the header files to see if there are things inside. Header contains a topic/queue destination, so from the file I can check where they are sent. e) Go to the broker where messages are sent. You can know this checking the output of the org.egee.SendToMsg metric: https://<hostname>/nagios/cgi-bin/extinfo.cgi?type=2&host=<hostname>&service=org.egee.SendToMsg If for instance you are using the gridmsg002 and a topic, go to: https://gridmsg002.cern.ch/admin/topics.jsp;jsessionid=52ng60zlxn5w and grep for the corresponding topic used to see if there are messages sent but not being consumed (in our case we check the queue: Consumer.Old_SAM_NAGIOS_HEP_VO.grid.xxxx) f) On the other hand, you can take one of the header & text messages and check if the content is OK. g) If I don't know where is my consumer running, I can connect to lxadm and execute: wassh root@samnag\[004-028\] "ps axf | grep msg-consume2db | grep -v grep" To know out of those which is the one we're interested in, we check all their config files: /etc/msg-consume2db/msg-consume2db-<number>.conf running this command: wassh root@samnag\[004-028\] 'egrep \-H "^destination" /etc/msg-consume2db/msg-consume2db*.conf' and we see that samnag016: /etc/msg-consume2db/msg-consume2db.conf:destination=/topic/grid.probe.metricOutput.EGEE.vo.\* is what we were looking for, so we go into that box (samnag016) and we do: cat /etc/msg-consume2db/msg-consume2db.conf the config file should contain these lines: accept_unknown_region=true accepted_role_for_unknown=vo On the same box, we also check the log file of the consumer to see if there are errors inserting tuples: tail -f /var/log/msg-consume2db/msg-consume2db.log Checks org.nagios.MrsDirSize and org.nagios.MsgDirSize return 'WARNING - /var/spool/nagios2metricstore size: 11484 KB'a) In /etc/nagios/nagios.cfg check that the following variable is set to zero: obsess_over_services=0 b) Check also if you have the following variable in section NCG::LocalMetricsAttrs/Active: INCLUDE_MSG_CHECKS_SEND = 0 c) Check that in either NCG::LocalMetricsAttrs/Active or NCG::ConfigGen/Nagios, the following variable is NOT set to 1 : LOCAL_METRIC_STORE Nagios verification reports errorManual run of Nagios verification fails, e.g.: # nagios -v /etc/nagios/nagios.cfg ... One or more problems was encountered while processing the config files... Check your configuration file(s) ... Check if there is a ncg.pl process running. Currently ncg.pl first moves existing configuration (/etc/nagios/wlcg.d) and then creates a new one. Running verification while ncg.pl is active will fail because configuration is not completely generated. Once the ncg.pl is finished verification will work fine. Test hr.srce.CADist-GetFiles reports warning "Use of uninitialized value in substitution"Test reports: Download Files WARNING - Getting http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.release failed: 500 Use of uninitialized value in substitution (s///). Getting http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.list failed: 500 Use of uninitialized value in substitution (s///). Getting http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.obsoleted failed: 500 Use of uninitialized value in substitution (s///). Problem occurs with the perl-libwww-perl package from Cydia repository (version: 5.837). Set up the priorities according to documentation and install package perl-libwww-perl from OS base or updates repository. Test org.sam.SRM-All reports warning "Error loading modules : No module named lcg_util"Starting from gLite-UI 3.2.10-1 there is a known issue with lack of environment variables. For details see: SAM-1693. The solution is to restart nagios service: service nagios restart |
![]() |
Document generated by Confluence on Feb 27, 2014 10:19 |