SAM Doc : Troubleshooting

This page last changed on Jun 29, 2012 by prodrigu.

Introduction

This pages lists relevant troubleshooting information related to SAM.

The following troubleshooting items are already available under two categories:

Installation

Nagios web interface shows 0 hosts and services
Problem with org.sam.SRM-All and org.sam.CE-JobState
Problem with org.ggus.Tickets
Problem with org.egee.MDDBSync
Incorrect version of perl-DBD-SQLite during ncg run
Incorrect version of perl-DBD-MySQL
Incorrect version of perl-SOAP-Lite causes voms-htpasswd failure

Run-time

Errors: "UNKNOWN: This metric is part of the org.sam.XXX bundle and cannot be executed independently."
A WN metric added to my Hash.pm file is in Pending state and with 'Service is not scheduled to be checked' status.
My Nagios doesn't receive Worker Node metrics.
JobSubmit works but some CEs' WN checks are in PENDING state
My metric results aren't sent to the Central Metric Store
Checks org.nagios.MrsDirSize and org.nagios.MsgDirSize return 'WARNING - /var/spool/nagios2metricstore size: 11484 KB'
Nagios verification reports error
Test hr.srce.CADist-GetFiles reports warning "Use of uninitialized value in substitution"
Test org.sam.SRM-All reports warning "Error loading modules : No module named lcg_util"

Installation

Nagios web interface shows 0 hosts and services

Web interface shows 0 hosts and services. After couple of reloads you see your services but after more reloads they disappear again.
Running validation of Nagios configuration returns ok:

nagios -v /etc/nagios/nagios.cfg
 ...
 Things look okay - No serious problems were detected during the pre-flight check

In the log file you can see entries such as:

Warning: Check result queue contained results for host 'bdii.srce.hr', but the host could not be found!  Perhaps you forgot to define the host in your config files?
 Warning: Check result queue contained results for service 'SSH' on host 'localhost', but the service could not be found!  Perhaps you forgot to define the service in your config files?

Solution: you are running two instances of nagios with two different configuration. Stop all nagios processes and start service all over again with:

service nagios start

Problem with org.sam.SRM-All and org.sam.CE-JobState

[1277737799] SERVICE ALERT: ce1-egee.srce.hr;org.sam.CE-JobState-dteam;UNKNOWN;HARD;2;UNKNOWN: CRITICAL: Problem with job submission to CE.
[1277738429] SERVICE NOTIFICATION: msg-contact;egee2.irb.hr;org.sam.SRM-All-dteam;UNKNOWN;ncg-notify-by-msg;UNKNOWN: Error loading modules : No module named lcg_util

Solution: most likely nagios service was restarted before glite-UI was configured, run yaim with glite-UI first or perform restart of nagios service.

Problem with org.ggus.Tickets

(also described in SAM-569)

SERVICE ALERT: ce1-egee.srce.hr;org.ggus.Tickets;UNKNOWN;SOFT;2;GGUS UNKNOWN - Could not parse XML from GGUS, Unable to recognise encoding of this document at /usr/lib/perl5/vendor_perl/5.8.8/XML/SAX/PurePerl/EncodingDetect.pm line 96.

Solution is to reinstall package perl-XML-SAX-Expat

yum reinstall perl-XML-SAX-Expat
service nagios restart

Problem with org.egee.MDDBSync

(also described in SAM-613)

[1277737669] SERVICE NOTIFICATION: nagiosadmin;nagiosdev001.cern.ch;org.egee.MDDBSync;CRITICAL;ncg-notify-by-email;MDDB_SYNC: Error running mddb-synchronizer, see logfile.

Solution is to chown files in /var/log/mddb/

chown -R nagios:nagios /var/log/mddb

Incorrect version of perl-DBD-SQLite during ncg run

install_driver(SQLite) failed: DBI version 1.57 required--this is only version 1.52 at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/DBD/SQLite.pm line 5.
BEGIN failed--compilation aborted at /usr/lib64/perl5/vendor_perl/5.8.8/x86_64-linux-thread-multi/DBD/SQLite.pm line 5.
Compilation failed in require at (eval 56) line 3.

 at /usr/lib/perl5/vendor_perl/5.8.8/GridMon/ConfigCache.pm line 38

Solved by excluding perl-DBI package from slc5-os and slc5-updates (see section 3). If the problem occurs, after excluding perl-DBI from base run:

yum update perl-DBI

Incorrect version of perl-DBD-MySQL

DBD::mysql::st execute failed: PROCEDURE mrs.getFreshMetrics can't return a result set in the given context
 at /usr/libexec/grid-monitoring/plugins/nagios/check_missing_probes_mrs line 191

Solved by using the perl-DBD-MySQL package from RPMforge extras repository. See configuration of Yum repositories.

Incorrect version of perl-SOAP-Lite causes voms-htpasswd failure

Can't call method "header" on an undefined value at /usr/bin/voms2htpasswd line 144, &lt;CONFIG&gt; line 3.

Solved by using the perl-SOAP-Lite package from RPMforge repository. See configuration of Yum repositories.

Run-time

Errors: "UNKNOWN: This metric is part of the org.sam.XXX bundle and cannot be executed independently."

This kind of error on hosts:

"UNKNOWN: This metric is part of the org.sam.XXX bundle and cannot be executed independently.

indicate on of these two situations:

Active checks on metrics which NCG configured as passive were enabled. You can recognize this situation if service doesn't have PASV icon on "Service Status Details" page or it has "Active Checks: ENABLED" on "Service Information" page. Solution in this case is to go on the "Service Information" page and click: "Disable active checks of this service".
Forced check execution was invoked on passive check. Solution in this case is to force schedule parent check or wait the next regular execution of parent check.

Warning
One should never enable active checks of checks which NCG configured as passive via web interface.
One should never force execution of checks which NCG configured as passive via web interface.

A WN metric added to my Hash.pm file is in Pending state and with 'Service is not scheduled to be checked' status.

a) Adding a new metric in the Hash.p file is not enough. The documentation on how to integrate a new metric in Nagios is described here: https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#Integration_of_WN_checks

b) On the other hand, in the Hash.pm file the metric names should not have a trailing '-<vo name>' (see 'org.sam.WN-sft-vo-swdir-alice' in the example below) because the VO name will be added/removed when needed automatically by Nagios. In this example, 'org.sam.WN-sft-vo-swdir-alice' should be replaced by 'org.sam.WN-sft-vo-swdir'

$WLCG_NODETYPE-&gt;{VO_ALICE}-&gt;{'CE'} = \[
'org.sam.CE-JobState',
'org.sam.CE-JobSubmit',
'org.sam.WN-SoftVer',
'org.sam.WN-sft-vo-swdir-alice'
\];

My Nagios doesn't receive Worker Node metrics.

a) Check service 'msg-to-handler':

- service msg-to-handler status

- https://<NAGIOS_HOST>/nagios/cgi-bin/extinfo.cgi?type=2&host=<NAGIOS_HOST>&service=org.nagios.ProcessMsgToHandler

b) Restart 'msg-to-handler' if no messages are received by 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.RecvFromQueue'

c) Problem is fixed when check 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.RecvFromQueue' receives messages.

d) Check your WMSes if you still see the problem (See details of test org.sam.CE-JobState of any of the computing elements).

e) Report problem to 'Nagios' team.

Note: WN tests use a discovery mechanism to check which messaging broker they should contact. If you want to specify a particular host, this must be specified in the command line during the execution of the checks. For this, you should update your /etc/ncg/ncg.localdb file including these lines:

MODIFY_METRIC_PARAMETER!org.sam.CREAMCE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/
MODIFY_METRIC_PARAMETER!org.sam.CE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/
MODIFY_METRIC_PARAMETER!org.sam.glexec.CE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/
MODIFY_METRIC_PARAMETER!org.sam.mpi.CE-JobState!--mb-uri!stomp://gridmsgxxx.cern.ch:6163/

Then rerun 'ncg.pl' and restart your Nagios when finished:

[root@vtb-generic-30 ncg]# ncg.pl
[root@vtb-generic-30 ncg]# service nagios reload

You can see now that if you run

nagios-run-check -v -d -H ce02.tier2.hep.manchester.ac.uk -s org.sam.CE-JobState-ops

Nagios will pass the --mb-uri parameter to the probe:

[root@vtb-generic-30 ncg]# nagios-run-check -v -d -H ce02.tier2.hep.manchester.ac.uk -s org.sam.CE-JobState-ops
Executing command:
su nagios -l -c '/usr/libexec/grid-monitoring/probes/org.sam/CE-probe -H "ce02.tier2.hep.manchester.ac.uk" -t 600 --vo ops --mb-destination /queue/grid.probe.metricOutput.EGEE.vtb-generic-30_cern_ch -x /etc/nagios/globus/userproxy.pem-ops --mb-uri stomp://gridmsgxxx.cern.ch:6163/ --prev-status $LASTSERVICESTATEID$ -m org.sam.CE-JobState --err-topics ce_wms,default'

JobSubmit works but some CEs' WN checks are in PENDING state

Look into stdout/stderr of the framework from WN:

/var/lib/gridprobes/atlas.Role=lcgadmin/org.sam/CE/&lt;hostname&gt;/jobOutput*/jobOutput_&lt;jobID&gt;/gridjob.out

This may give some hints on what is going on.

My metric results aren't sent to the Central Metric Store

a) Check details of check 'org.egee.SendToMsg':

- 'https://<NAGIOS_HOST>/nagios/cgi-bin/extinfo.cgi?type=2&host=<NAGIOS_HOST>&service=org.egee.SendToMsg'

- 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.SendToMsg'

b) Run the check manually from the command line.

c) Check in /var/spool/msg-nagios-bridge/outgoing-messages/ if there are messages (files) in the directory and with the command:

find /var/spool/msg-nagios-bridge/outgoing-messages/ \-name text \-exec ls \-l {} \;

check every few seconds the timestamps reported to see if there are outgoing messages from the box to the messaging (if the Nagios handler is producing messages)

d) Check if the messages are being sent:

find /var/spool/msg-nagios-bridge/outgoing-messages/ \-name header exec ls \-l {} \;

then grab any of the header files to see if there are things inside. Header contains a topic/queue destination, so from the file I can check where they are sent.

e) Go to the broker where messages are sent. You can know this checking the output of the org.egee.SendToMsg metric:

https://&lt;hostname&gt;/nagios/cgi-bin/extinfo.cgi?type=2&amp;host=&lt;hostname&gt;&amp;service=org.egee.SendToMsg

If for instance you are using the gridmsg002 and a topic, go to:

https://gridmsg002.cern.ch/admin/topics.jsp;jsessionid=52ng60zlxn5w

and grep for the corresponding topic used to see if there are messages sent but not being consumed (in our case we check the queue: Consumer.Old_SAM_NAGIOS_HEP_VO.grid.xxxx)

f) On the other hand, you can take one of the header & text messages and check if the content is OK.

g) If I don't know where is my consumer running, I can connect to lxadm and execute:

wassh root@samnag\[004-028\] "ps axf | grep msg-consume2db | grep -v grep"

To know out of those which is the one we're interested in, we check all their config files:

/etc/msg-consume2db/msg-consume2db-&lt;number&gt;.conf

running this command:

wassh root@samnag\[004-028\] 'egrep \-H "^destination" /etc/msg-consume2db/msg-consume2db*.conf'

and we see that samnag016:

/etc/msg-consume2db/msg-consume2db.conf:destination=/topic/grid.probe.metricOutput.EGEE.vo.\*

is what we were looking for, so we go into that box (samnag016) and we do:

cat /etc/msg-consume2db/msg-consume2db.conf

the config file should contain these lines:

accept_unknown_region=true
accepted_role_for_unknown=vo

On the same box, we also check the log file of the consumer to see if there are errors inserting tuples:

tail -f /var/log/msg-consume2db/msg-consume2db.log

Checks org.nagios.MrsDirSize and org.nagios.MsgDirSize return 'WARNING - /var/spool/nagios2metricstore size: 11484 KB'

a) In /etc/nagios/nagios.cfg check that the following variable is set to zero:

obsess_over_services=0

b) Check also if you have the following variable in section NCG::LocalMetricsAttrs/Active:

INCLUDE_MSG_CHECKS_SEND = 0

c) Check that in either NCG::LocalMetricsAttrs/Active or NCG::ConfigGen/Nagios, the following variable is NOT set to 1 :

LOCAL_METRIC_STORE

Nagios verification reports error

Manual run of Nagios verification fails, e.g.:

# nagios -v /etc/nagios/nagios.cfg
...
One or more problems was encountered while processing the config files... Check your configuration file(s)
...

Check if there is a ncg.pl process running. Currently ncg.pl first moves existing configuration (/etc/nagios/wlcg.d) and then creates a new one. Running verification while ncg.pl is active will fail because configuration is not completely generated. Once the ncg.pl is finished verification will work fine.

Test hr.srce.CADist-GetFiles reports warning "Use of uninitialized value in substitution"

Test reports:

Download Files WARNING - Getting http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.release failed: 500 Use of uninitialized value in substitution (s///). Getting http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.list failed: 500 Use of uninitialized value in substitution (s///). Getting http://repository.egi.eu/sw/production/cas/1/current/meta/ca-policy-egi-core.obsoleted failed: 500 Use of uninitialized value in substitution (s///).

Problem occurs with the perl-libwww-perl package from Cydia repository (version: 5.837). Set up the priorities according to documentation and install package perl-libwww-perl from OS base or updates repository.

Test org.sam.SRM-All reports warning "Error loading modules : No module named lcg_util"

Starting from gLite-UI 3.2.10-1 there is a known issue with lack of environment variables. For details see: SAM-1693. The solution is to restart nagios service:

service nagios restart

Document generated by Confluence on Feb 27, 2014 10:19