SAM Doc : Troubleshooting
This page last changed on Jun 29, 2012 by prodrigu.
This pages lists relevant troubleshooting information related to SAM.
The following troubleshooting items are already available under two categories:
Web interface shows 0 hosts and services. After couple of reloads you see your services but after more reloads they disappear again.
In the log file you can see entries such as:
Solution: you are running two instances of nagios with two different configuration. Stop all nagios processes and start service all over again with:
Solution: most likely nagios service was restarted before glite-UI was configured, run yaim with glite-UI first or perform restart of nagios service.
(also described in SAM-569)
Solution is to reinstall package perl-XML-SAX-Expat
(also described in SAM-613)
Solution is to chown files in /var/log/mddb/
Solved by excluding perl-DBI package from slc5-os and slc5-updates (see section 3). If the problem occurs, after excluding perl-DBI from base run:
Solved by using the perl-DBD-MySQL package from RPMforge extras repository. See configuration of Yum repositories.
Solved by using the perl-SOAP-Lite package from RPMforge repository. See configuration of Yum repositories.
Errors: "UNKNOWN: This metric is part of the org.sam.XXX bundle and cannot be executed independently."
This kind of error on hosts:
indicate on of these two situations:
A WN metric added to my Hash.pm file is in Pending state and with 'Service is not scheduled to be checked' status.
a) Adding a new metric in the Hash.p file is not enough. The documentation on how to integrate a new metric in Nagios is described here: https://twiki.cern.ch/twiki/bin/view/LCG/SAMProbesMetrics#Integration_of_WN_checks
b) On the other hand, in the Hash.pm file the metric names should not have a trailing '-<vo name>' (see 'org.sam.WN-sft-vo-swdir-alice' in the example below) because the VO name will be added/removed when needed automatically by Nagios. In this example, 'org.sam.WN-sft-vo-swdir-alice' should be replaced by 'org.sam.WN-sft-vo-swdir'
a) Check service 'msg-to-handler':
- service msg-to-handler status
b) Restart 'msg-to-handler' if no messages are received by 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.RecvFromQueue'
c) Problem is fixed when check 'https://<NAGIOS_HOST>/nagios/html/pnp4nagios/index.php?host=<NAGIOS_HOST>&srv=org.egee.RecvFromQueue' receives messages.
d) Check your WMSes if you still see the problem (See details of test org.sam.CE-JobState of any of the computing elements).
e) Report problem to 'Nagios' team.
Note: WN tests use a discovery mechanism to check which messaging broker they should contact. If you want to specify a particular host, this must be specified in the command line during the execution of the checks. For this, you should update your /etc/ncg/ncg.localdb file including these lines:
Then rerun 'ncg.pl' and restart your Nagios when finished:
You can see now that if you run
Nagios will pass the --mb-uri parameter to the probe:
Look into stdout/stderr of the framework from WN:
This may give some hints on what is going on.
a) Check details of check 'org.egee.SendToMsg':
b) Run the check manually from the command line.
c) Check in /var/spool/msg-nagios-bridge/outgoing-messages/ if there are messages (files) in the directory and with the command:
check every few seconds the timestamps reported to see if there are outgoing messages from the box to the messaging (if the Nagios handler is producing messages)
d) Check if the messages are being sent:
then grab any of the header files to see if there are things inside. Header contains a topic/queue destination, so from the file I can check where they are sent.
e) Go to the broker where messages are sent. You can know this checking the output of the org.egee.SendToMsg metric:
If for instance you are using the gridmsg002 and a topic, go to:
and grep for the corresponding topic used to see if there are messages sent but not being consumed (in our case we check the queue: Consumer.Old_SAM_NAGIOS_HEP_VO.grid.xxxx)
f) On the other hand, you can take one of the header & text messages and check if the content is OK.
g) If I don't know where is my consumer running, I can connect to lxadm and execute:
To know out of those which is the one we're interested in, we check all their config files:
running this command:
and we see that samnag016:
is what we were looking for, so we go into that box (samnag016) and we do:
the config file should contain these lines:
On the same box, we also check the log file of the consumer to see if there are errors inserting tuples:
Checks org.nagios.MrsDirSize and org.nagios.MsgDirSize return 'WARNING - /var/spool/nagios2metricstore size: 11484 KB'
a) In /etc/nagios/nagios.cfg check that the following variable is set to zero:
b) Check also if you have the following variable in section NCG::LocalMetricsAttrs/Active:
c) Check that in either NCG::LocalMetricsAttrs/Active or NCG::ConfigGen/Nagios, the following variable is NOT set to 1 :
Manual run of Nagios verification fails, e.g.:
Check if there is a ncg.pl process running. Currently ncg.pl first moves existing configuration (/etc/nagios/wlcg.d) and then creates a new one. Running verification while ncg.pl is active will fail because configuration is not completely generated. Once the ncg.pl is finished verification will work fine.
Problem occurs with the perl-libwww-perl package from Cydia repository (version: 5.837). Set up the priorities according to documentation and install package perl-libwww-perl from OS base or updates repository.
Starting from gLite-UI 3.2.10-1 there is a known issue with lack of environment variables. For details see: SAM-1693. The solution is to restart nagios service:
|Document generated by Confluence on Feb 27, 2014 10:19|