9. Troubleshooting

9.1. Worker not running

An example troubleshooting workflow typically includes the following steps:

  1. Watch dashboard
  2. Examine process status
  3. Show logs
  4. Inspect queue sizes

Below an example problem scenario is presented.

9.1.1. Show dashboard

Note

The monitoring components of XFM are not yet available for version 3.

_images/symon.png _images/dashboard.png

9.1.2. Examine process status

$ xfmctl --roles=singlenode supervisor
[127.0.0.1] Executing task 'supervisor'
[127.0.0.1] sudo: /usr/bin/supervisorctl -c /etc/xfm/supervisord/supervisord.conf status
[127.0.0.1] out: xfm-generator:xfm-generator-0       STOPPED   Not started
[127.0.0.1] out: xfm-ingester:xfm-ingester-0         RUNNING   pid 3040, uptime 1:50:41
[127.0.0.1] out: xfm-lake-runner:xfm-lake-runner-0   STOPPED   Not started
[127.0.0.1] out: xfm-lakeserver:xfm-lakeserver-0     RUNNING   pid 2386, uptime 2:06:24
[127.0.0.1] out: xfm-loader:xfm-loader-0             RUNNING   pid 5437, uptime 0:38:12
[127.0.0.1] out: xfm-pondserver:xfm-pondserver-0     RUNNING   pid 2377, uptime 2:06:24
[127.0.0.1] out: xfm-transformer:xfm-transformer-0   BACKOFF   Exited too quickly (process log may have details)

The transformer has trouble starting. Process definitions are in /etc/xfm/supervisord/conf.d. For example the definition for the transformer is /etc/xfm/supervisord/conf.d/xfm-transformer.conf:

[program:xfm-transformer]
directory=/
autostart=true
autorestart=true
command=/usr/bin/python /opt/mgrid/messaging3/integration/rabbitmq/transformer.py /etc/xfm/xfm.json
process_name=%(program_name)s-%(process_num)01d
numprocs=1
stopasgroup=true
startretries=10
redirect_stderr=true
stdout_logfile=/var/log/xfm/%(program_name)s-%(process_num)01d.log
user=xfmuser
environment=

It shows the log location: /var/log/xfm.

9.1.3. Show logs

XFM logs are in /var/log/xfm, but the supervisor command of xfmctl can also help:

$ xfmctl --roles=singlenode supervisor:"tail xfm-transformer:xfm-transformer-0"
[127.0.0.1] Executing task 'supervisor'
[127.0.0.1] sudo: /usr/bin/supervisorctl -c /etc/xfm/supervisord/supervisord.conf tail xfm-transformer:xfm-transformer-0
[127.0.0.1] out: 168.122.43 with userid xfm and vhost /messaging
[127.0.0.1] out: Traceback (most recent call last):
[127.0.0.1] out:   File "/opt/mgrid/messaging3/integration/rabbitmq/transformer.py", line 125, in <module>
[127.0.0.1] out:     confirm_publish=True)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
[127.0.0.1] out:     self.transport = self.Transport(host, connect_timeout, ssl)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
[127.0.0.1] out:     return create_transport(host, connect_timeout, ssl)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport
[127.0.0.1] out:     return TCPTransport(host, connect_timeout)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__
[127.0.0.1] out:     raise socket.error(last_err)
[127.0.0.1] out: socket.error: [Errno 111] Connection refused
[127.0.0.1] out: Connecting to broker on 192.168.122.43 with userid xfm and vhost /messaging
[127.0.0.1] out: Traceback (most recent call last):
[127.0.0.1] out:   File "/opt/mgrid/messaging3/integration/rabbitmq/transformer.py", line 125, in <module>
[127.0.0.1] out:     confirm_publish=True)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 165, in __init__
[127.0.0.1] out:     self.transport = self.Transport(host, connect_timeout, ssl)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 186, in Transport
[127.0.0.1] out:     return create_transport(host, connect_timeout, ssl)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 299, in create_transport
[127.0.0.1] out:     return TCPTransport(host, connect_timeout)
[127.0.0.1] out:   File "/usr/lib/python2.7/site-packages/amqp/transport.py", line 95, in __init__
[127.0.0.1] out:     raise socket.error(last_err)
[127.0.0.1] out: socket.error: [Errno 111] Connection refused

Looks like the transformer is unable to connect to the message broker (RabbitMQ). Do the ingester and loader also have the same problem?

$ xfmctl --roles=singlenode supervisor:"tail xfm-ingester:xfm-ingester-0"
[127.0.0.1] Executing task 'supervisor'
[127.0.0.1] sudo: /usr/bin/supervisorctl -c /etc/xfm/supervisord/supervisord.conf tail xfm-ingester:xfm-ingester-0
[127.0.0.1] out: unnable.java:64)
[127.0.0.1] out:    at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:53)
[127.0.0.1] out:    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
[127.0.0.1] out:    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
[127.0.0.1] out:    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
[127.0.0.1] out:    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[127.0.0.1] out:    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[127.0.0.1] out:    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[127.0.0.1] out:    at java.lang.Thread.run(Thread.java:745)
[127.0.0.1] out: Caused by: java.net.ConnectException: Connection refused
[127.0.0.1] out:    at java.net.PlainSocketImpl.socketConnect(Native Method)
[127.0.0.1] out:    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
[127.0.0.1] out:    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
[127.0.0.1] out:    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
[127.0.0.1] out:    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
[127.0.0.1] out:    at java.net.Socket.connect(Socket.java:579)
[127.0.0.1] out:    at com.rabbitmq.client.ConnectionFactory.createFrameHandler(ConnectionFactory.java:445)
[127.0.0.1] out:    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:504)
[127.0.0.1] out:    at com.rabbitmq.client.ConnectionFactory.newConnection(ConnectionFactory.java:545)
[127.0.0.1] out:    at org.springframework.amqp.rabbit.connection.AbstractConnectionFactory.createBareConnection(AbstractConnectionFactory.java:181)
[127.0.0.1] out:    ... 17 more

Indeed this seems the case. Note that supervisor command did show the RUNNING status for the ingester (and loader). This is because both the ingester and loader have an internal retry mechanism when connecting to the broker, so supervisor sees a running process. The transformer exists when it is not able to connect, and relies on the retry mechanism of supervisor.

Let’s inspect the broker.

9.1.4. Inspect queue sizes

Show an overview of the queues:

$ xfmctl --roles=singlenode messaging
[127.0.0.1] Executing task 'messaging'
[127.0.0.1] sudo: /opt/mgrid/xfm3-bootstrap/scripts/rabbitmqcmd.py broker list queues
[127.0.0.1] out: *** Could not connect: [Errno 111] Connection refused
[127.0.0.1] out:


Fatal error: sudo() received nonzero return code 1 while executing!

Requested: /opt/mgrid/xfm3-bootstrap/scripts/rabbitmqcmd.py broker list queues
Executed: sudo -S -p 'sudo password:'  /bin/bash -l -c "/opt/mgrid/xfm3-bootstrap/scripts/rabbitmqcmd.py broker list queues"

Again an issue connecting to the message broker.

Is the broker listening for connections?

$ xfmctl --roles=singlenode -- sudo systemctl status rabbitmq-server
[127.0.0.1] Executing task '<remainder>'
[127.0.0.1] run: sudo systemctl status rabbitmq-server
[127.0.0.1] out: ● rabbitmq-server.service - RabbitMQ broker
[127.0.0.1] out:    Loaded: loaded (/usr/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: disabled)
[127.0.0.1] out:    Active: inactive (dead) since Sun 2016-01-10 15:54:32 EST; 20min ago
[127.0.0.1] out:   Process: 6370 ExecStop=/usr/lib/rabbitmq/bin/rabbitmqctl stop (code=exited, status=0/SUCCESS)
[127.0.0.1] out:   Process: 6111 ExecStart=/usr/lib/rabbitmq/bin/rabbitmq-server (code=killed, signal=TERM)
[127.0.0.1] out:  Main PID: 6111 (code=killed, signal=TERM)

Aha, it is not running. Let’s start it.

$ xfmctl --roles=singlenode -- sudo systemctl start rabbitmq-server

9.2. Unable to open /etc/scl/prefixes/–

[127.0.0.1] out: Notice: /Stage[main]/Xfm_common::Pyenv/Exec[virtualenv]/returns: Unable to open /etc/scl/prefixes/--!
[127.0.0.1] out: Error: scl enable python27 --  virtualenv /var/lib/xfm/pyenv returned 1 instead of one of [0]
[127.0.0.1] out: Error: /Stage[main]/Xfm_common::Pyenv/Exec[virtualenv]/returns: change from notrun to 0 failed: scl enable python27 --  virtualenv /var/lib/xfm/pyenv returned 1 instead of one of [0]

You are using an old version of scl-utils. Upgrade to a newer version with the following command:

$ yum upgrade scl-utils