Troubleshooting Network Connectivity
Overview
This guide accompanies the one on networking and focuses on troubleshooting of network connections.
For connections that use TLS there is an additional guide on troubleshooting TLS.
Troubleshooting Methodology
Troubleshooting of network connectivity issues is a broad topic. There are entire books written about it. This guide explains a methodology and widely available networking tools that help narrow most common issues down efficiently.
Networking protocols are layered. So are problems with them. An effective troubleshooting strategy typically uses the process of elimination to pinpoint the issue (or multiple issues), starting at higher levels. Specifically for messaging technologies, the following steps are often effective and sufficient:
- Verify client configuration
- Verify server configuration using
rabbitmq-diagnostics listeners
,rabbitmq-diagnostics status
,rabbitmq-diagnostics environment
- Inspect server logs
- Verify hostname resolution
- Verify what TCP port are used and their accessibility
- Verify IP routing
- If needed, take and analyze a traffic dump (traffic capture)
- Verify that clients can successfully authenticate
These steps, when performed in sequence, usually help identify the root cause of the vast majority of networking issues. Troubleshooting tools and techniques for levels lower than the Internet (networking) layer are outside of the scope of this guide.
Certain problems only happen in environments with a high degree of connection churn. Client connections can be inspected using the management UI. It is also possible to inspect all TCP connections of a node and their state. That information collected over time, combined with server logs, will help detect connection churn, file descriptor exhaustion and related issues.
Verify Client Configuration
All developers and operators have been there: typos, outdated values, issues in provisioning tools, mixed up public and private key paths, and so on. Step one is to double check application and client library configuration.
Verify Server Configuration
Verifying server configuration helps prove that RabbitMQ is running with the expected set of settings related to networking. It also verifies that the node is actually running. Here are the recommended steps:
- Make sure the node is running using
rabbitmq-diagnostics status
- Verify config file is correctly placed and has correct syntax/structure
- Inspect listeners using
rabbitmq-diagnostics listeners
or thelisteners
section inrabbitmq-diagnostics status
- Inspect effective configuration using
rabbitmq-diagnostics environment
Note that in older RabbitMQ versions, the status
and environment
commands
were only available as part of rabbitmqctl:
rabbitmqctl status
and so on.
In modern versions either tool can be used to run those commands but
rabbitmq-diagnostics is what most documentation guides
will typically recommend.
The listeners section will look something like this:
Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS)
Interface: [::], port: 1883, protocol: mqtt, purpose: MQTT
In the above example, there are 6 TCP listeners on the node:
- Inter-node and CLI tool communication on port
25672
- AMQP 0-9-1 (and 1.0, if enabled) listener for non-TLS connections on port
5672
- AMQP 0-9-1 (and 1.0, if enabled) listener for TLS-enabled connections on port
5671
- HTTP API listeners on ports 15672 (HTTP) and 15671 (HTTPS)
- MQTT listener for non-TLS connections 1883
In second example, there are 4 TCP listeners on the node:
- Inter-node and CLI tool communication on port
25672
- AMQP 0-9-1 (and 1.0, if enabled) listener for non-TLS connections,
5672
- AMQP 0-9-1 (and 1.0, if enabled) listener for TLS-enabled connections,
5671
- HTTP API listener on ports 15672 (HTTP only)
All listeners are bound to all available interfaces.
Inspecting TCP listeners used by a node helps spot non-standard port configuration, protocol plugins (e.g. MQTT) that are supposed to be configured but aren't, cases when the node is limited to only a few network interfaces, and so on. If a port is not on the listener list it means the node cannot accept any connections on it.
Inspect Server Logs
RabbitMQ nodes will log key client connection lifecycle events. A TCP connection must be successfully established and at least 1 byte of data must be sent by the peer for a connection to be considered (and logged as) accepted.
From this point, connection handshake and negotiation proceeds as defined by the specification of the messaging protocol used, e.g. AMQP 0-9-1, AMQP 1.0 or MQTT.
If no events are logged, this means that either there were no successful inbound TCP connections or they sent no data.
Hostname Resolution
It is very common for applications to use hostnames or URIs with hostnames when connecting to RabbitMQ. dig and nslookup are commonly used tools for troubleshooting hostnames resolution.
Port Access
Besides hostname resolution and IP routing issues, TCP port inaccessibility for outside connections is a common reason for failing client connections. telnet is a commonly used, very minimalistic tool for testing TCP connections to a particular hostname and port.
The following example uses telnet
to connect to host localhost
on port 5672
.
There is a running node with stock defaults running on localhost
and nothing blocks access to the port, so
the connection succeeds. 12345
is then entered for input followed by an Enter.
This data will be sent to the node on the opened connection.
Since 12345
is not a correct AMQP 0-9-1 or AMQP 1.0 protocol header,
so the server closes TCP connection:
telnet localhost 5672
# => Trying ::1...
# => Connected to localhost.
# => Escape character is '^]'.
12345 # enter this and hit Enter to send
# => AMQP Connection closed by foreign host.
After telnet
connection succeeds, use Control + ]
and then Control + D
to
quit it.
The following example connects to localhost
on port 5673
.
The connection fails (refused by the OS) since there is no process listening on that port.
telnet localhost 5673
# => Trying ::1...
# => telnet: connect to address ::1: Connection refused
# => Trying 127.0.0.1...
# => telnet: connect to address 127.0.0.1: Connection refused
# => telnet: Unable to connect to remote host
Failed or timing out telnet
connections
strongly suggest there's a proxy, load balancer or firewall
that blocks incoming connections on the target port. It
could also be due to RabbitMQ process not running on the
target node or uses a non-standard port. Those scenarios
should be eliminated at the step that double checks server
listener configuration.
There's a great number of firewall, proxy and load balancer tools and products.
iptables is a commonly used
firewall on Linux and other UNIX-like systems. There is no shortage of iptables
tutorials on the Web.
Open ports, TCP and UDP connections of a node can be inspected using netstat, ss, lsof.
The following example uses lsof
to display OS processes that listen on port 5672 and use IPv4:
sudo lsof -n -i4TCP:5672 | grep LISTEN
Similarly, for programs that use IPv6:
sudo lsof -n -i6TCP:5672 | grep LISTEN
On port 1883:
sudo lsof -n -i4TCP:1883 | grep LISTEN
sudo lsof -n -i6TCP:1883 | grep LISTEN
If the above commands produce no output then no local OS processes listen on the given port.
The following example uses ss
to display listening TCP sockets that use IPv4 and their OS processes:
sudo ss --tcp -f inet --listening --numeric --processes
Similarly, for TCP sockets that use IPv6:
sudo ss --tcp -f inet6 --listening --numeric --processes
For the list of ports used by RabbitMQ and its various plugins, see above. Generally all ports used for external connections must be allowed by the firewalls and proxies.
rabbitmq-diagnostics listeners
and rabbitmq-diagnostics status
can be
used to list enabled listeners and their ports on a RabbitMQ node.
IP Routing
Messaging protocols supported by RabbitMQ use TCP and require IP routing between clients and RabbitMQ hosts to be functional. There are several tools and techniques that can be used to verify IP routing between two hosts. traceroute and ping are two common options available for many operating systems. Most routing table inspection tools are OS-specific.
Note that both traceroute
and ping
use ICMP
while RabbitMQ client libraries and inter-node connections use TCP.
Therefore a successful ping
run alone does not guarantee successful client connectivity.
Both traceroute
and ping
have Web-based and GUI tools built on top.
Capturing Traffic
All network activity can be inspected, filtered and analyzed using a traffic capture.
tcpdump and its GUI sibling Wireshark are the industry standards for capturing traffic, filtering and analysis. Both support all protocols supported by RabbitMQ. See the Using Wireshark with RabbitMQ guide for an overview.
TLS Connections
For connections that use TLS there is a separate guide on troubleshooting TLS.
When adopting TLS it is important to make sure that clients use correct port to connect (see the list of ports above) and that they are instructed to use TLS (perform TLS upgrade). A client that is not configured to use TLS will successfully connect to a TLS-enabled server port but its connection will then time out since it never performs the TLS upgrade that the server expects.
A TLS-enabled client connecting to a non-TLS enabled port will successfully connect and try to perform a TLS upgrade which the server does not expect, this triggering a protocol parser exception. Such exceptions will be logged by the server.
Inspecting Connections
Open ports, TCP and UDP connections of a node can be inspected using netstat, ss, lsof.
The following example uses netstat
to list all TCP connection sockets regardless of their state and interface.
IP addresses will be displayed as numbers instead of being resolved to domain names. Program names will be printed next
to numeric port values (as opposed to protocol names).
sudo netstat --all --numeric --tcp --programs
Both inbound (client, peer nodes, CLI tools) and outgoing (peer nodes, Federation links and Shovels) connections can be inspected this way.
rabbitmqctl list_connections
, management UI
can be used to inspect more connection properties, some of which are RabbitMQ- or
messaging protocol-specific:
- Network traffic flow, both inbound and outbound
- Messaging (application-level) protocol used
- Connection virtual host
- Time of connection
- Username
- Number of channels
- Client library details (name, version, capabilities)
- Effective heartbeat timeout
- TLS details
Combining connection information from management UI or CLI tools with those of netstat
or ss
can help troubleshoot misbehaving applications, application instances and client libraries.
Most relevant connection metrics can be collected, aggregated and monitored using Prometheus and Grafana.
Detecting High Connection Churn
High connection churn (lots of connections opened and closed after a brief
period of time) can lead to resource exhaustion.
It is therefore important to be able to identify such scenarios. netstat
and ss
are most popular options for inspecting TCP connections.
A lot of connections in the TIME_WAIT
state is a likely symptom of high connection churn.
Lots of connections in states other than ESTABLISHED
also might be a symptom worth investigating.
Evidence of short lived connections can be found in RabbitMQ log files. E.g. here's an example of such connection that lasted only a few milliseconds:
2018-06-17 16:23:29.851 [info] <0.634.0> accepting AMQP connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672)
2018-06-17 16:23:29.853 [info] <0.634.0> connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2018-06-17 16:23:29.855 [info] <0.634.0> closing AMQP connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672, vhost: '/', user: 'guest')