[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content
Version: 4.0

Troubleshooting Network Connectivity

Overview

This guide accompanies the one on networking and focuses on troubleshooting of network connections.

For connections that use TLS there is an additional guide on troubleshooting TLS.

Troubleshooting Methodology

Troubleshooting of network connectivity issues is a broad topic. There are entire books written about it. This guide explains a methodology and widely available networking tools that help narrow most common issues down efficiently.

Networking protocols are layered. So are problems with them. An effective troubleshooting strategy typically uses the process of elimination to pinpoint the issue (or multiple issues), starting at higher levels. Specifically for messaging technologies, the following steps are often effective and sufficient:

These steps, when performed in sequence, usually help identify the root cause of the vast majority of networking issues. Troubleshooting tools and techniques for levels lower than the Internet (networking) layer are outside of the scope of this guide.

Certain problems only happen in environments with a high degree of connection churn. Client connections can be inspected using the management UI. It is also possible to inspect all TCP connections of a node and their state. That information collected over time, combined with server logs, will help detect connection churn, file descriptor exhaustion and related issues.

Verify Client Configuration

All developers and operators have been there: typos, outdated values, issues in provisioning tools, mixed up public and private key paths, and so on. Step one is to double check application and client library configuration.

Verify Server Configuration

Verifying server configuration helps prove that RabbitMQ is running with the expected set of settings related to networking. It also verifies that the node is actually running. Here are the recommended steps:

Note that in older RabbitMQ versions, the status and environment commands were only available as part of rabbitmqctl: rabbitmqctl status and so on. In modern versions either tool can be used to run those commands but rabbitmq-diagnostics is what most documentation guides will typically recommend.

The listeners section will look something like this:

Interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Interface: [::], port: 15672, protocol: http, purpose: HTTP API
Interface: [::], port: 15671, protocol: https, purpose: HTTP API over TLS (HTTPS)
Interface: [::], port: 1883, protocol: mqtt, purpose: MQTT

In the above example, there are 6 TCP listeners on the node:

  • Inter-node and CLI tool communication on port 25672
  • AMQP 0-9-1 (and 1.0, if enabled) listener for non-TLS connections on port 5672
  • AMQP 0-9-1 (and 1.0, if enabled) listener for TLS-enabled connections on port 5671
  • HTTP API listeners on ports 15672 (HTTP) and 15671 (HTTPS)
  • MQTT listener for non-TLS connections 1883

In second example, there are 4 TCP listeners on the node:

  • Inter-node and CLI tool communication on port 25672
  • AMQP 0-9-1 (and 1.0, if enabled) listener for non-TLS connections, 5672
  • AMQP 0-9-1 (and 1.0, if enabled) listener for TLS-enabled connections, 5671
  • HTTP API listener on ports 15672 (HTTP only)

All listeners are bound to all available interfaces.

Inspecting TCP listeners used by a node helps spot non-standard port configuration, protocol plugins (e.g. MQTT) that are supposed to be configured but aren't, cases when the node is limited to only a few network interfaces, and so on. If a port is not on the listener list it means the node cannot accept any connections on it.

Inspect Server Logs

RabbitMQ nodes will log key client connection lifecycle events. A TCP connection must be successfully established and at least 1 byte of data must be sent by the peer for a connection to be considered (and logged as) accepted.

From this point, connection handshake and negotiation proceeds as defined by the specification of the messaging protocol used, e.g. AMQP 0-9-1, AMQP 1.0 or MQTT.

If no events are logged, this means that either there were no successful inbound TCP connections or they sent no data.

Hostname Resolution

It is very common for applications to use hostnames or URIs with hostnames when connecting to RabbitMQ. dig and nslookup are commonly used tools for troubleshooting hostnames resolution.

Port Access

Besides hostname resolution and IP routing issues, TCP port inaccessibility for outside connections is a common reason for failing client connections. telnet is a commonly used, very minimalistic tool for testing TCP connections to a particular hostname and port.

The following example uses telnet to connect to host localhost on port 5672. There is a running node with stock defaults running on localhost and nothing blocks access to the port, so the connection succeeds. 12345 is then entered for input followed by an Enter. This data will be sent to the node on the opened connection.

Since 12345 is not a correct AMQP 0-9-1 or AMQP 1.0 protocol header, so the server closes TCP connection:

telnet localhost 5672
# => Trying ::1...
# => Connected to localhost.
# => Escape character is '^]'.
12345 # enter this and hit Enter to send
# => AMQP Connection closed by foreign host.

After telnet connection succeeds, use Control + ] and then Control + D to quit it.

The following example connects to localhost on port 5673. The connection fails (refused by the OS) since there is no process listening on that port.

telnet localhost 5673
# => Trying ::1...
# => telnet: connect to address ::1: Connection refused
# => Trying 127.0.0.1...
# => telnet: connect to address 127.0.0.1: Connection refused
# => telnet: Unable to connect to remote host

Failed or timing out telnet connections strongly suggest there's a proxy, load balancer or firewall that blocks incoming connections on the target port. It could also be due to RabbitMQ process not running on the target node or uses a non-standard port. Those scenarios should be eliminated at the step that double checks server listener configuration.

There's a great number of firewall, proxy and load balancer tools and products. iptables is a commonly used firewall on Linux and other UNIX-like systems. There is no shortage of iptables tutorials on the Web.

Open ports, TCP and UDP connections of a node can be inspected using netstat, ss, lsof.

The following example uses lsof to display OS processes that listen on port 5672 and use IPv4:

sudo lsof -n -i4TCP:5672 | grep LISTEN

Similarly, for programs that use IPv6:

sudo lsof -n -i6TCP:5672 | grep LISTEN

On port 1883:

sudo lsof -n -i4TCP:1883 | grep LISTEN
sudo lsof -n -i6TCP:1883 | grep LISTEN

If the above commands produce no output then no local OS processes listen on the given port.

The following example uses ss to display listening TCP sockets that use IPv4 and their OS processes:

sudo ss --tcp -f inet --listening --numeric --processes

Similarly, for TCP sockets that use IPv6:

sudo ss --tcp -f inet6 --listening --numeric --processes

For the list of ports used by RabbitMQ and its various plugins, see above. Generally all ports used for external connections must be allowed by the firewalls and proxies.

rabbitmq-diagnostics listeners and rabbitmq-diagnostics status can be used to list enabled listeners and their ports on a RabbitMQ node.

IP Routing

Messaging protocols supported by RabbitMQ use TCP and require IP routing between clients and RabbitMQ hosts to be functional. There are several tools and techniques that can be used to verify IP routing between two hosts. traceroute and ping are two common options available for many operating systems. Most routing table inspection tools are OS-specific.

Note that both traceroute and ping use ICMP while RabbitMQ client libraries and inter-node connections use TCP. Therefore a successful ping run alone does not guarantee successful client connectivity.

Both traceroute and ping have Web-based and GUI tools built on top.

Capturing Traffic

All network activity can be inspected, filtered and analyzed using a traffic capture.

tcpdump and its GUI sibling Wireshark are the industry standards for capturing traffic, filtering and analysis. Both support all protocols supported by RabbitMQ. See the Using Wireshark with RabbitMQ guide for an overview.

TLS Connections

For connections that use TLS there is a separate guide on troubleshooting TLS.

When adopting TLS it is important to make sure that clients use correct port to connect (see the list of ports above) and that they are instructed to use TLS (perform TLS upgrade). A client that is not configured to use TLS will successfully connect to a TLS-enabled server port but its connection will then time out since it never performs the TLS upgrade that the server expects.

A TLS-enabled client connecting to a non-TLS enabled port will successfully connect and try to perform a TLS upgrade which the server does not expect, this triggering a protocol parser exception. Such exceptions will be logged by the server.

Inspecting Connections

Open ports, TCP and UDP connections of a node can be inspected using netstat, ss, lsof.

The following example uses netstat to list all TCP connection sockets regardless of their state and interface. IP addresses will be displayed as numbers instead of being resolved to domain names. Program names will be printed next to numeric port values (as opposed to protocol names).

sudo netstat --all --numeric --tcp --programs

Both inbound (client, peer nodes, CLI tools) and outgoing (peer nodes, Federation links and Shovels) connections can be inspected this way.

rabbitmqctl list_connections, management UI can be used to inspect more connection properties, some of which are RabbitMQ- or messaging protocol-specific:

  • Network traffic flow, both inbound and outbound
  • Messaging (application-level) protocol used
  • Connection virtual host
  • Time of connection
  • Username
  • Number of channels
  • Client library details (name, version, capabilities)
  • Effective heartbeat timeout
  • TLS details

Combining connection information from management UI or CLI tools with those of netstat or ss can help troubleshoot misbehaving applications, application instances and client libraries.

Most relevant connection metrics can be collected, aggregated and monitored using Prometheus and Grafana.

Detecting High Connection Churn

High connection churn (lots of connections opened and closed after a brief period of time) can lead to resource exhaustion. It is therefore important to be able to identify such scenarios. netstat and ss are most popular options for inspecting TCP connections. A lot of connections in the TIME_WAIT state is a likely symptom of high connection churn. Lots of connections in states other than ESTABLISHED also might be a symptom worth investigating.

Evidence of short lived connections can be found in RabbitMQ log files. E.g. here's an example of such connection that lasted only a few milliseconds:

2018-06-17 16:23:29.851 [info] <0.634.0> accepting AMQP connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672)
2018-06-17 16:23:29.853 [info] <0.634.0> connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672): user 'guest' authenticated and granted access to vhost '/'
2018-06-17 16:23:29.855 [info] <0.634.0> closing AMQP connection <0.634.0> (127.0.0.1:58588 -> 127.0.0.1:5672, vhost: '/', user: 'guest')