Description
Version 1.0.1
I'm debugging a case where sub.poll
suddenly doesn't receive any more messages, and it stays that way. On the sending side pub.offer
is successful (positive return value and connected=true closed=false).
Only two nodes.
- node1: 172-31-10-77
- node2: 172-31-8-204
node1 tries to send to node2, but node2 is not started yet. ~30 seconds later node2 is started and they successfully exchange a few messages (I have application level logs for that), and then node2 stops receiving more messages in sub.poll
. onFragment
in the FragmentAssembler is not invoked even though I repeatedly call poll
. Those messages are below mtu size.
The systems are rather loaded in this scenario but not overloaded, and the load is stopped after a while.
AeronStat from node1 172-31-10-77:
23: 1 - recv-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
24: 1,627,656,896 - snd-pos: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
25: 1 - send-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
26: 1,636,045,504 - pub-lmt: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
27: 170,432 - sub-pos: 1 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
28: 170,432 - rcv-hwm: 7 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
29: 19,750,176 - sub-pos: 2 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
30: 19,750,176 - rcv-hwm: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
31: 19,750,176 - rcv-pos: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
32: 28,953,216 - pub-lmt: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
33: 20,564,608 - snd-pos: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
AeronStat from node2 172-31-8-204:
23: 1 - recv-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
24: 1 - send-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
25: 8,559,040 - pub-lmt: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
26: 170,432 - snd-pos: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
27: 28,138,784 - pub-lmt: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
28: 19,750,176 - snd-pos: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
29: 2,176 - sub-pos: 1 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
30: 1,627,656,896 - rcv-hwm: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
31: 2,176 - rcv-pos: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
32: 20,500,992 - sub-pos: 2 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
33: 20,500,992 - rcv-hwm: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
34: 20,500,992 - rcv-pos: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
The problematic session is 1287205435. The other streams seem to progress. I kept it running for minutes after the stall.
Stream 1 is our control stream and it's low traffic, a few messages per second on this stream.
I have all files, if you need more information.