8000 Stalled Publication/Subscription · Issue #281 · aeron-io/aeron · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Stalled Publication/Subscription #281
Closed
@patriknw

Description

@patriknw

Version 1.0.1

I'm debugging a case where sub.poll suddenly doesn't receive any more messages, and it stays that way. On the sending side pub.offer is successful (positive return value and connected=true closed=false).

Only two nodes.

  • node1: 172-31-10-77
  • node2: 172-31-8-204

node1 tries to send to node2, but node2 is not started yet. ~30 seconds later node2 is started and they successfully exchange a few messages (I have application level logs for that), and then node2 stops receiving more messages in sub.poll. onFragment in the FragmentAssembler is not invoked even though I repeatedly call poll. Those messages are below mtu size.

The systems are rather loaded in this scenario but not overloaded, and the load is stopped after a while.

AeronStat from node1 172-31-10-77:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 24:        1,627,656,896 - snd-pos: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 25:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 26:        1,636,045,504 - pub-lmt: 8 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 27:              170,432 - sub-pos: 1 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 28:              170,432 - rcv-hwm: 7 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:           19,750,176 - sub-pos: 2 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520 @0
 30:           19,750,176 - rcv-hwm: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 31:           19,750,176 - rcv-pos: 9 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 32:           28,953,216 - pub-lmt: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 33:           20,564,608 - snd-pos: 10 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

AeronStat from node2 172-31-8-204:

 23:                    1 - recv-channel: aeron:udp?endpoint=ip-172-31-8-204:25520
 24:                    1 - send-channel: aeron:udp?endpoint=ip-172-31-10-77:25520
 25:            8,559,040 - pub-lmt: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 26:              170,432 - snd-pos: 3 434696915 1 aeron:udp?endpoint=ip-172-31-10-77:25520
 27:           28,138,784 - pub-lmt: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 28:           19,750,176 - snd-pos: 4 434696916 2 aeron:udp?endpoint=ip-172-31-10-77:25520
 29:                2,176 - sub-pos: 1 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 30:        1,627,656,896 - rcv-hwm: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 31:                2,176 - rcv-pos: 5 1287205435 1 aeron:udp?endpoint=ip-172-31-8-204:25520
 32:           20,500,992 - sub-pos: 2 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520 @0
 33:           20,500,992 - rcv-hwm: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520
 34:           20,500,992 - rcv-pos: 6 1287205436 2 aeron:udp?endpoint=ip-172-31-8-204:25520

The problematic session is 1287205435. The other streams seem to progress. I kept it running for minutes after the stall.

Stream 1 is our control stream and it's low traffic, a few messages per second on this stream.

I have all files, if you need more information.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0