Description
For the context: as part of the Aeron Cluster monitoring we compare snapshots made at the same log position. This allows us to detect divergence of the state in case of bugs which introduce non-deterministic logic.
From time to time we've been detecting consensus module producing different snapshots on different nodes. The difference is in the nextSessionId
field. After some investigation I found that ConsensusModuleAgent#nextSessionId
on the leader is updated at the same time as adding the "session open" message to the log, while on the followers it is updated when it reaches the "session open" message.
Consider following scenario:
- A snapshot command is issued
- Leader node adds the snapshot message to the log
- A new client is connected
- Leader node increments the nextSessionId and adds the "session open" message to the log
- Nodes reach the snapshot message and take a snapshot (at this point leader and followers have different
nextSessionId
) - Followers reach the "session open" message in the log and increment the
nextSessionId
(now all nodes have samenextSessionId
)
Is it expected that nodes in the cluster may have different consensus module snapshots? Or should the leader write the same nextSessionId
value as a follower would?