ZooKeeper server set namer `io.l5d.serversets` appears to leak ZooKeeper watches · Issue #2460 · linkerd/linkerd · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given that linkerd 1.x is in maintenance mode, I'm not sure how likely a bug fix is. At minimum this issue can help get confirmation of the issue, and help anyone else who encounters issues on linkerd 1.x.
What happened:
Linkerd can lose its connection to ZooKeeper either through a normal rolling restart/update of the ZooKeeper cluster, or spurious connectivity loss. However, when this happens, linkerd enters a tight loop, logging this repeatedly:
Reacquiring watch on com.twitter.finagle.serverset2.client.SessionState$SyncConnected$@742bd12c. Session: 703ca837df3e0a2
During this loop, linkerd can consume >99% of the host's CPU.
If linkerd has been running for a long time before this, then the loop goes on for a long time (thousands of messages). Linkerd then starts logging:
log queue overflow - record dropped
Ultimately the linkerd process OOMs:
VM error: Java heap space
java.lang.OutOfMemoryError: Java heap space
As an experiment, I ran a script to restart all instances of linkerd and let it come back up as normal. The total watch count on the ZooKeeper cluster dropped significantly:
What you expected to happen:
Linkerd would recover gracefully from a spurious ZooKeeper disconnect or rolling restart of ZooKeeper nodes.
Watch count on the ZooKeeper cluster would be stable. It wouldn't increase linearly with linkerd uptime, nor suddenly decrease when linkerd is restarted.
How to reproduce it (as minimally and precisely as possible):
Set up linkerd using the io.l5d.serversets namer, allow it to run for a long time. Then run a rolling restart of the ZooKeeper cluster, or rolling restart of linkerd processes.
Anything else we need to know?:
Linkerd is deployed one-per-host (daemon) on a cluster. So in practice, each instance of linkerd needs to be up all the time to enable communication between services.
In my case, the uptime of linkerd processes was well over 200 days. A workaround could be preventing long uptime of linkerd, by restarting the linkerd processes periodically to prevent accumulating thousands of ZooKeeper watches. This would make the system resilient to blips in ZooKeeper, but restarting linkerd in this deployment model (one-per-host) incurs errors in service communication (or at best, latency spikes - if clients have fine-tuned retry logic).
Environment:
I've seen this on linkerd versions 1.7.4 and 1.7.5.
Linkerd config snippet:
namers:
- kind: io.l5d.serversetszkAddrs:
{%- for zk_addr in zks %}
- host: {{ zk_addr }}port: {{ zk_port }}{%- endfor %}
The text was updated successfully, but these errors were encountered:
Issue Type:
Bug report / question
Given that linkerd 1.x is in maintenance mode, I'm not sure how likely a bug fix is. At minimum this issue can help get confirmation of the issue, and help anyone else who encounters issues on linkerd 1.x.
What happened:
Linkerd can lose its connection to ZooKeeper either through a normal rolling restart/update of the ZooKeeper cluster, or spurious connectivity loss. However, when this happens, linkerd enters a tight loop, logging this repeatedly:
During this loop, linkerd can consume >99% of the host's CPU.
If linkerd has been running for a long time before this, then the loop goes on for a long time (thousands of messages). Linkerd then starts logging:
Ultimately the linkerd process OOMs:
As an experiment, I ran a script to restart all instances of linkerd and let it come back up as normal. The total watch count on the ZooKeeper cluster dropped significantly:
What you expected to happen:
Linkerd would recover gracefully from a spurious ZooKeeper disconnect or rolling restart of ZooKeeper nodes.
Watch count on the ZooKeeper cluster would be stable. It wouldn't increase linearly with linkerd uptime, nor suddenly decrease when linkerd is restarted.
How to reproduce it (as minimally and precisely as possible):
Set up linkerd using the
io.l5d.serversets
namer, allow it to run for a long time. Then run a rolling restart of the ZooKeeper cluster, or rolling restart of linkerd processes.Anything else we need to know?:
Linkerd is deployed one-per-host (daemon) on a cluster. So in practice, each instance of linkerd needs to be up all the time to enable communication between services.
In my case, the uptime of linkerd processes was well over 200 days. A workaround could be preventing long uptime of linkerd, by restarting the linkerd processes periodically to prevent accumulating thousands of ZooKeeper watches. This would make the system resilient to blips in ZooKeeper, but restarting linkerd in this deployment model (one-per-host) incurs errors in service communication (or at best, latency spikes - if clients have fine-tuned retry logic).
Environment:
I've seen this on linkerd versions 1.7.4 and 1.7.5.
Linkerd config snippet:
The text was updated successfully, but these errors were encountered: