-
Notifications
You must be signed in to change notification settings - Fork 167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure only one daemon can run at a time #622
Conversation
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
For additional context, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me with green CI.
Any thoughts on what is going on in macOS? That seems to be new with this patch. |
Yeah, it looks like some nuance in OSX net stack. I'll investigate. |
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Found the issue. See 03a2926. CI up to |
Alright, finally green. Going in. |
It's failing when calling
That's surprising 😂 . |
Argh... I'll take a look. |
I have just found the same failure in
|
That job reports it used a4daa76, which does include this change. src/ros2/ros2cli:
type: git
url: https://github.com/ros2/ros2cli.git
version: a4daa7672f287997d1345a44ebb9e0c3d0c490b6 I think this PR is the cause of that failure too. It has now happened on |
Sorry, I forgot that the ci jobs in |
The original problem here is that running the test_strategy.py test in the nightly repeated jobs "sometimes" fails. There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation. #622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it. To fix this for real requires 3 parts in this commit, though one of the parts is platform-dependent: 1. When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly clean things up. 2. We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST). 3. We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means "allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)". This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as it can cause a socket to bind even if there is something else bound already. Because of that, we want to set SO_REUSEADDR on Unix, but not Windows. Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that is in the process of shutting down. In that case, we should also consider the daemon not "connected". Signed-off-by: Chris Lalancette <clalancette@gmail.com>
Improvement over #620. Flakes in
ros2cli
's test_strategy.py such as this one were due to TOCTOU races between anis_daemon_running
check and asocket.bind
within the spawned daemon process. This patch does not resolve that race (not trivial, see discussion in #620), but ensures only one daemon can successfully bind to the same address.CI up to
ros2cli
: