Ensure only one daemon can run at a time #622

hidmic · 2021-04-14T22:11:34Z

Improvement over #620. Flakes in ros2cli's test_strategy.py such as this one were due to TOCTOU races between an is_daemon_running check and a socket.bind within the spawned daemon process. This patch does not resolve that race (not trivial, see discussion in #620), but ensures only one daemon can successfully bind to the same address.

CI up to ros2cli:

Linux
Linux-aarch64
macOS
Windows

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>

hidmic · 2021-04-14T22:13:50Z

For additional context, SO_REUSEADDR was used as a workaround for sockets left in TIME_WAIT state after being actively closed by the XMLRPC server. SO_LINGER ensures sockets are hard closed (i.e. RST packet instead of FIN packet).

audrow

Looks good to me with green CI.

clalancette · 2021-04-15T12:30:06Z

Any thoughts on what is going on in macOS? That seems to be new with this patch.

hidmic · 2021-04-15T13:18:10Z

Any thoughts on what is going on in macOS? That seems to be new with this patch.

Yeah, it looks like some nuance in OSX net stack. I'll investigate.

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>

hidmic · 2021-04-15T14:38:15Z

Found the issue. See 03a2926.

CI up to ros2cli:

Linux
Linux-aarch64
macOS
Windows

hidmic · 2021-04-15T16:31:51Z

Alright, finally green. Going in.

ivanpauno · 2021-04-19T13:18:22Z

https://ci.ros2.org/view/nightly/job/nightly_linux-aarch64_debug/1577/testReport/junit/ros2param.ros2param.test/test_verb_load/test_verb_load/ 😕

ivanpauno · 2021-04-19T13:24:17Z

It's failing when calling connect() with EADDRNOTAVAIL, which according to the man pages means:

         (Internet domain sockets) The socket referred to by sockfd
         had not previously been bound to an address and, upon
         attempting to bind it to an ephemeral port, it was
         determined that all port numbers in the ephemeral port
         range are currently in use.  See the discussion of
         /proc/sys/net/ipv4/ip_local_port_range in ip(7).

That's surprising 😂 .

ivanpauno · 2021-04-19T13:29:24Z

FWIW, it only failed in aarch64 debug and repeated jobs.

hidmic · 2021-04-19T13:33:24Z

Argh... I'll take a look.

ivanpauno · 2021-04-19T13:45:17Z

I have just found the same failure in build.ros2.org amd64: https://build.ros2.org/view/Rci/job/Rci__nightly-extra-rmw-release_ubuntu_focal_amd64/294/testReport/junit/ros2param.ros2param.test/test_verb_list/test_verb_list/, so:

It's not aarch64 only.
We never released this change, so it seems to be unrelated.

sloretz · 2021-04-20T22:45:30Z

I have just found the same failure in build.ros2.org amd64: https://build.ros2.org/view/Rci/job/Rci__nightly-extra-rmw-release_ubuntu_focal_amd64/294/testReport/junit/ros2param.ros2param.test/test_verb_list/test_verb_list/

We never released this change, so it seems to be unrelated.

That job reports it used a4daa76, which does include this change.

  src/ros2/ros2cli:
    type: git
    url: https://github.com/ros2/ros2cli.git
    version: a4daa7672f287997d1345a44ebb9e0c3d0c490b6

I think this PR is the cause of that failure too. It has now happened on nightly_linux_release too. https://ci.ros2.org/job/nightly_linux_release/1891

ivanpauno · 2021-04-21T14:37:46Z

That job reports it used a4daa76, which does include this change.

Sorry, I forgot that the ci jobs in build.ros2.org where also testing "master".

The original problem here is that running the test_strategy.py test in the nightly repeated jobs "sometimes" fails. There have been a few attempts to fix flakiness in the ros2 daemon in the past. These include #620 , #622 , and #652 . These all changed things in various ways, but the key PR was #652, which made spawning the daemon a reliable operation. #622 made some changes to change the sockets to add SO_LINGER with a zero timeout. That improved, but did not totally solve the situation. It also has its own downsides, as SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the socket and terminates it. To fix this for real requires 3 parts in this commit, though one of the parts is platform-dependent: 1. When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly clean things up. 2. We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST). 3. We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means "allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)". This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as it can cause a socket to bind even if there is something else bound already. Because of that, we want to set SO_REUSEADDR on Unix, but not Windows. Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that is in the process of shutting down. In that case, we should also consider the daemon not "connected". Signed-off-by: Chris Lalancette <clalancette@gmail.com>

hidmic added 2 commits April 14, 2021 18:56

Substitute SO_REUSEADDR by SO_LINGER.

742e20e

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>

Wait for daemon shutdown before testing NodeStrategy

5c2f0b1

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>

hidmic requested review from audrow and ivanpauno April 14, 2021 22:11

hidmic mentioned this pull request Apr 14, 2021

Wait for daemon shutdown before testing NodeStrategy #620

Closed

hidmic changed the title ~~Ensure only one daemon can runs at a time~~ Ensure only one daemon can run at a time Apr 14, 2021

audrow approved these changes Apr 14, 2021

View reviewed changes

Use SO_LINGER for accepted sockets too.

03a2926

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>

ivanpauno approved these changes Apr 15, 2021

View reviewed changes

hidmic merged commit 05b0a5e into master Apr 15, 2021

delete-merged-branch bot deleted the hidmic/fix-daemon-spawning-race branch April 15, 2021 16:32

sloretz mentioned this pull request Apr 20, 2021

Flaky tests test_verb_load test_verb_list can't create a daemon OSError Errno 99 #630

Closed

hidmic mentioned this pull request Apr 23, 2021

[POC] Fix failing ros2param tests #632

Closed

clalancette mentioned this pull request Nov 18, 2024

Allow reuse address when firing up the ros2cli daemon. #947

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure only one daemon can run at a time #622

Ensure only one daemon can run at a time #622

Ensure only one daemon can run at a time #622

Ensure only one daemon can run at a time #622

Conversation

Choose a reason for hiding this comment