[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure only one daemon can run at a time #622

Merged
merged 3 commits into from
Apr 15, 2021

Conversation

hidmic
Copy link
Contributor
@hidmic hidmic commented Apr 14, 2021

Improvement over #620. Flakes in ros2cli's test_strategy.py such as this one were due to TOCTOU races between an is_daemon_running check and a socket.bind within the spawned daemon process. This patch does not resolve that race (not trivial, see discussion in #620), but ensures only one daemon can successfully bind to the same address.

CI up to ros2cli:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

hidmic added 2 commits April 14, 2021 18:56
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@hidmic hidmic requested review from audrow and ivanpauno April 14, 2021 22:11
@hidmic
Copy link
Contributor Author
hidmic commented Apr 14, 2021

For additional context, SO_REUSEADDR was used as a workaround for sockets left in TIME_WAIT state after being actively closed by the XMLRPC server. SO_LINGER ensures sockets are hard closed (i.e. RST packet instead of FIN packet).

@hidmic hidmic changed the title Ensure only one daemon can runs at a time Ensure only one daemon can run at a time Apr 14, 2021
Copy link
Member
@audrow audrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me with green CI.

@clalancette
Copy link
Contributor

Any thoughts on what is going on in macOS? That seems to be new with this patch.

@hidmic
Copy link
Contributor Author
hidmic commented Apr 15, 2021

Any thoughts on what is going on in macOS? That seems to be new with this patch.

Yeah, it looks like some nuance in OSX net stack. I'll investigate.

Signed-off-by: Michel Hidalgo <michel@ekumenlabs.com>
@hidmic
Copy link
Contributor Author
hidmic commented Apr 15, 2021

Found the issue. See 03a2926.

CI up to ros2cli:

  • Linux Build Status
  • Linux-aarch64 Build Status
  • macOS Build Status
  • Windows Build Status

@hidmic
Copy link
Contributor Author
hidmic commented Apr 15, 2021

Alright, finally green. Going in.

@hidmic hidmic merged commit 05b0a5e into master Apr 15, 2021
@delete-merged-branch delete-merged-branch bot deleted the hidmic/fix-daemon-spawning-race branch April 15, 2021 16:32
@ivanpauno
Copy link
Member

It's failing when calling connect() with EADDRNOTAVAIL, which according to the man pages means:

         (Internet domain sockets) The socket referred to by sockfd
         had not previously been bound to an address and, upon
         attempting to bind it to an ephemeral port, it was
         determined that all port numbers in the ephemeral port
         range are currently in use.  See the discussion of
         /proc/sys/net/ipv4/ip_local_port_range in ip(7).

That's surprising 😂 .

@ivanpauno
Copy link
Member

FWIW, it only failed in aarch64 debug and repeated jobs.

@hidmic
Copy link
Contributor Author
hidmic commented Apr 19, 2021

Argh... I'll take a look.

@ivanpauno
Copy link
Member

I have just found the same failure in build.ros2.org amd64: https://build.ros2.org/view/Rci/job/Rci__nightly-extra-rmw-release_ubuntu_focal_amd64/294/testReport/junit/ros2param.ros2param.test/test_verb_list/test_verb_list/, so:

  • It's not aarch64 only.
  • We never released this change, so it seems to be unrelated.

@sloretz
Copy link
Contributor
sloretz commented Apr 20, 2021

I have just found the same failure in build.ros2.org amd64: https://build.ros2.org/view/Rci/job/Rci__nightly-extra-rmw-release_ubuntu_focal_amd64/294/testReport/junit/ros2param.ros2param.test/test_verb_list/test_verb_list/

We never released this change, so it seems to be unrelated.

That job reports it used a4daa76, which does include this change.

  src/ros2/ros2cli:
    type: git
    url: https://github.com/ros2/ros2cli.git
    version: a4daa7672f287997d1345a44ebb9e0c3d0c490b6

I think this PR is the cause of that failure too. It has now happened on nightly_linux_release too. https://ci.ros2.org/job/nightly_linux_release/1891

@ivanpauno
Copy link
Member

That job reports it used a4daa76, which does include this change.

Sorry, I forgot that the ci jobs in build.ros2.org where also testing "master".

clalancette added a commit that referenced this pull request Nov 20, 2024
The original problem here is that running the test_strategy.py test
in the nightly repeated jobs "sometimes" fails.

There have been a few attempts to fix flakiness in the ros2 daemon in the past.
These include #620 , 
#622 , and
#652 .
These all changed things in various ways, but the key PR was #652, which made
spawning the daemon a reliable operation.

#622 made some changes to change the sockets to add SO_LINGER with a zero timeout.
That improved, but did not totally solve the situation. It also has its own downsides, as
SO_LINGER doesn't gracefully terminate connections and instead just sends RST on the
socket and terminates it.

To fix this for real requires 3 parts in this commit, though one of the parts is platform-dependent:

1.  When the daemon is exiting cleanly, it should explicitly shutdown the socket that it was using
for the XMLRPC server. That will cleanly shutdown the socket, and tell the kernel it can start the
cleanup. On its own, this does not completely solve the problem, but it reduces the amount of time
that things are hanging about waiting for the Python interpreter and/or the kernel to implicitly
clean things up.
2.  We should not specify SO_LINGER on the daemon sockets. As mentioned above, this is actually
something of an anti-pattern and does not properly terminate connections with FIN (it just sends RST).
3.  We should specify SO_REUSEADDR, but only on Unix. On Unix, SO_REUSEADDR essentially means
"allow binding to an address/port that is in TCP TIME_WAIT (but not that is otherwise in use)".
This is exactly the behavior we want. On Windows, SO_REUSEADDR causes undefined behavior, as
it can cause a socket to bind even if there is something else bound already. Because of that, we want
to set SO_REUSEADDR on Unix, but not Windows.

Finally, while testing here I had to add in one bugfix to make things reliable on Windows, which is to
also catch ConnectionResetError. That arises because we can attempt to "connect" to a daemon that
is in the process of shutting down. In that case, we should also consider the daemon not "connected".

Signed-off-by: Chris Lalancette <clalancette@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants