Ensure stuck workers are killed #72

dfradette-stripe · 2018-02-22T15:35:49Z

What's the problem?

There is a rare case where after an upgrade, a worker will become stuck after being issued a USR2. If the signal_timeout timer doesn't fire for some reason, the child will remain running forever with an old version.

What's the fix?

This new behaviour adds another stage to cull where it will periodically check if any previously upgraded (read: signaled with USR2) are running longer than signal-timeout and, if so, terminates them with KILL.

One of the broader changes here is setting the IO read/write timeout to the same value as signal-timeout so that the read doesn't block forever and the reap/replenish/cull gets hit periodically.

This should ensure that any old upgraded workers are short lived.

There is a rare case where after an upgrade, a worker will become stuck after being issued a USR2. If the signal_timeout timer doesn't fire for some reason, the child will remain running forever. This changes the signal-timeout timer to process all signaled children and use the `last_signaled_at` to determine if the child process should be killed. This should reduce/eliminate old stuck workers in the long run since each future signal timeout timer will attempt to kill anything that has been signaled and is still running.

dfradette-stripe · 2018-04-17T13:49:42Z

cc @evan-stripe @asf-stripe @nelhage to get some experienced eyes on this PR.

asf-stripe

Be careful what you wish for! (-:

I have put in a few comments about stuff that's unclear, especially around what the interactions between signal_timeout and resignal_timeout are. It does look like einhorn is trying to do what this PR is doing, but is not completely hooked up to the mains?

asf-stripe · 2018-04-17T13:59:06Z

lib/einhorn/command.rb

+
+        Einhorn.log_info("Child #{pid.inspect} was signaled #{(child[:last_signaled_at] - now).to_i}s ago. Sending SIGKILL as it is still active after #{Einhorn::State.signal_timeout}s timeout.", :upgrade)
+        begin
+          Process.kill('KILL', pid)


Hm! I'm not sure what the difference is between this and the Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}s. Sending SIGKILL code block. Why would this get hit, and the other wouldn't? (I guess it has to do with the last_signaled_at portion, but I thought the code above should still SIGKILL things, when it really doesn't seem to?

No problem, the change is subtle and the diff doesn't show it all that great.

The original behaviour in signal_all was optimistic and assume the timer would always work. That is, it would schedule a timer that would act only on the local children values. If that timer didn't work for one reason or another (as has been observed with "stuck workers"), those processes would never be looked at again as it assumes the timer handled them properly.

The new block is more pessimistic. It will not concern itself with what was just signaled, instead it will look at all processes that have been signaled and compare their last_signaled_at timestamp delta with the signal_timeout value. This way, the timer in signal_all will always try to ensure processes that should be shut down are shutdown.

asf-stripe · 2018-04-17T14:02:04Z

bin/einhorn

@@ -310,6 +310,10 @@ if true # $0 == __FILE__
      Einhorn::State.signal_timeout = Integer(t)
    end

+    opts.on('-r', '--resignal-timeout', 'Only applies if --signal-timeout is set. Resignal all children instead of just the ones most recently terminated.') do
+      Einhorn::State.config[:resignal_timeout] = true


What's the difference between this and Einhorn::State.signal_timeout, used below?

Are they the same? Is Einhorn::State.signal_timeout some vestige of something?

--signal-timeout N is the original flag that will send a KILL to a process if it is still running N seconds after an upgrade was issued. resignal-timeout is a new flag that opts in to the new behaviour in this PR.

asf-stripe · 2018-04-17T14:03:14Z

lib/einhorn/command.rb

+
+        now = Time.now
+        expires_at = child[:last_signaled_at] + Einhorn::State.signal_timeout
+        next unless now >= expires_at


I guess this should set up a new timer that fires at expires_at? Or is the assumption that this is set up already?

Hm, interesting idea! Could definitely set up a new timer here. I was more going for "eventual consistency" where it would just clean up successfully on the next upgrade.

asf-stripe

I gave this a quick skim! Out for today (night for me), but I'll give it a more thorough read tomorrow!

asf-stripe · 2018-04-18T17:39:30Z

lib/einhorn/command.rb


-      "Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}"
+        "Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}"


Wait, lol, was that meant to be an Einhorn.log_debug? The call sites I see don't do anything with that string. 10000

Probably. Good catch, will add that.

asf-stripe · 2018-04-18T17:47:05Z

lib/einhorn/command.rb

+    def self.kill_old_upgraded_workers
+      now = Time.now
+      children = Einhorn::State.children.select do |_,c|
+        # Only interested in old, upgraded workers that have been signaled.


I think the term "upgraded" here might be wrong? Aren't they more like outdated?

upgraded meaning a USR2 was sent to them previously. outdated would be a version difference (I think?)

Yeah, to me, "upgraded" means that they're a newer version than the stuff that was running before. If this is for the kicking task, I think in einhorn's view they're still an old version, right?

asf-stripe · 2018-04-18T17:50:11Z

lib/einhorn/command.rb

+          Process.kill('KILL', pid)
+        rescue Errno::ESRCH
+          Einhorn.log_debug("Attempted to SIGKILL child #{pid.inspect} but the process does not exist.")
+          mourn(pid)


I belieeeeve we should do here some of what reap does, or defer to the later stages of the event loop, where reap will reap the exited child processes (otherwise their dead PID will show up in later runs of the event loop and cause mourn to get called again).

Looking at this a bit more, I am not sure mourn should be called here at all (The timer in signal_all doesn't do it either).

asf-stripe

I think this looks very good - just two requests for changes (:

asf-stripe · 2018-04-23T17:15:30Z

lib/einhorn/command.rb

+      children = Einhorn::State.children.select do |_,c|
+        # Only interested in USR2 signaled workers
+        next unless c[:signaled] && c[:signaled].length > 0
+        next unless c[:signaled].include?('USR2')


I think this should also filter out the processes that have received KILL already - since the signal can't be trapped, the child process should (barring kernel/machine trouble) always exit.

If einhorn should get a bug where it doesn't remove the child from the list of its children, this seems like a very easy way 6DB6 to set up a trap for later processes that get the same PID later on.

asf-stripe · 2018-04-23T17:17:26Z

lib/einhorn.rb

+    # in the same timeframe, ensuring processes are culled
+    # on a regular basis.
+    if Einhorn::State.signal_timeout
+      Einhorn::Event.default_timeout = Einhorn::State.signal_timeout


Ahhh, this is great. Excellent to break from select for the signal_timeout. One suggestion though, take the minimum of any existing default_timeout & the signal_timeout?

(or make an accessor that reduces the default_timeout to a given value (-:)

dfradette-stripe · 2018-04-24T15:12:52Z

@asf-stripe updated with the changes you requested!

asf-stripe

\o/

dfradette-stripe force-pushed the resignal-stuck-workers branch from 26781a0 to fb8557e Compare April 16, 2018 19:33

asf-stripe suggested changes Apr 17, 2018

View reviewed changes

dfradette-stripe force-pushed the resignal-stuck-workers branch from fb8557e to f2831f8 Compare April 18, 2018 17:29

asf-stripe reviewed Apr 18, 2018

View reviewed changes

dfradette-stripe force-pushed the resignal-stuck-workers branch from f2831f8 to bd841b3 Compare April 20, 2018 17:59

asf-stripe suggested changes Apr 23, 2018

View reviewed changes

Refactored KILLs to also occur during culling

a061913

dfradette-stripe force-pushed the resignal-stuck-workers branch from bd841b3 to a061913 Compare April 24, 2018 15:01

asf-stripe approved these changes Apr 24, 2018

View reviewed changes

dfradette-stripe merged commit 0767c90 into contribsys:master Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure stuck workers are killed #72

Ensure stuck workers are killed #72


		"Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}"
		"Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}"

Ensure stuck workers are killed #72

Ensure stuck workers are killed #72

Conversation

What's the problem?

What's the fix?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment