8000 Ensure stuck workers are killed by dfradette-stripe · Pull Request #72 · contribsys/einhorn · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Ensure stuck workers are killed #72

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dfradette-stripe
Copy link
Contributor
@dfradette-stripe dfradette-stripe commented Feb 22, 2018

What's the problem?

There is a rare case where after an upgrade, a worker will become stuck after being issued a USR2. If the signal_timeout timer doesn't fire for some reason, the child will remain running forever with an old version.

What's the fix?

This new behaviour adds another stage to cull where it will periodically check if any previously upgraded (read: signaled with USR2) are running longer than signal-timeout and, if so, terminates them with KILL.

One of the broader changes here is setting the IO read/write timeout to the same value as signal-timeout so that the read doesn't block forever and the reap/replenish/cull gets hit periodically.

This should ensure that any old upgraded workers are short lived.

There is a rare case where after an upgrade, a worker will become
stuck after being issued a USR2. If the signal_timeout timer doesn't
fire for some reason, the child will remain running forever.

This changes the signal-timeout timer to process all signaled children
and use the `last_signaled_at` to determine if the child process should
be killed. This should reduce/eliminate old stuck workers in the long
run since each future signal timeout timer will attempt to kill anything
that has been signaled and is still running.
@dfradette-stripe
Copy link
Contributor Author

cc @evan-stripe @asf-stripe @nelhage to get some experienced eyes on this PR.

Copy link
Contributor
@asf-stripe asf-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be careful what you wish for! (-:

I have put in a few comments about stuff that's unclear, especially around what the interactions between signal_timeout and resignal_timeout are. It does look like einhorn is trying to do what this PR is doing, but is not completely hooked up to the mains?


Einhorn.log_info("Child #{pid.inspect} was signaled #{(child[:last_signaled_at] - now).to_i}s ago. Sending SIGKILL as it is still active after #{Einhorn::State.signal_timeout}s timeout.", :upgrade)
begin
Process.kill('KILL', pid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm! I'm not sure what the difference is between this and the Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}s. Sending SIGKILL code block. Why would this get hit, and the other wouldn't? (I guess it has to do with the last_signaled_at portion, but I thought the code above should still SIGKILL things, when it really doesn't seem to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, the change is subtle and the diff doesn't show it all that great.

The original behaviour in signal_all was optimistic and assume the timer would always work. That is, it would schedule a timer that would act only on the local children values. If that timer didn't work for one reason or another (as has been observed with "stuck workers"), those processes would never be looked at again as it assumes the timer handled them properly.

The new block is more pessimistic. It will not concern itself with what was just signaled, instead it will look at all processes that have been signaled and compare their last_signaled_at timestamp delta with the signal_timeout value. This way, the timer in signal_all will always try to ensure processes that should be shut down are shutdown.

bin/einhorn Outdated
@@ -310,6 +310,10 @@ if true # $0 == __FILE__
Einhorn::State.signal_timeout = Integer(t)
end

opts.on('-r', '--resignal-timeout', 'Only applies if --signal-timeout is set. Resignal all children instead of just the ones most recently terminated.') do
Einhorn::State.config[:resignal_timeout] = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and Einhorn::State.signal_timeout, used below?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they the same? Is Einhorn::State.signal_timeout some vestige of something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--signal-timeout N is the original flag that will send a KILL to a process if it is still running N seconds after an upgrade was issued. resignal-timeout is a new flag that opts in to the new behaviour in this PR.


now = Time.now
expires_at = child[:last_signaled_at] + Einhorn::State.signal_timeout
next unless now >= expires_at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should set up a new timer that fires at expires_at? Or is the assumption that this is set up already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting idea! Could definitely set up a new timer here. I was more going for "eventual consistency" where it would just clean up successfully on the next upgrade.

Copy link
Contributor
@asf-stripe asf-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a quick skim! Out for today (night for me), but I'll give it a more thorough read tomorrow!


"Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}"
"Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, lol, was that meant to be an Einhorn.log_debug? The call sites I see don't do anything with that string. 10000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Good catch, will add that.

def self.kill_old_upgraded_workers
now = Time.now
children = Einhorn::State.children.select do |_,c|
# Only interested in old, upgraded workers that have been signaled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term "upgraded" here might be wrong? Aren't they more like outdated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgraded meaning a USR2 was sent to them previously. outdated would be a version difference (I think?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, to me, "upgraded" means that they're a newer version than the stuff that was running before. If this is for the kicking task, I think in einhorn's view they're still an old version, right?

Process.kill('KILL', pid)
rescue Errno::ESRCH
Einhorn.log_debug("Attempted to SIGKILL child #{pid.inspect} but the process does not exist.")
mourn(pid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I belieeeeve we should do here some of what reap does, or defer to the later stages of the event loop, where reap will reap the exited child processes (otherwise their dead PID will show up in later runs of the event loop and cause mourn to get called again).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this a bit more, I am not sure mourn should be called here at all (The timer in signal_all doesn't do it either).

Copy link
Contributor
@asf-stripe asf-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks very good - just two requests for changes (:

children = Einhorn::State.children.select do |_,c|
# Only interested in USR2 signaled workers
next unless c[:signaled] && c[:signaled].length > 0
next unless c[:signaled].include?('USR2')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also filter out the processes that have received KILL already - since the signal can't be trapped, the child process should (barring kernel/machine trouble) always exit.

If einhorn should get a bug where it doesn't remove the child from the list of its children, this seems like a very easy way 6DB6 to set up a trap for later processes that get the same PID later on.

lib/einhorn.rb Outdated
# in the same timeframe, ensuring processes are culled
# on a regular basis.
if Einhorn::State.signal_timeout
Einhorn::Event.default_timeout = Einhorn::State.signal_timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh, this is great. Excellent to break from select for the signal_timeout. One suggestion though, take the minimum of any existing default_timeout & the signal_timeout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(or make an accessor that reduces the default_timeout to a given value (-:)

@dfradette-stripe
Copy link
Contributor Author

@asf-stripe updated with the changes you requested!

Copy link
Contributor
@asf-stripe asf-stripe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\o/

@dfradette-stripe dfradette-stripe merged commit 0767c90 into contribsys:master Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0