-
Notifications
You must be signed in to change notification settings - Fork 46
Ensure stuck workers are killed #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure stuck workers are killed #72
Conversation
There is a rare case where after an upgrade, a worker will become stuck after being issued a USR2. If the signal_timeout timer doesn't fire for some reason, the child will remain running forever. This changes the signal-timeout timer to process all signaled children and use the `last_signaled_at` to determine if the child process should be killed. This should reduce/eliminate old stuck workers in the long run since each future signal timeout timer will attempt to kill anything that has been signaled and is still running.
26781a0
to
fb8557e
Compare
cc @evan-stripe @asf-stripe @nelhage to get some experienced eyes on this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be careful what you wish for! (-:
I have put in a few comments about stuff that's unclear, especially around what the interactions between signal_timeout
and resignal_timeout
are. It does look like einhorn is trying to do what this PR is doing, but is not completely hooked up to the mains?
lib/einhorn/command.rb
Outdated
|
||
Einhorn.log_info("Child #{pid.inspect} was signaled #{(child[:last_signaled_at] - now).to_i}s ago. Sending SIGKILL as it is still active after #{Einhorn::State.signal_timeout}s timeout.", :upgrade) | ||
begin | ||
Process.kill('KILL', pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm! I'm not sure what the difference is between this and the Child #{child.inspect} is still active after #{Einhorn::State.signal_timeout}s. Sending SIGKILL
code block. Why would this get hit, and the other wouldn't? (I guess it has to do with the last_signaled_at
portion, but I thought the code above should still SIGKILL things, when it really doesn't seem to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem, the change is subtle and the diff doesn't show it all that great.
The original behaviour in signal_all
was optimistic and assume the timer would always work. That is, it would schedule a timer that would act only on the local children
values. If that timer didn't work for one reason or another (as has been observed with "stuck workers"), those processes would never be looked at again as it assumes the timer handled them properly.
The new block is more pessimistic. It will not concern itself with what was just signaled, instead it will look at all processes that have been signaled and compare their last_signaled_at
timestamp delta with the signal_timeout
value. This way, the timer in signal_all
will always try to ensure processes that should be shut down are shutdown.
bin/einhorn
Outdated
@@ -310,6 +310,10 @@ if true # $0 == __FILE__ | |||
Einhorn::State.signal_timeout = Integer(t) | |||
end | |||
|
|||
opts.on('-r', '--resignal-timeout', 'Only applies if --signal-timeout is set. Resignal all children instead of just the ones most recently terminated.') do | |||
Einhorn::State.config[:resignal_timeout] = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and Einhorn::State.signal_timeout
, used below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they the same? Is Einhorn::State.signal_timeout
some vestige of something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--signal-timeout N
is the original flag that will send a KILL
to a process if it is still running N seconds after an upgrade was issued. resignal-timeout
is a new flag that opts in to the new behaviour in this PR.
lib/einhorn/command.rb
Outdated
|
||
now = Time.now | ||
expires_at = child[:last_signaled_at] + Einhorn::State.signal_timeout | ||
next unless now >= expires_at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this should set up a new timer that fires at expires_at
? Or is the assumption that this is set up already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting idea! Could definitely set up a new timer here. I was more going for "eventual consistency" where it would just clean up successfully on the next upgrade.
fb8557e
to
f2831f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I gave this a quick skim! Out for today (night for me), but I'll give it a more thorough read tomorrow!
lib/einhorn/command.rb
Outdated
|
||
"Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}" | ||
"Successfully sent #{signal}s to #{signaled.length} processes: #{signaled.keys}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, lol, was that meant to be an Einhorn.log_debug
? The call sites I see don't do anything with that string.
10000
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably. Good catch, will add that.
lib/einhorn/command.rb
Outdated
def self.kill_old_upgraded_workers | ||
now = Time.now | ||
children = Einhorn::State.children.select do |_,c| | ||
# Only interested in old, upgraded workers that have been signaled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the term "upgraded" here might be wrong? Aren't they more like outdated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upgraded meaning a USR2
was sent to them previously. outdated would be a version difference (I think?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, to me, "upgraded" means that they're a newer version than the stuff that was running before. If this is for the kicking task, I think in einhorn's view they're still an old version, right?
lib/einhorn/command.rb
Outdated
Process.kill('KILL', pid) | ||
rescue Errno::ESRCH | ||
Einhorn.log_debug("Attempted to SIGKILL child #{pid.inspect} but the process does not exist.") | ||
mourn(pid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I belieeeeve we should do here some of what reap
does, or defer to the later stages of the event loop, where reap
will reap the exited child processes (otherwise their dead PID will show up in later runs of the event loop and cause mourn
to get called again).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this a bit more, I am not sure mourn
should be called here at all (The timer in signal_all
doesn't do it either).
f2831f8
to
bd841b3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks very good - just two requests for changes (:
children = Einhorn::State.children.select do |_,c| | ||
# Only interested in USR2 signaled workers | ||
next unless c[:signaled] && c[:signaled].length > 0 | ||
next unless c[:signaled].include?('USR2') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also filter out the processes that have received KILL
already - since the signal can't be trapped, the child process should (barring kernel/machine trouble) always exit.
If einhorn should get a bug where it doesn't remove the child from the list of its children, this seems like a very easy way 6DB6 to set up a trap for later processes that get the same PID later on.
lib/einhorn.rb
Outdated
# in the same timeframe, ensuring processes are culled | ||
# on a regular basis. | ||
if Einhorn::State.signal_timeout | ||
Einhorn::Event.default_timeout = Einhorn::State.signal_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahhh, this is great. Excellent to break from select
for the signal_timeout
. One suggestion though, take the minimum of any existing default_timeout & the signal_timeout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or make an accessor that reduces the default_timeout to a given value (-:)
bd841b3
to
a061913
Compare
@asf-stripe updated with the changes you requested! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\o/
What's the problem?
There is a rare case where after an upgrade, a worker will become stuck after being issued a USR2. If the signal_timeout timer doesn't fire for some reason, the child will remain running forever with an old version.
What's the fix?
This new behaviour adds another stage to
cull
where it will periodically check if any previously upgraded (read: signaled withUSR2
) are running longer thansignal-timeout
and, if so, terminates them withKILL
.One of the broader changes here is setting the IO read/write timeout to the same value as
signal-timeout
so that the read doesn't block forever and thereap/replenish/cull
gets hit periodically.This should ensure that any old upgraded workers are short lived.