-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix deadlock when logging happens during a database reset #7798
Fix deadlock when logging happens during a database reset #7798
Conversation
A deadlock could happen if a log relay thread was trying to serialize logs into the database when a database reset was being attempted. Since the log relay thread is started by the same thread that executes the database reset, the scheduler thread, ensure that the log relay thread has finished its work before doing a database reset on the next scheduler loop. Also ensure that when the scheduler is finishing its work, to permit osquery to exit, we wait on the log relaying thread if it's still running to prevent race conditions and possible crashes on shutdown. Finally remove the relayStatusLog call from the watcher process, it's a no-op since there's no logger plugin active.
1af2d93
to
e411588
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
in a path called by Google Log, and failing to properly wait | ||
for the thread to finish will either cause a race condition or a deadlock | ||
*/ | ||
kOptBufferedLogSinkSender->wait(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is is possible for this to block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent here is to block, to ensure the ordering of the threads; if you mea 8000 n to ask "can this block indefinitely", if there's a bug, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha, yes indefinitely was implied. In your estimate, how likely is a bug there? I imagine a kill-9 will still kill off osquery, and someone will report it and we can deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's difficult to estimate, but that shouldn't stop us, meaning, the synchronization is needed.
And yes kill -9 will kill osquery. Even killing the watchdog process will have the osquery worker process exit, due to the separate thread which is there only to ensure that the parent (the watchdog) is still alive.
But anyway, this wait is nothing different from any other lock we have.
A deadlock could happen if a log relay thread
was trying to serialize logs into the database
when a database reset was being attempted.
Since the log relay thread is started by the same thread that executes the database reset, the scheduler thread, ensure that the log relay thread has finished its work before doing a database reset on the next scheduler loop.
Also ensure that when the scheduler is finishing its work, to permit osquery to exit, we wait on the log relaying thread if it's still running to prevent race conditions
and possible crashes on shutdown.
Finally remove the relayStatusLog call from the watcher process, it's a no-op since there's no logger plugin active.
To reproduce
Since the database reset doesn't happen at every scheduler step, but every
schedule_reload
seconds, and by default that is 3600, reducing it to a lower value increases the chances to encounter the issue.The starting of the relay thread doesn't too happen at every step, but every 3, so one has to wait when the two somehow align, so that at the previous step a relay thread is started, that at the next step a database reset happens, and the relay thread is still running.
When testing, to increase the chances for this to happen I've inserted a sleep when the reset database logic takes the exclusive lock on the database reset mutex (it would be below this line):
osquery/osquery/database/database.cpp
Line 134 in 0273079
Moreover this causes the scheduler loop to take more than the normally allotted time for a step (1 second), so multiple steps happen one after the other, increase the overlap chance.
osquery is launched with:
and the test TLS server with: