8000 Fix deadlock when logging happens during a database reset by Smjert · Pull Request #7798 · osquery/osquery · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix deadlock when logging happens during a database reset #7798

Conversation

Smjert
Copy link
Member
@Smjert Smjert commented Oct 24, 2022

A deadlock could happen if a log relay thread
was trying to serialize logs into the database
when a database reset was being attempted.

Since the log relay thread is started by the same thread that executes the database reset, the scheduler thread, ensure that the log relay thread has finished its work before doing a database reset on the next scheduler loop.

Also ensure that when the scheduler is finishing its work, to permit osquery to exit, we wait on the log relaying thread if it's still running to prevent race conditions
and possible crashes on shutdown.

Finally remove the relayStatusLog call from the watcher process, it's a no-op since there's no logger plugin active.

To reproduce

Since the database reset doesn't happen at every scheduler step, but every schedule_reload seconds, and by default that is 3600, reducing it to a lower value increases the chances to encounter the issue.
The starting of the relay thread doesn't too happen at every step, but every 3, so one has to wait when the two somehow align, so that at the previous step a relay thread is started, that at the next step a database reset happens, and the relay thread is still running.

When testing, to increase the chances for this to happen I've inserted a sleep when the reset database logic takes the exclusive lock on the database reset mutex (it would be below this line):

WriteLock lock(kDatabaseReset);

Moreover this causes the scheduler loop to take more than the normally allotted time for a step (1 second), so multiple steps happen one after the other, increase the overlap chance.

osquery is launched with:

sudo osquery/osqueryd --allow_unsafe --enroll_tls_endpoint=/enroll --tls_client_cert=test_configs/test_client.pem --tls_client_key=test_configs/test_client.key --tls_server_certs=test_configs/test_server_ca.pem --enroll_secret_path=test_configs/test_enroll_secret.txt --tls_hostname=localhost:8080 --logger_plugin=tls --verbose --schedule_reload=5

and the test TLS server with:

tools/tests/test_http_server.py --tls --persist --verbose --ca test_configs/test_server_ca.pem --cert test_configs/test_server.pem --key test_configs/test_server.key --test-configs-dir test_configs 8080

A deadlock could happen if a log relay thread
was trying to serialize logs into the database
when a database reset was being attempted.

Since the log relay thread is started by the same thread
that executes the database reset, the scheduler thread,
ensure that the log relay thread has finished its work
before doing a database reset on the next scheduler loop.

Also ensure that when the scheduler is finishing its work,
to permit osquery to exit, we wait on the log relaying thread
if it's still running to prevent race conditions
and possible crashes on shutdown.

Finally remove the relayStatusLog call from the watcher process,
it's a no-op since there's no logger plugin active.
@Smjert Smjert force-pushed the stefano/fix/reset-database-logging-deadlock branch from 1af2d93 to e411588 Compare October 24, 2022 16:45
@Smjert Smjert added this to the 5.7.0 milestone Oct 24, 2022
Copy link
Member
@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

in a path called by Google Log, and failing to properly wait
for the thread to finish will either cause a race condition or a deadlock
*/
kOptBufferedLogSinkSender->wait();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is is possible for this to block?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent here is to block, to ensure the ordering of the threads; if you mea 8000 n to ask "can this block indefinitely", if there's a bug, yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, yes indefinitely was implied. In your estimate, how likely is a bug there? I imagine a kill-9 will still kill off osquery, and someone will report it and we can deal.

Copy link
Member Author
@Smjert Smjert Oct 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that's difficult to estimate, but that shouldn't stop us, meaning, the synchronization is needed.

And yes kill -9 will kill osquery. Even killing the watchdog process will have the osquery worker process exit, due to the separate thread which is there only to ensure that the parent (the watchdog) is still alive.

But anyway, this wait is nothing different from any other lock we have.

@mike-myers-tob mike-myers-tob merged commit e5276eb into osquery:master Nov 22, 2022
@mike-myers-tob mike-myers-tob deleted the stefano/fix/reset-database-logging-deadlock branch November 22, 2022 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0