Fix deadlock when logging happens during a database reset #7798

Smjert · 2022-10-24T15:17:11Z

A deadlock could happen if a log relay thread
was trying to serialize logs into the database
when a database reset was being attempted.

Since the log relay thread is started by the same thread that executes the database reset, the scheduler thread, ensure that the log relay thread has finished its work before doing a database reset on the next scheduler loop.

Also ensure that when the scheduler is finishing its work, to permit osquery to exit, we wait on the log relaying thread if it's still running to prevent race conditions
and possible crashes on shutdown.

Finally remove the relayStatusLog call from the watcher process, it's a no-op since there's no logger plugin active.

To reproduce

Since the database reset doesn't happen at every scheduler step, but every schedule_reload seconds, and by default that is 3600, reducing it to a lower value increases the chances to encounter the issue.
The starting of the relay thread doesn't too happen at every step, but every 3, so one has to wait when the two somehow align, so that at the previous step a relay thread is started, that at the next step a database reset happens, and the relay thread is still running.

When testing, to increase the chances for this to happen I've inserted a sleep when the reset database logic takes the exclusive lock on the database reset mutex (it would be below this line):

osquery/osquery/database/database.cpp

Line 134 in 0273079

WriteLock lock(kDatabaseReset);

Moreover this causes the scheduler loop to take more than the normally allotted time for a step (1 second), so multiple steps happen one after the other, increase the overlap chance.

osquery is launched with:

sudo osquery/osqueryd --allow_unsafe --enroll_tls_endpoint=/enroll --tls_client_cert=test_configs/test_client.pem --tls_client_key=test_configs/test_client.key --tls_server_certs=test_configs/test_server_ca.pem --enroll_secret_path=test_configs/test_enroll_secret.txt --tls_hostname=localhost:8080 --logger_plugin=tls --verbose --schedule_reload=5

and the test TLS server with:

tools/tests/test_http_server.py --tls --persist --verbose --ca test_configs/test_server_ca.pem --cert test_configs/test_server.pem --key test_configs/test_server.key --test-configs-dir test_configs 8080

A deadlock could happen if a log relay thread was trying to serialize logs into the database when a database reset was being attempted. Since the log relay thread is started by the same thread that executes the database reset, the scheduler thread, ensure that the log relay thread has finished its work before doing a database reset on the next scheduler loop. Also ensure that when the scheduler is finishing its work, to permit osquery to exit, we wait on the log relaying thread if it's still running to prevent race conditions and possible crashes on shutdown. Finally remove the relayStatusLog call from the watcher process, it's a no-op since there's no logger plugin active.

directionless

Approved.

directionless · 2022-10-29T16:05:54Z

osquery/logger/logger.cpp

+       in a path called by Google Log, and failing to properly wait
+       for the thread to finish will either cause a race condition or a deadlock
+     */
+    kOptBufferedLogSinkSender->wait();


Is is possible for this to block?

The intent here is to block, to ensure the ordering of the threads; if you mea 8000 n to ask "can this block indefinitely", if there's a bug, yes.

Ha, yes indefinitely was implied. In your estimate, how likely is a bug there? I imagine a kill-9 will still kill off osquery, and someone will report it and we can deal.

Well, that's difficult to estimate, but that shouldn't stop us, meaning, the synchronization is needed.

And yes kill -9 will kill osquery. Even killing the watchdog process will have the osquery worker process exit, due to the separate thread which is there only to ensure that the parent (the watchdog) is still alive.

But anyway, this wait is nothing different from any other lock we have.

Smjert added bug logging database daemon core labels Oct 24, 2022

Smjert requested review from a team as code owners October 24, 2022 15:17

Smjert force-pushed the stefano/fix/reset-database-logging-deadlock branch from 1af2d93 to e411588 Compare October 24, 2022 16:45

Smjert added this to the 5.7.0 milestone Oct 24, 2022

directionless approved these changes Oct 29, 2022

View reviewed changes

mike-myers-tob merged commit e5276eb into osquery:master Nov 22, 2022

mike-myers-tob deleted the stefano/fix/reset-database-logging-deadlock branch November 22, 2022 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix deadlock when logging happens during a database reset #7798

Fix deadlock when logging happens during a database reset #7798

Fix deadlock when logging happens during a database reset #7798

Fix deadlock when logging happens during a database reset #7798

Conversation

To reproduce

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment