-
Notifications
You must be signed in to change notification settings - Fork 880
store: fix multi process with multi goroutines race on db. #2391
Conversation
return nil | ||
} | ||
|
||
func (dl *dbLock) unlock() error { | ||
if err := dl.fl.Unlock(); err != nil { | ||
dl.Unlock() | ||
return err | ||
} | ||
dl.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just call dl.Unlock() at the top of the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You want to always unlock the mutex guarding the fileLock second since it's the "outer lock". Unlocking it first would leave the file lock unprotected.
I do think this function would be slightly clearer with a named return, e.g.
// Named return, err
err = dl.fl.Unlock()
dl.Unlock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about using defer? (trying to learn better go)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a defer dl.Unlock()
at the top would be functionally equivalent, yeah. Whether that or a named return is clearer is debatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just call dl.Unlock() at the top of the function?
If your question is why I inverted the order putting the sync.Mutex lock before the flock that's because the sync.Mutex must protect the flock. The reason is that calling flock.ExclusiveLock after a previous ExclusiveLock will continue since it's on the same filedescriptor. This can cause something like this:
- Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and do db.Open
- Process A, goroutine 2: takes the flock, blocks on the lock on the mutex
- Process A, goroutine 1: does something on db, calls db.Close, release the lock on the mutex, unlocks the flock
- Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED
- Process B, goroutine 1: take the lock on the mutex (since it's another process), and takes the lock on the flock: the ql camlistore lock will fail.
So the sync.Mutex that's really (per store instance) exclusive must go before the flock
Obviously there must be only one store instance (on the same dir) per process. This is what actually rkt does (I haven't found wrong store usage).
so good news. With this patch I seem to have been able to pass a k8s test where I never passed before bad new, I still seem similar error output
and
see a number of these |
return nil | ||
} | ||
|
||
func (dl *dbLock) unlock() error { | ||
if err := dl.fl.Unlock(); err != nil { | ||
dl.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can end up in a bad state here.
If the fl.Unlock
here fails, it's very possible that the file lock is still held, and yet you release the guarding mutex here. the lock
function will now never succeed (returning an error each time).
However, it's worse than that, since the caller can't fix this state. If the caller doesn't call unlock
again, they're broken in that lock
will never work again. However, if they call unlock
again to try and fix this, they'll end up hitting dl.Unlock
with it already being unlocked, which is a panic.
I think the correct fix is to keep the above dl.Lock
move up, but to only Unlock
here if the fl.Unlock
succeeds, and then have a caller up the line retry.
There's a couple additional subtleties in that an error from fl.Unlock
does not necessarily mean that the lock can ever be unlocked, so you really need to inspect the error and behave differently depending.
It's possible I'm missing something here and the above scenario can't happen, or is somehow handled up the call path (which I am not familiar with).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fl.Unlock here fails
Looking at man 2 flock
looks like the only way an unlock will fail is if the underlying fd has been closed. This should be a programming error. But I'm not sure if there'll be other cases.
How to handle this possibility?
usually rkt commands will get the error and exit, instead the rkt api service will continue to use that store instance. Probably, it should handle this error specially, closing that store instance and creating a new one (this needs to be correctly handled in different ways)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer dl.Unlock() at the top sounds good to me.
There was a prob 8000 lem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deferring it doesn't alleviate my concern that we're unlocking it in the case of an an fl.Unlock
error.
I agree that it's not likely that the fl unlock will fail. Because of that, having a log or panic or store-instance cleanup all seem like plausible paths forwards for this PR. I'm fine with my concern not being fully fixed, since it's out of scope of the other fix here. At the very least though, I'd want a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it's not likely that the fl unlock will fail. Because of that, having a log or panic or store-instance cleanup all seem like plausible paths forwards for this PR. I'm fine with my concern not being fully fixed, since it's out of scope of the other fix here. At the very least though, I'd want a comment.
@euank I added a (quite long) comment trying to explain the problem. In the meantime I removed the mutex unlock and made it panic when flock unlock fails (let's see if and when it'll happen)
uhm. was the rkt api service restarted? Which tests (so I can try them locally) I have to leave for some hours, I will answer to the other questions when I'll be back. |
A1: yes, started fresh A2: [Fail] [k8s.io] SchedulerPredicates [Serial] [It] validates MaxPods limit number of pods that are allowed to run [Slow] I'm basically using https://github.com/yifan-gu/kubernetes/tree/gce_coreos_rkt (i.e. branch is gce_coreos_rkt) setup the environment currently with
build by doing a (the PYTHON env above is important for this)
I set up the cluster by doing a
this downloads an appropriate rkt build from our gcloud storage as specified by KUBE_RKT_VERSION run the specific test by doing a
in another window I do
and I see the error output |
@@ -34,9 +34,11 @@ const ( | |||
// dbLock is used to guarantee both thread-safety and process-safety | |||
// for db access. | |||
type dbLock struct { | |||
// This lock is to make that access to the ql db file being blocking | |||
// since ql use an internal locking but it won't block but return an |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/but return/and return ?
Tried this test multiple times (on a local k8s git master cluster) but can't reproduce with this fix (without I can reproduce it every time). That's really strange since the other test I was using ( Sorry for being pedantic 😄 but are you really sure that you're running a patched rkt version? |
if err := dl.fl.ExclusiveLock(); err != nil { | ||
dl.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I went back to the history, found the reason that we do flock() before mutex is because of #1892 (comment)
But I think this fixes the problem.
@sgotti @sjpotter Is the error coming from the ql lock or our flock? We should add some logging here... |
sgotti: pretty sure,
can you attach your rkt tarball and I can try again? |
ok, so patched head (1.3.0+) with this and it works perfectly, no unexpected output. |
@sjpotter Awesome! Sounds like a big progress. Though we need to fix the symlink issue with 1.3. |
That sounds great! With me it worked also on a patched 1.2.1, perhaps there's some bad http caching... Tomorrow I'll update the PR with all the suggestions. |
@sgotti Thank you so very much! |
78fc8d5
to
76712a7
Compare
Thanks a lot for all you reviews and tests. I tried to implement the various suggestions. Now if the flock unlock fails the code will panic since there's no easy way to recover it. @yifan-gu if this will ever happen, the rkt api service should be raspawned by a systemd restart policy or something similar. Probably some followup patches should:
|
@@ -78,6 +78,9 @@ func createTable(db *DB, t *testing.T) { | |||
} | |||
|
|||
func TestDBRace(t *testing.T) { | |||
// TODO(sgotti) this will not find concurrenct accesses to ql db from | |||
// multiple processes using multiple goroutines. A test that spawns at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonboulle Sorry, it's difficult to explain. What I meant was that:
Given that the store db model is based on making short lived Open/Tx/Close to make the ql db accessible by multiple processes and/or goroutines without long blocking and possible deadlocks.
The "cannot acquire lock: resource temporarily unavailable" doesn't happen with:
- Multiple OS processes not having multiple goroutines using a shared store instance.
- Only one OS process accessing the db (also if it has multiple goroutines using a shared store instance like this test and the rkt api-service).
It happens only when there are multiple OS processes concurrently accessing the ql db and at least one of them has multiple goroutines using a shared store instance.
Plus, after more analysis, I think that the case
- Only one OS process accessing the db (also if it has multiple goroutines using a shared store instance like this test and the rkt api-service).
that in the commit comments I say won't give an error but cause possible db corruption is probably not true so I'll remove it from the commit comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonboulle I changed to commit comment removing the very confusing description and adding an example.
3e70d76
to
d4c702d
Compare
When there are multiple OS processes concurrently accessing the ql db and at least one of them has multiple goroutines using a shared store instance the internal ql (camlistore based) locking error: `cannot acquire lock: resource temporarily unavailable` can be triggered. For example: Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and do db.Open Process A, goroutine 2: takes the flock, blocks on the lock on the mutex Process A, goroutine 1: does something on db, calls db.Close, unlocks the flock, release the lock on the mutex Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED Process B, goroutine 1: take the lock on the flock and then takes lock on the mutex (since it's another process): the ql internal camlistore lock will fail. In the code there's already a locking made by a mutex, but looks like it was put just there to make the go race detector happy and the lock/unlock order is wrong. Instead this is really needed. This patch fixes the lock/unlock order. A future test to really test this problem will need to spawn two processes with one at least using multiple goroutines using the same store instance. Here an example repro: ``` package main import ( "fmt" "os" "github.com/coreos/rkt/store" ) func main() { s, err := store.NewStore("/tmp/test/rkt") if err != nil { fmt.Printf("cannot open store: %v", err) os.Exit(1) } runs := 10000 for i := 0; i < runs; i++ { go func() { _, err = s.GetAllACIInfos(nil, false) if err != nil { fmt.Printf("cannot query db: %v", err) os.Exit(1) } }() } s.Close() } ``` compile and execute the above command with something like: ``` #!/bin/bash while true; do for i in $(seq 0 2); do ./test01 & done wait done ```
d4c702d
to
53104d0
Compare
By my reading, since there's an early return when there's an I'm 👍 on merging this, and giving fun error scenarios more thought in a new issue. |
Oh dammit, I even called this problem @ #1892 (comment) when it wasn't an issue (because at the time we were creating a new |
When there are multiple OS processes concurrently accessing the ql
db and at least one of them has multiple goroutines using a shared store
instance the internal ql (camlistore based) locking error:
cannot acquire lock: resource temporarily unavailable
can be triggered.For example:
Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and does db.Open
Process A, goroutine 2: takes the flock, blocks on the lock on the mutex
Process A, goroutine 1: does something on db, calls db.Close, unlocks the flock, release the lock on the mutex
Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED
Process B, goroutine 1: take the lock on the flock and then takes lock on the mutex (since it's another process): the ql internal camlistore lock will fail.
In the code there's already a locking made by a mutex, but looks like it
was put just there to make the go race detector happy and the
lock/unlock order is wrong. Instead this is really needed.
This patch fixes the lock/unlock order.
A future test to really test this problem will need to spawn two
processes with one at least using multiple goroutines using the same
store instance.
Here an example repro:
compile and execute the above command with something like:
Should fix #2374