store: fix multi process with multi goroutines race on db. #2391

sgotti · 2016-04-07T15:57:08Z

When there are multiple OS processes concurrently accessing the ql
db and at least one of them has multiple goroutines using a shared store
instance the internal ql (camlistore based) locking error: cannot acquire lock: resource temporarily unavailable can be triggered.

For example:

Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and does db.Open
Process A, goroutine 2: takes the flock, blocks on the lock on the mutex
Process A, goroutine 1: does something on db, calls db.Close, unlocks the flock, release the lock on the mutex
Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED
Process B, goroutine 1: take the lock on the flock and then takes lock on the mutex (since it's another process): the ql internal camlistore lock will fail.

In the code there's already a locking made by a mutex, but looks like it
was put just there to make the go race detector happy and the
lock/unlock order is wrong. Instead this is really needed.

This patch fixes the lock/unlock order.

A future test to really test this problem will need to spawn two
processes with one at least using multiple goroutines using the same
store instance.

Here an example repro:

package main

import (
        "fmt"
        "os"

        "github.com/coreos/rkt/store"
)

func main() {
        s, err := store.NewStore("/tmp/test/rkt")
        if err != nil {
                fmt.Printf("cannot open store: %v", err)
                os.Exit(1)
        }

        runs := 10000
        for i := 0; i < runs; i++ {
                go func() {
                        _, err = s.GetAllACIInfos(nil, false)
                        if err != nil {
                                fmt.Printf("cannot query db: %v", err)
                                os.Exit(1)
                        }
                }()
        }

        s.Close()
}

compile and execute the above command with something like:

#!/bin/bash
while true; do
        for i in $(seq 0 2); do
                ./test01 &
        done
        wait
done

Should fix #2374

sjpotter · 2016-04-07T16:36:06Z

store/db.go

 	return nil
 }

 func (dl *dbLock) unlock() error {
 	if err := dl.fl.Unlock(); err != nil {
+		dl.Unlock()
 		return err
 	}
 	dl.Unlock()


why not just call dl.Unlock() at the top of the function?

You want to always unlock the mutex guarding the fileLock second since it's the "outer lock". Unlocking it first would leave the file lock unprotected.

I do think this function would be slightly clearer with a named return, e.g.

// Named return, err err = dl.fl.Unlock() dl.Unlock

what about using defer? (trying to learn better go)

a defer dl.Unlock() at the top would be functionally equivalent, yeah. Whether that or a named return is clearer is debatable.

why not just call dl.Unlock() at the top of the function?

If your question is why I inverted the order putting the sync.Mutex lock before the flock that's because the sync.Mutex must protect the flock. The reason is that calling flock.ExclusiveLock after a previous ExclusiveLock will continue since it's on the same filedescriptor. This can cause something like this:

Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and do db.Open

Process A, goroutine 2: takes the flock, blocks on the lock on the mutex

Process A, goroutine 1: does something on db, calls db.Close, release the lock on the mutex, unlocks the flock

Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED

Process B, goroutine 1: take the lock on the mutex (since it's another process), and takes the lock on the flock: the ql camlistore lock will fail.

So the sync.Mutex that's really (per store instance) exclusive must go before the flock

Obviously there must be only one store instance (on the same dir) per process. This is what actually rkt does (I haven't found wrong store usage).

sjpotter · 2016-04-07T17:23:07Z

so good news. With this patch I seem to have been able to pass a k8s test where I never passed before

bad new, I still seem similar error output

e2e-tests-sched-pred-aboh9   2016-04-07 10:09:04 -0700 PDT   2016-04-07 10:09:04 -0700 PDT   1         maxp-252   Pod       spec.containers{maxp-252}   Warning   Failed    {kubelet spotter-rkt-minion-gtjj}   Failed to create rkt container with error: failed to run [prepare --quiet --pod-manifest /tmp/manifest-maxp-252-392804414]: exit status 1
stdout: 
stderr: prepare: cannot open store: cannot acquire lock: resource temporarily unavailable

and

e2e-tests-sched-pred-aboh9   2016-04-07 10:10:38 -0700 PDT   2016-04-07 10:10:38 -0700 PDT   1         maxp-48   Pod                 Warning   FailedSync   {kubelet spotter-rkt-minion-gtjj}   Error syncing pod, skipping: failed to SyncPod: failed to run [prepare --quiet --pod-manifest /tmp/manifest-maxp-48-665407762]: exit status 1
stdout: 
stderr: image: using image from file /opt/rkt/stage1-coreos.aci
prepare: cannot acquire lock: resource temporarily unavailable

see a number of these

euank · 2016-04-07T17:46:26Z

store/db.go

 	return nil
 }

 func (dl *dbLock) unlock() error {
 	if err := dl.fl.Unlock(); err != nil {
+		dl.Unlock()


I think you can end up in a bad state here.

If the fl.Unlock here fails, it's very possible that the file lock is still held, and yet you release the guarding mutex here. the lock function will now never succeed (returning an error each time).

However, it's worse than that, since the caller can't fix this state. If the caller doesn't call unlock again, they're broken in that lock will never work again. However, if they call unlock again to try and fix this, they'll end up hitting dl.Unlock with it already being unlocked, which is a panic.

I think the correct fix is to keep the above dl.Lock move up, but to only Unlock here if the fl.Unlock succeeds, and then have a caller up the line retry.

There's a couple additional subtleties in that an error from fl.Unlock does not necessarily mean that the lock can ever be unlocked, so you really need to inspect the error and behave differently depending.

It's possible I'm missing something here and the above scenario can't happen, or is somehow handled up the call path (which I am not familiar with).

If the fl.Unlock here fails

Looking at man 2 flock looks like the only way an unlock will fail is if the underlying fd has been closed. This should be a programming error. But I'm not sure if there'll be other cases.

How to handle this possibility?
usually rkt commands will get the error and exit, instead the rkt api service will continue to use that store instance. Probably, it should handle this error specially, closing that store instance and creating a new one (this needs to be correctly handled in different ways)

defer dl.Unlock() at the top sounds good to me.

deferring it doesn't alleviate my concern that we're unlocking it in the case of an an fl.Unlock error.

I agree that it's not likely that the fl unlock will fail. Because of that, having a log or panic or store-instance cleanup all seem like plausible paths forwards for this PR. I'm fine with my concern not being fully fixed, since it's out of scope of the other fix here. At the very least though, I'd want a comment.

I agree that it's not likely that the fl unlock will fail. Because of that, having a log or panic or store-instance cleanup all seem like plausible paths forwards for this PR. I'm fine with my concern not being fully fixed, since it's out of scope of the other fix here. At the very least though, I'd want a comment.

@euank I added a (quite long) comment trying to explain the problem. In the meantime I removed the mutex unlock and made it panic when flock unlock fails (let's see if and when it'll happen)

sgotti · 2016-04-07T18:23:06Z

bad new, I still seem similar error output

uhm. was the rkt api service restarted? Which tests (so I can try them locally)

I have to leave for some hours, I will answer to the other questions when I'll be back.

sjpotter · 2016-04-07T18:35:26Z

A1: yes, started fresh

A2: [Fail] [k8s.io] SchedulerPredicates [Serial] [It] validates MaxPods limit number of pods that are allowed to run [Slow]

I'm basically using https://github.com/yifan-gu/kubernetes/tree/gce_coreos_rkt (i.e. branch is gce_coreos_rkt)

setup the environment currently with

BUILD_PYTHON_IMAGE=true
KUBE_CONTAINER_RUNTIME=rkt
KUBE_ENABLE_CLUSTER_MONITORING=standalone
KUBE_ENABLE_NODE_LOGGING=false
KUBE_GCE_INSTANCE_PREFIX=${USER}-rkt
KUBE_GCE_MASTER_IMAGE=coreos-alpha-962-0-0-v20160218
KUBE_GCE_MASTER_PROJECT=coreos-cloud
KUBE_GCE_NETWORK=${USER}-rkt
KUBE_GCE_NODE_IMAGE=coreos-alpha-960-0-0-v20160217
KUBE_GCE_NODE_PROJECT=coreos-cloud
KUBE_GCE_ZONE=us-east1-b
KUBE_OS_DISTRIBUTION=coreos
KUBE_RKT_VERSION=rkt-${USER}-hack
KUBE_SKIP_UPDATE=y

build by doing a (the PYTHON env above is important for this)

make quick-release

I set up the cluster by doing a

go run hack/e2e.go -v -up

this downloads an appropriate rkt build from our gcloud storage as specified by KUBE_RKT_VERSION

run the specific test by doing a

go run hack/e2e.go -v -test -test_args="-ginkgo.focus=validates.MaxPods.limit"

in another window I do

 ./cluster/kubectl.sh get events -w --all-namespaces

and I see the error output

yifan-gu · 2016-04-07T20:26:31Z

store/db.go

@@ -34,9 +34,11 @@ const (
 // dbLock is used to guarantee both thread-safety and process-safety
 // for db access.
 type dbLock struct {
+	// This lock is to make that access to the ql db file being blocking
+	// since ql use an internal locking but it won't block but return an


s/but return/and return ?

sgotti · 2016-04-07T20:33:08Z

A1: yes, started fresh

A2: [Fail] [k8s.io] SchedulerPredicates [Serial] [It] validates MaxPods limit number of pods that are allowed to run [Slow]

Tried this test multiple times (on a local k8s git master cluster) but can't reproduce with this fix (without I can reproduce it every time). That's really strange since the other test I was using (kubelet should be able to delete 10 pods per node in 1m0s) does the same rkt prepare (with less pods).

Sorry for being pedantic 😄 but are you really sure that you're running a patched rkt version?

yifan-gu · 2016-04-07T21:02:54Z

store/db.go

 	if err := dl.fl.ExclusiveLock(); err != nil {
+		dl.Unlock()


So, I went back to the history, found the reason that we do flock() before mutex is because of #1892 (comment)

But I think this fixes the problem.

yifan-gu · 2016-04-07T21:24:57Z

bad new, I still seem similar error output
e2e-tests-sched-pred-aboh9 2016-04-07 10:09:04 -0700 PDT 2016-04-07 10:09:04 -0700 PDT 1 maxp-252 Pod spec.containers{maxp-252} Warning Failed {kubelet spotter-rkt-minion-gtjj} Failed to create rkt container with error: failed to run [prepare --quiet --pod-manifest /tmp/manifest-maxp-252-392804414]: exit status 1
stdout:
stderr: prepare: cannot open store: cannot acquire lock: resource temporarily unavailable

@sgotti @sjpotter Is the error coming from the ql lock or our flock? We should add some logging here...

sgotti · 2016-04-07T21:29:24Z

@sgotti @sjpotter Is the error coming from the ql lock or our flock? We should add some logging here...

It's coming from ql camlistore locking. You're right and I also wanted to improve the logging but wanted to keep this patch clear (perhaps a new PR will be better).

sjpotter · 2016-04-07T21:31:36Z

sgotti: pretty sure,

spotter-rkt-minion-60dh spotter # systemctl cat kubernetes-install-rkt
# /etc/systemd/system/kubernetes-install-rkt.service
[Unit]
Description=Fetch rkt
Documentation=http://github.com/coreos/rkt
Requires=network-online.target
After=network-online.target
Requires=kube-env.service
After=kube-env.service
[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/etc/kube-env
ExecStartPre=/usr/bin/mkdir -p /etc/rkt
ExecStartPre=/usr/bin/mkdir -p /opt/downloads
ExecStartPre=/usr/bin/curl --location --create-dirs --output /opt/downloads/rkt.tar.gz https://storage.googleapis.com/rkt-bin/${RKT_VERSION}.tar.gz
ExecStartPre=/usr/bin/tar xf /opt/downloads/rkt.tar.gz -C /opt --overwrite
ExecStartPre=/usr/bin/rm -fr /opt/rkt
ExecStart=/usr/bin/mv /opt/bin /opt/rkt

spotter-rkt-minion-60dh spotter # systemctl cat rkt-api-service
# /etc/systemd/system/rkt-api-service.service
[Unit]
Description=Start rkt API service as Daemon
Requires=kubernetes-install-rkt.service
After=kubernetes-install-rkt.service
[Service]
ExecStart=/opt/rkt/rkt api-service
Restart=always
RestartSec=10

can you attach your rkt tarball and I can try again?

8000

sjpotter · 2016-04-07T22:33:30Z

ok, so patched head (1.3.0+) with this and it works perfectly, no unexpected output.

yifan-gu · 2016-04-07T22:39:04Z

ok, so patched head (1.3.0+) with this and it works perfectly, no unexpected output.
Show all checks

@sjpotter Awesome! Sounds like a big progress. Though we need to fix the symlink issue with 1.3.

sgotti · 2016-04-07T22:42:31Z

ok, so patched head (1.3.0+) with this and it works perfectly, no unexpected output.

That sounds great! With me it worked also on a patched 1.2.1, perhaps there's some bad http caching...

Tomorrow I'll update the PR with all the suggestions.

yifan-gu · 2016-04-08T00:21:38Z

@sgotti Thank you so very much!

sgotti · 2016-04-08T09:11:41Z

Thanks a lot for all you reviews and tests.

I tried to implement the various suggestions. Now if the flock unlock fails the code will panic since there's no easy way to recover it. @yifan-gu if this will ever happen, the rkt api service should be raspawned by a systemd restart policy or something similar.

Probably some followup patches should:

Improve the error reporting.
@euank I noticed that when I wrote it I was ignoring the sqldb.Close errors, I'd like to also handle them. Thoughts?

jonboulle · 2016-04-08T09:25:34Z

store/db_test.go

@@ -78,6 +78,9 @@ func createTable(db *DB, t *testing.T) {
 }

 func TestDBRace(t *testing.T) {
+	// TODO(sgotti) this will not find concurrenct accesses to ql db from
+	// multiple processes using multiple goroutines. A test that spawns at


@sgotti I don't really understand your use of the word "process" in this context (and in the original comment). Isn't this just a problem within a single process with multiple goroutines?

@jonboulle Sorry, it's difficult to explain. What I meant was that:

Given that the store db model is based on making short lived Open/Tx/Close to make the ql db accessible by multiple processes and/or goroutines without long blocking and possible deadlocks.

The "cannot acquire lock: resource temporarily unavailable" doesn't happen with:

Multiple OS processes not having multiple goroutines using a shared store instance.

Only one OS process accessing the db (also if it has multiple goroutines using a shared store instance like this test and the rkt api-service).

It happens only when there are multiple OS processes concurrently accessing the ql db and at least one of them has multiple goroutines using a shared store instance.

Plus, after more analysis, I think that the case

Only one OS process accessing the db (also if it has multiple goroutines using a shared store instance like this test and the rkt api-service).
that in the commit comments I say won't give an error but cause possible db corruption is probably not true so I'll remove it from the commit comment.

@jonboulle I changed to commit comment removing the very confusing description and adding an example.

When there are multiple OS processes concurrently accessing the ql db and at least one of them has multiple goroutines using a shared store instance the internal ql (camlistore based) locking error: `cannot acquire lock: resource temporarily unavailable` can be triggered. For example: Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and do db.Open Process A, goroutine 2: takes the flock, blocks on the lock on the mutex Process A, goroutine 1: does something on db, calls db.Close, unlocks the flock, release the lock on the mutex Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED Process B, goroutine 1: take the lock on the flock and then takes lock on the mutex (since it's another process): the ql internal camlistore lock will fail. In the code there's already a locking made by a mutex, but looks like it was put just there to make the go race detector happy and the lock/unlock order is wrong. Instead this is really needed. This patch fixes the lock/unlock order. A future test to really test this problem will need to spawn two processes with one at least using multiple goroutines using the same store instance. Here an example repro: ``` package main import ( "fmt" "os" "github.com/coreos/rkt/store" ) func main() { s, err := store.NewStore("/tmp/test/rkt") if err != nil { fmt.Printf("cannot open store: %v", err) os.Exit(1) } runs := 10000 for i := 0; i < runs; i++ { go func() { _, err = s.GetAllACIInfos(nil, false) if err != nil { fmt.Printf("cannot query db: %v", err) os.Exit(1) } }() } s.Close() } ``` compile and execute the above command with something like: ``` #!/bin/bash while true; do for i in $(seq 0 2); do ./test01 & done wait done ```

euank · 2016-04-08T17:56:06Z

By my reading, since there's an early return when there's an sqldb.Close error, you end up in a state with a held lock and no easy path forwards. I think that's a discussion for another issue, but that, too, looks to be fraught with peril.

I'm 👍 on merging this, and giving fun error scenarios more thought in a new issue.

jonboulle · 2016-04-11T12:57:45Z

Oh dammit, I even called this problem @ #1892 (comment) when it wasn't an issue (because at the time we were creating a new dbLock for every txn) - but then I missed that it was introduced after all @ #1892 (comment) :-(

jonboulle · 2016-04-11T13:00:24Z

LGTM. @euank can you file a follow-up?

thanks @sgotti !

This was referenced Apr 7, 2016

rkt/image: check that discovery labels match manifest labels #2311

Merged

occasional failures to prepare in rktnetes testing #2374

Closed

sjpotter reviewed Apr 7, 2016
View reviewed changes

euank reviewed Apr 7, 2016
View reviewed changes

yifan-gu reviewed Apr 7, 2016
View reviewed changes

sgotti force-pushed the fix_store_concurrency branch from 78fc8d5 to 76712a7 Compare April 8, 2016 09:03

jonboulle reviewed Apr 8, 2016
View reviewed changes

sgotti force-pushed the fix_store_concurrency branch 3 times, most recently from 3e70d76 to d4c702d Compare April 8, 2016 10:58

sgotti force-pushed the fix_store_concurrency branch from d4c702d to 53104d0 Compare April 8, 2016 11:02

sgotti changed the title ~~store: fix goroutines race on db.~~ store: fix multi process with multi goroutines race on db. Apr 8, 2016

steveej added kind/bug component/store labels Apr 8, 2016

jonboulle merged commit baa7f1f into rkt:master Apr 11, 2016

jonboulle added this to the v1.4.0 milestone Apr 11, 2016

alban mentioned this pull request Apr 11, 2016

store: fix comment + add link in db test #2407

Merged

sgotti mentioned this pull request Jul 10, 2016

store: simplify db locking and functions #2897

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: fix multi process with multi goroutines race on db. #2391

store: fix multi process with multi goroutines race on db. #2391

store: fix multi process with multi goroutines race on db. #2391

store: fix multi process with multi goroutines race on db. #2391

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment