8000 store: fix multi process with multi goroutines race on db. by sgotti · Pull Request #2391 · rkt/rkt · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Feb 24, 2020. It is now read-only.

store: fix multi process with multi goroutines race on db. #2391

Merged
merged 1 commit into from
Apr 11, 2016

Conversation

sgotti
Copy link
Contributor
@sgotti sgotti commented Apr 7, 2016

When there are multiple OS processes concurrently accessing the ql
db and at least one of them has multiple goroutines using a shared store
instance the internal ql (camlistore based) locking error: cannot acquire lock: resource temporarily unavailable can be triggered.

For example:

Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and does db.Open
Process A, goroutine 2: takes the flock, blocks on the lock on the mutex
Process A, goroutine 1: does something on db, calls db.Close, unlocks the flock, release the lock on the mutex
Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED
Process B, goroutine 1: take the lock on the flock and then takes lock on the mutex (since it's another process): the ql internal camlistore lock will fail.

In the code there's already a locking made by a mutex, but looks like it
was put just there to make the go race detector happy and the
lock/unlock order is wrong. Instead this is really needed.

This patch fixes the lock/unlock order.

A future test to really test this problem will need to spawn two
processes with one at least using multiple goroutines using the same
store instance.

Here an example repro:

package main

import (
        "fmt"
        "os"

        "github.com/coreos/rkt/store"
)

func main() {
        s, err := store.NewStore("/tmp/test/rkt")
        if err != nil {
                fmt.Printf("cannot open store: %v", err)
                os.Exit(1)
        }

        runs := 10000
        for i := 0; i < runs; i++ {
                go func() {
                        _, err = s.GetAllACIInfos(nil, false)
                        if err != nil {
                                fmt.Printf("cannot query db: %v", err)
                                os.Exit(1)
                        }
                }()
        }

        s.Close()
}

compile and execute the above command with something like:

#!/bin/bash
while true; do
        for i in $(seq 0 2); do
                ./test01 &
        done
        wait
done

Should fix #2374

return nil
}

func (dl *dbLock) unlock() error {
if err := dl.fl.Unlock(); err != nil {
dl.Unlock()
return err
}
dl.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just call dl.Unlock() at the top of the function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You want to always unlock the mutex guarding the fileLock second since it's the "outer lock". Unlocking it first would leave the file lock unprotected.

I do think this function would be slightly clearer with a named return, e.g.

// Named return, err
err = dl.fl.Unlock()
dl.Unlock

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about using defer? (trying to learn better go)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a defer dl.Unlock() at the top would be functionally equivalent, yeah. Whether that or a named return is clearer is debatable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just call dl.Unlock() at the top of the function?

If your question is why I inverted the order putting the sync.Mutex lock before the flock that's because the sync.Mutex must protect the flock. The reason is that calling flock.ExclusiveLock after a previous ExclusiveLock will continue since it's on the same filedescriptor. This can cause something like this:

  • Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and do db.Open
  • Process A, goroutine 2: takes the flock, blocks on the lock on the mutex
  • Process A, goroutine 1: does something on db, calls db.Close, release the lock on the mutex, unlocks the flock
  • Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED
  • Process B, goroutine 1: take the lock on the mutex (since it's another process), and takes the lock on the flock: the ql camlistore lock will fail.

So the sync.Mutex that's really (per store instance) exclusive must go before the flock

Obviously there must be only one store instance (on the same dir) per process. This is what actually rkt does (I haven't found wrong store usage).

@sjpotter
Copy link
Contributor
sjpotter commented Apr 7, 2016

so good news. With this patch I seem to have been able to pass a k8s test where I never passed before

bad new, I still seem similar error output

e2e-tests-sched-pred-aboh9   2016-04-07 10:09:04 -0700 PDT   2016-04-07 10:09:04 -0700 PDT   1         maxp-252   Pod       spec.containers{maxp-252}   Warning   Failed    {kubelet spotter-rkt-minion-gtjj}   Failed to create rkt container with error: failed to run [prepare --quiet --pod-manifest /tmp/manifest-maxp-252-392804414]: exit status 1
stdout: 
stderr: prepare: cannot open store: cannot acquire lock: resource temporarily unavailable

and

e2e-tests-sched-pred-aboh9   2016-04-07 10:10:38 -0700 PDT   2016-04-07 10:10:38 -0700 PDT   1         maxp-48   Pod                 Warning   FailedSync   {kubelet spotter-rkt-minion-gtjj}   Error syncing pod, skipping: failed to SyncPod: failed to run [prepare --quiet --pod-manifest /tmp/manifest-maxp-48-665407762]: exit status 1
stdout: 
stderr: image: using image from file /opt/rkt/stage1-coreos.aci
prepare: cannot acquire lock: resource temporarily unavailable

see a number of these

return nil
}

func (dl *dbLock) unlock() error {
if err := dl.fl.Unlock(); err != nil {
dl.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can end up in a bad state here.

If the fl.Unlock here fails, it's very possible that the file lock is still held, and yet you release the guarding mutex here. the lock function will now never succeed (returning an error each time).

However, it's worse than that, since the caller can't fix this state. If the caller doesn't call unlock again, they're broken in that lock will never work again. However, if they call unlock again to try and fix this, they'll end up hitting dl.Unlock with it already being unlocked, which is a panic.

I think the correct fix is to keep the above dl.Lock move up, but to only Unlock here if the fl.Unlock succeeds, and then have a caller up the line retry.

There's a couple additional subtleties in that an error from fl.Unlock does not necessarily mean that the lock can ever be unlocked, so you really need to inspect the error and behave differently depending.

It's possible I'm missing something here and the above scenario can't happen, or is somehow handled up the call path (which I am not familiar with).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the fl.Unlock here fails

Looking at man 2 flock looks like the only way an unlock will fail is if the underlying fd has been closed. This should be a programming error. But I'm not sure if there'll be other cases.

How to handle this possibility?
usually rkt commands will get the error and exit, instead the rkt api service will continue to use that store instance. Probably, it should handle this error specially, closing that store instance and creating a new one (this needs to be correctly handled in different ways)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defer dl.Unlock() at the top sounds good to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deferring it doesn't alleviate my concern that we're unlocking it in the case of an an fl.Unlock error.

I agree that it's not likely that the fl unlock will fail. Because of that, having a log or panic or store-instance cleanup all seem like plausible paths forwards for this PR. I'm fine with my concern not being fully fixed, since it's out of scope of the other fix here. At the very least though, I'd want a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's not likely that the fl unlock will fail. Because of that, having a log or panic or store-instance cleanup all seem like plausible paths forwards for this PR. I'm fine with my concern not being fully fixed, since it's out of scope of the other fix here. At the very least though, I'd want a comment.

@euank I added a (quite long) comment trying to explain the problem. In the meantime I removed the mutex unlock and made it panic when flock unlock fails (let's see if and when it'll happen)

@sgotti
Copy link
Contributor Author
sgotti commented Apr 7, 2016

bad new, I still seem similar error output

uhm. was the rkt api service restarted? Which tests (so I can try them locally)

I have to leave for some hours, I will answer to the other questions when I'll be back.

@sjpotter
Copy link
Contributor
sjpotter commented Apr 7, 2016

A1: yes, started fresh

A2: [Fail] [k8s.io] SchedulerPredicates [Serial] [It] validates MaxPods limit number of pods that are allowed to run [Slow]

I'm basically using https://github.com/yifan-gu/kubernetes/tree/gce_coreos_rkt (i.e. branch is gce_coreos_rkt)

setup the environment currently with

BUILD_PYTHON_IMAGE=true
KUBE_CONTAINER_RUNTIME=rkt
KUBE_ENABLE_CLUSTER_MONITORING=standalone
KUBE_ENABLE_NODE_LOGGING=false
KUBE_GCE_INSTANCE_PREFIX=${USER}-rkt
KUBE_GCE_MASTER_IMAGE=coreos-alpha-962-0-0-v20160218
KUBE_GCE_MASTER_PROJECT=coreos-cloud
KUBE_GCE_NETWORK=${USER}-rkt
KUBE_GCE_NODE_IMAGE=coreos-alpha-960-0-0-v20160217
KUBE_GCE_NODE_PROJECT=coreos-cloud
KUBE_GCE_ZONE=us-east1-b
KUBE_OS_DISTRIBUTION=coreos
KUBE_RKT_VERSION=rkt-${USER}-hack
KUBE_SKIP_UPDATE=y

build by doing a (the PYTHON env above is important for this)

make quick-release

I set up the cluster by doing a

go run hack/e2e.go -v -up

this downloads an appropriate rkt build from our gcloud storage as specified by KUBE_RKT_VERSION

run the specific test by doing a

go run hack/e2e.go -v -test -test_args="-ginkgo.focus=validates.MaxPods.limit"

in another window I do

 ./cluster/kubectl.sh get events -w --all-namespaces

and I see the error output

@@ -34,9 +34,11 @@ const (
// dbLock is used to guarantee both thread-safety and process-safety
// for db access.
type dbLock struct {
// This lock is to make that access to the ql db file being blocking
// since ql use an internal locking but it won't block but return an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/but return/and return ?

@sgotti
Copy link
Contributor Author
sgotti commented Apr 7, 2016

A1: yes, started fresh

A2: [Fail] [k8s.io] SchedulerPredicates [Serial] [It] validates MaxPods limit number of pods that are allowed to run [Slow]

Tried this test multiple times (on a local k8s git master cluster) but can't reproduce with this fix (without I can reproduce it every time). That's really strange since the other test I was using (kubelet should be able to delete 10 pods per node in 1m0s) does the same rkt prepare (with less pods).

Sorry for being pedantic 😄 but are you really sure that you're running a patched rkt version?

if err := dl.fl.ExclusiveLock(); err != nil {
dl.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I went back to the history, found the reason that we do flock() before mutex is because of #1892 (comment)

But I think this fixes the problem.

@yifan-gu
Copy link
Contributor
yifan-gu commented Apr 7, 2016

bad new, I still seem similar error output
e2e-tests-sched-pred-aboh9 2016-04-07 10:09:04 -0700 PDT 2016-04-07 10:09:04 -0700 PDT 1 maxp-252 Pod spec.containers{maxp-252} Warning Failed {kubelet spotter-rkt-minion-gtjj} Failed to create rkt container with error: failed to run [prepare --quiet --pod-manifest /tmp/manifest-maxp-252-392804414]: exit status 1
stdout:
stderr: prepare: cannot open store: cannot acquire lock: resource temporarily unavailable

@sgotti @sjpotter Is the error coming from the ql lock or our flock? We should add some logging here...

@sgotti
Copy link
Contributor Author
sgotti commented Apr 7, 2016

@sgotti @sjpotter Is the error coming from the ql lock or our flock? We should add some logging here...

It's coming from ql camlistore locking. You're right and I also wanted to improve the logging but wanted to keep this patch clear (perhaps a new PR will be better).

@sjpotter
Copy link
Contributor
sjpotter commented Apr 7, 2016

sgotti: pretty sure,

spotter-rkt-minion-60dh spotter # systemctl cat kubernetes-install-rkt
# /etc/systemd/system/kubernetes-install-rkt.service
[Unit]
Description=Fetch rkt
Documentation=http://github.com/coreos/rkt
Requires=network-online.target
After=network-online.target
Requires=kube-env.service
After=kube-env.service
[Service]
Type=oneshot
RemainAfterExit=yes
EnvironmentFile=/etc/kube-env
ExecStartPre=/usr/bin/mkdir -p /etc/rkt
ExecStartPre=/usr/bin/mkdir -p /opt/downloads
ExecStartPre=/usr/bin/curl --location --create-dirs --output /opt/downloads/rkt.tar.gz https://storage.googleapis.com/rkt-bin/${RKT_VERSION}.tar.gz
ExecStartPre=/usr/bin/tar xf /opt/downloads/rkt.tar.gz -C /opt --overwrite
ExecStartPre=/usr/bin/rm -fr /opt/rkt
ExecStart=/usr/bin/mv /opt/bin /opt/rkt
spotter-rkt-minion-60dh spotter # systemctl cat rkt-api-service
# /etc/systemd/system/rkt-api-service.service
[Unit]
Description=Start rkt API service as Daemon
Requires=kubernetes-install-rkt.service
After=kubernetes-install-rkt.service
[Service]
ExecStart=/opt/rkt/rkt api-service
Restart=always
RestartSec=10

can you attach your rkt tarball and I can try again?

8000

@sjpotter
Copy link
Contributor
sjpotter commented Apr 7, 2016

ok, so patched head (1.3.0+) with this and it works perfectly, no unexpected output.

@yifan-gu
Copy link
Contributor
yifan-gu commented Apr 7, 2016

ok, so patched head (1.3.0+) with this and it works perfectly, no unexpected output.
Show all checks

@sjpotter Awesome! Sounds like a big progress. Though we need to fix the symlink issue with 1.3.

@sgotti
Copy link
Contributor Author
sgotti commented Apr 7, 2016

ok, so patched head (1.3.0+) with this and it works perfectly, no unexpected output.

That sounds great! With me it worked also on a patched 1.2.1, perhaps there's some bad http caching...

Tomorrow I'll update the PR with all the suggestions.

@yifan-gu
Copy link
Contributor
yifan-gu commented Apr 8, 2016

@sgotti Thank you so very much!

@sgotti sgotti force-pushed the fix_store_concurrency branch from 78fc8d5 to 76712a7 Compare April 8, 2016 09:03
@sgotti
Copy link
Contributor Author
sgotti commented Apr 8, 2016

Thanks a lot for all you reviews and tests.

I tried to implement the various suggestions. Now if the flock unlock fails the code will panic since there's no easy way to recover it. @yifan-gu if this will ever happen, the rkt api service should be raspawned by a systemd restart policy or something similar.

Probably some followup patches should:

  • Improve the error reporting.
  • @euank I noticed that when I wrote it I was ignoring the sqldb.Close errors, I'd like to also handle them. Thoughts?

@@ -78,6 +78,9 @@ func createTable(db *DB, t *testing.T) {
}

func TestDBRace(t *testing.T) {
// TODO(sgotti) this will not find concurrenct accesses to ql db from
// multiple processes using multiple goroutines. A test that spawns at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgotti I don't really understand your use of the word "process" in this context (and in the original comment). Isn't this just a problem within a single process with multiple goroutines?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonboulle Sorry, it's difficult to explain. What I meant was that:

Given that the store db model is based on making short lived Open/Tx/Close to make the ql db accessible by multiple processes and/or goroutines without long blocking and possible deadlocks.

The "cannot acquire lock: resource temporarily unavailable" doesn't happen with:

  • Multiple OS processes not having multiple goroutines using a shared store instance.
  • Only one OS process accessing the db (also if it has multiple goroutines using a shared store instance like this test and the rkt api-service).

It happens only when there are multiple OS processes concurrently accessing the ql db and at least one of them has multiple goroutines using a shared store instance.

Plus, after more analysis, I think that the case

  • Only one OS process accessing the db (also if it has multiple goroutines using a shared store instance like this test and the rkt api-service).
    that in the commit comments I say won't give an error but cause possible db corruption is probably not true so I'll remove it from the commit comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonboulle I changed to commit comment removing the very confusing description and adding an example.

@sgotti sgotti force-pushed the fix_store_concurrency branch 3 times, most recently from 3e70d76 to d4c702d Compare April 8, 2016 10:58
When there are multiple OS processes concurrently accessing the ql
db and at least one of them has multiple goroutines using a shared store
instance the internal ql (camlistore based) locking error: `cannot
acquire lock: resource temporarily unavailable` can be triggered.

For example:

Process A, goroutine 1: takes the the flock, then takes the lock on the mutex, and do db.Open
Process A, goroutine 2: takes the flock, blocks on the lock on the mutex
Process A, goroutine 1: does something on db, calls db.Close, unlocks the flock, release the lock on the mutex
Process A, goroutine 2: takes the lock on the mutex, but now it has the flock UNLOCKED
Process B, goroutine 1: take the lock on the flock and then takes lock on the mutex (since it's another process): the ql internal camlistore lock will fail.

In the code there's already a locking made by a mutex, but looks like it
was put just there to make the go race detector happy and the
lock/unlock order is wrong. Instead this is really needed.

This patch fixes the lock/unlock order.

A future test to really test this problem will need to spawn two
processes with one at least using multiple goroutines using the same
store instance.

Here an example repro:

```
package main

import (
        "fmt"
        "os"

        "github.com/coreos/rkt/store"
)

func main() {
        s, err := store.NewStore("/tmp/test/rkt")
        if err != nil {
                fmt.Printf("cannot open store: %v", err)
                os.Exit(1)
        }

        runs := 10000
        for i := 0; i < runs; i++ {
                go func() {
                        _, err = s.GetAllACIInfos(nil, false)
                        if err != nil {
                                fmt.Printf("cannot query db: %v", err)
                                os.Exit(1)
                        }
                }()
        }

        s.Close()
}

```

compile and execute the above command with something like:

```
#!/bin/bash
while true; do
        for i in $(seq 0 2); do
                ./test01 &
        done
        wait
done
```
@sgotti sgotti force-pushed the fix_store_concurrency branch from d4c702d to 53104d0 Compare April 8, 2016 11:02
@sgotti sgotti changed the title store: fix goroutines race on db. store: fix multi process with multi goroutines race on db. Apr 8, 2016
@euank
Copy link
Member
euank commented Apr 8, 2016

By my reading, since there's an early return when there's an sqldb.Close error, you end up in a state with a held lock and no easy path forwards. I think that's a discussion for another issue, but that, too, looks to be fraught with peril.

I'm 👍 on merging this, and giving fun error scenarios more thought in a new issue.

@jonboulle
Copy link
Contributor

Oh dammit, I even called this problem @ #1892 (comment) when it wasn't an issue (because at the time we were creating a new dbLock for every txn) - but then I missed that it was introduced after all @ #1892 (comment) :-(

@jonboulle
Copy link
Contributor

LGTM. @euank can you file a follow-up?

thanks @sgotti !

@jonboulle jonboulle merged commit baa7f1f into rkt:master Apr 11, 2016
@jonboulle jonboulle added this to the v1.4.0 milestone Apr 11, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

occasional failures to prepare in rktnetes testing
6 participants
0