8000 How to simulate most monitor failure scenarios? · Issue #323 · rook/kubectl-rook-ceph · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

How to simulate most monitor failure scenarios? #323

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
haoxiaoci opened this issue Sep 9, 2024 · 8 comments · Fixed by #324
Closed

How to simulate most monitor failure scenarios? #323

haoxiaoci opened this issue Sep 9, 2024 · 8 comments · Fixed by #324

Comments

@haoxiaoci
Copy link
haoxiaoci commented Sep 9, 2024

How to simulate most of the monitor failure scenarios?

in my test environment, i have 3 health mons, and rook-operator works fine, ceph cluster is and found mon-a is leader

root@node1:/home/haoxiaoci# kubectl get pod -n rook-ceph -owide | grep mon
rook-ceph-mon-a-74dd5d44f7-xb875                                  2/2     Running     0              8d      10.46.155.135    node1   <none>           <none>
rook-ceph-mon-c-5d8c6b79d5-f68vw                                  2/2     Running     0              14m     10.46.218.207    node2   <none>           <none>
rook-ceph-mon-d-566c87c699-x2nxs                                  2/2     Running     0              2m57s   10.46.218.208    node3   <none>           <none>

# mon-a pod
[root@node1 ceph]# ceph daemon mon.a mon_status |head -n 10
{
    "name": "a",
    "rank": 0,
    "state": "leader",
    "election_epoch": 70,
    "quorum": [
        0,
        1,
        2
    ],
# in tools pod
[root@node3 /]# ceph -s
  cluster:
    id:     a9b5698d-456f-47ea-93b5-7caddbebc808
    health: HEALTH_WARN
            noout flag(s) set
 
  services:
    mon: 3 daemons, quorum a,c,d (age 3m)
    mgr: b(active, since 8d), standbys: a
    osd: 18 osds: 18 up (since 8d), 18 in (since 11d)
         flags noout
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    pools:   11 pools, 1121 pgs
    objects: 65.41k objects, 245 GiB
    usage:   2.2 TiB used, 138 TiB / 140 TiB avail
    pgs:     1121 active+clean
 
  io:
    client:   3.1 MiB/s wr, 0 op/s rd, 79 op/s wr

in order to simulate majority mons offline, i scale down mon-d and mon-c

kubectl scale -n rook-ceph deploy/rook-ceph-mon-d --replicas=0
kubectl scale -n rook-ceph deploy/rook-ceph-mon-c --replicas=0

then check mon-a state, found it's in probling state

[root@node1 ceph]# ceph daemon mon.a mon_status |head -n 10
{
    "name": "a",
    "rank": 0,
    "state": "probing",
    "election_epoch": 74,
    "quorum": [],
    "features": {
        "required_con": "2449958755906961412",
        "required_mon": [
            "kraken",

then i can't use restore-quorum to restore c, d from good mon a.

root@node1:/home/haoxiaoci# kubectl rook-ceph mons restore-quorum a

Info: mon "a" state is "probing"
Error: mon "a" in "probing" state but must be in leader/peon state
@subhamkrai
Copy link
Collaborator

@haoxiaoci it should be temporary state "state": "probing",

@haoxiaoci
Copy link
Author

it has been probing for more than 10min....

@subhamkrai
Copy link
Collaborator

In that case try removing mons from cluster and then check https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#removing-a-monitor-manual

@haoxiaoci
Copy link
Author

after scale mon-c, mon-d to 0, ceph command execute got stuck, so i can't rm mon c or d manually.
and then, how to simulate this scenario, does my steps correct?

@subhamkrai
Copy link
Collaborator

@haoxiaoci you can try to scale mon to 1 in cluster.yaml config file. or you can first try to scale up the mon pods and try again.

@travisn
Copy link
8000
Member
travisn commented Sep 9, 2024

@subhamkrai I wonder if the tool should also allow resetting the quorum in case the state is probing. I saw another case today as well where the cluster mon was in this state, and the tool skipped the restore, when perhaps it would help. Can we test (at least manually) if it works as expected if we allow the restore in the probing state?

@haoxiaoci
Copy link
Author

not all probing state mon is a "good" mon. i think this project should clarify the specific scenario that restore from a probing state mon.

@travisn
Copy link
Member
travisn commented Sep 16, 2024

thanks for the feedback, let's continue the discussion on #326.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0