How to simulate most monitor failure scenarios？ #323

haoxiaoci · 2024-09-09T06:49:51Z

How to simulate most of the monitor failure scenarios?

in my test environment， i have 3 health mons， and rook-operator works fine, ceph cluster is and found mon-a is leader

root@node1:/home/haoxiaoci# kubectl get pod -n rook-ceph -owide | grep mon
rook-ceph-mon-a-74dd5d44f7-xb875                                  2/2     Running     0              8d      10.46.155.135    node1   <none>           <none>
rook-ceph-mon-c-5d8c6b79d5-f68vw                                  2/2     Running     0              14m     10.46.218.207    node2   <none>           <none>
rook-ceph-mon-d-566c87c699-x2nxs                                  2/2     Running     0              2m57s   10.46.218.208    node3   <none>           <none>

# mon-a pod
[root@node1 ceph]# ceph daemon mon.a mon_status |head -n 10
{
    "name": "a",
    "rank": 0,
    "state": "leader",
    "election_epoch": 70,
    "quorum": [
        0,
        1,
        2
    ],
# in tools pod
[root@node3 /]# ceph -s
  cluster:
    id:     a9b5698d-456f-47ea-93b5-7caddbebc808
    health: HEALTH_WARN
            noout flag(s) set
 
  services:
    mon: 3 daemons, quorum a,c,d (age 3m)
    mgr: b(active, since 8d), standbys: a
    osd: 18 osds: 18 up (since 8d), 18 in (since 11d)
         flags noout
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    pools:   11 pools, 1121 pgs
    objects: 65.41k objects, 245 GiB
    usage:   2.2 TiB used, 138 TiB / 140 TiB avail
    pgs:     1121 active+clean
 
  io:
    client:   3.1 MiB/s wr, 0 op/s rd, 79 op/s wr

in order to simulate majority mons offline, i scale down mon-d and mon-c

kubectl scale -n rook-ceph deploy/rook-ceph-mon-d --replicas=0
kubectl scale -n rook-ceph deploy/rook-ceph-mon-c --replicas=0

then check mon-a state, found it's in probling state

[root@node1 ceph]# ceph daemon mon.a mon_status |head -n 10
{
    "name": "a",
    "rank": 0,
    "state": "probing",
    "election_epoch": 74,
    "quorum": [],
    "features": {
        "required_con": "2449958755906961412",
        "required_mon": [
            "kraken",

then i can't use restore-quorum to restore c, d from good mon a.

root@node1:/home/haoxiaoci# kubectl rook-ceph mons restore-quorum a

Info: mon "a" state is "probing"
Error: mon "a" in "probing" state but must be in leader/peon state

The text was updated successfully, but these errors were encountered:

subhamkrai · 2024-09-09T07:22:27Z

@haoxiaoci it should be temporary state "state": "probing",

haoxiaoci · 2024-09-09T07:38:31Z

it has been probing for more than 10min....

subhamkrai · 2024-09-09T08:04:41Z

In that case try removing mons from cluster and then check https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/#removing-a-monitor-manual

haoxiaoci · 2024-09-09T08:30:31Z

after scale mon-c, mon-d to 0, ceph command execute got stuck, so i can't rm mon c or d manually.
and then, how to simulate this scenario, does my steps correct?

subhamkrai · 2024-09-09T11:46:02Z

@haoxiaoci you can try to scale mon to 1 in cluster.yaml config file. or you can first try to scale up the mon pods and try again.

travisn · 2024-09-09T17:02:44Z

@subhamkrai I wonder if the tool should also allow resetting the quorum in case the state is probing. I saw another case today as well where the cluster mon was in this state, and the tool skipped the restore, when perhaps it would help. Can we test (at least manually) if it works as expected if we allow the restore in the probing state?

haoxiaoci · 2024-09-14T07:56:46Z

not all probing state mon is a "good" mon. i think this project should clarify the specific scenario that restore from a probing state mon.

travisn · 2024-09-16T20:51:14Z

thanks for the feedback, let's continue the discussion on #326.

subhamkrai mentioned this issue Sep 11, 2024

mons: restore-quorum should proceed with probing #324

Merged

5 tasks

subhamkrai closed this as completed in #324 Sep 11, 2024

travisn mentioned this issue Sep 16, 2024

Restoring mon quorum should warn if the mon is in probing state #326

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to simulate most monitor failure scenarios？ #323

How to simulate most monitor failure scenarios？ #323

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

How to simulate most monitor failure scenarios？ #323

How to simulate most monitor failure scenarios？ #323

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!