Progression can't go backwards #1880

ELENAGER · 2025-03-03T16:26:26Z

Fixes: https://issues.redhat.com/browse/DFBUGS-612

We have encountered several bugs where DRPC progression unexpectedly reverts from advanced stages to earlier ones. In this particular case, it regresses during Failover from Completed to WaitingForReadiness due to conflicts in VRG generations. While these changes typically resolve themselves over time, users may find the regression alarming.

To address this, we will introduce a check before updating the DRPC status to prevent regression, ensuring it remains at the previous progression level. However, this safeguard will apply only when the Phase remains unchanged—when transitioning between Phases, progression can still move in either direction.

raaizik · 2025-03-04T13:43:42Z

internal/controller/drplacementcontrol.go

+var deployProgressionList = []rmn.ProgressionStatus{
+	rmn.ProgressionCreatingMW,
+	rmn.ProgressionUpdatingPlRule,
+	rmn.ProgressionEnsuringVolSyncSetup,
+	rmn.ProgressionSettingupVolsyncDest,
+	rmn.ProgressionCompleted,
+}
+
+var failoverProgressionList = []rmn.ProgressionStatus{
+	rmn.ProgressionCheckingFailoverPrerequisites,
+	rmn.ProgressionWaitForFencing,
+	rmn.ProgressionWaitForStorageMaintenanceActivation,
+	rmn.ProgressionFailingOverToCluster,
+	rmn.ProgressionWaitingForResourceRestore,
+	rmn.ProgressionEnsuringVolSyncSetup,
+	rmn.ProgressionSettingupVolsyncDest,
+	rmn.ProgressionWaitForReadiness,
+	rmn.ProgressionUpdatedPlacement,
+	rmn.ProgressionCompleted,
+	rmn.ProgressionCleaningUp,
+	rmn.ProgressionWaitOnUserToCleanUp,
+}
+
+var relocateProgressionList = []rmn.ProgressionStatus{
+	rmn.ProgressionPreparingFinalSync,
+	rmn.ProgressionClearingPlacement,
+	rmn.ProgressionRunningFinalSync,
+	rmn.ProgressionFinalSyncComplete,
+	rmn.ProgressionEnsuringVolumesAreSecondary,
+	rmn.ProgressionWaitOnUserToCleanUp,
+	rmn.ProgressionCompleted,
+	rmn.ProgressionCleaningUp,
+	rmn.ProgressionWaitingForResourceRestore,
+	rmn.ProgressionWaitForReadiness,
+	rmn.ProgressionUpdatedPlacement,
+	rmn.ProgressionEnsuringVolSyncSetup,
+	rmn.ProgressionSettingupVolsyncDest,
+}


IMHO, using Golang's enumerated type is better. It's more straightforward and spares the need to handle hardcoded lists. Simply refactoring ProgressionStatus in drplacementcontroler_types.go like so (and divide to groups):

type ProgressionStatus int const ( ProgressionCompleted = iota ProgressionCreatingMW ProgressionUpdatingPlRule ProgressionWaitForReadiness ProgressionCleaningUp ProgressionWaitOnUserToCleanUp ProgressionCheckingFailoverPrerequisites ProgressionFailingOverToCluster ProgressionWaitForFencing ProgressionWaitForStorageMaintenanceActivation ProgressionPreparingFinalSync ProgressionClearingPlacement ProgressionRunningFinalSync ProgressionFinalSyncComplete ProgressionEnsuringVolumesAreSecondary ProgressionWaitingForResourceRestore ProgressionUpdatedPlacement ProgressionEnsuringVolSyncSetup ProgressionSettingupVolsyncDest ProgressionEnsuringVolSyncDestSetup ProgressionDeleting ProgressionDeleted ProgressionActionPaused ) func (ps ProgressionStatus) String() string { return [...]string{ "Completed", "CreatingMW", "UpdatingPlRule", "WaitForReadiness", "Cleaning Up", "WaitOnUserToCleanUp", "CheckingFailoverPrerequisites", "FailingOverToCluster", "WaitForFencing", "WaitForStorageMaintenanceActivation", "PreparingFinalSync", "ClearingPlacement", "RunningFinalSync", "FinalSyncComplete", "EnsuringVolumesAreSecondary", "WaitingForResourceRestore", "UpdatedPlacement", "EnsuringVolSyncSetup", "SettingUpVolSyncDest", "EnsuringVolSyncDestSetup", "Deleting", "Deleted", "Paused", }[ps] } func (ps ProgressionStatus) EnumIndex() int { return int(ps) }

@raaizik it maybe better from golang point of view, but we are losing ordered lists for specific workflows.
Right now, I have explicitly defined sequences of steps (failoverProgressionList, relocateProgressionList etc.).
With the enum approach, I’d need to implement logic to determine valid transitions instead of iterating over a predefined list.

raaizik · 2025-03-04T13:49:02Z

internal/controller/drplacementcontrol.go

+		progressionList = deployProgressionList
+	}
+
+	currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression)


Suggested change

currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression)

currentProgressionIndex := slices.Index(progressionList, d.savedInstanceStatus)

In my PR I am playing with in memory progression, not with the saved one. I am comparing the current "in memory" progression with the newer one, I want to set. It is still not the common solution we discussed.

raaizik · 2025-03-04T13:59:12Z

internal/controller/drplacementcontrol.go

+	}
+
+	currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression)
+	nextProgressionIndex := slices.Index(progressionList, nextProgression)


Instead of passing in nextProgression as an argument to this function, you could use:

Suggested change

nextProgressionIndex := slices.Index(progressionList, nextProgression)

nextProgressionIndex := slices.Index(progressionList, d.getProgression())

@raaizik no, I can't use d.getProgression() in my specific case. I am before the setProgression in the block, where I am calling this function.

nirs · 2025-03-22T19:19:34Z

Fixes: https://issues.redhat.com/browse/DFBUGS-612

We have encountered several bugs where DRPC progression unexpectedly reverts from advanced stages to earlier ones. In this particular case, it regresses during Failover from Completed to WaitingForReadiness due to conflicts in VRG generations. While these changes typically resolve themselves over time, users may find the regression alarming.

Can you explain why the progression changes to earlier progression? You wrote "conflicts in VRG generations" but it is not clear why this changes the progression.

ramen is designed to try all checks related to particular state for every reconcile, so it is expected that failure in the start of the reconcile will change the state that ramen can report.

To address this, we will introduce a check before updating the DRPC status to prevent regression, ensuring it remains at the previous progression level.

Is this correct to report the progression based on previously reported progression? In general we should not report our status based on the current status but on the spec and the system state.

I think we need analysis of the issue shown in linked bug. We need to understand why the progression changes.

Maybe we should change the progression only when we are sure that the world state has changed, and keep the status unchanged if we cannot tell what is the current world state. For example if we have a stale VRG we need to check again and there is no point in updating the status.

ELENAGER · 2025-03-24T13:53:31Z

@nirs to your question how progression can go backwards. Here are two examples of how it's happening:

ramen/internal/controller/drplacementcontrol.go

Line 371 in 81c8242

ready := d.checkReadiness(failoverCluster)

and

ramen/internal/controller/drplacementcontrol.go

Line 855 in 81c8242

ready := d.checkReadiness(preferredCluster)

Every time we are reconciling and entering RunRelocate or RunFailover functions, we are checking VRG conditions on target cluster. We are doing it every if we already completed relocation or failover process. Sometimes check for readiness fails not because VRG conditions were changed, but because this verification fails:

condition.ObservedGeneration == vrg.Generation

Then we are thinking, that VRG is not ready and marking our progression as not Completed (i.e. going backwards)

nirs · 2025-03-24T14:00:48Z

Then we are thinking, that VRG is not ready and marking our progression as not Completed (i.e. going backwards)

So should we mark the VRG as not completed in this case? this sounds like a bug. The actual issue is that we don't have a valid VRG (condition generation does not match the VRG generation), so we need to requeue without changing the status.

Maybe if we fix few instances of this issue we don't need to add infrastructure for preventing progressing from going backward, which may be correct in other case when the world state has changed.

raaizik · 2025-03-25T10:23:24Z

Then we are thinking, that VRG is not ready and marking our progression as not Completed (i.e. going backwards)

So should we mark the VRG as not completed in this case? this sounds like a bug. The actual issue is that we don't have a valid VRG (condition generation does not match the VRG generation), so we need to requeue without changing the status.

Agreed. Whenever the metadata's generation differs from the status' observed generation, we need to re-queue (= return true) to allow the resource to stabilize (i.e., become valid).

nirs · 2025-03-25T16:19:09Z

internal/controller/drplacementcontrol.go

+	rmn.ProgressionEnsuringVolSyncSetup,
+	rmn.ProgressionSettingupVolsyncDest,
+	rmn.ProgressionCompleted,
+}


I don't know if we can guarantee that progression will never go "backward", but having list progressions is great! It makes it much easier to understand the expected flow.

nirs · 2025-03-25T16:24:14Z

internal/controller/drplacementcontrol_controller.go

+
+	if nextProgressionIndex < currentProgressionIndex {
+		log.Info(fmt.Sprintf("Repair reverse Progression from '%s' to '%s'",
+			drpc.Status.Progression, r.savedInstanceStatus.Progression))


Repairing the progression and logging is not the way to go. We should instead not set the wrong progression.

If we are sure we can do this, we can change the place setting the progression to reject value which will make go backward. At that point we can log a debug message about rejecting the progression.

@nirs We are talking about this code:

ready := d.checkReadiness(failoverCluster) if !ready { d.log.Info("VRGCondition not ready to finish failover") d.setProgression(rmn.ProgressionWaitForReadiness) return !done, nil }

If there is generation mismatch, we are requeuing, so next reconciliation may or may not solve it.
But every reconcile we are going through several steps, even if we already completed the preferred action.

2025-03-26T09:42:23.338Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:826 Entering RunRelocate {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1", "state": "Relocated", "progression": "Completed"} 2025-03-26T09:42:23.422Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).ensureRelocateActionCompleted changing Progression from 'Completed' to 'Cleaning Up' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"} 2025-03-26T09:42:23.522Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).EnsureSecondaryReplicationSetup changing Progression from 'Cleaning Up' to 'EnsuringVolSyncSetup' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"} 2025-03-26T09:42:23.523Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).refreshVRGSecondarySpec changing Progression from 'EnsuringVolSyncSetup' to 'SettingUpVolSyncDest' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"} 2025-03-26T09:42:23.553Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).ensureActionCompleted changing Progression from 'SettingUpVolSyncDest' to 'Completed' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}

In this case, for example, rejecting any intermediate progression step will leave us with other intermediate wrong progression

@nirs We are talking about this code:

ready := d.checkReadiness(failoverCluster) if !ready { d.log.Info("VRGCondition not ready to finish failover") d.setProgression(rmn.ProgressionWaitForReadiness) return !done, nil }

If there is generation mismatch, we are requeuing, so next reconciliation may or may not solve it. But every reconcile we are going through several steps, even if we already completed the preferred action.

This is the bug! checkReadyness returns false because the generations do not match. It should fail with an error since it cannot tell if we are ready or not. The resource status wil not be updated until the generations match.

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>

ELENAGER requested review from nirs and BenamarMk March 3, 2025 16:26

ELENAGER force-pushed the DFBUGS-612 branch 3 times, most recently from c6ded1b to 9bc6ff5 Compare March 4, 2025 10:18

raaizik reviewed Mar 4, 2025

View reviewed changes

ELENAGER force-pushed the DFBUGS-612 branch 2 times, most recently from 59f8431 to 2f011c3 Compare March 10, 2025 19:25

ELENAGER marked this pull request as ready for review March 11, 2025 12:06

ELENAGER requested a review from ShyamsundarR as a code owner March 11, 2025 12:06

ELENAGER force-pushed the DFBUGS-612 branch from 2f011c3 to dfa2e61 Compare March 16, 2025 09:19

nirs reviewed Mar 25, 2025

View reviewed changes

Progression can't go backwards

f2f65e1

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>

ELENAGER force-pushed the DFBUGS-612 branch from dfa2e61 to f2f65e1 Compare April 2, 2025 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Progression can't go backwards #1880

Progression can't go backwards #1880

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression)
	currentProgressionIndex := slices.Index(progressionList, d.savedInstanceStatus)

	nextProgressionIndex := slices.Index(progressionList, nextProgression)
	nextProgressionIndex := slices.Index(progressionList, d.getProgression())

Progression can't go backwards #1880

Are you sure you want to change the base?

Progression can't go backwards #1880

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!