8000 Progression can't go backwards by ELENAGER · Pull Request #1880 · RamenDR/ramen · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Progression can't go backwards #1880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

ELENAGER
Copy link
Member
@ELENAGER ELENAGER commented Mar 3, 2025

Fixes: https://issues.redhat.com/browse/DFBUGS-612

We have encountered several bugs where DRPC progression unexpectedly reverts from advanced stages to earlier ones. In this particular case, it regresses during Failover from Completed to WaitingForReadiness due to conflicts in VRG generations. While these changes typically resolve themselves over time, users may find the regression alarming.

To address this, we will introduce a check before updating the DRPC status to prevent regression, ensuring it remains at the previous progression level. However, this safeguard will apply only when the Phase remains unchanged—when transitioning between Phases, progression can still move in either direction.

@ELENAGER ELENAGER requested review from nirs and BenamarMk March 3, 2025 16:26
@ELENAGER ELENAGER force-pushed the DFBUGS-612 branch 3 times, most recently from c6ded1b to 9bc6ff5 Compare March 4, 2025 10:18
Comment on lines 2304 to 2341
var deployProgressionList = []rmn.ProgressionStatus{
rmn.ProgressionCreatingMW,
rmn.ProgressionUpdatingPlRule,
rmn.ProgressionEnsuringVolSyncSetup,
rmn.ProgressionSettingupVolsyncDest,
rmn.ProgressionCompleted,
}

var failoverProgressionList = []rmn.ProgressionStatus{
rmn.ProgressionCheckingFailoverPrerequisites,
rmn.ProgressionWaitForFencing,
rmn.ProgressionWaitForStorageMaintenanceActivation,
rmn.ProgressionFailingOverToCluster,
rmn.ProgressionWaitingForResourceRestore,
rmn.ProgressionEnsuringVolSyncSetup,
rmn.ProgressionSettingupVolsyncDest,
rmn.ProgressionWaitForReadiness,
rmn.ProgressionUpdatedPlacement,
rmn.ProgressionCompleted,
rmn.ProgressionCleaningUp,
rmn.ProgressionWaitOnUserToCleanUp,
}

var relocateProgressionList = []rmn.ProgressionStatus{
rmn.ProgressionPreparingFinalSync,
rmn.ProgressionClearingPlacement,
rmn.ProgressionRunningFinalSync,
rmn.ProgressionFinalSyncComplete,
rmn.ProgressionEnsuringVolumesAreSecondary,
rmn.ProgressionWaitOnUserToCleanUp,
rmn.ProgressionCompleted,
rmn.ProgressionCleaningUp,
rmn.ProgressionWaitingForResourceRestore,
rmn.ProgressionWaitForReadiness,
rmn.ProgressionUpdatedPlacement,
rmn.ProgressionEnsuringVolSyncSetup,
rmn.ProgressionSettingupVolsyncDest,
}
Copy link
Member
@raaizik raaizik Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, using Golang's enumerated type is better. It's more straightforward and spares the need to handle hardcoded lists. Simply refactoring ProgressionStatus in drplacementcontroler_types.go like so (and divide to groups):

type ProgressionStatus int

const (
	ProgressionCompleted = iota
	ProgressionCreatingMW
	ProgressionUpdatingPlRule
	ProgressionWaitForReadiness
	ProgressionCleaningUp
	ProgressionWaitOnUserToCleanUp
	ProgressionCheckingFailoverPrerequisites
	ProgressionFailingOverToCluster
	ProgressionWaitForFencing
	ProgressionWaitForStorageMaintenanceActivation
	ProgressionPreparingFinalSync
	ProgressionClearingPlacement
	ProgressionRunningFinalSync
	ProgressionFinalSyncComplete
	ProgressionEnsuringVolumesAreSecondary
	ProgressionWaitingForResourceRestore
	ProgressionUpdatedPlacement
	ProgressionEnsuringVolSyncSetup
	ProgressionSettingupVolsyncDest
	ProgressionEnsuringVolSyncDestSetup
	ProgressionDeleting
	ProgressionDeleted
	ProgressionActionPaused
)

func (ps ProgressionStatus) String() string {
	return [...]string{
		"Completed",
		"CreatingMW",
		"UpdatingPlRule",
		"WaitForReadiness",
		"Cleaning Up",
		"WaitOnUserToCleanUp",
		"CheckingFailoverPrerequisites",
		"FailingOverToCluster",
		"WaitForFencing",
		"WaitForStorageMaintenanceActivation",
		"PreparingFinalSync",
		"ClearingPlacement",
		"RunningFinalSync",
		"FinalSyncComplete",
		"EnsuringVolumesAreSecondary",
		"WaitingForResourceRestore",
		"UpdatedPlacement",
		"EnsuringVolSyncSetup",
		"SettingUpVolSyncDest",
		"EnsuringVolSyncDestSetup",
		"Deleting",
		"Deleted",
		"Paused",
	}[ps]
}

func (ps ProgressionStatus) EnumIndex() int {
	return int(ps)
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raaizik it maybe better from golang point of view, but we are losing ordered lists for specific workflows.
Right now, I have explicitly defined sequences of steps (failoverProgressionList, relocateProgressionList etc.).
With the enum approach, I’d need to implement logic to determine valid transitions instead of iterating over a predefined list.

progressionList = deployProgressionList
}

currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression)
currentProgressionIndex := slices.Index(progressionList, d.savedInstanceStatus)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my PR I am playing with in memory progression, not with the saved one. I am comparing the current "in memory" progression with the newer one, I want to set. It is still not the common solution we discussed.

}

currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression)
nextProgressionIndex := slices.Index(progressionList, nextProgression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing in nextProgression as an argument to this function, you could use:

Suggested change
nextProgressionIndex := slices.Index(progressionList, nextProgression)
nextProgressionIndex := slices.Index(progressionList, d.getProgression())

Copy link
Member Author
@ELENAGER ELENAGER Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raaizik no, I can't use d.getProgression() in my specific case. I am before the setProgression in the block, where I am calling this function.

@ELENAGER ELENAGER force-pushed the DFBUGS-612 branch 2 times, most recently from 59f8431 to 2f011c3 Compare March 10, 2025 19:25
@ELENAGER ELENAGER marked this pull request as ready for review March 11, 2025 12:06
@ELENAGER ELENAGER requested a review from ShyamsundarR as a code owner March 11, 2025 12:06
@nirs
Copy link
Member

Fixes: https://issues.redhat.com/browse/DFBUGS-612

We have encountered several bugs where DRPC progression unexpectedly reverts from advanced stages to earlier ones. In this particular case, it regresses during Failover from Completed to WaitingForReadiness due to conflicts in VRG generations. While these changes typically resolve themselves over time, users may find the regression alarming.

Can you explain why the progression changes to earlier progression? You wrote "conflicts in VRG generations" but it is not clear why this changes the progression.

ramen is designed to try all checks related to particular state for every reconcile, so it is expected that failure in the start of the reconcile will change the state that ramen can report.

To address this, we will introduce a check before updating the DRPC status to prevent regression, ensuring it remains at the previous progression level.

Is this correct to report the progression based on previously reported progression? In general we should not report our status based on the current status but on the spec and the system state.

I think we need analysis of the issue shown in linked bug. We need to understand why the progression changes.

Maybe we should change the progression only when we are sure that the world state has changed, and keep the status unchanged if we cannot tell what is the current world state. For example if we have a stale VRG we need to check again and there is no point in updating the status.

@ELENAGER
Copy link
Member Author

@nirs to your question how progression can go backwards. Here are two examples of how it's happening:

ready := d.checkReadiness(failoverCluster)

and
ready := d.checkReadiness(preferredCluster)

Every time we are reconciling and entering RunRelocate or RunFailover functions, we are checking VRG conditions on target cluster. We are doing it every if we already completed relocation or failover process. Sometimes check for readiness fails not because VRG conditions were changed, but because this verification fails:

condition.ObservedGeneration == vrg.Generation

Then we are thinking, that VRG is not ready and marking our progression as not Completed (i.e. going backwards)

@nirs
Copy link
Member
nirs commented Mar 24, 2025

Then we are thinking, that VRG is not ready and marking our progression as not Completed (i.e. going backwards)

So should we mark the VRG as not completed in this case? this sounds like a bug. The actual issue is that we don't have a valid VRG (condition generation does not match the VRG generation), so we need to requeue without changing the status.

Maybe if we fix few instances of this issue we don't need to add infrastructure for preventing progressing from going backward, which may be correct in other case when the world state has changed.

@raaizik
Copy link
Member
raaizik commented Mar 25, 2025

Then we are thinking, that VRG is not ready and marking our progression as not Completed (i.e. going backwards)

So should we mark the VRG as not completed in this case? this sounds like a bug. The actual issue is that we don't have a valid VRG (condition generation does not match the VRG generation), so we need to requeue without changing the status.

Agreed. Whenever the metadata's generation differs from the status' observed generation, we need to re-queue (= return true) to allow the resource to stabilize (i.e., become valid).

rmn.ProgressionEnsuringVolSyncSetup,
rmn.ProgressionSettingupVolsyncDest,
rmn.ProgressionCompleted,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if we can guarantee that progression will never go "backward", but having list progressions is great! It makes it much easier to understand the expected flow.


if nextProgressionIndex < currentProgressionIndex {
log.Info(fmt.Sprintf("Repair reverse Progression from '%s' to '%s'",
drpc.Status.Progression, r.savedInstanceStatus.Progression))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repairing the progression and logging is not the way to go. We should instead not set the wrong progression.

If we are sure we can do this, we can change the place setting the progression to reject value which will make go backward. At that point we can log a debug message about rejecting the progression.

Copy link
Member Author
@ELENAGER ELENAGER Mar 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirs We are talking about this code:

ready := d.checkReadiness(failoverCluster)
if !ready {
	d.log.Info("VRGCondition not ready to finish failover")
	d.setProgression(rmn.ProgressionWaitForReadiness)

	return !done, nil
}

If there is generation mismatch, we are requeuing, so next reconciliation may or may not solve it.
But every reconcile we are going through several steps, even if we already completed the preferred action.

2025-03-26T09:42:23.338Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:826    Entering RunRelocate  {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1", "state": "Relocated", "progression": "Completed"}
2025-03-26T09:42:23.422Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:2230   function (*DRPCInstance).ensureRelocateActionCompleted changing Progression from 'Completed' to 'Cleaning Up' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}
2025-03-26T09:42:23.522Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:2230   function (*DRPCInstance).EnsureSecondaryReplicationSetup changing Progression from 'Cleaning Up' to 'EnsuringVolSyncSetup'    {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}
2025-03-26T09:42:23.523Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:2230   function (*DRPCInstance).refreshVRGSecondarySpec changing Progression from 'EnsuringVolSyncSetup' to 'SettingUpVolSyncDest'   {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}
2025-03-26T09:42:23.553Z        INFO    controllers.DRPlacementControl  controller/drplacementcontrol.go:2230   function (*DRPCInstance).ensureActionCompleted changing Progression from 'SettingUpVolSyncDest' to 'Completed'        {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}

In this case, for example, rejecting any intermediate progression step will leave us with other intermediate wrong progression

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirs We are talking about this code:

ready := d.checkReadiness(failoverCluster)
if !ready {
	d.log.Info("VRGCondition not ready to finish failover")
	d.setProgression(rmn.ProgressionWaitForReadiness)

	return !done, nil
}

If there is generation mismatch, we are requeuing, so next reconciliation may or may not solve it. But every reconcile we are going through several steps, even if we already completed the preferred action.

This is the bug! checkReadyness returns false because the generations do not match. It should fail with an error since it cannot tell if we are ready or not. The resource status wil not be updated until the generations match.

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0