-
Notifications
You must be signed in to change notification settings - Fork 61
Progression can't go backwards #1880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
c6ded1b
to
9bc6ff5
Compare
var deployProgressionList = []rmn.ProgressionStatus{ | ||
rmn.ProgressionCreatingMW, | ||
rmn.ProgressionUpdatingPlRule, | ||
rmn.ProgressionEnsuringVolSyncSetup, | ||
rmn.ProgressionSettingupVolsyncDest, | ||
rmn.ProgressionCompleted, | ||
} | ||
|
||
var failoverProgressionList = []rmn.ProgressionStatus{ | ||
rmn.ProgressionCheckingFailoverPrerequisites, | ||
rmn.ProgressionWaitForFencing, | ||
rmn.ProgressionWaitForStorageMaintenanceActivation, | ||
rmn.ProgressionFailingOverToCluster, | ||
rmn.ProgressionWaitingForResourceRestore, | ||
rmn.ProgressionEnsuringVolSyncSetup, | ||
rmn.ProgressionSettingupVolsyncDest, | ||
rmn.ProgressionWaitForReadiness, | ||
rmn.ProgressionUpdatedPlacement, | ||
rmn.ProgressionCompleted, | ||
rmn.ProgressionCleaningUp, | ||
rmn.ProgressionWaitOnUserToCleanUp, | ||
} | ||
|
||
var relocateProgressionList = []rmn.ProgressionStatus{ | ||
rmn.ProgressionPreparingFinalSync, | ||
rmn.ProgressionClearingPlacement, | ||
rmn.ProgressionRunningFinalSync, | ||
rmn.ProgressionFinalSyncComplete, | ||
rmn.ProgressionEnsuringVolumesAreSecondary, | ||
rmn.ProgressionWaitOnUserToCleanUp, | ||
rmn.ProgressionCompleted, | ||
rmn.ProgressionCleaningUp, | ||
rmn.ProgressionWaitingForResourceRestore, | ||
rmn.ProgressionWaitForReadiness, | ||
rmn.ProgressionUpdatedPlacement, | ||
rmn.ProgressionEnsuringVolSyncSetup, | ||
rmn.ProgressionSettingupVolsyncDest, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, using Golang's enumerated type is better. It's more straightforward and spares the need to handle hardcoded lists. Simply refactoring ProgressionStatus
in drplacementcontroler_types.go like so (and divide to groups):
type ProgressionStatus int
const (
ProgressionCompleted = iota
ProgressionCreatingMW
ProgressionUpdatingPlRule
ProgressionWaitForReadiness
ProgressionCleaningUp
ProgressionWaitOnUserToCleanUp
ProgressionCheckingFailoverPrerequisites
ProgressionFailingOverToCluster
ProgressionWaitForFencing
ProgressionWaitForStorageMaintenanceActivation
ProgressionPreparingFinalSync
ProgressionClearingPlacement
ProgressionRunningFinalSync
ProgressionFinalSyncComplete
ProgressionEnsuringVolumesAreSecondary
ProgressionWaitingForResourceRestore
ProgressionUpdatedPlacement
ProgressionEnsuringVolSyncSetup
ProgressionSettingupVolsyncDest
ProgressionEnsuringVolSyncDestSetup
ProgressionDeleting
ProgressionDeleted
ProgressionActionPaused
)
func (ps ProgressionStatus) String() string {
return [...]string{
"Completed",
"CreatingMW",
"UpdatingPlRule",
"WaitForReadiness",
"Cleaning Up",
"WaitOnUserToCleanUp",
"CheckingFailoverPrerequisites",
"FailingOverToCluster",
"WaitForFencing",
"WaitForStorageMaintenanceActivation",
"PreparingFinalSync",
"ClearingPlacement",
"RunningFinalSync",
"FinalSyncComplete",
"EnsuringVolumesAreSecondary",
"WaitingForResourceRestore",
"UpdatedPlacement",
"EnsuringVolSyncSetup",
"SettingUpVolSyncDest",
"EnsuringVolSyncDestSetup",
"Deleting",
"Deleted",
"Paused",
}[ps]
}
func (ps ProgressionStatus) EnumIndex() int {
return int(ps)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raaizik it maybe better from golang point of view, but we are losing ordered lists for specific workflows.
Right now, I have explicitly defined sequences of steps (failoverProgressionList, relocateProgressionList etc.).
With the enum approach, I’d need to implement logic to determine valid transitions instead of iterating over a predefined list.
progressionList = deployProgressionList | ||
} | ||
|
||
currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression) | |
currentProgressionIndex := slices.Index(progressionList, d.savedInstanceStatus) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my PR I am playing with in memory progression, not with the saved one. I am comparing the current "in memory" progression with the newer one, I want to set. It is still not the common solution we discussed.
} | ||
|
||
currentProgressionIndex := slices.Index(progressionList, d.instance.Status.Progression) | ||
nextProgressionIndex := slices.Index(progressionList, nextProgression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing in nextProgression
as an argument to this function, you could use:
nextProgressionIndex := slices.Index(progressionList, nextProgression) | |
nextProgressionIndex := slices.Index(progressionList, d.getProgression()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raaizik no, I can't use d.getProgression() in my specific case. I am before the setProgression in the block, where I am calling this function.
59f8431
to
2f011c3
Compare
Can you explain why the progression changes to earlier progression? You wrote "conflicts in VRG generations" but it is not clear why this changes the progression. ramen is designed to try all checks related to particular state for every reconcile, so it is expected that failure in the start of the reconcile will change the state that ramen can report.
Is this correct to report the progression based on previously reported progression? In general we should not report our status based on the current status but on the spec and the system state. I think we need analysis of the issue shown in linked bug. We need to understand why the progression changes. Maybe we should change the progression only when we are sure that the world state has changed, and keep the status unchanged if we cannot tell what is the current world state. For example if we have a stale VRG we need to check again and there is no point in updating the status. |
@nirs to your question how progression can go backwards. Here are two examples of how it's happening:
and
Every time we are reconciling and entering RunRelocate or RunFailover functions, we are checking VRG conditions on target cluster. We are doing it every if we already completed relocation or failover process. Sometimes check for readiness fails not because VRG conditions were changed, but because this verification fails:
Then we are thinking, that VRG is not ready and marking our progression as not Completed (i.e. going backwards) |
So should we mark the VRG as not completed in this case? this sounds like a bug. The actual issue is that we don't have a valid VRG (condition generation does not match the VRG generation), so we need to requeue without changing the status. Maybe if we fix few instances of this issue we don't need to add infrastructure for preventing progressing from going backward, which may be correct in other case when the world state has changed. |
Agreed. Whenever the metadata's generation differs from the status' observed generation, we need to re-queue (= return true) to allow the resource to stabilize (i.e., become valid). |
rmn.ProgressionEnsuringVolSyncSetup, | ||
rmn.ProgressionSettingupVolsyncDest, | ||
rmn.ProgressionCompleted, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if we can guarantee that progression will never go "backward", but having list progressions is great! It makes it much easier to understand the expected flow.
|
||
if nextProgressionIndex < currentProgressionIndex { | ||
log.Info(fmt.Sprintf("Repair reverse Progression from '%s' to '%s'", | ||
drpc.Status.Progression, r.savedInstanceStatus.Progression)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repairing the progression and logging is not the way to go. We should instead not set the wrong progression.
If we are sure we can do this, we can change the place setting the progression to reject value which will make go backward. At that point we can log a debug message about rejecting the progression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirs We are talking about this code:
ready := d.checkReadiness(failoverCluster)
if !ready {
d.log.Info("VRGCondition not ready to finish failover")
d.setProgression(rmn.ProgressionWaitForReadiness)
return !done, nil
}
If there is generation mismatch, we are requeuing, so next reconciliation may or may not solve it.
But every reconcile we are going through several steps, even if we already completed the preferred action.
2025-03-26T09:42:23.338Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:826 Entering RunRelocate {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1", "state": "Relocated", "progression": "Completed"}
2025-03-26T09:42:23.422Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).ensureRelocateActionCompleted changing Progression from 'Completed' to 'Cleaning Up' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}
2025-03-26T09:42:23.522Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).EnsureSecondaryReplicationSetup changing Progression from 'Cleaning Up' to 'EnsuringVolSyncSetup' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}
2025-03-26T09:42:23.523Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).refreshVRGSecondarySpec changing Progression from 'EnsuringVolSyncSetup' to 'SettingUpVolSyncDest' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}
2025-03-26T09:42:23.553Z INFO controllers.DRPlacementControl controller/drplacementcontrol.go:2230 function (*DRPCInstance).ensureActionCompleted changing Progression from 'SettingUpVolSyncDest' to 'Completed' {"DRPC": {"name":"deployment-rbd-drpc","namespace":"deployment-rbd"}, "rid": "cff50561-04a8-4794-a7d2-eac8924025a1"}
In this case, for example, rejecting any intermediate progression step will leave us with other intermediate wrong progression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nirs We are talking about this code:
ready := d.checkReadiness(failoverCluster) if !ready { d.log.Info("VRGCondition not ready to finish failover") d.setProgression(rmn.ProgressionWaitForReadiness) return !done, nil }
If there is generation mismatch, we are requeuing, so next reconciliation may or may not solve it. But every reconcile we are going through several steps, even if we already completed the preferred action.
This is the bug! checkReadyness returns false because the generations do not match. It should fail with an error since it cannot tell if we are ready or not. The resource status wil not be updated until the generations match.
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Fixes: https://issues.redhat.com/browse/DFBUGS-612
We have encountered several bugs where DRPC progression unexpectedly reverts from advanced stages to earlier ones. In this particular case, it regresses during Failover from Completed to WaitingForReadiness due to conflicts in VRG generations. While these changes typically resolve themselves over time, users may find the regression alarming.
To address this, we will introduce a check before updating the DRPC status to prevent regression, ensuring it remains at the previous progression level. However, this safeguard will apply only when the Phase remains unchanged—when transitioning between Phases, progression can still move in either direction.