8000 Implement checkpointing, rollback recovery and failure detection in the Prefect workflow · Issue #30 · casangi/RADPS · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Implement checkpointing, rollback recovery and failure detection in the Prefect workflow #30
Open
@krlberry

Description

@krlberry

Objective

Implement error handling in the Prefect example workflow.

Note: broken off from #17.

Requirements

  • Incorporate entities that facilitate timely and accurate failure detection.
  • An ideal rollback recovery approach would not require source code modifications, source code recompilation, or relinking support binaries.
  • Recovery rollback should include robust failure detection that activates without user intervention.
  • Time to create checkpoints should be significantly shorter than the application runtime and the checkpoint size should be small.

Definition of Done

  • Methods of checkpointing, rollback recovery and failure detection using the Prefect framework are demonstrated in the example workflow.

Artifacts

  • Source files that implement error handling
  • Logs and outputs from demo execution
  • Notes on the design and implementation

Potential Challenges

  • How should a workflow handle errors/failures that it can't correct?
  • There are different categories of failure that can possibly occur in a workflow (e.g., exception raised by some task logic vs. preemption of some resource manager agent running a stage). All possible failure modes might not be demonstrable before incorporating data processing and domain application execution into the prototype.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0