Open
Description
Objective
Implement error handling in the Prefect example workflow.
Note: broken off from #17.
Requirements
- Incorporate entities that facilitate timely and accurate failure detection.
- An ideal rollback recovery approach would not require source code modifications, source code recompilation, or relinking support binaries.
- Recovery rollback should include robust failure detection that activates without user intervention.
- Time to create checkpoints should be significantly shorter than the application runtime and the checkpoint size should be small.
Definition of Done
- Methods of checkpointing, rollback recovery and failure detection using the Prefect framework are demonstrated in the example workflow.
Artifacts
- Source files that implement error handling
- Logs and outputs from demo execution
- Notes on the design and implementation
Potential Challenges
- How should a workflow handle errors/failures that it can't correct?
- There are different categories of failure that can possibly occur in a workflow (e.g., exception raised by some task logic vs. preemption of some resource manager agent running a stage). All possible failure modes might not be demonstrable before incorporating data processing and domain application execution into the prototype.