Dapr Workflow needs a Remedy of Last Resort to be documented for dealing with crashed workflows

@Xon

This topic was brought up on Discord on 3/12/25 when "@Xon" commented in the Workflows channel that he might have to try examining and even altering the Workflow State Store (i.e. actors database) to get his 1.14 currently in-process workflows working on Dapr 1.15. Several people responded that this was a bad idea, and high risk. Hi risk? I agree. Bad idea? Sometimes it is the only reasonable "Remedy of Last Resort" when having a crashed workflow that will result in a lot of work to be redone by the end user.

Even the best laid plans occasionally go bad. When an organization is dependent upon Dapr Workflows for making their business processes seamlessly work OK in their day to day activities, but then one day something totally improbable and/or unexpected happens which in turn blows up many of their in-process Workflows it can effectively shut down part or all of a business! This makes the businesses employees, managers, and executives really angry because they cannot do business as normal until the "Workflows are fixed". And such anger is often taken out on the workflow framework, i.e. in our case Dapr. Not good for anyone.

And this can happen no matter how much effort, planning, resiliency, and redundancy has been invested in the software and its operation,. Workflows, aka long-running processes, give the processes a relatively huge time window to be at risk of something really bad happening, compared to other short lived processes that we are used to. For example a Workflow instance can have thousands of seconds of exposure before it completes vs a few seconds for a short lived normal process. Thus, for long running processes like Workflows the risk is much higher, as it the probability, that sooner or later something really bad will happen to bring some Workflows to an unplanned halt.

After having lived through one such episode (see below) I came to this conclusion: Every system that uses Workflows needs a clearly documented "Remedy of Last Resort" that provides the information required to manually nurs 6633 e crashed workflows back to life. Doing such nursing is ugly, at times difficult, can take painstaking hours, but also can save thousands of dollars and precious time, and for mission critical apps perhaps it can save much more.

How do I know this? In 2008-2009 I was developing in .NET C#, Silverlight UI, WCF, and Windows Workflow for a small consulting firm. Our client was a property management firm that needed an Accounts Payable app that allowed a few levels of managerial approval for the payment of vendors for their services. Like mowing lawns, doing repairs, etc. An individual Property Manager submitted an invoice to the Workflow (including scanning in a copy of the vendor invoice) then the Workflow moved the transaction through several levels of approval, including handling rejects and resubmissions. Fairly standard stuff.

I took over as the main developer once the system was ready for beta testing. We installed the software on our client's site and had them run 60 invoices through the system. The clients really liked it and reported several bugs which we promptly fixed. Then, a few days later while the beta test was still running with the invoices slowly percolating through the Workflow, one night 33 of the Workflows blew up just after midnight. Each one crashed with a database timeout preventing them from persisting the app state and workflow state. Ugh...

Our clients were very, very upset. I started frantically debugging, assisted by the developer who designed and wrote the system initially. After a long, long day we found the cause -- Our Project Manager installed a demo version of SQL Server that had a 3 month free license. And that license had EXPIRED at midnight the day before, causing all the Workflows that tried to access the DB after midnight to crash! He forgot all about it was a demo version. These things happen!

So in a few days of searching through the newly licensed SQL server app state and workflow state stores I figured out a way to change the data in the DB in a way that would resurrect the 33 crashed workflows and let them run to completion "naturally". Essentially I changed the current date on each workflow to the date of the day just before the midnight crash, thus moving each workflow backwards in time. I caused workflow time travel! Ha, ha.

But the work to resurrect each workflow required additional manipulation of the data, beyond the time. So much that it took me a week of full time work to resurrect all 33 crashed workflows. But that was OK because it made our clients happy and restored their confidence in their new software and in our consulting firm.

This manual editing of app state and workflow state was the "Remedy of Last Resort". I believe it was vastly more effective (considering all parties -- clients and service providers (me)) to do the above remedy, rather than the other "Remedy of Last Resort" which was having our client re-enter the data for all 33 crashed workflows, including scanning in all invoices. And I believe that this is likely to be the general case.

Therefore, please develop instructions on how to implement such a Remedy of Last Resort for Dapr Workflows, including an explanation of what each data element in the workflow state store is, and giving some indication of how to move a workflow back in time. PS I tried to move Azure Durable Functions workflows back in time with mixed success. Although I did manage almost always to successfully change arguments passed to activities by changing values in Table Storage. So it may well be possible to do this sort of "Dead Workflow Resurrection" with Dapr Workflows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions