Description
As stayrtr operator I want stayrtr to keep fetching updates if the backend system is slow or not responsive.
If I want updates every 10 minutes, and a update takes 5 minutes, I want the next update to run 10 minutes after the previous one started. Not 15 minutes after (10 minutes after the previous finished).
Context
When running stayrtr from a slow connection (4G was not cooperating) I noticed that the update loop does not have a set interval but has a set delay. If the response of SLURM or the JSON are slow the loop takes (much) longer.
Root cause
Handling slow responses is a hard problem. It ends up being a tradeoff between liveliness of the whole system or getting all information.
For example, in my rpki-client wrapped I found that some repositories were so slow that they prevented me from updating on time. I decided to add a utility to timeout/abort fetching from slow repos. There I decided finishing an update was more important than having all information.
Desired behaviour
first of all:
- exponential backoff on errors
- have basic metrics for http behaviour. We have part of this, but last succesful response for url/response size/duration/status code should be tracked. And some metrics can be moved:
RefreshStatusCode
etc could be tracked from the http util. - make both updates (slurm + vrp-json) asynchronous, they can be performed in parallel.
then:
- abort connection if retrieving the response takes longer than [timelimit] to send the response
- schedule updates at set interval: "a update happens every interval". Not "interval after the previous update finishes"