-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
[scheduler] Properly handle millis() overflow #8197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Also rework scheduler debug code.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## dev #8197 +/- ##
==========================================
+ Coverage 53.70% 54.91% +1.20%
==========================================
Files 50 50
Lines 9408 9913 +505
Branches 1654 1354 -300
==========================================
+ Hits 5053 5444 +391
- Misses 4056 4125 +69
- Partials 299 344 +45 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Isn't this THE classic example of why never to do That way you are immune to overflows as long as you keep millis related variables as unsigned. |
Working in 16 bit for example,
So actual elapsed time is 2, but last_execution - now evaluates to FFFE, which is larger than interval, so there would be a spurious additional execution. |
@clydebarrow Oops, now I did it wrong myself. Swap last_execution and now of course. Other way around makes no sense 😅 1 - 0xFFFF = 2 But instead of trying to calculate the new moment, this will always work. No matter an overflow. As long as the variables are unsigned and larger than interval. But in the current situation if we do a callback @ 0xF000, this will set next_execution_ = 0xF000 + 0x4000 = 0x3000 So next time we do a if (item->next_execution_ > now) {
// Not reached timeout yet, done for this call
break;
} Will not break and thus do a callback even though just 0xFF00 - 0xF000 = 0x0F00 time passed... As if you would do |
@septillion-git while I agree the new code does appear to have a bug when |
@DAVe3283 So we rather waste memory than to fix a bug? I would not dare to call that proper handling.... It is after all Arduino 101 I'm willing to change the code to really fix it. But I don't have a dev setup to test it... |
I don't believe this new method actually uses any more memory. It used to be two separate uint32_t values stored per instance, but it is now rolled up to a single uint64_t with a function. Or maybe I'm misreading it, this isn't a part of the codebase I'm familiar with. I'm just happy to have the previous very real bug at 100 days resolved. I do agree, making the scheduler theoretically perfect would be nice. But I don't have a test framework like Clyde has to verify the change works as expected by speeding up execution or however he did it. |
@DAVe3283 True for the fix! But it's a bit shameful it was present to start with... And yeah, does use more memory.
Know it's not major amount. But it's really such a basic bug with such a simple solution. It's a bit comical this required this much to "fix"... Let me see if a have some time to rewrite it :) |
What does this implement/fix?
Some users have reported a failure to update sensors when the device uptime hits around 100 days (second overflow of a 32 bit count in milliseconds.)
The scheduler does have code to manage this overflow, and it works most of the time. In the particular case tested the factor that triggered a failure was setting the
flash_write_interval
to 0s, presumably in the belief that this disabled the flash sync. In fact it doesn't, it does cause the sync function to be called continually, since that literally sets the interval to 0s.Exactly why this triggers the failure is unclear, but the effect is that no other scheduled callbacks ever happen once the millisecond time has overflowed twice. The code handling this is rather opaque, since it evidently was written at a time when
uint64_t
was either unavailable or not preferred, and consequently it's quite hard to check for correctnesss by examination. Testing confirms the failure in the above scenario however.Consequently I have refactored the scheduler code to remove several pieces of convoluted code. The result should actually be more efficient than the original code since it calculates the next execution time for a task once, rather than each time the scheduler runs.
Testing of this confirms correct operation even after multiple
millis()
overflows.In addition the scheduler debug code has been revised to add additional reporting, and allowing that debug code to be enabled with a yaml config option.
Types of changes
Related issue or feature (if applicable):
Pull request in esphome-docs with documentation (if applicable):
Test Environment
Example entry for
config.yaml
:Checklist:
tests/
folder).If user exposed functionality or configuration variables are added/changed: