10000 [scheduler] Properly handle millis() overflow by clydebarrow · Pull Request #8197 · esphome/esphome · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[scheduler] Properly handle millis() overflow #8197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Mar 27, 2025
Merged

Conversation

clydebarrow
Copy link
Contributor
@clydebarrow clydebarrow commented Feb 4, 2025

What does this implement/fix?

Some users have reported a failure to update sensors when the device uptime hits around 100 days (second overflow of a 32 bit count in milliseconds.)

The scheduler does have code to manage this overflow, and it works most of the time. In the particular case tested the factor that triggered a failure was setting the flash_write_interval to 0s, presumably in the belief that this disabled the flash sync. In fact it doesn't, it does cause the sync function to be called continually, since that literally sets the interval to 0s.

Exactly why this triggers the failure is unclear, but the effect is that no other scheduled callbacks ever happen once the millisecond time has overflowed twice. The code handling this is rather opaque, since it evidently was written at a time when uint64_t was either unavailable or not preferred, and consequently it's quite hard to check for correctnesss by examination. Testing confirms the failure in the above scenario however.

Consequently I have refactored the scheduler code to remove several pieces of convoluted code. The result should actually be more efficient than the original code since it calculates the next execution time for a task once, rather than each time the scheduler runs.

Testing of this confirms correct operation even after multiple millis() overflows.

In addition the scheduler debug code has been revised to add additional reporting, and allowing that debug code to be enabled with a yaml config option.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Code quality improvements to existing code or addition of tests
  • Other

Related issue or feature (if applicable):

Pull request in esphome-docs with documentation (if applicable):

Test Environment

  • ESP32
  • ESP32 IDF
  • ESP8266
  • RP2040
  • BK72xx
  • RTL87xx

Example entry for config.yaml:

# Example config.yaml

esphome:
  # enable scheduler debug logging 
  debug_scheduler: true

Checklist:

  • The code change is tested and works locally.
  • Tests have been added to verify that the new code works (under tests/ folder).

If user exposed functionality or configuration variables are added/changed:

@clydebarrow clydebarrow requested a review from a team as a code owner February 4, 2025 07:25
@codecov-commenter
Copy link
codecov-commenter commented Feb 4, 2025

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 54.91%. Comparing base (4d8b5ed) to head (a87b4be).
Report is 2370 commits behind head on dev.

Files with missing lines Patch % Lines
esphome/core/config.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #8197      +/-   ##
==========================================
+ Coverage   53.70%   54.91%   +1.20%     
==========================================
  Files          50       50              
  Lines        9408     9913     +505     
  Branches     1654     1354     -300     
==========================================
+ Hits         5053     5444     +391     
- Misses       4056     4125      +69     
- Partials      299      344      +45     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@clydebarrow clydebarrow added this to the 2025.4.0b1 milestone Mar 27, 2025
@kbx81 kbx81 merged commit 58fe8b3 into esphome:dev Mar 27, 2025
25 checks passed
@clydebarrow clydebarrow deleted the scheduler branch March 28, 2025 00:45
grantbey pushed a commit to grantbey/esphome that referenced this pull request Mar 28, 2025
@jesserockz jesserockz mentioned this pull request Apr 9, 2025
@septillion-git
Copy link
septillion-git commented Apr 9, 2025

Isn't this THE classic example of why never to do previousMillis + interval < millis() but always do millis() - previousMillis > interval? So instead of setting item->next_execution_ = now + offset; we should set item->last_execution_ = now; and change the check to break to if (now - item->last_execution_ < item->interval)

That way you are immune to overflows as long as you keep millis related variables as unsigned.

@clydebarrow
Copy link
Contributor Author

if (item->last_execution_ - now < item->interval)

Working in 16 bit for example,

  • interval == 0x4000
  • last_execution == 0xFFFF
  • now == 1

So actual elapsed time is 2, but last_execution - now evaluates to FFFE, which is larger than interval, so there would be a spurious additional execution.

@septillion-git
Copy link
septillion-git commented Apr 11, 2025

@clydebarrow Oops, now I did it wrong myself. Swap last_execution and now of course. Other way around makes no sense 😅

1 - 0xFFFF = 2

But instead of trying to calculate the new moment, this will always work. No matter an overflow. As long as the variables are unsigned and larger than interval.

But in the current situation if we do a callback @ 0xF000, this will set next_execution_ = 0xF000 + 0x4000 = 0x3000

So next time we do a call() to check if we need to do the callback it could be that now = 0xFF00
And thus this

      if (item->next_execution_ > now) {
        // Not reached timeout yet, done for this call
        break;
      }

Will not break and thus do a callback even though just 0xFF00 - 0xF000 = 0x0F00 time passed...

As if you would do now - item->last_execution_ < item->interval
0xFF00 - 0xF000 < 0x4000 => true and thus does break.

@DAVe3283
Copy link
Contributor

@septillion-git while I agree the new code does appear to have a bug when this->millis_() wraps around, it was changed to a uint64_t, which will wrap every 2^64-1 ms, or 584,942,417 years. So not actually a problem in practice, as far as I can tell.

@septillion-git
Copy link
septillion-git commented Apr 15, 2025

@DAVe3283 So we rather waste memory than to fix a bug?

I would not dare to call that proper handling.... It is after all Arduino 101

I'm willing to change the code to really fix it. But I don't have a dev setup to test it...

@DAVe3283
Copy link
Contributor

@DAVe3283 So we rather waste memory than to fix a bug?

I would not dare to call that proper handling.... It is after all Arduino 101

I'm willing to change the code to really fix it. But I don't have a dev setup to test it...

I don't believe this new method actually uses any more memory. It used to be two separate uint32_t values stored per instance, but it is now rolled up to a single uint64_t with a function. Or maybe I'm misreading it, this isn't a part of the codebase I'm familiar with.

I'm just happy to have the previous very real bug at 100 days resolved. I do agree, making the scheduler theoretically perfect would be nice. But I don't have a test framework like Clyde has to verify the change works as expected by speeding up execution or however he did it.

shibida pushed a commit to shibida/esphome that referenced this pull request Apr 15, 2025
herruzo99 pushed a commit to herruzo99/esphome that referenced this pull request Apr 16, 2025
@septillion-git
Copy link

@DAVe3283 True for the fix! But it's a bit shameful it was present to start with...

And yeah, does use more memory.

  • 2 bytes for the major
  • 4 bytes to store last millis
  • 2 extra bytes to store the moment

Know it's not major amount. But it's really such a basic bug with such a simple solution. It's a bit comical this required this much to "fix"...

Let me see if a have some time to rewrite it :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Secheduler "stops" on second "Incrementing scheduler major" - after ~100d
7 participants
0