[scheduler] Properly handle millis() overflow #8197

clydebarrow · 2025-02-04T07:25:08Z

What does this implement/fix?

Some users have reported a failure to update sensors when the device uptime hits around 100 days (second overflow of a 32 bit count in milliseconds.)

The scheduler does have code to manage this overflow, and it works most of the time. In the particular case tested the factor that triggered a failure was setting the flash_write_interval to 0s, presumably in the belief that this disabled the flash sync. In fact it doesn't, it does cause the sync function to be called continually, since that literally sets the interval to 0s.

Exactly why this triggers the failure is unclear, but the effect is that no other scheduled callbacks ever happen once the millisecond time has overflowed twice. The code handling this is rather opaque, since it evidently was written at a time when uint64_t was either unavailable or not preferred, and consequently it's quite hard to check for correctnesss by examination. Testing confirms the failure in the above scenario however.

Consequently I have refactored the scheduler code to remove several pieces of convoluted code. The result should actually be more efficient than the original code since it calculates the next execution time for a task once, rather than each time the scheduler runs.

Testing of this confirms correct operation even after multiple millis() overflows.

In addition the scheduler debug code has been revised to add additional reporting, and allowing that debug code to be enabled with a yaml config option.

Types of changes

Bugfix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Code quality improvements to existing code or addition of tests
Other

Related issue or feature (if applicable):

fixes Secheduler "stops" on second "Incrementing scheduler major" - after ~100d issues#5375

Pull request in esphome-docs with documentation (if applicable):

[scheduler] document debug_scheduler flag esphome-docs#4641

Test Environment

Example entry for `config.yaml`:

# Example config.yaml

esphome:
  # enable scheduler debug logging 
  debug_scheduler: true

Checklist:

The code change is tested and works locally.
Tests have been added to verify that the new code works (under tests/ folder).

If user exposed functionality or configuration variables are added/changed:

Documentation added/updated in esphome-docs.

Also rework scheduler debug code.

codecov-commenter · 2025-02-04T07:27:07Z

Codecov Report

Attention: Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.

Project coverage is 54.91%. Comparing base (4d8b5ed) to head (a87b4be).
Report is 2370 commits behind head on dev.

Files with missing lines	Patch %	Lines
esphome/core/config.py	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##              dev    #8197      +/-   ##
==========================================
+ Coverage   53.70%   54.91%   +1.20%     
==========================================
  Files          50       50              
  Lines        9408     9913     +505     
  Branches     1654     1354     -300     
==========================================
+ Hits         5053     5444     +391     
- Misses       4056     4125      +69     
- Partials      299      344      +45

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

tests/components/esphome/common.yaml

esphome/core/scheduler.cpp

…to scheduler

septillion-git · 2025-04-09T16:33:46Z

Isn't this THE classic example of why never to do previousMillis + interval < millis() but always do millis() - previousMillis > interval? So instead of setting item->next_execution_ = now + offset; we should set item->last_execution_ = now; and change the check to break to if (now - item->last_execution_ < item->interval)

That way you are immune to overflows as long as you keep millis related variables as unsigned.

clydebarrow · 2025-04-09T20:44:35Z

if (item->last_execution_ - now < item->interval)

Working in 16 bit for example,

interval == 0x4000
last_execution == 0xFFFF
now == 1

So actual elapsed time is 2, but last_execution - now evaluates to FFFE, which is larger than interval, so there would be a spurious additional execution.

septillion-git · 2025-04-11T18:52:38Z

@clydebarrow Oops, now I did it wrong myself. Swap last_execution and now of course. Other way around makes no sense 😅

1 - 0xFFFF = 2

But instead of trying to calculate the new moment, this will always work. No matter an overflow. As long as the variables are unsigned and larger than interval.

But in the current situation if we do a callback @ 0xF000, this will set next_execution_ = 0xF000 + 0x4000 = 0x3000

So next time we do a call() to check if we need to do the callback it could be that now = 0xFF00
And thus this

      if (item->next_execution_ > now) {
        // Not reached timeout yet, done for this call
        break;
      }

Will not break and thus do a callback even though just 0xFF00 - 0xF000 = 0x0F00 time passed...

As if you would do now - item->last_execution_ < item->interval
0xFF00 - 0xF000 < 0x4000 => true and thus does break.

DAVe3283 · 2025-04-15T14:55:49Z

@septillion-git while I agree the new code does appear to have a bug when this->millis_() wraps around, it was changed to a uint64_t, which will wrap every 2^64-1 ms, or 584,942,417 years. So not actually a problem in practice, as far as I can tell.

septillion-git · 2025-04-15T16:04:34Z

@DAVe3283 So we rather waste memory than to fix a bug?

I would not dare to call that proper handling.... It is after all Arduino 101

I'm willing to change the code to really fix it. But I don't have a dev setup to test it...

DAVe3283 · 2025-04-15T18:32:54Z

@DAVe3283 So we rather waste memory than to fix a bug?

I would not dare to call that proper handling.... It is after all Arduino 101

I'm willing to change the code to really fix it. But I don't have a dev setup to test it...

I don't believe this new method actually uses any more memory. It used to be two separate uint32_t values stored per instance, but it is now rolled up to a single uint64_t with a function. Or maybe I'm misreading it, this isn't a part of the codebase I'm familiar with.

I'm just happy to have the previous very real bug at 100 days resolved. I do agree, making the scheduler theoretically perfect would be nice. But I don't have a test framework like Clyde has to verify the change works as expected by speeding up execution or however he did it.

septillion-git · 2025-04-16T19:27:06Z

@DAVe3283 True for the fix! But it's a bit shameful it was present to start with...

And yeah, does use more memory.

2 bytes for the major
4 bytes to store last millis
2 extra bytes to store the moment

Know it's not major amount. But it's really such a basic bug with such a simple solution. It's a bit comical this required this much to "fix"...

Let me see if a have some time to rewrite it :)

clydebarrow added 4 commits February 4, 2025 15:26

[scheduler] Properly handle millis() overflow

2b732ce

Also rework scheduler debug code.

Revert testing changes.

2be6034

Remove debug code

6524690

Add tests

00fd113

clydebarrow requested a review from a team as a code owner February 4, 2025 07:25

probot-esphome bot added core has-tests labels Feb 4, 2025

clydebarrow mentioned this pull request Feb 4, 2025

[scheduler] document debug_scheduler flag esphome/esphome-docs#4641

Merged

3 tasks

clydebarrow added the bugfix label Feb 4, 2025

clydebarrow commented Feb 4, 2025

View reviewed changes

tests/components/esphome/common.yaml Outdated Show resolved Hide resolved

Fix test

19f55aa

jesserockz reviewed Feb 4, 2025

View reviewed changes

esphome/core/scheduler.cpp Show resolved Hide resolved

clydebarrow added 7 commits February 5, 2025 21:59

Merge branch 'dev' of https://github.com/esphome/esphome into scheduler

ea85e86

Merge branch 'scheduler' of https://github.com/clydebarrow/esphome in…

3b731bd

…to scheduler

Merge branch 'dev' of https://github.com/esphome/esphome into scheduler

6e9b698

Add comment clarifying the set_interval() semantics

75c589b

Merge branch 'dev' into scheduler

ffc2d0d

Merge branch 'dev' into scheduler

aca218e

Merge branch 'dev' into scheduler

a87b4be

clydebarrow added this to the 2025.4.0b1 milestone Mar 27, 2025

kbx81 approved these changes Mar 27, 2025

View reviewed changes

kbx81 merged commit 58fe8b3 into esphome:dev Mar 27, 2025
25 checks passed

clydebarrow deleted the scheduler branch March 28, 2025 00:45

grantbey pushed a commit to grantbey/esphome that referenced this pull request Mar 28, 2025

[scheduler] Properly handle millis() overflow (esphome#8197)

bb36f29

jesserockz mentioned this pull request Apr 9, 2025

2025.4.0b1 #8534

Merged

shibida pushed a commit to shibida/esphome that referenced this pull request Apr 15, 2025

[scheduler] Properly handle millis() overflow (esphome#8197)

e33ad71

herruzo99 pushed a commit to herruzo99/esphome that referenced this pull request Apr 16, 2025

[scheduler] Properly handle millis() overflow (esphome#8197)

3b35c40

jesserockz mentioned this pull request Apr 16, 2025

2025.4.0 #8581

Merged

gnumpi mentioned this pull request May 16, 2025

Automatic firmware update checking not working FutureProofHomes/Satellite1-ESPHome#316

Closed

sa-crespo pushed a commit to sa-crespo/esphome that referenced this pull request May 26, 2025

[scheduler] Properly handle millis() overflow (esphome#8197)

314e5eb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[scheduler] Properly handle millis() overflow #8197

[scheduler] Properly handle millis() overflow #8197

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[scheduler] Properly handle millis() overflow #8197

[scheduler] Properly handle millis() overflow #8197

Uh oh!

Conversation

Uh oh!

What does this implement/fix?

Types of changes

Test Environment

Example entry for config.yaml:

Checklist:

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Example entry for `config.yaml`: