8000 perf(consensus): Use TrySend for hasVote/HasBlockPart messages by ValarDragon · Pull Request #3407 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

perf(consensus): Use TrySend for hasVote/HasBlockPart messages #3407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jul 10, 2024

Conversation

ValarDragon
Copy link
Contributor
@ValarDragon ValarDragon commented Jul 3, 2024

Closes #3151

The broadcast routine in its entirely is the source of ~75% of all created channels in the system, all coming from the NewTimer call in Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from Broadcast is 2% of all CPU time across the system. However I think this is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is CheckTimers in the scheduler logic. I'd expect reducing most of the Timer calls reduces this. We also reduce total mallocgc time here by 5% by removing NewTimer, which should hopefully reduce GC load as well.


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

8000

@ValarDragon ValarDragon requested review from a team as code owners July 3, 2024 10:25
@ValarDragon
Copy link
Contributor Author

Going to test this in osmosis prod to see the scheduler improvement, mostly because I'm curious if it reflects my understanding.

The test I added tests TryBroadcast to same extent Broadcast is tested within switch. (I didn't see any other test)

@ValarDragon
Copy link
Contributor Author

I fixed the old Go Benchmark (It hasn't been working for months).

BenchmarkSwitchBroadcast-12    	  347770	      4304 ns/op	    1238 B/op	      16 allocs/op
BenchmarkSwitchTryBroadcast-12     2268736	       482.7 ns/op	     381 B/op	      11 allocs/op

So a significant CPU cost reduction!

@ValarDragon
Copy link
Contributor Author

This appears to have been an empiric 20% reduction in scheduler time! (At consensus state and mempool times being roughly equal)

And a 5-10% GC time reduction!

@cason
Copy link
Contributor
cason commented Jul 3, 2024

I will take a detailed look maybe tomorrow, but conceptually I agree, in particular because missing those two messages is not a problem: they are optimizations, preventing duplication, they are not core messages of the protocol.

itsdevbear pushed a commit to berachain/cometbft that referenced this pull request Jul 4, 2024
Copy link
Contributor
@cason cason left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

Please fix the changelog before committing.

use TrySend instead of Send. This saves notable amounts of performance,
as TrySend just drops sending when the channel is full, instead of creating
a timer / new channel.
([\#3342](https://github.com/cometbft/cometbft/issues/3342))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this link right? #3342 is something else.

Copy link
Member
@hvanz hvanz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

ValarDragon and others added 4 commits July 5, 2024 22:37
…-use-trysend.md

Co-authored-by: Daniel <daniel.cason@informal.systems>
…-use-trysend.md

Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Daniel <daniel.cason@informal.systems>
…-use-trysend.md

Co-authored-by: Daniel <daniel.cason@informal.systems>
…-use-trysend.md

Co-authored-by: Daniel <daniel.cason@informal.systems>
Copy link
Contributor
@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ValarDragon ❤️

@melekes melekes enabled auto-merge July 9, 2024 08:38
@melekes
Copy link
Contributor
melekes commented Jul 9, 2024
--- FAIL: TestStateOversizedBlock (0.38s)
    --- FAIL: TestStateOversizedBlock/max_size,_correct_block (0.38s)
        state_test.go:276: Detected real max data size for block; size 65255 softMaxDataBytes 64805
        state_test.go:307: Block Sizes; Limit 65536 Current 65535
        state_test.go:308: Proposal Parts; Maximum 1 Current 1
panic: Timeout expired while waiting for NewVote event [recovered]
	panic: Timeout expired while waiting for NewVote event

goroutine 37363 [running]:
testing.tRunner.func1.2({0x1359020, 0x175c150})
	/opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1631 +0x3f7
testing.tRunner.func1()
	/opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1634 +0x6b6
panic({0x1359020?, 0x175c150?})
	/opt/hostedtoolcache/go/1.22.5/x64/src/runtime/panic.go:770 +0x132
github.com/cometbft/cometbft/internal/consensus.ensureVote(0xc008015b00, 0x1, 0x1, 0x2)
	/home/runner/work/cometbft/cometbft/internal/consensus/common_test.go:740 +0x47e
github.com/cometbft/cometbft/internal/consensus.ensurePrecommit(...)
	/home/runner/work/cometbft/cometbft/internal/consensus/common_test.go:728
github.com/cometbft/cometbft/internal/consensus.TestStateOversizedBlock.func1(0xc004f48d00)
	/home/runner/work/cometbft/cometbft/internal/consensus/state_test.go:331 +0xcec
testing.tRunner(0xc004f48d00, 0xc0031f2300)
	/opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1689 +0x21f
created by testing.(*T).Run in goroutine 37362
	/opt/hostedtoolcache/go/1.22.5/x64/src/testing/testing.go:1742 +0x826
FAIL	github.com/cometbft/cometbft/internal/consensus	20.370s
FAIL

@ValarDragon
Copy link
Contributor Author

I'm at conferences this week, I can work on debugging on weekend! Sorry for delay on replies

@cason
Copy link
Contributor
cason commented Jul 9, 2024

This error is odd, are you sure it was introduced by this PR?

@cason
Copy link
Contributor
cason commented Jul 9, 2024

I mean, TestStateOversizedBlock don't use use a reactor.

@melekes
Copy link
Contributor
melekes commented Jul 9, 2024

This error is odd, are you sure it was introduced by this PR?

not sure

@melekes melekes added consensus backport-to-v1.x Tell Mergify to backport the PR to v1.x p2p labels Jul 10, 2024
@melekes melekes disabled auto-merge July 10, 2024 06:14
@melekes melekes added this pull request to the merge queue Jul 10, 2024
Merged via the queue into main with commit 55493e0 Jul 10, 2024
39 checks passed
@melekes melekes deleted the dev/use_trysend_for_hasXmsgs branch July 10, 2024 06:35
mergify bot pushed a commit that referenced this pull request Jul 10, 2024
Closes #3151

The broadcast routine in its entirely is the source of ~75% of all
created channels in the system, all coming from the `NewTimer` call in
Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from
Broadcast is 2% of all CPU time across the system. However I think this
is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is
`CheckTimers` in the scheduler logic. I'd expect reducing most of the
Timer calls reduces this. We also reduce total `mallocgc` time here by
5% by removing NewTimer, which should hopefully reduce GC load as well.

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
(cherry picked from commit 55493e0)

# Conflicts:
#	internal/consensus/reactor.go
#	p2p/switch.go
#	p2p/switch_test.go
melekes added a commit that referenced this pull request Jul 10, 2024
…ort #3407) (#3466)

Closes #3151 

The broadcast routine in its entirely is the source of ~75% of all
created channels in the system, all coming from the `NewTimer` call in
Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from
Broadcast is 2% of all CPU time across the system. However I think this
is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is
`CheckTimers` in the scheduler logic. I'd expect reducing most of the
Timer calls reduces this. We also reduce total `mallocgc` time here by
5% by removing NewTimer, which should hopefully reduce GC load as well.

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3407 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
ValarDragon added a commit to osmosis-labs/cometbft that referenced this pull request Aug 19, 2024
…ort cometbft#3407) (cometbft#3466)

Closes cometbft#3151

The broadcast routine in its entirely is the source of ~75% of all
created channels in the system, all coming from the `NewTimer` call in
Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from
Broadcast is 2% of all CPU time across the system. However I think this
is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is
`CheckTimers` in the scheduler logic. I'd expect reducing most of the
Timer calls reduces this. We also reduce total `mallocgc` time here by
5% by removing NewTimer, which should hopefully reduce GC load as well.

---

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3407 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-to-v1.x Tell Mergify to backport the PR to v1.x consensus p2p
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change broadcastHasVote and broadcastHasBlockpart to use TrySend not Send
4 participants
0