8000 Change broadcastHasVote and broadcastHasBlockpart to use TrySend not Send · Issue #3151 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Change broadcastHasVote and broadcastHasBlockpart to use TrySend not Send #3151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #3245
ValarDragon opened this issue May 30, 2024 · 3 comments · Fixed by #3407
Closed
Tracked by #3245

Change broadcastHasVote and broadcastHasBlockpart to use TrySend not Send #3151

ValarDragon opened this issue May 30, 2024 · 3 comments · Fixed by #3407
Labels
consensus enhancement New feature or request

Comments

@ValarDragon
Copy link
Contributor

Feature Request

Summary

Currently both BroadcastHasVote and BroadcastHasBlockpart use Broadcast, which does a Send message to every peer.

While its not clear that BroadcastHasVote is needed, it definitely should be using TrySend not Send! TrySend makes it ignore the channel buffering capacity, and depend on time.Sleep's for cancelling. This increases the overhead for the go scheduler, which does actually take a lot of the time for the system. (Just creating the timers takes 30s of CPU time across a 1hour cpu profile, independent of the scheduler overhead)

Theres a number of things worth revisting about this BroadcastHas*, but it seems like we should be using TrySend everytime not Send. The channel capacity for hasVote is sent to 1, so were not really respecting that right now with the correct behavior. Its likely to me this is an accident in implementation, since the Broadcast method already existed, and used Send, with no corresponding trySend method.

Problem Definition

Reduce our CPU and scheduling overheads, and respect the channel capacities, so we don't "overgossip" from Has* channels.

Proposal

For now, just make a BroadcastTrySend method, and switch the consensus gossip routines to using that!

If for some reason we really wanted Send (which doesn't make sense to me) then we should raise the channel capacities.

@ValarDragon ValarDragon added enhancement New feature or request needs-triage This issue/PR has not yet been triaged by the team. labels May 30, 2024
@andynog andynog added this to CometBFT May 30, 2024
@github-project-automation github-project-automation bot moved this to Todo in CometBFT May 30, 2024
@andynog andynog added this to the 2024-Q2 milestone May 30, 2024
@ValarDragon
Copy link
Contributor Author

After looking at #3182, I think we should just change Broadcast to use TrySend everywhere, not Send. And adapt all relevant channel buffers accordingly.

github-merge-queue bot pushed a commit that referenced this issue Jun 7, 2024
Remove the return channel and waitgroup from switch.Broadcast as its
unused. This was costing us some CPU overhead (2% of broadcast routine
as it stands right now, but broadcast routine time should be dropping)
and one extra goroutine. This removes more overhead than what #3180 may
introduce.

It also better highlights the point that maybe all broadcast usecases
should consider using TrySend (And we potentially adapt channel
buffers), ala the point in #3151 . (This would have significant
system-wide impact, by significantly lowering the number of
timers/scheduler overhead we have)

---

#### PR checklist

- [x] Tests written/updated - Simplifies code, the behavior being
removed was never tested!
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
@cason cason removed the needs-triage This issue/PR has not yet been triaged by the team. label Jun 11, 2024
@cason
Copy link
Contributor
cason commented Jun 11, 2024

Yes, I partially agree.

The problem is that TrySend does not guarantee delivery to correct peers. If we me miss the Has* messages, ok, they are there to save redundancy.

But the NewRoundStepMessage appears to me to be required, I am afraid that dropping them would have unpredictable outputs.

8000

@cason
Copy link
Contributor
cason commented Jun 11, 2024

The easiest solution here is to clone Switch.Broadcast() to some method inside the consensus reactor that does the same while using TrySend() instead of Send().

@cason cason removed the p2p label Jun 11, 2024
@adizere adizere removed this from the 2024-Q2 milestone Jul 2, 2024
github-merge-queue bot pushed a commit that referenced this issue Jul 10, 2024
Closes #3151 

The broadcast routine in its entirely is the source of ~75% of all
created channels in the system, all coming from the `NewTimer` call in
Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from
Broadcast is 2% of all CPU time across the system. However I think this
is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is
`CheckTimers` in the scheduler logic. I'd expect reducing most of the
Timer calls reduces this. We also reduce total `mallocgc` time here by
5% by removing NewTimer, which should hopefully reduce GC load as well.

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
@github-project-automation github-project-automation bot moved this from Todo to Done in CometBFT Jul 10, 2024
mergify bot pushed a commit that referenced this issue Jul 10, 2024
Closes #3151

The broadcast routine in its entirely is the source of ~75% of all
created channels in the system, all coming from the `NewTimer` call in
Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from
Broadcast is 2% of all CPU time across the system. However I think this
is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is
`CheckTimers` in the scheduler logic. I'd expect reducing most of the
Timer calls reduces this. We also reduce total `mallocgc` time here by
5% by removing NewTimer, which should hopefully reduce GC load as well.

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Daniel <daniel.cason@informal.systems>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
(cherry picked from commit 55493e0)

# Conflicts:
#	internal/consensus/reactor.go
#	p2p/switch.go
#	p2p/switch_test.go
melekes added a commit that referenced this issue Jul 10, 2024
…ort #3407) (#3466)

Closes #3151 

The broadcast routine in its entirely is the source of ~75% of all
created channels in the system, all coming from the `NewTimer` call in
Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from
Broadcast is 2% of all CPU time across the system. However I think this
is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is
`CheckTimers` in the scheduler logic. I'd expect reducing most of the
Timer calls reduces this. We also reduce total `mallocgc` time here by
5% by removing NewTimer, which should hopefully reduce GC load as well.

---

#### PR checklist

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3407 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
mergify bot pushed a commit that referenced this issue Jul 10, 2024
Remove the return channel and waitgroup from switch.Broadcast as its
unused. This was costing us some CPU overhead (2% of broadcast routine
as it stands right now, but broadcast routine time should be dropping)
and one extra goroutine. This removes more overhead than what #3180 may
introduce.

It also better highlights the point that maybe all broadcast usecases
should consider using TrySend (And we potentially adapt channel
buffers), ala the point in #3151 . (This would have significant
system-wide impact, by significantly lowering the number of
timers/scheduler overhead we have)

---

#### PR checklist

- [x] Tests written/updated - Simplifies code, the behavior being
removed was never tested!
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
(cherry picked from commit e731a3f)

# Conflicts:
#	p2p/switch.go
#	p2p/switch_test.go
melekes added a commit that referenced this issue Jul 10, 2024
Remove the return channel and waitgroup from switch.Broadcast as its
unused. This was costing us some CPU overhead (2% of broadcast routine
as it stands right now, but broadcast routine time should be dropping)
and one extra goroutine. This removes more overhead than what #3180 may
introduce.

It also better highlights the point that maybe all broadcast usecases
should consider using TrySend (And we potentially adapt channel
buffers), ala the point in #3151 . (This would have significant
system-wide impact, by significantly lowering the number of
timers/scheduler overhead we have)

---

#### PR checklist

- [x] Tests written/updated - Simplifies code, the behavior being
removed was never tested!
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3182 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
ValarDragon added a commit to osmosis-labs/cometbft that referenced this issue Aug 19, 2024
…ometbft#3480)

Remove the return channel and waitgroup from switch.Broadcast as its
unused. This was costing us some CPU overhead (2% of broadcast routine
as it stands right now, but broadcast routine time should be dropping)
and one extra goroutine. This removes more overhead than what cometbft#3180 may
introduce.

It also better highlights the point that maybe all broadcast usecases
should consider using TrySend (And we potentially adapt channel
buffers), ala the point in cometbft#3151 . (This would have significant
system-wide impact, by significantly lowering the number of
timers/scheduler overhead we have)

---

- [x] Tests written/updated - Simplifies code, the behavior being
removed was never tested!
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3182 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
ValarDragon added a commit to osmosis-labs/cometbft that referenced this issue Aug 19, 2024
…ometbft#3480) (#134)

Remove the return channel and waitgroup from switch.Broadcast as its
unused. This was costing us some CPU overhead (2% of broadcast routine
as it stands right now, but broadcast routine time should be dropping)
and one extra goroutine. This removes more overhead than what cometbft#3180 may
introduce.

It also better highlights the point that maybe all broadcast usecases
should consider using TrySend (And we potentially adapt channel
buffers), ala the point in cometbft#3151 . (This would have significant
system-wide impact, by significantly lowering the number of
timers/scheduler overhead we have)

---

- [x] Tests written/updated - Simplifies code, the behavior being
removed was never tested!
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3182 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
ValarDragon added a commit to osmosis-labs/cometbft that referenced this issue Aug 19, 2024
…ort cometbft#3407) (cometbft#3466)

Closes cometbft#3151

The broadcast routine in its entirely is the source of ~75% of all
created channels in the system, all coming from the `NewTimer` call in
Send. (and 75% of all NewTimer) Furthermore, the NewTimer call from
Broadcast is 2% of all CPU time across the system. However I think this
is actually more impact, due to the following.

We spend 11% of all CPU time in the scheduler, and 4% of all CPU time is
`CheckTimers` in the scheduler logic. I'd expect reducing most of the
Timer calls reduces this. We also reduce total `mallocgc` time here by
5% by removing NewTimer, which should hopefully reduce GC load as well.

---

- [x] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request cometbft#3407 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus enhancement New feature or request
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants
0