8000 checkqueue: implement a new scriptcheck worker pool with atomic variables by HowHsu · Pull Request #32791 · bitcoin/bitcoin · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

checkqueue: implement a new scriptcheck worker pool with atomic variables #32791

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

HowHsu
Copy link
@HowHsu HowHsu commented Jun 21, 2025
TL;DR
This patchset proposes a new design of scriptcheck worker pool with
atomic variables to leverage more cpu resource on a machine which the
current mutex version cannot. Achieve about 45% boost in performance.

Main Idea:

The current version of worker pool use mutex to do the synchronization
between worker threads, which makes the performance downgrade after
worker_threads_num >= 14. One root cause is the high contention on the
queue's mutex among worker threads. Workers repeatedly sleep and be woken
up in short time, leading to heavy context switching.
The idea is to use atomic variables to protect the queue, workers
acquire it and increase it with a batch_size(a small number). This
significantly reduces context switching.

Implementation

To implement it, below changes are made:(enabled in atomic mode)
- atomic<int> nNext is used to indicate the next available indice in the
  queue
- atomic<int> nTodo2 is used to indicate the number of tasks left
- nBatchSize is the batch size of tasks everytime a worker consumes
  it should be small to keep workload balance among workers
- m_mutex is still used, to protect m_result and conditional vars
- fAtomic to indicate which mode is enabled
- about the queue:

  - the queue is used as queue, not stacks any more
  - the queue is 'add only', which means only adding tasks to it, don't
    remove used tasks.

    the above two are a must to make this idea achievable, and the
    second sacrifices space

  - start the workers only after all the tasks have been added to the
    queue

    this is a trade-off to make the implementaion simple, a worker can
    compare nNext + nBatchSize to queue.size() to find if there are
    tasks undone in the queue, otherwise another atomic<int> tail is
    needed, and race condition may exist in this way (need more
    investigation). From my test, it's not 100% to say this is a
    'trade-off' since it also brings something good, pros && cons:
    - pros:
        1. no contention among producer and consumers when adding tasks
           to the queue
        2. the adding period is very short compared to the verification
           period
    - cons:
        1. the workers start later

For details, please refer to the worker handler code

Tests:

- Numbers
  - For blocks/s numbers
    `./build/bin/bench_bitcoin --filter=ConnectBlockMixedEcdsaSchnor`
  - For cpu usage numbers
    `NANOBENCH_ENDLESS=ConnectBlockAllEcdsa ./build/bin/bench_bitcoin --filter=ConnectBlockAllEcdsa`

  - Some data is missing because the machine was billed by time, and there wasn't
    enough time to complete all measurements.
  - Some cells contain a range instead of a single value due to high fluctuations
  - The numbers represent the average of multiple test runs.

           |     mutex verion       |   atomic version
nr_workers | blocks/s  | cpu usage  | blocks/s  | cpu usage2
-----------|-----------|------------|-----------|-------------
0          | 4.22      | 100%       | 4.2       | 100%
1          | 8.23      | 198%       | 8         | 191.3%
2          | 12.08     | 293.7%     | 11.6      | 276%
3          | 16        | 386.7%     | 14.9      | 354.5%
4          | 19.8      | 476.3%     |           |
5          | 23.12     | 565.4%     |           |
6          | 27.3      | 654%       | 23        | 560%
7          | 28.5      | 738%       |           |
8          | 33.8      | 809%       |           |
9          | 33.5      | 868%       | 28.5      | 730%
10         | 37.4      | 910%       | 30.5      | 777%
11         | 33~40     | 880%~928%  | 32        | 825%
12         | 22~39     | 860%~925%  | 32.5      | 830%
13         | 27.8~37   | 700%~913%  | 34.8      | 928%
15         | 24.8~38.9 | 640%~719%  | 38        | 1000%
19         | 18~23     | 370%~510%  | 41        | 1161%
22         | 7~18      | 270%~400%  |           |
25         | 6.4       | 260%       | 46        | 1400%
30         | 5         | 270%       | 49        | 1560%
60         | 5         | 280%       | 50~58     | 2100%

- Env:
 A virtual machine on cloud
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Address sizes:         46 bits physical, 48 bits virtual
   Byte Order:            Little Endian
   CPU(s):                64
   On-line CPU(s) list:   0-63
   Vendor ID:             GenuineIntel
   Model name:            Intel Xeon Processor (Cascadelake)
   CPU family:            6
   Model:                 85
   Thread(s) per core:    2
   Core(s) per socket:    16
   Socket(s):             2
   Stepping:              6
   BogoMIPS:              5985.93
 Virtualization features:
   Hypervisor vendor:     KVM
   Virtualization type:   full
 Caches (sum of all):
   L1d:                   2 MiB (64 instances)
   L1i:                   2 MiB (64 instances)
   L2:                    128 MiB (32 instances)
   L3:                    32 MiB (2 instances)
 NUMA:
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-31
   NUMA node1 CPU(s):     32-63

Signed-off-by: Hao Xu <hao.xu@linux.dev>

@DrahtBot
Copy link
Contributor
DrahtBot commented Jun 21, 2025

The following sections might be updated with supplementary metadata relevant to reviewers and maintainers.

Code Coverage & Benchmarks

For details see: https://corecheck.dev/bitcoin/bitcoin/pulls/32791.

Reviews

See the guideline for information on the review process.
A summary of reviews will appear here.

Conflicts

Reviewers, this pull request conflicts with the following ones:

  • #32699 (docs: adds correct updated documentation links by Zeegaths)
  • #31132 (validation: fetch block inputs on parallel threads 10% faster IBD by andrewtoth)
  • #28690 (build: Introduce internal kernel library by TheCharlatan)

If you consider this pull request important, please also help to review the conflicting pull requests. Ideally, start with the one that should be merged first.

@DrahtBot
Copy link
Contributor

🚧 At least one of the CI tasks failed.
Task no wallet, libbitcoinkernel: https://github.com/bitcoin/bitcoin/runs/44527955551
LLM reason (✨ experimental): The CI failed due to a linker error: undefined reference to GetNumCores(), indicating a missing implementation or linkage of this function.

Hints

Try to run the tests locally, according to the documentation. However, a CI failure may still
happen due to a number of reasons, for example:

  • Possibly due to a silent merge conflict (the changes in this pull request being
    incompatible with the current code in the target branch). If so, make sure to rebase on the latest
    commit of the target branch.

  • A sanitizer issue, which can only be found by compiling with the sanitizer and running the
    affected test.

  • An intermittent issue.

Leave a comment here, if you need help tracking down a confusing failure.

Copy link
@Muaytie666 Muaytie666 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

98b1db9283b5c04775309fa5dc2161fc597b355a

@HowHsu
Copy link
Author
HowHsu commented Jun 22, 2025

hmmm...Ok, I realized that this may only be suitable for when all the preout Coin are in CCoinsViewCache, otherwise the Add() is not fast due to I/O there.

@HowHsu HowHsu force-pushed the checkqueue_atomic branch from 7d2b002 to 15ca020 Compare June 23, 2025 14:36
@HowHsu HowHsu force-pushed the checkqueue_atomic branch 5 times, most recently from 50a5c2c to 299d5b5 Compare June 24, 2025 12:16
Hao Xu added 4 commits June 24, 2025 21:22
…bles

TL;DR
This patchset proposes a new design of scriptcheck worker pool with
atomic variables to leverage more cpu resource on a machine which the
current mutex version cannot. Achieve about 45% boost in performance.

Main Idea:

The current version of worker pool use mutex to do the synchronization
between worker threads, which makes the performance downgrade after
worker_threads_num >= 14. One root cause is the high contention on the
queue's mutex among worker threads. Workers repeatedly sleep and be woken
up in short time, leading to heavy context switching.
The idea is to use atomic variables to protect the queue, workers
acquire it and increase it with a batch_size(a small number). This
significantly reduces context switching.

Implementation

To implement it, below changes are made:(enabled in atomic mode)
- atomic<int> nNext is used to indicate the next available indice in the
  queue
- atomic<int> nTodo2 is used to indicate the number of tasks left
- nBatchSize is the batch size of tasks everytime a worker consumes
  it should be small to keep workload balance among workers
- m_mutex is still used, to protect m_result and conditional vars
- fAtomic to indicate which mode is enabled
- about the queue:

  - the queue is used as queue, not stacks any more
  - the queue is 'add only', which means only adding tasks to it, don't
    remove used tasks.

    the above two are a must to make this idea achievable, and the
    second sacrifices space

  - start the workers only after all the tasks have been added to the
    queue

    this is a trade-off to make the implementaion simple, a worker can
    compare nNext + nBatchSize to queue.size() to find if there are
    tasks undone in the queue, otherwise another atomic<int> tail is
    needed, and race condition may exist in this way (need more
    investigation). From my test, it's not 100% to say this is a
    'trade-off' since it also brings something good, pros && cons:
    - pros:
        1. no contention among producer and consumers when adding tasks
           to the queue
        2. the adding period is very short compared to the verification
           period
    - cons:
        1. the workers start later

For details, please refer to the worker handler code

Tests:

- Numbers
  - For blocks/s numbers
    `./build/bin/bench_bitcoin --filter=ConnectBlockMixedEcdsaSchnor`
  - For cpu usage numbers
    `NANOBENCH_ENDLESS=ConnectBlockAllEcdsa ./build/bin/bench_bitcoin --filter=ConnectBlockAllEcdsa`

  - Some data is missing because the machine was billed by time, and there wasn't
    enough time to complete all measurements.
  - Some cells contain a range instead of a single value due to high fluctuations
  - The numbers represent the average of multiple test runs.

           |     mutex verion       |   atomic version
nr_workers | blocks/s  | cpu usage  | blocks/s  | cpu usage2
-----------|-----------|------------|-----------|-------------
0          | 4.22      | 100%       | 4.2       | 100%
1          | 8.23      | 198%       | 8         | 191.3%
2          | 12.08     | 293.7%     | 11.6      | 276%
3          | 16        | 386.7%     | 14.9      | 354.5%
4          | 19.8      | 476.3%     |           |
5          | 23.12     | 565.4%     |           |
6          | 27.3      | 654%       | 23        | 560%
7          | 28.5      | 738%       |           |
8          | 33.8      | 809%       |           |
9          | 33.5      | 868%       | 28.5      | 730%
10         | 37.4      | 910%       | 30.5      | 777%
11         | 33~40     | 880%~928%  | 32        | 825%
12         | 22~39     | 860%~925%  | 32.5      | 830%
13         | 27.8~37   | 700%~913%  | 34.8      | 928%
15         | 24.8~38.9 | 640%~719%  | 38        | 1000%
19         | 18~23     | 370%~510%  | 41        | 1161%
22         | 7~18      | 270%~400%  |           |
25         | 6.4       | 260%       | 46        | 1400%
30         | 5         | 270%       | 49        | 1560%
60         | 5         | 280%       | 50~58     | 2100%

- Env:
 A virtual machine on cloud
   Architecture:          x86_64
   CPU op-mode(s):        32-bit, 64-bit
   Address sizes:         46 bits physical, 48 bits virtual
   Byte Order:            Little Endian
   CPU(s):                64
   On-line CPU(s) list:   0-63
   Vendor ID:             GenuineIntel
   Model name:            Intel Xeon Processor (Cascadelake)
   CPU family:            6
   Model:                 85
   Thread(s) per core:    2
   Core(s) per socket:    16
   Socket(s):             2
   Stepping:              6
   BogoMIPS:              5985.93
 Virtualization features:
   Hypervisor vendor:     KVM
   Virtualization type:   full
 Caches (sum of all):
   L1d:                   2 MiB (64 instances)
   L1i:                   2 MiB (64 instances)
   L2:                    128 MiB (32 instances)
   L3:                    32 MiB (2 instances)
 NUMA:
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-31
   NUMA node1 CPU(s):     32-63

Signed-off-by: Hao Xu <hao.xu@linux.dev>
The current max number of scriptcheck threads is MAX_SCRIPTCHECK_THREADS = 14
due to the limitation of the current mechanism, release it to GetNumCores() as
we have atomic mode of scriptcheck worker pool.

Signed-off-by: Hao Xu <hao.xu@linux.dev>
When verification of a transaction get failed, make workers quit the
work doing path and sleep immediately.

Signed-off-by: Hao Xu <hao.xu@linux.dev>
GetNumCores() is now used in checkqueue related code, thus it has to be
added to libbitcoinkernel to keep everything ok

Signed-off-by: Hao Xu <hao.xu@linux.dev>
@HowHsu HowHsu force-pushed the checkqueue_atomic branch from 299d5b5 to a222779 Compare June 24, 2025 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0