Establish and implement the relevant metrics to understand storage workloads #46

lasarojc · 2022-12-23T18:09:48Z

We need to identify the set of metrics to understand the storage workloads of CometBFT. The metrics should help us identify:

The access patterns (sequential / random access)
How often the data is read/written
Who reads/writes the data
Is the data accessed by multiple components or just one.
How much of the total height time is spent in storage - on a small network, on a big network? (Is storage a bottleneck? )

Open questions: Do we want information on which CometBFT Blockstore / StateStore method call triggered the access or is read/write/delete count and timing enough?

Draft implementation: tendermint/tendermint#9774

DoD

Set of candidate metrics is established, discussed and well understood.
The identified metrics are added into the Tendermint codebase.
Created a custom grafana dashboard for easy monitoring. fix(localnet): Grafana dashboards for storage tests - removed experimental metrics #2448 Grafana dashboards for local use #2107 fix(monitoring): Do not convert ms to ms #2615 monitoring: Add block time to storage metrics #2616

jmalicevic · 2023-11-22T09:44:07Z

For Q4 2023, we will focus on:

Storage access time (last point in the issue description)
Which data is most frequently read
The amount of storage taken by different workloads with and without pruning
The throughout with and without pruning - This was derived from a call with Injective whose users' latency was heavily impacted by pruning being enabled.

This list might be altered after an in person meeting of the team deciding on the exact measurments that will help us achieve our Q4 goals.

The main result of this testing should be :

Showing that the impact of pruning has improved - compaction works
Database access times have decreased
Impact of the new pruning mechanism on throughput
Impact of different key representation + new pruning mechanism on throughput.

melekes · 2023-11-28T12:58:28Z

The throughout with and without pruning - This was derived from a call with Injective whose users' latency was heavily impacted by pruning being enabled.

The throughput can be hard to measure as it depends on many factors. Is it better to measure the duration of pruning (if possible)? This is similar to how the Golang core team optimised their garbage collector by looking at the duration of GC pauses.

This PR superceeds #79 with some adjustments to the segments of code timed as well as bucket sizes. The majority of the code was done by William in #79. I tried to fine tune the measurements to exclude proto marshalling/unmarshalling where I think it made sense. Closes #46 Blocked on benchmarking to confirm it is measuring what we want. (follow up) The metrics gave us nice and realistic measurements in our benchmarks.  --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

This PR superceeds #79 with some adjustments to the segments of code timed as well as bucket sizes. The majority of the code was done by William in #79. I tried to fine tune the measurements to exclude proto marshalling/unmarshalling where I think it made sense. Closes #46 Blocked on benchmarking to confirm it is measuring what we want. (follow up) The metrics gave us nice and realistic measurements in our benchmarks.  --- #### PR checklist - [ ] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com> (cherry picked from commit dfd3f6c) # Conflicts: # internal/store/store.go

lasarojc added metrics storage labels Dec 23, 2022

lasarojc assigned jmalicevic Dec 23, 2022

lasarojc added this to CometBFT 2023 Dec 23, 2022

lasarojc mentioned this issue Dec 23, 2022

Storage optimization #44

Open

21 tasks

sergio-mena mentioned this issue Dec 23, 2022

Establish the baseline and future requirements for the storage backend #63

Closed

4 tasks

jmalicevic mentioned this issue Dec 27, 2022

Storage evaluation baseline: measure and report current storage related behaviour of CometBFT #67

Closed

4 tasks

jmalicevic moved this to Todo in CometBFT 2023 Dec 27, 2022

jmalicevic mentioned this issue Jan 5, 2023

metrics: add a candidate set of metrics for measuring db access times. #79

Closed

jmalicevic added this to the 2023-Q2 milestone Mar 28, 2023

thanethomson added the P:storage-optimization Priority: Give operators greater control over storage and storage optimization label Jun 20, 2023

jmalicevic modified the milestones: 2023-Q2, 2023-Q3 Jun 27, 2023

jmalicevic removed their assignment Jun 27, 2023

andynog mentioned this issue Jun 29, 2023

Benchmark current storage improvements #1044

Closed

6 tasks

jmalicevic modified the milestones: 2023-Q3, 2023-Q4 Oct 16, 2023

jmalicevic mentioned this issue Jan 5, 2024

feat(storage/metrics): Metrics to measure storage #1974

Merged

3 tasks

adizere modified the milestones: 2023-Q4, 2024-Q1 Jan 10, 2024

adizere added this to CometBFT Jan 10, 2024

github-project-automation bot moved this to Todo in CometBFT Jan 10, 2024

adizere mentioned this issue Jan 12, 2024

Community call 1/11/2024: Databases & db perf #2023

Closed

jmalicevic moved this from Todo to In Progress in CometBFT Jan 17, 2024

jmalicevic self-assigned this Jan 17, 2024

jmalicevic moved this from In Progress to Ready for Review in CometBFT Jan 30, 2024

jmalicevic closed this as completed in #1974 Feb 13, 2024

github-project-automation bot moved this from Ready for Review to Done in CometBFT Feb 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Establish and implement the relevant metrics to understand storage workloads #46

Establish and implement the relevant metrics to understand storage workloads #46

Uh oh!

Uh oh!

Establish and implement the relevant metrics to understand storage workloads #46

Establish and implement the relevant metrics to understand storage workloads #46

Comments

Uh oh!

DoD

Uh oh!

Uh oh!