8000 State: Use `std::vector` by glebm · Pull Request #97 · s-yata/marisa-trie · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

State: Use std::vector #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 25, 2025
Merged

State: Use std::vector #97

merged 2 commits into from
May 25, 2025

Conversation

glebm
Copy link
Contributor
@glebm glebm commented May 24, 2025

The vectors in State are never memory-mapped, so we can use std::vector instead of Marisa's vector.

std::vector is smaller in size (24 bytes vs 32 bytes).

This reduces State size from 80 bytes to 64 bytes, making it fit into a single cache line.

Benchmark on my machine shows a tiny improvement on a large file (best run of 3, so could just be noise):

Before:

Number of tries: 1 - 5
TAIL mode: Text mode
Node order: Descending weight order
Cache level: Normal cache
Number of keys: 100000
Total length: 2552180
------+----------+--------+--------+--------+--------+--------
#tries       size    build   lookup  reverse   prefix  predict
                                      lookup   search   search
          [bytes]    [K/s]    [K/s]    [K/s]    [K/s]    [K/s]
------+----------+--------+--------+--------+--------+--------
     1    2671808  2103.71  3719.41  4865.23  3502.50   226.75
     2    2623280  1718.33  2442.30  3081.38  2328.61    82.26
     3    2589952  1349.84  1763.61  2414.41  1780.18    50.49
     4    2560184  1160.51  1467.46  1991.91  1430.51    37.06
     5    2533704  1011.27  1239.13  1674.87  1213.36    29.25
------+----------+--------+--------+--------+--------+--------

After:

------+----------+--------+--------+--------+--------+--------
#tries       size    build   lookup  reverse   prefix  predict
                                      lookup   search   search
          [bytes]    [K/s]    [K/s]    [K/s]    [K/s]    [K/s]
------+----------+--------+--------+--------+--------+--------
     1    2671808  2089.91  3801.27  4954.66  3593.12   231.46
     2    2623280  1715.00  2398.54  3184.51  2322.39    83.26
     3    2589952  1360.84  1821.86  2412.37  1742.92    51.08
     4    2560184  1153.80  1468.41  1966.65  1424.16    37.24
     5    2533704  1009.94  1230.84  1680.45  1209.50    29.56

The vectors in `State` are never memory-mapped, so
we can use `std::vector` instead of Marisa's vector.

`std::vector` is smaller in size.

This reduces `State` size from 80 bytes to 64 bytes,
making it fit into a single cache line.
@glebm
Copy link
Contributor Author
glebm commented May 24, 2025

Perhaps in a future PR we can merge State into Agent, reducing overall memory usage and eliminating an allocation.
That might be a subtly backwards-incompatible change (e.g. we'd probably change the agent.key() method to return by-value rather than a const reference) so probably good for 0.3 or 0.4.

@s-yata
Copy link
Owner
s-yata commented May 24, 2025

Please wait.
std::vector can throw std::exception.
libmarisa has to catch all such exceptions and throw marisa::Exception instead.
It's so troublesome.

@glebm
Copy link
Contributor Author
glebm commented May 24, 2025

I think the only exception it can throw is std::bad_alloc? Perhaps there is no need to have a Marisa-specific exception for std::bad_alloc?

@jmr
Copy link
Contributor
jmr commented May 24, 2025

I was also wondering what the advantage of using std::nothrow and then throwing marisa::Exception was.

@s-yata
Copy link
Owner
s-yata commented May 24, 2025

I cannot remember the details.
Probably, for debugging?
marisa::Exception has line() and filename().

Even if the design is strange, we should not change the behavior for compatibility.

@glebm glebm force-pushed the state-vectors branch 2 times, most recently from 5b32d39 to 8014250 Compare May 24, 2025 18:05
@glebm glebm marked this pull request as draft May 24, 2025 18:09
@glebm
Copy link
Contributor Author
glebm commented May 24, 2025

I've added exception rethrowing. It increases libmarisa.a size by ~20 KiB but doesn't seem to reduce performance, so we still have a small performance win.

For a major version bump, I think it'd be useful to consider not having this custom behaviour. Most code out there doesn't customize exceptions from the standard library this way, so it can be unexpected.

For example, in the rare situation where a program actually handles allocation failure, there is usually an std::bad_alloc try-catch around a large section of the program. Marisa throws a custom exception instead of the standard one, bypassing such handlers.

@glebm glebm marked this pull request as ready for review May 24, 2025 18:39
Comment on lines +74 to +84
#ifndef MARISA_USE_EXCEPTIONS
#if defined(__GNUC__) && !defined(__EXCEPTIONS)
#define MARISA_USE_EXCEPTIONS 0
#elif defined(__clang__) && !defined(__cpp_exceptions)
#define MARISA_USE_EXCEPTIONS 0
#elif defined(_MSC_VER) && !_HAS_EXCEPTIONS
#define MARISA_USE_EXCEPTIONS 0
#else
#define MARISA_USE_EXCEPTIONS 1
#endif
#endif
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you assume an environment that disallows exceptions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This supports such environments on GCC, Clang, and MSVC.
Exceptions are commonly disabled in environments such as Google and game development.

It is usually possible to compile individual libraries with exception support in these environments (as long as there is no exception handling in public headers) but it's pointless to catch-and-rethrow in this case.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the information.
I didn't know this.

If exceptions are disabled, an application is terminated by an exception?

Copy link
Contributor Author
@glebm glebm May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if exceptions are disabled, then throwing an exception terminates the program.

Internally at Google we use explicit error handling with absl::Status (std::expected is the standard library equivalent in C++23).

There is generally no sensible way to handle things like out-of-memory errors or internal logic errors, so we do not try to handle them and simply allow the program to crash in those cases.

@s-yata s-yata merged commit 9c4615a into s-yata:master May 25, 2025
7 checks passed
@glebm glebm deleted the state-vectors branch May 25, 2025 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0