8000 Introduce the `small-hash` feature for `bitcoin_hashes` by afilini · Pull Request #1990 · rust-bitcoin/rust-bitcoin · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Introduce the small-hash feature for bitcoin_hashes #1990

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 17, 2023

Conversation

afilini
Copy link
Contributor
@afilini afilini commented Aug 11, 2023

When enabled this feature swaps the hash implementation of sha512, sha256 and ripemd160 for a smaller (but also slower) one.

On embedded processors (Cortex-M4) it can lead to up to a 52% size reduction, from around 37KiB for just the process_block methods of the three hash functions to 17.8KiB.

The following numbers were collected on aarch64-unknown-linux-gnu with cargo 1.72.0-nightly.

Original

RUSTFLAGS='--cfg=bench -C opt-level=z' cargo bench
test hash160::benches::hash160_10                 ... bench:          33 ns/iter (+/- 1) = 303 MB/s
test hash160::benches::hash160_1k                 ... bench:       2,953 ns/iter (+/- 187) = 346 MB/s
test hash160::benches::hash160_64k                ... bench:     188,480 ns/iter (+/- 11,595) = 347 MB/s
test hmac::benches::hmac_sha256_10                ... bench:          33 ns/iter (+/- 2) = 303 MB/s
test hmac::benches::hmac_sha256_1k                ... bench:       2,957 ns/iter (+/- 104) = 346 MB/s
test hmac::benches::hmac_sha256_64k               ... bench:     192,022 ns/iter (+/- 6,407) = 341 MB/s
test ripemd160::benches::ripemd160_10             ... bench:          25 ns/iter (+/- 1) = 400 MB/s
test ripemd160::benches::ripemd160_1k             ... bench:       2,288 ns/iter (+/- 93) = 447 MB/s
test ripemd160::benches::ripemd160_64k            ... bench:     146,823 ns/iter (+/- 1,102) = 446 MB/s
test sha1::benches::sha1_10                       ... bench:          41 ns/iter (+/- 0) = 243 MB/s
test sha1::benches::sha1_1k                       ... bench:       3,844 ns/iter (+/- 70) = 266 MB/s
test sha1::benches::sha1_64k                      ... bench:     245,854 ns/iter (+/- 10,158) = 266 MB/s
test sha256::benches::sha256_10                   ... bench:          35 ns/iter (+/- 0) = 285 MB/s
test sha256::benches::sha256_1k                   ... bench:       3,063 ns/iter (+/- 15) = 334 MB/s
test sha256::benches::sha256_64k                  ... bench:     195,729 ns/iter (+/- 2,880) = 334 MB/s
test sha256d::benches::sha256d_10                 ... bench:          34 ns/iter (+/- 1) = 294 MB/s
test sha256d::benches::sha256d_1k                 ... bench:       3,071 ns/iter (+/- 107) = 333 MB/s
test sha256d::benches::sha256d_64k                ... bench:     188,614 ns/iter (+/- 8,101) = 347 MB/s
test sha512::benches::sha512_10                   ... bench:          21 ns/iter (+/- 0) = 476 MB/s
test sha512::benches::sha512_1k                   ... bench:       1,714 ns/iter (+/- 36) = 597 MB/s
test sha512::benches::sha512_64k                  ... bench:     110,084 ns/iter (+/- 3,637) = 595 MB/s
test sha512_256::benches::sha512_256_10           ... bench:          22 ns/iter (+/- 1) = 454 MB/s
test sha512_256::benches::sha512_256_1k           ... bench:       1,822 ns/iter (+/- 70) = 562 MB/s
test sha512_256::benches::sha512_256_64k          ... bench:     116,231 ns/iter (+/- 4,745) = 563 MB/s
test siphash24::benches::siphash24_1ki            ... bench:       1,072 ns/iter (+/- 41) = 955 MB/s
test siphash24::benches::siphash24_1ki_hash       ... bench:       1,102 ns/iter (+/- 42) = 929 MB/s
test siphash24::benches::siphash24_1ki_hash_u64   ... bench:       1,064 ns/iter (+/- 41) = 962 MB/s
test siphash24::benches::siphash24_64ki           ... bench:      69,957 ns/iter (+/- 2,712) = 936 MB/
0000000000005872 t _ZN84_$LT$bitcoin_hashes..ripemd160..HashEngine$u20$as$u20$bitcoin_hashes..HashEngine$GT$5input17hc4800746a9da7ff4E
0000000000007956 t _ZN81_$LT$bitcoin_hashes..sha256..HashEngine$u20$as$u20$bitcoin_hashes..HashEngine$GT$5input17hf49345f65130ce9bE
0000000000008024 t _ZN14bitcoin_hashes6sha2568Midstate10const_hash17h57317bc8012004b4E.llvm.441255102889972912
0000000000010528 t _ZN81_$LT$bitcoin_hashes..sha512..HashEngine$u20$as$u20$bitcoin_hashes..HashEngine$GT$5input17h9bc868d4392bd9acE

Total size: 32380 bytes

With small-hash enabled

RUSTFLAGS='--cfg=bench -C opt-level=z' cargo bench --features small-hash
test hash160::benches::hash160_10                 ... bench:          52 ns/iter (+/- 3) = 192 MB/s
test hash160::benches::hash160_1k                 ... bench:       4,817 ns/iter (+/- 286) = 212 MB/s
test hash160::benches::hash160_64k                ... bench:     319,572 ns/iter (+/- 11,031) = 205 MB/s
test hmac::benches::hmac_sha256_10                ... bench:          54 ns/iter (+/- 2) = 185 MB/s
test hmac::benches::hmac_sha256_1k                ... bench:       4,846 ns/iter (+/- 204) = 211 MB/s
test hmac::benches::hmac_sha256_64k               ... bench:     319,114 ns/iter (+/- 4,451) = 205 MB/s
test ripemd160::benches::ripemd160_10             ... bench:          27 ns/iter (+/- 0) = 370 MB/s
test ripemd160::benches::ripemd160_1k             ... bench:       2,358 ns/iter (+/- 150) = 434 MB/s
test ripemd160::benches::ripemd160_64k            ... bench:     154,573 ns/iter (+/- 3,954) = 423 MB/s
test sha1::benches::sha1_10                       ... bench:          41 ns/iter (+/- 1) = 243 MB/s
test sha1::benches::sha1_1k                       ... bench:       3,700 ns/iter (+/- 243) = 276 MB/s
test sha1::benches::sha1_64k                      ... bench:     231,039 ns/iter (+/- 13,989) = 283 MB/s
test sha256::benches::sha256_10                   ... bench:          51 ns/iter (+/- 3) = 196 MB/s
test sha256::benches::sha256_1k                   ... bench:       4,823 ns/iter (+/- 182) = 212 MB/s
test sha256::benches::sha256_64k                  ... bench:     299,960 ns/iter (+/- 17,545) = 218 MB/s
test sha256d::benches::sha256d_10                 ... bench:          52 ns/iter (+/- 2) = 192 MB/s
test sha256d::benches::sha256d_1k                 ... bench:       4,827 ns/iter (+/- 323) = 212 MB/s
test sha256d::benches::sha256d_64k                ... bench:     302,844 ns/iter (+/- 15,796) = 216 MB/s
test sha512::benches::sha512_10                   ... bench:          34 ns/iter (+/- 1) = 294 MB/s
test sha512::benches::sha512_1k                   ... bench:       3,002 ns/iter (+/- 123) = 341 MB/s
test sha512::benches::sha512_64k                  ... bench:     189,767 ns/iter (+/- 10,396) = 345 MB/s
test sha512_256::benches::sha512_256_10           ... bench:          34 ns/iter (+/- 1) = 294 MB/s
test sha512_256::benches::sha512_256_1k           ... bench:       2,996 ns/iter (+/- 198) = 341 MB/s
test sha512_256::benches::sha512_256_64k          ... bench:     192,024 ns/iter (+/- 8,181) = 341 MB/s
test siphash24::benches::siphash24_1ki            ... bench:       1,081 ns/iter (+/- 65) = 947 MB/s
test siphash24::benches::siphash24_1ki_hash       ... bench:       1,083 ns/iter (+/- 63) = 945 MB/s
test siphash24::benches::siphash24_1ki_hash_u64   ... bench:       1,084 ns/iter (+/- 63) = 944 MB/s
test siphash24::benches::siphash24_64ki           ... bench:      67,237 ns/iter (+/- 4,185) = 974 MB/s
0000000000005384 t _ZN81_$LT$bitcoin_hashes..sha256..HashEngine$u20$as$u20$bitcoin_hashes..HashEngine$GT$5input17hae341658cf9b880bE
0000000000005608 t _ZN14bitcoin_hashes9ripemd16010HashEngine13process_block17h3276b13f1e9feef8E.llvm.13618235596061801146
0000000000005616 t _ZN14bitcoin_hashes6sha2568Midstate10const_hash17h3e6fbef64c15ee00E.llvm.7326223909590351031
0000000000005944 t _ZN81_$LT$bitcoin_hashes..sha512..HashEngine$u20$as$u20$bitcoin_hashes..HashEngine$GT$5input17h321a237bfbe5c0bbE

Total size: 22552 bytes

Conclusion

On aarch64 there's overall a ~30% improvement in size, although ripemd160 doesn't really shrink that much (and its performance also aren't impacted much with only a 6% slowdown). sha512 and sha256 instead are almost 40% slower with small-hash enabled.

I don't have performance numbers for other architectures, but in terms of size there was an even larger improvements on thumbv7em-none-eabihf, with a 52% size reduction overall:

   Size          Crate Name
25.3KiB bitcoin_hashes <bitcoin_hashes[fe467ef2aa3a1470]::sha512::HashEngine as bitcoin_hashes[fe467ef2aa3a1470]::HashEngine>::input
 6.9KiB bitcoin_hashes <bitcoin_hashes[fe467ef2aa3a1470]::sha256::HashEngine as bitcoin_hashes[fe467ef2aa3a1470]::HashEngine>::input
 4.8KiB bitcoin_hashes <bitcoin_hashes[fe467ef2aa3a1470]::ripemd160::HashEngine as bitcoin_hashes[fe467ef2aa3a1470]::HashEngine>::input

vs

  Size          Crate Name
9.5KiB bitcoin_hashes <bitcoin_hashes[974bb476ef905797]::sha512::HashEngine as bitcoin_hashes[974bb476ef905797]::HashEngine>::input
4.5KiB bitcoin_hashes <bitcoin_hashes[974bb476ef905797]::ripemd160::HashEngine>::process_block
3.8KiB bitcoin_hashes <bitcoin_hashes[974bb476ef905797]::sha256::HashEngine as bitcoin_hashes[974bb476ef905797]::HashEngine>::input

I'm assuming this is because on more limited architectures the compiler needs to use more instructions to move data in and out of registers (especially for sha512 which ideally would benefit from 64-bit registers), so reusing the code by moving it into functions saves a lot of those instructions.

Also note that the const_hash method on sha256 causes the compiler to emit two independent implementations. I haven't looked into the code yet, maybe there's a way to merge them so that the non-const process_block calls into the const fn.


Note: commits are unverified right now because I don't have the keys available, I will sign them after addressing the review comments.

@afilini afilini force-pushed the feature-small-hash branch 2 times, most recently from c2698b8 to e46a718 Compare August 11, 2023 20:08
@apoelstra
Copy link
Member

Can you edit clippy.toml to set too-many-arguments-threshold to 9 (or whatever value you need).

#[allow(non_snake_case)]
const fn Sigma1(x: u32) -> u32 { x.rotate_left(26) ^ x.rotate_left(21) ^ x.rotate_left(7) }
const fn sigma0(x: u32) -> u32 { x.rotate_left(25) ^ x.rotate_left(14) ^ (x >> 3) }
const fn sigma1(x: u32) -> u32 { x.rotate_left(15) ^ x.rotate_left(13) ^ (x >> 10) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In e46a718633ee897ff00fd8fbcde81a06ced3b702:

I think you could promote these to the main module. I'm doubtful that converting macros to const fns would have any effect at all on performance. Plausibly it could increase performance if it made the compiler more aware when we're repeating the same operations and could vectorize them. Though we should measure it. But I think it should just give the compiler more freedom, to make them functions when it's optimizing for size and to inline them otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I guess the compiler would probably normally inline them, unless you are optimizing for size.

If it ends up affecting performance we could also tell the compiler to inline them explicitly when small-hash is not enabled with #[cfg-attr(...)]

@apoelstra
Copy link
Member

Overall concept ACK. I'm super impressed at how small this diff is. I was a little apprehensive about reviewing this PR but it's much smaller than I expected.

@dpc
Copy link
Contributor
dpc commented Aug 11, 2023

Just thinking aloud. I wonder if this should be a Cargo feature. Feels kind of weird ... as this doesn't really enable any "feature", and is more like a compilation level thing. I wonder if there are any idiomatic alternatives. Feel to me like this would be best controlled by env var during compilation. Or maybe I'm just overcomplicating.

@afilini
Copy link
Contributor Author
afilini commented Aug 12, 2023

I'm super impressed at how small this diff is. I was a little apprehensive about reviewing this PR but it's much smaller than I expected.

Yeah one of my main goals was to keep this very easy to review (and also for me to write, since I don't have much experience writing and optimizing code like this). I'm sure we could squeeze out a lot more space/performance with more effort, but this is probably a good trade-off between code readability and space/performance imho.

@dpc regarding the cargo feature thing: sometimes features are used in this way, for example secp has a lowmemory feature that doesn't really add anything to the library but it makes it use a smaller precomputed table to save memory. Env variables would be a bit annoying to use in my opinion, first of all you would have to document them (where?), while with features one can simply look at the cargo.toml and find them. Also it would force downstream projects to either have custom scripts to compile, or write a build.rs file that sets these env variables automatically. And it would make it impossible to include two versions of the library with different features (although one might argue that in general nobody would really want to do that..).

@apoelstra
Copy link
Member

I agree that it's a little weird to use a cargo feature, but also think it's the right choice, for the same reasons as the rust-secp lowmemory feature. That is, it has zero effect on functionality, and if some dep wants this on, it doesn't (really) make sense for any other dep to be able to turn it off.

Copy link
Member
@tcharding tcharding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the trick @afilini, why is one fast and one small, const functions vs macros, I don't see it right now. Thanks

Comment on lines 203 to 214
pub(super) const fn round(
a: u32,
b: u32,
c: u32,
d: u32,
e: u32,
f: u32,
g: u32,
h: u32,
k: u32,
w: u32,
) -> (u32, u32) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps custom formating would be better here, something like:

    #[rustfmt::skip]
    pub(super) const fn round(a: u32, b: u32, c: u32, d: u32, e: u32,
                              f: u32, g: u32, h: u32, k: u32, w: u32) -> (u32, u32) {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah hey all we have to do to have sane formatting is polish our code with some pretty annotations!

@apoelstra
Copy link
Member

@tcharding I suspect that constfns are just as fast, but the reason that they (can be) smaller is that macros are always inlined whereas constfns might not be. Inlining is faster because it avoids function call overhead.

Also +1 to custom formatting.

@tcharding
Copy link
Member
tcharding commented Aug 14, 2023

So I tried to read the dragon book a couple of times before and never made it past the first chapters. Anyone got any suggestions on books/sites to learn about compiler optimisations without making my eyes bleed?

@tcharding
Copy link
Member

With green CI I'm happy to ack this. Its basically just swapping macros for const functions, it did take me a while to work that out though :)

@tcharding
Copy link
Member

Do we want this in the upcoming hashes release?

@afilini afilini force-pushed the feature-small-hash branch from e46a718 to 0fcb907 Compare August 16, 2023 12:18
When enabled this feature swaps the hash implementation of sha512,
sha256 and ripemd160 for a smaller (but also slower) one.

On embedded processors (Cortex-M4) it can lead to up to a 52% size
reduction, from around 37KiB for just the `process_block` methods of the
three hash functions to 17.8KiB.
@afilini afilini force-pushed the feature-small-hash branch from 0fcb907 to f2c5f19 Compare August 16, 2023 12:19
@afilini
Copy link
Contributor Author
afilini commented Aug 16, 2023

I pushed a new version with all the feedback received so far: I didn't notice any performance regression by switching to const fns in the "fast hash" implementation. Even when compiling with z (size) optimization the numbers where still very close to what they were originally.

I also manually formatted the functions with many arguments, I tried to break them up into a few lines and I think they look pretty good now.

@tcharding just a sidenote: I don't know much about compilers and optimizations either, the idea behind this PR was just to reuse common code as much as possible rather than expanding the full hash function, which ends up being huge. I think the implementation you have here is a pretty good tradeoff between speed and size: it could be made much faster by writing critical sections in assembly manually, or even using SIMD instructions on supported platforms. It could also be made much smaller by once again playing with assembly manually, but that's pretty complicated and this library isn't trying to be either a fast or super small hash functions library, so I think this is more than enough. This is all to say that somebody who really knows these stuff could definitely make big improvements, but they would all have a cost in terms of code readability/simplicity, so I'm not even sure it's worth for this lib.

@apoelstra
Copy link
Member

Do we want this in the upcoming hashes release?

Yeah, let's do it.

@apoelstra
Copy link
Member

Nice! I suspect round could also be a fn but since it can't be a constfn I am less confident that the compiler will inline it. Maybe we could add cfg_attr(feature = "small_hashes", inline(never)) on it (and a corresponding inline(always) in the other case)?

But in the interest of not making you do a ton of iterations on this, I'm happy to ACK this as is. But if you want to try this, that'd be great :).

Copy link
Member
@apoelstra apoelstra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK f2c5f19

@tcharding
Copy link
Member

Thanks for the explanation @afilini!

Copy link
Member
@tcharding tcharding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACK f2c5f19

@apoelstra apoelstra merged commit 1b3a9d3 into rust-bitcoin:master Aug 17, 2023
@afilini afilini deleted the feature-small-hash branch August 17, 2023 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
0