8000 Other fast(er) hash algorithms · Issue #1 · kelindar/xxrand · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Other fast(er) hash algorithms #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dumblob opened this issue Sep 11, 2021 · 6 comments
Open

Other fast(er) hash algorithms #1

dumblob opened this issue Sep 11, 2021 · 6 comments

Comments

@dumblob
Copy link
dumblob commented Sep 11, 2021

Just out of curiosity, let me mention there are faster alternatives to xxh3 or algos with about the same speed but jaw-droppingly simpler. See https://github.com/rurban/smhasher/ .

Namely fastest (for both bulk & short inputs) seems to be wyhash and simpliest from the fastest ones mx3 (btw. I feel mx3 could still have some space for speed optimization as I suspect some false dependencies between instructions in the pipeline... - maybe even good old manual loop unrolling could make a difference, IDK).

But maybe xxh3 is better in some ways which I don't see but would like to. I can imagine something related to DBs but I don't know what exactly 😉.

@dumblob dumblob changed the title Other faster hash algorithms Other fast(er) hash algorithms Sep 11, 2021
@kelindar
Copy link
Owner

This is a fair point, might try one of these to see if the performance improves. In this specific use-case we are simply hashing 4 byte integer, hence the input is constant and we only need the mixer part. So we're technically not using the full xxh3 and this code might be simpler indeed. In fact, there's no loops to unroll in the first place.

Here's what the library is using now:

x := v ^ (0x1cad21f72c81017c ^ 0xdb979083e96dd4de) + seed
x ^= bits.RotateLeft64(x, 49) ^ bits.RotateLeft64(x, 24)
x *= 0x9fb21c651e98df25
x ^= (x >> 35) + 4
x *= 0x9fb21c651e98df25
x ^= (x >> 28)

Most of the hashes are pretty much the same, I just looked at mx3, we can easily benchmark. Looks quite similar for the use-case.

x ^= x >> 32;
x *= 0xbea225f9eb34556d;
x ^= x >> 29;
x *= 0xbea225f9eb34556d;
x ^= x >> 32;
x *= 0xbea225f9eb34556d;
x ^= x >> 29;

My biggest problem is the atomic counter actually. This instruction doesn't scale with manycore systems so I used RDTSCP to get the counter, along with some fences.

@kelindar
Copy link
Owner

In fact, I just tried by swapping it with mx3 mixer. Tests pass, but I need to add a proper chi squared test.

Here's results with mx3 hashing:

BenchmarkParallel/rand-001-8            390485431                3.095 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-008-8            383161458                3.243 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-032-8            377720888                3.240 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-128-8            383817609                3.104 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-512-8            389031010                3.138 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-2048-8           377937162                3.139 ns/op           0 B/op          0 allocs/op

And these are with xxh3 hashing:

BenchmarkParallel/rand-001-8            389891288                3.075 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-008-8            339281023                3.272 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-032-8            357304174                3.861 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-128-8            389677950                3.067 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-512-8            378438109                3.175 ns/op           0 B/op          0 allocs/op
BenchmarkParallel/rand-2048-8           364905903                3.083 ns/op           0 B/op          0 allocs/op

@dumblob
Copy link
Author
dumblob commented Sep 11, 2021

Yep, the speed you measured is something I more or less expected (though xxh3 seems just partially implemented which is kind of an "unfair" advantage speed-wise 😉 - but that's what you said, so nothing bad about it).

But I wonder why do you consider an atomic counter non-scalable? From my humble understanding it should be enough te have just a "pre-seed" as an atomic counter (to have different random values in each thread) but this shouldn't affect the PRNG at all because you can read the process-global pre-seed into thread-local storage into a "true seed" for the PRNG and since then everything works truly in parallel.

But I probably misunderstood what you mean. In which case there are splittable PRNGs such as JAX.

@kelindar
Copy link
Owner

Hey, I'll have a look into JAX, never heard about it.

For the atomic increment is still a lock, so it would not scale beyond a certain number of cores. Most of the time, you don't care as it's not called that often, but when I was working on my column benchmark, I ran into the issue where atomic random called too often caused a degradation inside my benchmark.

For the thread-local storage, this was actually my initial approach this morning. The issue is that there's no native way to use it in Go, which is a bit annoying. So I started by having an array of counters where n = runtime.NumCPU() but it then each time you'd need a random number, I used a CPUID instruction to get the current core for the goroutine and then get the counter at that index. It worked, but the overhead of that CPUID instruction was pretty huge, around 30ns in my benchmarks. It was expected, since it pulls a lot more than just the id.

To go around that, I decided to try out RDTSC which reads the instruction counter and is pretty fast. As mentioned before, I still want to expose other APIs to the lib where you explicitly provide the counter and seed.

@wangyi-fudan
Copy link

you can try wyrand PRNG in wyhash package

@dumblob
Copy link
Author
dumblob commented Sep 12, 2021

Hey, I'll have a look into JAX, never heard about it.

For the atomic increment is still a lock, so it would not scale beyond a certain number of cores. Most of the time, you don't care as it's not called that often, but when I was working on my column benchmark, I ran into the issue where atomic random called too often caused a degradation inside my benchmark.

Sure, anything "atomic" inside of a non-splittable PRNG will cause serious degradation. I'd try really hard to avoid anything shared.

For the thread-local storage, this was actually my initial approach this morning. The issue is that there's no native way to use it in Go, which is a bit annoying.

This is really bad - I didn't look at Go's support for thread local storage, but if it's not there I'd definitely invest energy (even if it's a lot) into implementing my own tiny constrained support for that (at least for Go's default compiler - not for cgo nor any other for the beginning) using unsafe... Without TLS it's becoming nowadays almost impossible to make scalable algos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0