-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Atomic operations JMH benchmarks #1911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Some results run on i7 920 @ 2.66GHz, JDK 1.8u25, Windows 7 x64.
Few observations:
|
Atomic operations JMH benchmarks
This is cool information. Thanks. |
Sorry for coming into the year old thread (I'm reviewing RxJava sources…), but my runs of I'm not a JMH master, but as far as I understand, tests in this PR are not very good for one big reason — loops. AFAIK loops in benchmark tests are bad because JIT is too smart about them, see http://java-performance.info/jmh/. If I remove loops from these tests I see that AtomicFieldUpdaters have 25-30% overhead, not 10% (especially in @akarnokd @benjchristensen thoughts? Probably it's a reason to reconsider switch to I guess, @benjchristensen in your usage cases (Netflix) you would like to have faster code than save few bytes of memory. For Android, it will be nice to avoid this reflection too. |
Except when there are atomic operations inside the loop.
JIT can sometimes properly eliminate the class-validation in the updater and it should run as fast as Unsafe. Sometimes not, most likely when there are some other implementation of the host class around.
Its a tradeoff between portability, memory usage, the cost of validation vs dereference. Do you know a benchmark tool for Android where these cases could be run? |
Since RxJava needs a lot of It will:
No :( But we can create one similar to JMH :) |
Concurrent operators can become faster by padding away heavily mutated counters, but at a cost of higher memory usage and increased class complexity. These latter two become drawbacks with short lived tasks that are run from a tight loop (i.e., benchmarking just().observeOn().subscribe(...) with JMH. In addition, having a reference to Atomic classes may still end up with false sharing as GC may pack them next to each other or their owner class. Sometimes, even if they are further away, there could be a false-sharing issue with the object header itself. Java 8 Atomic variables might get padded, but that's unreliable. What's left is manually assembling classes with padding fields and either using Unsafe (unavailable in Android) or field updaters. The same is true for our copies of the JCTools queues: class padding may help, element padding may help up to a certain limit.
I generally try to avoid them because they can become the new contention point.
Allocation is pretty performant in desktop Java and if you don't run short-lived tasks that frequently, it isn't a problem. I saw the tendency for the need of optimizing for short-lived tasks (i.e., just() and Single) and I believe there is an upper bound on how few operations can achieve the functionality without endangering concurrent-safety.
I think Android is in disadvantage here; I never had any problems with desktop Java reflection since Java 7. The 2-second long reflective lookup of all methods (NewThreadWorker issue) baffled me. I think the Android platform is so behind that it requires an independent RxJavaAndroid port that specifically knows about such problems.
Most benchmarks really test for operator overhead by doing nothing but blackholing source values. Many of these, by using boxed integers, test for GC+operator overhead together (although I think there should be perf test that specifically avoid GC interference). |
Yeah :( Though I don't like manual paddings because it's platform specific thing… Okay, your arguments were enough convincing for me, let's leave |
@artem-zinnatullin I didn't mean to discourage you from doing performance improvements, just saying that there are tradeoffs per platform and per implementation aim. With There is, however, another effect usually found with async benchmarks and the computation scheduler: each inner iteration may end up on different cores which may trigger more thread migrations and soak up interference from other programs. I've resolved this in 2.x with the introduction of the Finally, there is sometimes an effect going on I call emission-grabbing where the thread which requests grabs the emission logic and the whole source-sink ends up on the same thread, boosting the throughput considerably. In terms of the For example, in the range test, I've added a pipelined version where the emission is forced to a different thread than the observation: As you can see, pipelined has 6x less throughput than non-pipelined. Also note that 2.x observeOn is 1 cache line padded (the usual padding is 2x for compensating for adjacent prefetch, but I was a bit sloppy). That being said, I'm always open to optimizations. Would you like to pursue this further? |
Yeah, I understand.
This requires JMH-like tool for Android first. I'll work on it in my spare time, feel free to review/comment/follow https://github.com/artem-zinnatullin/droid-jmh
Wow.
Yep, I'll continue my "investigations" :) |
The CAS vs getAndAdd conclusion is somewhat misleading. From JDK8 getAndAdd is intrinsified into XADD, which scales better than CAS under contention. To see the effect measure with different thread counts. |
Created some JMH benchmarks for the typical volatile and atomic operations.