8000 Atomic operations JMH benchmarks by akarnokd · Pull Request #1911 · ReactiveX/RxJava · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Atomic operations JMH benchmarks #1911

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 1, 2014
Merged

Conversation

akarnokd
Copy link
Member
@akarnokd akarnokd commented Dec 1, 2014

Created some JMH benchmarks for the typical volatile and atomic operations.

@akarnokd
Copy link
Member Author
akarnokd commented Dec 1, 2014

Some results run on i7 920 @ 2.66GHz, JDK 1.8u25, Windows 7 x64.

Benchmark                                       (times)   Mode   Samples        Score  Score error    Units
r.i.AtomicPerf.atomicIntCASCheckFailure               1   avgt         5        2,668        0,057    ns/op
r.i.AtomicPerf.atomicIntCASCheckFailure            1000   avgt         5     1534,512       74,719    ns/op
r.i.AtomicPerf.atomicIntCASCheckFailure         1000000   avgt         5  2284937,183    28254,968    ns/op
r.i.AtomicPerf.atomicIntCASCheckSuccess               1   avgt         5       16,064        0,576    ns/op
r.i.AtomicPerf.atomicIntCASCheckSuccess            1000   avgt         5    11408,954      213,337    ns/op
r.i.AtomicPerf.atomicIntCASCheckSuccess         1000000   avgt         5 11417083,239   142717,458    ns/op
r.i.AtomicPerf.atomicIntCASFailure                    1   avgt         5        7,985        0,131    ns/op
r.i.AtomicPerf.atomicIntCASFailure                 1000   avgt         5     6899,117      377,535    ns/op
r.i.AtomicPerf.atomicIntCASFailure              1000000   avgt         5  6861438,949   224204,293    ns/op
r.i.AtomicPerf.atomicIntCASSuccess                    1   avgt         5       17,949        0,298    ns/op
r.i.AtomicPerf.atomicIntCASSuccess                 1000   avgt         5     7208,543      140,543    ns/op
r.i.AtomicPerf.atomicIntCASSuccess              1000000   avgt         5  7251170,709   306098,222    ns/op
r.i.AtomicPerf.atomicIntFieldCASFailure               1   avgt         5        9,892        0,096    ns/op
r.i.AtomicPerf.atomicIntFieldCASFailure            1000   avgt         5    10664,317      136,754    ns/op
r.i.AtomicPerf.atomicIntFieldCASFailure         1000000   avgt         5  8024782,072   305465,473    ns/op
r.i.AtomicPerf.atomicIntFieldCASSuccess               1   avgt         5       19,520        0,528    ns/op
r.i.AtomicPerf.atomicIntFieldCASSuccess            1000   avgt         5    11772,415      265,342    ns/op
r.i.AtomicPerf.atomicIntFieldCASSuccess         1000000   avgt         5  8361572,721   162622,626    ns/op
r.i.AtomicPerf.atomicIntFieldGetAndIncrement          1   avgt         5       15,282        0,405    ns/op
r.i.AtomicPerf.atomicIntFieldGetAndIncrement       1000   avgt         5    15215,159      139,704    ns/op
r.i.AtomicPerf.atomicIntFieldGetAndIncrement    1000000   avgt         5 12560260,811   334571,720    ns/op
r.i.AtomicPerf.atomicIntFieldIncrementAndGet          1   avgt         5       15,238        0,191    ns/op
r.i.AtomicPerf.atomicIntFieldIncrementAndGet       1000   avgt         5    12954,997      161,420    ns/op
r.i.AtomicPerf.atomicIntFieldIncrementAndGet    1000000   avgt         5 12557229,845   138688,873    ns/op
r.i.AtomicPerf.atomicIntFieldLazySet                  1   avgt         5        4,179        0,065    ns/op
r.i.AtomicPerf.atomicIntFieldLazySet               1000   avgt         5     3049,292       47,938    ns/op
r.i.AtomicPerf.atomicIntFieldLazySet            1000000   avgt         5  3481701,210   290943,430    ns/op
r.i.AtomicPerf.atomicIntGetAndIncrement               1   avgt         5       11,794        0,212    ns/op
r.i.AtomicPerf.atomicIntGetAndIncrement            1000   avgt         5    11799,472      422,497    ns/op
r.i.AtomicPerf.atomicIntGetAndIncrement         1000000   avgt         5 11851958,773   433823,735    ns/op
r.i.AtomicPerf.atomicIntIncrementAndGet               1   avgt         5       11,449        0,238    ns/op
r.i.AtomicPerf.atomicIntIncrementAndGet            1000   avgt         5    11404,258      206,021    ns/op
r.i.AtomicPerf.atomicIntIncrementAndGet         1000000   avgt         5 11411961,183   194850,135    ns/op
r.i.AtomicPerf.atomicIntLazySet                       1   avgt         5        2,286        0,025    ns/op
r.i.AtomicPerf.atomicIntLazySet                    1000   avgt         5     1150,760       10,376    ns/op
r.i.AtomicPerf.atomicIntLazySet                 1000000   avgt         5  1142247,802    12846,952    ns/op
r.i.AtomicPerf.atomicLongCASCheckFailure              1   avgt         5        2,663        0,027    ns/op
r.i.AtomicPerf.atomicLongCASCheckFailure           1000   avgt         5     1541,088       88,716    ns/op
r.i.AtomicPerf.atomicLongCASCheckFailure        1000000   avgt         5  2285668,598   108122,588    ns/op
r.i.AtomicPerf.atomicLongCASCheckSuccess              1   avgt         5       16,014        0,616    ns/op
r.i.AtomicPerf.atomicLongCASCheckSuccess           1000   avgt         5    11391,372      175,883    ns/op
r.i.AtomicPerf.atomicLongCASCheckSuccess        1000000   avgt         5 11445626,686   192965,285    ns/op
r.i.AtomicPerf.atomicLongCASFailure                   1   avgt         5        7,608        0,071    ns/op
r.i.AtomicPerf.atomicLongCASFailure                1000   avgt         5     6841,162      143,312    ns/op
r.i.AtomicPerf.atomicLongCASFailure             1000000   avgt         5  6856431,735   113946,099    ns/op
r.i.AtomicPerf.atomicLongCASSuccess                   1   avgt         5       17,938        0,332    ns/op
r.i.AtomicPerf.atomicLongCASSuccess                1000   avgt         5     7253,693       99,541    ns/op
r.i.AtomicPerf.atomicLongCASSuccess             1000000   avgt         5  7235321,845    78547,957    ns/op
r.i.AtomicPerf.atomicLongFieldCASFailure              1   avgt         5        9,564        0,463    ns/op
r.i.AtomicPerf.atomicLongFieldCASFailure           1000   avgt         5     9830,242      187,956    ns/op
r.i.AtomicPerf.atomicLongFieldCASFailure        1000000   avgt         5  8022695,593   330115,681    ns/op
r.i.AtomicPerf.atomicLongFieldCASSuccess              1   avgt         5       19,403        0,292    ns/op
r.i.AtomicPerf.atomicLongFieldCASSuccess           1000   avgt         5     9179,600      460,020    ns/op
r.i.AtomicPerf.atomicLongFieldCASSuccess        1000000   avgt         5  7989544,345    68693,036    ns/op
r.i.AtomicPerf.atomicLongFieldGetAndIncrement         1   avgt         5       15,203        0,142    ns/op
r.i.AtomicPerf.atomicLongFieldGetAndIncrement      1000   avgt         5    12964,216      153,536    ns/op
r.i.AtomicPerf.atomicLongFieldGetAndIncrement   1000000   avgt         5 12579871,025   407386,205    ns/op
r.i.AtomicPerf.atomicLongFieldIncrementAndGet         1   avgt         5       15,261        0,192    ns/op
r.i.AtomicPerf.atomicLongFieldIncrementAndGet      1000   avgt         5    13886,679      129,986    ns/op
r.i.AtomicPerf.atomicLongFieldIncrementAndGet   1000000   avgt         5 12570775,806   138961,157    ns/op
r.i.AtomicPerf.atomicLongFieldLazySet                 1   avgt         5        4,183        0,076    ns/op
r.i.AtomicPerf.atomicLongFieldLazySet              1000   avgt         5     3077,770      131,585    ns/op
r.i.AtomicPerf.atomicLongFieldLazySet           1000000   avgt         5  3446711,809   107526,958    ns/op
r.i.AtomicPerf.atomicLongGetAndIncrement              1   avgt         5       11,841        0,329    ns/op
r.i.AtomicPerf.atomicLongGetAndIncrement           1000   avgt         5    11788,437      110,213    ns/op
r.i.AtomicPerf.atomicLongGetAndIncrement        1000000   avgt         5 11813940,290   129279,063    ns/op
r.i.AtomicPerf.atomicLongIncrementAndGet              1   avgt         5       11,421        0,139    ns/op
r.i.AtomicPerf.atomicLongIncrementAndGet           1000   avgt         5    11493,952      396,706    ns/op
r.i.AtomicPerf.atomicLongIncrementAndGet        1000000   avgt         5 11497876,406   797429,030    ns/op
r.i.AtomicPerf.atomicLongLazySet                      1   avgt         5        2,293        0,043    ns/op
r.i.AtomicPerf.atomicLongLazySet                   1000   avgt         5     1157,337       55,157    ns/op
r.i.AtomicPerf.atomicLongLazySet                1000000   avgt         5  1147311,576    55736,976    ns/op
r.i.AtomicPerf.volatileIntRead                        1   avgt         5        3,060        0,155    ns/op
r.i.AtomicPerf.volatileIntRead                     1000   avgt         5     1920,303       48,465    ns/op
r.i.AtomicPerf.volatileIntRead                  1000000   avgt         5  1910374,684    41195,249    ns/op
r.i.AtomicPerf.volatileIntWrite                       1   avgt         5        9,178        0,287    ns/op
r.i.AtomicPerf.volatileIntWrite                    1000   avgt         5     8029,069      271,625    ns/op
r.i.AtomicPerf.volatileIntWrite                 1000000   avgt         5  7986386,165   146129,049    ns/op
r.i.AtomicPerf.volatileLongRead                       1   avgt         5        3,055        0,070    ns/op
r.i.AtomicPerf.volatileLongRead                    1000   avgt         5     1912,078       40,309    ns/op
r.i.AtomicPerf.volatileLongRead                 1000000   avgt         5  1906103,736    21611,386    ns/op
r.i.AtomicPerf.volatileLongWrite                      1   avgt         5        9,175        0,217    ns/op
r.i.AtomicPerf.volatileLongWrite                   1000   avgt         5     8029,490       78,576    ns/op
r.i.AtomicPerf.volatileLongWrite                1000000   avgt         5  7984140,211   151524,011    ns/op
r.i.AtomicPerf.atomicIntGetAndSet          1   avgt         5       11,036        0,129    ns/op
r.i.AtomicPerf.atomicIntGetAndSet       1000   avgt         5    11030,586      191,069    ns/op
r.i.AtomicPerf.atomicIntGetAndSet    1000000   avgt         5 11052108,014   294789,921    ns/op
r.i.AtomicPerf.atomicLongGetAndSet         1   avgt         5       11,044        0,164    ns/op
r.i.AtomicPerf.atomicLongGetAndSet      1000   avgt         5    11033,459      135,330    ns/op
r.i.AtomicPerf.atomicLongGetAndSet   1000000   avgt         5 11093528,851   543185,442    ns/op

Few observations:

  • incrementAndGet and getAndIncrement are practically the same,
  • lazySet is generally twice as fast as volatile write,
  • plain CAS success and failure take roughly the same time
  • avoiding certain CAS failure is a great win, but reading before certain CAS success has extra cost.
  • there isn't much difference between ints and longs on x64,
  • AtomicFieldUpdaters seem to have around 10% overhead.
  • getAndSet costs more than compareAndSet (so using getAndSet for swapping in a terminal state lowers performance)

benjchristensen added a commit that referenced this pull request Dec 1, 2014
Atomic operations JMH benchmarks
@benjchristensen benjchristensen merged commit 43456c5 into ReactiveX:1.x Dec 1, 2014
@benjchristensen
Copy link
Member

This is cool information. Thanks.

@akarnokd akarnokd deleted the AtomicPerf branch December 1, 2014 18:03
@artem-zinnatullin
Copy link
Contributor

Sorry for coming into the year old thread (I'm reviewing RxJava sources…), but my runs of ./gradlew benchmark for OperatorObserveOn show that switching to plain Atomic fields intead of volatile + AtomicFieldUpdater give pretty significant performance boost (I'll do more tests tomorrow and probably will open a PR).

I'm not a JMH master, but as far as I understand, tests in this PR are not very good for one big reason — loops. AFAIK loops in benchmark tests are bad because JIT is too smart about them, see http://java-performance.info/jmh/.

If I remove loops from these tests I see that AtomicFieldUpdaters have 25-30% overhead, not 10% (especially in getAndIncrement and incrementAndGet).

@akarnokd @benjchristensen thoughts? Probably it's a reason to reconsider switch to volatile + AtomicFieldUpdater instead of Atomic*, of course in some (rare) cases it's good because of slightly lower memory usage, but reflection costs too much, especially on Android :(

I guess, @benjchristensen in your usage cases (Netflix) you would like to have faster code than save few bytes of memory. For Android, it will be nice to avoid this reflection too.

@akarnokd
Copy link
Member Author

benchmark tests are bad because JIT is too smart about them

Except when there are atomic operations inside the loop.

I see that AtomicFieldUpdaters have 25-30% overhead

JIT can sometimes properly eliminate the class-validation in the updater and it should run as fast as Unsafe. Sometimes not, most likely when there are some other implementation of the host class around.

some (rare) cases it's good because of slightly lower memory usage, but reflection costs too much, especially on Android

Its a tradeoff between portability, memory usage, the cost of validation vs dereference. Do you know a benchmark tool for Android where these cases could be run?

@artem-zinnatullin
Copy link
Contributor

Its a tradeoff between portability, memory usage, the cost of validation vs dereference.

Since RxJava needs a lot of AtomicIntegers and AtomicLongs and also RxJava should be fast and memory-friendly, what do you think about Object Pools for them?

It will:

  • Remove some amount of memory allocations
  • Remove need in reflection (AtomicFieldUpdater)
  • Make operators faster

Do you know a benchmark tool for Android where these cases could be run?

No :(

But we can create one similar to JMH :)

@akarnokd
Copy link
Member Author

RxJava should be fast and memory-friendly

Concurrent operators can become faster by padding away heavily mutated counters, but at a cost of higher memory usage and increased class complexity. These latter two become drawbacks with short lived tasks that are run from a tight loop (i.e., benchmarking just().observeOn().subscribe(...) with JMH.

In addition, having a reference to Atomic classes may still end up with false sharing as GC may pack them next to each other or their owner class. Sometimes, even if they are further away, there could be a false-sharing issue with the object header itself. Java 8 Atomic variables might get padded, but that's unreliable. What's left is manually assembling classes with padding fields and either using Unsafe (unavailable in Android) or field updaters.

The same is true for our copies of the JCTools queues: class padding may help, element padding may help up to a certain limit.

what do you think about Object Pools for them

I generally try to avoid them because they can become the new contention point.

Remove some amount of memory allocations

Allocation is pretty performant in desktop Java and if you don't run short-lived tasks that frequently, it isn't a problem. I saw the tendency for the need of optimizing for short-lived tasks (i.e., just() and Single) and I believe there is an upper bound on how few operations can achieve the functionality without endangering concurrent-safety.

Remove need in reflection (AtomicFieldUpdater)

I think Android is in disadvantage here; I never had any problems with desktop Java reflection since Java 7. The 2-second long reflective lookup of all methods (NewThreadWorker issue) baffled me. I think the Android platform is so behind that it requires an independent RxJavaAndroid port that specifically knows about such problems.

Make operators faster

Most benchmarks really test for operator overhead by doing nothing but blackholing source values. Many of these, by using boxed integers, test for GC+operator overhead together (although I think there should be perf test that specifically avoid GC interference).

@artem-zinnatullin
Copy link
Contributor

I think Android is in disadvantage here;

Yeah :(

Though I don't like manual paddings because it's platform specific thing…

Okay, your arguments were enough convincing for me, let's leave AtomicFieldUpdater :)

@akarnokd
Copy link
Member Author

@artem-zinnatullin I didn't mean to discourage you from doing performance improvements, just saying that there are tradeoffs per platform and per implementation aim.

With observeOn we can check for the platform Android and run a different inner subscriber for each. For example, desktop could run a padded version and Android could run with AtomicXXX instances.

There is, however, another effect usually found with async benchmarks and the computation scheduler: each inner iteration may end up on different cores which may trigger more thread migrations and soak up interference from other programs. I've resolved this in 2.x with the introduction of the Schedulers.single() which routes to the same underlying thread all the time.

Finally, there is sometimes an effect going on I call emission-grabbing where the thread which requests grabs the emission logic and the whole source-sink ends up on the same thread, boosting the throughput considerably. In terms of the observeOn that means padding optimization may not mean too much because there is going to be little-to-none concurrency happening there.

For example, in the range test, I've added a pipelined version where the emission is forced to a different thread than the observation:

image

As you can see, pipelined has 6x less throughput than non-pipelined. Also note that 2.x observeOn is 1 cache line padded (the usual padding is 2x for compensating for adjacent prefetch, but I was a bit sloppy).

That being said, I'm always open to optimizations. Would you like to pursue this further?

@artem-zinnatullin
Copy link
Contributor

I didn't mean to discourage you to do performance imporvements, just saying that there are tradeoffs per platform and per implementation aim.

Yeah, I understand.

With observeOn we can check for the platform Android and run a different inner subscriber for each. For example, desktop could run a padded version and Android could run with AtomicXXX instances.

This requires JMH-like tool for Android first. I'll work on it in my spare time, feel free to review/comment/follow https://github.com/artem-zinnatullin/droid-jmh

As you can see, pipelined has 6x less throughput than non-pipelined.

Wow.

That being said, I'm always open to optimizations. Would you like to pursue this further?

Yep, I'll continue my "investigations" :)

@nitsanw
Copy link
Contributor
nitsanw commented Mar 2, 2016

The CAS vs getAndAdd conclusion is somewhat misleading. From JDK8 getAndAdd is intrinsified into XADD, which scales better than CAS under contention. To see the effect measure with different thread counts.
The volatile read cost is misleading in the sense that side effects of volatile are not measured/discussed. In the single value read there should be no difference. Volatile reads can stop many other memory reads from optimizing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0