observed performance regression #1549

lano1106 · 2024-08-28T20:55:20Z

I was evaluating switching to the new google maintained tcmalloc project. As part of the evaluation, I did rerun the benchmark that Sanjay initially did at:
https://gperftools.github.io/gperftools/tcmalloc.html

I do not know what, but it seems like the 2024 gperftools tcmalloc is much less performant than what was there back then...

this is mostly flagrant when you compare the 8 threads 64 bytes allocation Mops.

google/tcmalloc#260 (comment)

It went from 8 Mops to 2 Mops...

I do not know what but it seems like something did seriously break in between...

lano1106 · 2024-08-29T13:47:49Z

I have included jemalloc into the benchmark...

something suspicious is that all allocators appears to suffer from a lock contention except glibc... what could be the common denominator?

I have no clue on how the thread_local keyword is implemented but this could be this guy...

update:
I have just checked glibc/malloc/malloc.c, glibc also uses TLS so this is not it... maybe some false sharing when updating some global atomic stats variables...

lano1106 · 2024-08-29T15:49:15Z

I went as far as recompiling tag perftools-1.0 and the same crappy performance is found with this version so the problem is external to the project code...

I fail to see what it could be...

that would be really nice if someone could test:

LD_PRELOAD=libtcmalloc_minimal.so time ./ptmalloc_test1 1 1 1000000 64
LD_PRELOAD=libtcmalloc_minimal.so time ./ptmalloc_test1 8 8 1000000 64

to validate that he sees the same performance deterioration as soon as several threads are used.

alk · 2024-08-29T17:35:49Z

So this test appears to be running extremely quickly. If run time is too short, measurements cannot be trusted.

Also can you please point exactly to the code you're running (I just checked out perftools-0.1 tag and built src/tests/ptmalloc/t-test1.c with -O2; no idea if you're doing same/similar)

lano1106 · 2024-08-29T18:15:37Z

yes I have compiled src/tests/ptmalloc/t-test1.c.

not with -O2 but with -O3.. beside that this is what I have done

ptmalloc_test1 is the resulting binary from compiling t-test1.c

the one that I have used is:

git log --all --full-history -- "**/t-test1.c"

commit 25eed16e1b042a80c9a3e83bbf7ed227d04fb45a
Author: csilvers <csilvers@6b5cf1ce-ec42-a296-1ba9-69fdba395a50>
Date:   Tue Oct 27 17:30:52 2009 +0000

            * Fix Symbolize() to call pprof once, rather than once/symbol (glider)
            * Fix unsetting of hooks before forking, in debug mode (maxim)
            * Add some documention for pmuprofile (aruns)
            * Speed up futex with FUTEX_PRIVATE_FLAG (m3b)
            * Fix os x 10.6: prefer sys/ucontext.h to ucontext.h (csilvers)
            * Fix C shims to be actually valid C: malloc_extension/etc (csilvers)
            * Fix a longtime memset bug (csilvers)
            * Implement nothrow versions of delete (csilvers)
            * Fix recursively-unmapped-region accounting (ppluzhnikov)
            * Better distinguish between real and fake VDSO (ppluzhnikov)
            * Modify span coalescing to improve performance (sanjay)
            * WINDOWS: Remove unnecessary lock around VirtualAlloc (mbelshe)
            * Remove performance tests for ptmalloc2 (csilvers)
    
    
    git-svn-id: http://gperftools.googlecode.com/svn/trunk@77 6b5cf1ce-ec42-a296-1ba9-69fdba395a50

followed by:

git checkout 25eed16e1b042a80c9a3e83bbf7ed227d04fb45a^

but I guess that checking out perftools-1.0 is as good.

I'll follow your suggestion and I will add one zero to 1,000,000 to see if it result into something different

Update:
making 10,000,000 iterations did provide some more precision to the measurements but nothing that is making a noticeable change on the graphs...

lano1106 · 2024-08-29T18:36:32Z

my investigation has made me find a minor glitch in the project code... I'll prepare a pull request to share this finding soon...

#1551

lano1106 · 2024-08-30T17:46:19Z

my new hypothesis to explain my observation it is that the central heap is somehow way too much accessed.

it is protected by a spinlock so if it is the case, the mystery will be solved... it is a no brainer that if there is contention over a lock by 8 threads at the same time, it will show in performance.

I am not too sure if there is an easy and ready to use way to monitor the central heap accesses... I might add my own temporary debug hooks to validate this...

One possible factor for this sad state might have something to do with the 64 bytes class batch_size, without FLAGS_tcmalloc_transfer_num_objects, it would be 1024 but the default value is 32... I am wondering if in the initially benchmarked tcmalloc version, batch_size was NOT bounded at all considering that TCMALLOC_TRANSFER_NUM_OBJ env var appears to not be documented in the original HTML documents (Is it documented at all somewhere today?)

Update:
I did run the test with:
TCMALLOC_RELEASE_RATE=0 TCMALLOC_TRANSFER_NUM_OBJ=256

it did help a little bit but absolutely not enough to say case closed...

64 bytes 8 threads mops went from 2.5 to 2.557

lano1106 · 2024-08-30T19:18:52Z

I have done this:

diff --git a/src/thread_cache.cc b/src/thread_cache.cc
index 3ac1019..c88b7de 100644
--- a/src/thread_cache.cc
+++ b/src/thread_cache.cc
@@ -80,7 +80,8 @@ ThreadCache::ThreadCache() {
   ASSERT(Static::pageheap_lock()->IsHeld());
 
   size_ = 0;
-
+  central_fetch_cnt_ = 0;
+  central_release_cnt_ = 0;
   max_size_ = 0;
   IncreaseCacheLimitLocked();
   if (max_size_ == 0) {
@@ -114,6 +115,13 @@ ThreadCache::~ThreadCache() {
       ReleaseToCentralCache(&list_[cl], cl, list_[cl].length());
     }
   }
+  FILE *fp = fopen("tc_stats.txt", "a");
+
+  if (fp) {
+      fprintf(fp, "central fetch:%lu release:%lu\n",
+              central_fetch_cnt_, central_release_cnt_);
+      fclose(fp);
+  }
 }
 
 // Remove some objects of class "cl" from central cache and add to thread heap.
@@ -128,6 +136,7 @@ void* ThreadCache::FetchFromCentralCache(uint32_t cl, int32_t byte_size,
   void *start, *end;
   int fetch_count = Static::central_cache()[cl].RemoveRange(
       &start, &end, num_to_move);
+  ++central_fetch_cnt_;
 
   if (fetch_count == 0) {
     ASSERT(start == NULL);
@@ -166,6 +175,7 @@ void ThreadCache::ListTooLong(FreeList* list, uint32_t cl) {
 
   const int batch_size = Static::sizemap()->num_objects_to_move(cl);
   ReleaseToCentralCache(list, cl, batch_size);
+  ++central_release_cnt_;
 
   // If the list is too long, we need to transfer some number of
   // objects to the central cache.  Ideally, we would transfer
diff --git a/src/thread_cache.h b/src/thread_cache.h
index 7454688..23bd1cf 100644
--- a/src/thread_cache.h
+++ b/src/thread_cache.h
@@ -305,6 +305,8 @@ class ThreadCache {
   // All ThreadCache objects are kept in a linked list (for stats collection)
   ThreadCache* next_;
   ThreadCache* prev_;
+  size_t       central_fetch_cnt_;
+  size_t       central_release_cnt_;
 
   // Ensure that this class is cacheline-aligned. This is critical for
   // performance, as false sharing would negate many of the benefits

for:

LD_PRELOAD=/home/lano1106/dev/packages/gperftools/src/gperftools/.libs/libtcmalloc_minimal.so TCMALLOC_RELEASE_RATE=0 TCMALLOC_TRANSFER_NUM_OBJ=256 time ./ptmalloc_test1 1 1 10000000 64
Using posix threads.
total=1 threads=1 i_max=10000000 size=64 bins=125000
Created thread 7841c66006c0.
Done.
0.62user 0.01system 0:00.94elapsed 67%CPU (0avgtext+0avgdata 13576maxresident)k
0inputs+32outputs (0major+2065minor)pagefaults 0swaps

I got:

central fetch:1085 release:461
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0

for

LD_PRELOAD=/home/lano1106/dev/packages/gperftools/src/gperftools/.libs/libtcmalloc_minimal.so TCMALLOC_RELEASE_RATE=0 TCMALLOC_TRANSFER_NUM_OBJ=256 time ./ptmalloc_test1 8 8 10000000 64
Using posix threads.
total=8 threads=8 i_max=10000000 size=64 bins=15625
Created thread 7a553fa006c0.
Created thread 7a553f0006c0.
Created thread 7a553e6006c0.
Created thread 7a553dc006c0.
Created thread 7a553d2006c0.
Created thread 7a553c8006c0.
Created thread 7a553be006c0.
Created thread 7a553b4006c0.
Done.
4.09user 0.04system 0:01.53elapsed 269%CPU (0avgtext+0avgdata 15920maxresident)k
0inputs+168outputs (0major+1785minor)pagefaults 0swaps

I got:

central fetch:793 release:524
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:793 release:521
central fetch:789 release:524
central fetch:786 release:523
central fetch:792 release:523
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:788 release:524
central fetch:786 release:523
central fetch:790 release:524
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0
central fetch:2 release:0

I am surprised that 4 ThreadCache objects per thread get created. I was expecting 1!
The amount of central fetch is higher than I would have expected for a reasonably bounded allocation usage pattern
TBH, I do not fully understand what t-test1.c is doing... It does the specified amount of iterations but at each iteration, it seems to choose a random action... I honestly cannot affirm that the test program is correct. If it is leaking, that could explain why the central heap is that much solicited. OTOH, glibc appears to manage much better the situation by adopting a more performant strategy to cope with what is thrown at it...

alk · 2024-08-30T21:40:35Z

Hi. I'll give more elaborate comment later (likely tomorrow), but here are some quick thoughts.

you're likely seeing not a regression in gperftools or ("abseil") tcmalloc but glibc getting better (and in this benchmark too)
I am thankful for your insightful comments and general interest in gperftools. We need more people, even if it is just for extra code review (but we have some "writing code" projects too). But do keep in mind than way back when original tcmalloc was written, glibc and some common mallocs were not so great for common C++-ful workloads. Nearly any benchmark was good enough to enable good progress. But today, ~all common mallocs got their basics right, so now benchmark and/or workloads specifics matter. This benchmark, do keep in mind, is not necessary great representation of whatever workloads you're optimizing.

lano1106 · 2024-08-30T22:30:27Z

you make me think about something... I must not be calculating mops the same way than Sanjay did.

all I am doing is iterations/time

by looking at graph on
https://gperftools.github.io/gperftools/tcmalloc.html

I now realize that he must be doing num_threads*iteration/time
this is how the mops can increase with the number of threads...

maybe not regression... but certainly valid interrogations... and improvements might still be possible in your very mature project

I fail to see why there is so much accesses to the central heap...
The number of thread caches is suspicious too...

if it is only a matter of configuration... the library is incredibly opaque about suboptimal operation... I have been struggling few days to make sense of what I am seeing and no one has been able to provide a satisfying explanation yet...

thanks for the good words and I am looking forward your more elaborated comment!

p.s.: I have just run valgrind on t-test1.c exec... no leaks...

p.p.s.: gperftools is still the fastest single thread allocator:
google/tcmalloc#260 (comment)

I erroneously thought that the project actively maintained by Google with all its bells & whistles would automatically be better...

if I can reduce central heap contention, as far as I am concerned gperftools will stay the fastest allocator currently available.

lano1106 · 2024-08-31T00:28:07Z

2. Nearly any benchmark was good enough to enable good progress. But today, ~all common mallocs got their basics right, so now benchmark and/or workloads specifics matter. This benchmark, do keep in mind, is not necessary great representation of whatever workloads you're optimizing.

I get what you say here... the simple benchmark may not be representative of my actual workload...

yet it makes tcmalloc exhibit a behavior that I want to avoid in my real application and I only start to be aware of it and know what to look for... there must be some stats out there that I am unware of to know precisely what I want.... my app is perhaps having a 1GB virtual memory footprint (111.2MB res) running on a 128GB RAM dedicated system... I absolutely don't care at all to accumulate memory in the thread cache. The only thing that I want is the fastest allocation possible... I have multiple threads having memory allocated by openssl (95% of all my app allocations)... t-test1.c is representative in the sense that it does not appear to allocate an unreasonable amount of memory yet the library release very often memory to the central heap making it 4-5x slower than glibc.

I am perhaps few hours away from finally fully understand how tcmalloc take this decision and how to configure it to behave like I want (or maybe shorter if I get some help)...

the question that I have, if it is only a matter of configuration, is tcmalloc default settings adequate for a general purpose usage? Unless you tell me that there is a lot of embedded systems users in tcmalloc userbase, then maybe they could be a bit more liberal like what glibc choose to be to exhibit a better performance in t-test1.c that I believe to be reasonably representative to an average multithreaded heap usage...

update:
I have finally figured precisely what t-test1.c is doing.

If I get it right, it will allocate approximately 4MB by doing roughly 60,000 64 bytes allocation (there is a random element to the test).

It is always 4MB divided by (size*n_thr)... then the tcmalloc slowdown has to do with the class size list length and this is why with bigger allocations, tcmalloc performs better than with smaller ones...

lano1106 · 2024-08-31T00:52:51Z

oh here is small gem that I have discovered... I am using GCC 14.1.2 and I am using the switch -std=c++26 when I compile tcmalloc... I do not know what in C++26 is doing that but with this settings I have seen some results from malloc_bench be 10% faster than if compiled with the default standard of -std=c++17...

10% improvement... this is not something to disregard...
(or maybe adding -march=sapphirerapids did contribute too... but this is something to confirm. I got a 10% boost just by recompiling)

lano1106 · 2024-08-31T14:23:14Z

in t-test1.c malloc_test() function. I identify 3 sections

init bins. 50% of them will be allocated. (Most the central heap fetch are going to occur there)
main test random deallocation/allocation are going to be performed 'iterations' times
teardown. all the memory will be released (this is where the majority of the central heap release will occur, IMHO)

I think that it would be interesting to know how many times the central heap is accessed in step 2. possibly never or very little (hopefully)... I'll try to modify my debug hooks to answer this question...

beside that, it might be possible to improve the central heap access logic to minimize access to it...

I am throwing few random ideas that come my mind:

Make the fetch ramp-up batchsize customizable... It could be interesting to give the option to the users to initialize FreeList::max_length_ with Static::sizemap()->num_objects_to_move(cl)

lano1106 · 2024-08-31T15:30:22Z

I have enhanced my ThreadCache log trace to include ThreadCache::max_size_

the result obtained with
./ptmalloc_test1 1 1 10000000 64

central fetch:1085 release:461 max_size_:196608
central fetch:2 release:0 max_size_:65536
central fetch:2 release:0 max_size_:65536
central fetch:2 release:0 max_size_:65536

where does this ridiculously low value of 196608 come from?

https://gperftools.github.io/gperftools/tcmalloc.html

mention that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES default to 32MB (or 16MB if you refer to the value in the text)

the mysterious bogus cache objects are stealing kStealAmount (1 << 16) from the unclaimed_cache_space_ initialized with kDefaultOverallThreadCacheSize (32 MB)

196608 happens to be (3 << 16)

alk · 2024-08-31T18:37:16Z

So another quick thing. I am not sure what is going on on your end.

That 10000000 "imax" parameter is way way too short. It runs 15 ms (yes, milliseconds) on my box. I need to run 100x more to get plausible figures.
profile of that run shows ~all malloc ticks in fast-path. No central free list business. In fact, for "size" parameter of 64 we'll get only few size-classes in use and so we should be able to adapt reasonably quickly to to only 125k "bins" that this test uses.

Ah btw gperftools is faster than glibc on this workloads (because, again, we're pretty good on this fast-path thingy)

So I am not sure what you're chasing after.

alk · 2024-08-31T18:44:29Z

lano1106 · 2024-08-31T18:58:19Z

you are performing the test on a much faster machine than me. My dev machine is a old sandybridge system.

I got results in the 15ms range with 1,000,000 iterations... with 10,000,000 iterations it takes a couple of seconds (4 ish)
Cut in half the 125k "bins" number. Roughly only half of them are allocated:

		if(RANDOM(&ld, 2) == 0)
			bin_alloc(&p.m[b], RANDOM(&ld, p.size) + 1,
					  lran2(&ld));

do you have env vars configured that would explain why you get different result?
Sure, if you never access the central heap as I do, you will get a much faster run...

I do not understand how it is possible that you get away from ever accessing the central heap...
max_size_ is initialized at kStealAmount (64kb) and is incremented by kStealAmount each time salvage() is called....

I am going to create a profiler run output like you did to show you how different my run looks like...

BTW, I have solved the bogus ThreadCache thing... I think that temporary ThreadCache objects are created because I am calling functions that allocate memory from the ThreadCache destructor... I have still improved my trace by adding few more fields... The bogus cache objects are reusing the same addresses:

central fetch:2163 release:0 max_size_:917504 unclaimed_cache_space_:32505856 thread_heap_count_:3 this:0x653987e22040
central fetch:2    release:0 max_size_:65536  unclaimed_cache_space_:33357824 thread_heap_count_:3 this:0x653987e23080
central fetch:2    release:0 max_size_:65536  unclaimed_cache_space_:33357824 thread_heap_count_:3 this:0x653987e22040
central fetch:2    release:0 max_size_:65536  unclaimed_cache_space_:33357824 thread_heap_count_:3 this:0x653987e23080

in reality, my program must have 2 real caches... 1 for the main thread 1 for the test thread. The 3rd one would be the temporary one...

btw, I have succeeded to have a run with zero releases... I have achieved this by changing the

kMaxDynamicFreeListLength value from 8192 to 32768...

at this point, I am just playing the various knobs to see what they do...

lano1106 · 2024-08-31T19:26:10Z

also, since we are running on different HW, it does not matter that your run is much faster than mine... what you want to check is how the time for a 8 threads run compare with a 1 thread run... What I have observed is 8x slowdown caused by central heap accesses... If you do not have the same contention, then the expected result should be 1thread test time more or less equal to 8threads test time

alk · 2024-08-31T19:27:05Z

I do not understand how it is possible that you get away from ever accessing the central heap... max_size_ is initialized at kStealAmount (64kb) and is incremented by kStealAmount each time salvage() is called....

It is not that they don't happen. It is just peanuts compared to the total amount of work done.

I don't set any environment variables (except LD_PRELOAD and CPUPROFILE).

Yes, it shouldn't matter if you're doing it on older machine or not. I guess I try this on rpi 4 I have to double check.

lano1106 · 2024-08-31T20:58:01Z

I have reproduced your profiler profile on my side:

IMHO, it is not conclusive...

yes it is peanuts in the single thread setup since there is no contention.

The real test is comparing the exec time on the same HW and same parameters. If the multiple threads test exhibits an exec time 8x slower it is sign of a contention... The only contention that I am aware of is the central heap lock one... If it is peanuts, I would expect the 8 threads setup perform almost as fast as the single thread one...

I would not have fallen into the rabbit hole if the contention was not real...

alk · 2024-08-31T22:01:38Z

Absolutely no signs of lock contention. Here is RPI 4:

pi@raspberrypi:~/src/ptmalloc-bench $ perf stat \time sh -c 'LD_PRELOAD=/home/pi/gpt/lib/libtcmalloc_minimal.so ./t-test1 1 1 100000000 64'
Using posix threads.                                         
total=1 threads=1 i_max=100000000 size=64 bins=125000        
Created thread 7f918ff160.                                   
num_bin_allocs: 50047547                                     
Done.                                                        
14.54user 0.00system 0:14.55elapsed 99%CPU (0avgtext+0avgdata 11744maxresident)k
0inputs+0outputs (0major+2259minor)pagefaults 0swaps         
                                                             
 Performance counter stats for 'time sh -c LD_PRELOAD=/home/pi/gpt/lib/libtcmalloc_minimal.so ./t-test1 1 1 100000000 64':
                                                             
          14554.16 msec task-clock                       #    1.000 CPUs utilized             
                85      context-switches                 #    5.840 /sec                      
                 0      cpu-migrations                   #    0.000 /sec                      
              2312      page-faults                      #  158.855 /sec                      
       26196192120      cycles                           #    1.800 GHz                       
       11177991268      instructions                     #    0.43  insn per cycle            
   <not supported>      branches                                                              
         142574508      branch-misses                                                         

      14.558668832 seconds time elapsed

      14.545619000 seconds user
       0.009988000 seconds sys


pi@raspberrypi:~/src/ptmalloc-bench $ 
pi@raspberrypi:~/src/ptmalloc-bench $ perf stat \time sh -c 'LD_PRELOAD=/home/pi/gpt/lib/libtcmalloc_minimal.so ./t-test1 4 4 100000000 64'
Using posix threads.
total=4 threads=4 i_max=100000000 size=64 bins=31250
Created thread 7f8c2df160.
Created thread 7f8bacf160.
Created thread 7f8b2bf160.
Created thread 7f8aaaf160.
num_bin_allocs: 50024031
num_bin_allocs: 50040030
num_bin_allocs: 50005881
num_bin_allocs: 50021897
Done.
95.46user 0.07system 0:24.09elapsed 396%CPU (0avgtext+0avgdata 11920maxresident)k
0inputs+0outputs (0major+2302minor)pagefaults 0swaps

 Performance counter stats for 'time sh -c LD_PRELOAD=/home/pi/gpt/lib/libtcmalloc_minimal.so ./t-test1 4 4 100000000 64':

          95548.26 msec task-clock                       #    3.965 CPUs utilized             
               451      context-switches                 #    4.720 /sec                      
                 2      cpu-migrations                   #    0.021 /sec                      
              2356      page-faults                      #   24.658 /sec                      
      171972371510      cycles                           #    1.800 GHz                       
       44659199504      instructions                     #    0.26  insn per cycle            
   <not supported>      branches                                                              
         283899065      branch-misses                                                         

      24.098579251 seconds time elapsed

      95.468168000 seconds user
       0.077941000 seconds sys

Bumping thread count proportionally scales up amount of work. And yes we have each core getting about 2x slower, but we have total number of instructions scaled 4x doing 4x more work. So whatever slowdown is due to microarchitectural reasons (thermal budget, shared caches things like that).

And 100% cpu utilization too. Lock contention would lower it due to sleeping, or we'd see spinlock busy looping (we do a bit) on profiles. None of that is there.

Okay, I think this can be closed as 'not a bug'.

Ah, final thing. Profile you show above has catastrophically small number of profiling ticks. That screams noise everywhere. And likely explains oddities you report like random unrelated flags showing big impact that is "impossible".

lano1106 · 2024-08-31T22:08:13Z

ok my bad... I must have confused user time with elapsed time (aka clock time)

glibc times are better when looking user time because it must be sleeping on locks

thank you for your help

lano1106 · 2024-09-01T16:07:35Z

here are the updated graphes with the corrected measurement methodology

alk · 2024-09-01T16:13:19Z

Thanks. I am slightly surprised to see "abseil" tcmalloc to loose to us here.

Could be "prefetch" thingy though. I.e. they do prefetch in allocation path which has some evidence (although I wasn't 100% sure I buy it) that it benefits real world app performance despite doing slightly more work within malloc. And certainly degrades benchmark numbers a bit.

lano1106 · 2024-09-01T17:20:34Z

yes... it is very surprising... their documentation is very well done... the features/benefits list is very enticing...

but OTOH, I had few interrogation points:

bazel build support only
Must be statically built. ODR issue if app is already linking with Abseil How would I build a shared object file? google/tcmalloc#27 (comment)
Incompatibility with glibc because of unorthodox RSEQ usage Add support for use of GLIBC's rseq google/tcmalloc#144

with all those friction points, I wanted to be absolutely sure that transitioning to this variant was the best possible move. This is the driver for the benchmark effort.

because of a stupid error from me, I spent a lot of time, (and some of yours) trying to resolve an imaginary problem. OTOH, I have learned a lot from it... I can now claim to have a pretty good understanding about how tcmalloc works...

that was an incredible learning activity...

my opinion about "abseil" tcmalloc... (I am not aware of the prefetch thing), it is that they have traded off some allocation/deallocation performance to offer a per-cpu cache...

This might be very beneficial when you have a system with much more threads than CPUs... This is not the case in my setup...

alk · 2024-09-02T16:53:09Z

my opinion about "abseil" tcmalloc... (I am not aware of the prefetch thing), it is that they have traded off some allocation/deallocation performance to offer a per-cpu cache...

Not entirely correct. Yes, for them per-cpu caches matter or at least they think so. Because of fibers use-case (think of it like, thread per request; so that would be a ton of threads to "waste" memory on per-thread caches). But per-cpu caches actually enable some performance, despite some (fairly small) rseq overhead. This is because pushing and poping elements to the per-size-class lists doesn't involve pointer chasing. And it seems to matter quite a bit. Also enables them do to much improved central free list design (also saving tons of pointer chasing). This plus some pretty good attention to performance makes abseil tcmalloc pretty good performance wise.

And there are certainly a ton of workloads, non-Google workloads, that would benefit from per-cpu caches today (e.g. Mongo).

I am not sure what causes the gaps on graphs above, but they don't represent reality in practice for C++ workloads. Mentioning C++ workloads because one of changes there, they turned malloc into memalign to conform to nonsensical minimal alignment requirements formally part of C standard (BTW C++ might actually do same). So operator new is full performance. Malloc is slightly slower (but I think still faster then our memalign).

Ah and another thing. If you have some appetite for central free list performance chasing there is actually one semi-prominent benchmark that we lose today and it involves CFL. Here: #1448

lano1106 · 2024-09-03T00:36:06Z

when you mention pointer chasing... this is a tcmalloc aspect that did not appear in my code revision. I suppose that this pointer chasing business occur during deallocation to know to which size class a pointer belongs to... Hence the C++14 sized delete operator is a welcome addition as it totally shortcut the pointer chasing thing.

I think that recent C++ standard revisions also mandate some minimal alignment requirements...

I am not sure why you say that t-test1 benchmark does not represent C++ workload... What is a representative C++ workload? How would you qualify it?

the alignment aspect is possibly a non-issue in t-test1 benchmark considering that all the sizes are perfect power of 2 values...

The only difference that I possibly can imagine between t-test1 workload and a real app workload is that the real app might have more variety in its allocation sizes... to me t-test1 benchmask is pretty convincing beside that point...

t-test1 measure how fast an allocator can service a sequence of random allocation/deallocation of a certain size across N threads... this seems to me like a valid way to measure and compare allocators performance...

Do you have data that shows abseil tcmalloc on its best side? I have not been able to find any...

Otherwise, what would be the ultimate way to see what the allocator is capable to do with a real workload? Profile the app using different allocators and see where malloc is positioned in the top profiled functions list?

beside that, thank you for the link but no... thank you... my craving for benchmark numbers is satisfied... I need to move on as I have already spent too much time on this topic sure I am sure that there is exist benchmarks that press all on the wrong buttons to make certain allocators looks bad simply because of their design choices... the type of situation that does not show up in real scenarios... I am trying to put back the top on the pot concerning this specific topic...

I am going to stay at my current conclusion that for a series of random allocation/deallocation, abseil tcmalloc perform less well than vanilla tcmalloc until new data convince me otherwise... I need hard data to form an opinion...

here is an unified bench_malloc result table:

my understanding of what it is, it tries to isolate the execution time of a single allocation with a single thread from different setups.

I am convinced that abseil tcmalloc has its strengths but based on my investigation, it does not seem like raw speed is one of them...

alk · 2024-09-03T17:17:59Z

I've yet to check that link you posted above about RSEQ compatibility issues, but that numbers you posted above don't look right. I speculate (but plausibly) that your abseil tcmalloc doesn't run in per-cpu mode.

I now see you statement above above "compromising" performance for per-cpu mode in potentially new light. So let me clarify. It is arguably part of big "weakness" of sorts of Abseil thing in general, but tcmalloc abseil especially. They only care above their use-case and not interested in any compromises. Open-source in this project is to enable sharing with other Alphabet bets and encouraging external contributions (like from Intel or Arm, stuff like that). They're not interested in other contributions as long they don't help their somewhat narrow use-case.

So in this regard, per-cpu mode is the only mode they optimize for. There have been numerous internal discussions while I was part of that team about even dropping "legacy" per-thread stuff. I can tell you with certainty that I have seen abseil tcmalloc to perform much better than stats above and I know both Chris and Dmitry won't regress performance.

So as long as you're not doing per-cpu mode in practice, you're not seeing its true might. And if (again, saying this without really checking) lack of per-cpu mode happens because of some glibc/tcmalloc RSEQ incompatibility, yes, that is a problem. But don't expect them to bother much, sadly.

Now for the pointer chasing. Classic case for pointer chasing is traversing linked lists. When you have a code that continuously pops elements off free list that is classic pointer chasing. Yes it can be argued that as part of malloc/operator new you're reading memory that you're about to return and which your caller is likely to use anyways, but it adds work. Even more work and slowness it adds to "pop a bunch of per-thread list" operation that we do as part of central free list operations. And pushing elements to free list as part of free, we also have "extra" memory store, touching and potentially bringing into a caches a memory that is being freed.

For the difference in workloads. I don't have super-great understanding but real workloads have:

a) non-uniform distribution of object sizes. Sizes in ball-park of 32 bytes or so are most common and there is typical "peak"-ful distribution.

b) non-uniform distribution of object lifetimes. With quite heavy head towards very short lifetimes.

c) dependency in object lifetimes. Stuff allocated together is likely to be freed together. Also threads often run in some sort of phases, where for some time it may allocate a bunch of stuff (say, to populate hash table or something) and then does a burst of frees.

None of that is simulated by ptmalloc bench (or any other bench I've seen yet). And, yes, C++ workloads at Google (but I suspect not much beyond) are by far sized delete, which may affect some design choices.

the alignment aspect is possibly a non-issue in t-test1 benchmark considering that all the sizes are perfect power of 2 values...

Alignment aspect is the (very tiny) issue because abseil tcmalloc malloc() does more work than operator new.

lano1106 · 2024-09-03T18:28:12Z

it is a super interesting discussion...

I am aware about the per-cpu cache mode can possibly be disabled because of glibc... I can assure you that the bench-malloc numbers are with per-cpu caches enabled.

tbh, I took many measurements when benchmarking t-test1 only to realize that I forgot GLIBC_TUNABLES=glibc.pthread.rseq=0

and you know what? that difference was barely noticeable... It did look very much like the graphs coming along my experimental commit #1554

to me, it seems like everyone who has been involved with the abseil tcmalloc has the strong belief that it must be better than the legacy project.

I have not seen any evidence of this so far... Of course, I can imagine scenarios where abseil tcmalloc shines and leaves every other allocators far behind...

ie: having hundred of threads on a 32 CPU system...

I am sure that the per-cpu cache possibly enable allocations that can optimize the CPU cache usage... and possibly despite a longer allocation time, it delivers an overall better app performance...

but sw design is always a matter of tradeoffs... if abseil tcmalloc is doing more complex stuff, it will need to give up some raw speed over the simpler design... imho, it is inevitable...

concerning the idea that they are unwilling to compromise outside of their very narrow use-case... I understood that point by myself... The bazel only support is another sign of this... This is coming straight from Titus Winter legacy of building the whole world at each build... (the living at HEAD philosophy... which I must confess has clear benefits)

I definitely invite you to check the RSEQ ticket... RSEQ creator Mathieu Desnoyers did chime in and was not very happy about some aspect to the point where he committed something in the kernel to express his disapproval... I am sure that it will end up with a happy compromise or a win-win solution but it is very fascinating what is happening around tcmalloc... It feels like a soap opera... It is very entertaining to watch very talented people trying to break the limits to create better software...

compudj · 2024-09-03T19:23:15Z

I definitely invite you to check the RSEQ ticket... RSEQ creator Mathieu Desnoyers did chime in and was not very happy about some aspect to the point where he committed something in the kernel to express his disapproval... I am sure that it will end up with a happy compromise or a win-win solution but it is very fascinating what is happening around tcmalloc... It feels like a soap opera... It is very entertaining to watch very talented people trying to break the limits to create better software...

Note that I have not pushed anything upstream that would prevent users from misusing the rseq ABI, but I indeed have a patch ready. I want to find alternative solutions in collaboration with the tcmalloc people first before starting to enforce proper ABI use from the kernel.

Note that I've just posted a patch series improving cache-locality of RSEQ mm_cid. You may want to try running your benchmarks with it:

https://lore.kernel.org/lkml/20240903190650.53644-1-mathieu.desnoyers@efficios.com/T/#t

lano1106 · 2024-09-03T19:50:58Z

Mathieu, if this can help to see how the patch impact the result measured in my benchmark, I can give it a shot...

yes I know that you did not commit anything that would impede anyone from misusing RSEQ... If I understood it correctly it was logging warnings when configured for debugging...

please, explain how mm_cid can impact tcmalloc performance?

I am not sure that I did see anything named mm_cid in what is shared between the kernel and RSEQ users...

and keep in mind too that in every benchmark setups, there are always much less threads than CPUs (my testing setup is not a NUMA machine) so I am not expecting a lot of tasks migrations or excessive context switches that could impact performance due to suboptimal cache optimization...

alk · 2024-09-03T23:30:13Z

I can assure you that the bench-malloc numbers are with per-cpu caches enabled.

I encourage you to double check. At least on my system (up-to-date kernel and Debian Sid) running latest abseil tcmalloc's tcmalloc_bencmark thingy, I see that cpu_cache_active is false. Don't really want to bother to check why. It used to work.

Broadly, I don't think we have much disagreement. We broadly agree that a number of choices made as part of abseil makes it very hard if at all possible to use their malloc.

It is just when you say "per-cpu mode having to do more work", you activate my "someone is wrong on internet" mode :) I encourage you check disassembly of operator new and operator delete in both gperftools and tcmalloc and you'll see your statement isn't true (yes, despite some tiny extra overhead of storing into "active rseq" field using upstream rseq compared to what it could have been).

Also don't underestimate performance hit when per-thread mode is having when a number of threads want to grow their per-thread cache and fight each other. It really gets nasty very fast. And per-cpu mode fixes it nicely despite all the troubles. And despite that, I am not really 100% on-board yet with this approach. Even more so for their "virtual cpu id" stuff. It it just your "per-cpu mode slow" statement that provokes me some

compudj · 2024-09-04T01:08:52Z

Mathieu, if this can help to see how the patch impact the result measured in my benchmark, I can give it a shot...

yes I know that you did not commit anything that would impede anyone from misusing RSEQ... If I understood it correctly it was logging warnings when configured for debugging...

Well actually the approach I intend to take is to kill any process that corrupt RSEQ fields documented as read-only by user-space. But as I said, I want to discuss a way forward with the tcmalloc developers before introducing this kind of validation.

please, explain how mm_cid can impact tcmalloc performance?

I am not sure that I did see anything named mm_cid in what is shared between the kernel and RSEQ users...

Hrm, indeed in the tcmalloc master branch I see that struct kernel_rseq has padding where the node_id and mm_cid fields are located, and then has a union with vcpu stuff that's really not something that is present in the upstream ABI (and will conflict with future upstream kernel changes). So I guess you need a Google-internal version of the kernel to have those custom fields.

I'd recommend wiring up tcmalloc with the upstream mm_cid field instead to replace all this custom vcpu stuff. See include/uapi/linux/rseq.h for details.

and keep in mind too that in every benchmark setups, there are always much less threads than CPUs (my testing setup is not a NUMA machine) so I am not expecting a lot of tasks migrations or excessive context switches that could impact performance due to suboptimal cache optimization...

When using mm_cid without my recent cache locality improvements, even with few threads running intermittently across a few cores, you would end up having cache line bouncing due to allocation of the same concurrency ID across all used cores. My recent patch fixes that. This cache line bouncing is not present when the workload runs threads across all used cores at least every 100ms, which may explain why this slowdown does not show up in typical stress-tests, but might show up in real-life workload.

lano1106 · 2024-09-04T16:40:56Z

I can assure you that the bench-malloc numbers are with per-cpu caches enabled.

I encourage you to double check. At least on my system (up-to-date kernel and Debian Sid) running latest abseil tcmalloc's tcmalloc_bencmark thingy, I see that cpu_cache_active is false. Don't really want to bother to check why. It used to work.

Broadly, I don't think we have much disagreement. We broadly agree that a number of choices made as part of abseil makes it very hard if at all possible to use their malloc.

It is just when you say "per-cpu mode having to do more work", you activate my "someone is wrong on internet" mode :) I encourage you check disassembly of operator new and operator delete in both gperftools and tcmalloc and you'll see your statement isn't true (yes, despite some tiny extra overhead of storing into "active rseq" field using upstream rseq compared to what it could have been).

Also don't underestimate performance hit when per-thread mode is having when a number of threads want to grow their per-thread cache and fight each other. It really gets nasty very fast. And per-cpu mode fixes it nicely despite all the troubles. And despite that, I am not really 100% on-board yet with this approach. Even more so for their "virtual cpu id" stuff. It it just your "per-cpu mode slow" statement that provokes me some

I suppose the reason why the per-cpu cache used to be enabled and it is not anymore has something to do with a glibc upgrade. RSEQ registration by glibc has been introduced in v2.35... if tcmalloc fails to register rseq because glibc did it first, per-cpu cache is disabled. The simple fix for that is to prepend the benchmark program with GLIBC_TUNABLES=glibc.pthread.rseq=0

I even did add a print statement in bench_malloc to confirm that I was using the per-cpu cache... I am 100% confident on this...
google/tcmalloc#260 (comment)

I even went as far as comparing the result with/without the per-cpu cache... the difference is noticeable but marginal... not a big deal at all (in the benchmark context).

concerning your "someone is wrong on internet" mode, ok I might be wrong... It is my best educated guess to explain the observation... (If I knew exactly why, I would offer a patch)

I still think that this must not be that far from the reality. Few days ago, you did mention a "prefetch" thing as a possible explanation for the result. imho, that "prefetch" thing would fall into the extra work category...

another hypothesis that I have offered is that gcc (the compiler that I am using) is maybe not able to inline as well as clang (I assume this is the one used internally since Google actively works on clang)... It turns out that PageMap::sizeclass() in abseil version is more complex using templates. imho, this is unlikely but this idea did remained uncommented...
google/tcmalloc#260 (comment)

I feel like my bad performance report is received with disbelief possibly by being in "someone is wrong on internet" mode too...

but the fact that abseil tcmalloc is slower than legacy tcmalloc remains... If this is unexpected, some investigation should be performed to find out why... this is not something that I am going to do... I am going to leave that task to the maintainers if they are interested in discovering and potentially fix a problem in that area...

lano1106 · 2024-09-04T16:45:00Z

Mathieu, if this can help to see how the patch impact the result measured in my benchmark, I can give it a shot...
yes I know that you did not commit anything that would impede anyone from misusing RSEQ... If I understood it correctly it was logging warnings when configured for debugging...

Well actually the approach I intend to take is to kill any process that corrupt RSEQ fields documented as read-only by user-space. But as I said, I want to discuss a way forward with the tcmalloc developers before introducing this kind of validation.

please, explain how mm_cid can impact tcmalloc performance?
I am not sure that I did see anything named mm_cid in what is shared between the kernel and RSEQ users...

Hrm, indeed in the tcmalloc master branch I see that struct kernel_rseq has padding where the node_id and mm_cid fields are located, and then has a union with vcpu stuff that's really not something that is present in the upstream ABI (and will conflict with future upstream kernel changes). So I guess you need a Google-internal version of the kernel to have those custom fields.

I'd recommend wiring up tcmalloc with the upstream mm_cid field instead to replace all this custom vcpu stuff. See include/uapi/linux/rseq.h for details.

and keep in mind too that in every benchmark setups, there are always much less threads than CPUs (my testing setup is not a NUMA machine) so I am not expecting a lot of tasks migrations or excessive context switches that could impact performance due to suboptimal cache optimization...

When using mm_cid without my recent cache locality improvements, even with few threads running intermittently across a few cores, you would end up having cache line bouncing due to allocation of the same concurrency ID across all used cores. My recent patch fixes that. This cache line bouncing is not present when the workload runs threads across all used cores at least every 100ms, which may explain why this slowdown does not show up in typical stress-tests, but might show up in real-life workload.

I going to summarize my takeaway from this reply.

that a RSEQ user app use the mm_cid field or not, prior to the fix, there was a cache line bouncing issue at the kernel level and that would impact every RSEQ enabled apps.

Is this correct?

lano1106 · 2024-09-04T16:48:04Z

Hrm, indeed in the tcmalloc master branch I see that struct kernel_rseq has padding where the node_id and mm_cid fields are located, and then has a union with vcpu stuff that's really not something that is present in the upstream ABI (and will conflict with future upstream kernel changes). So I guess you need a Google-internal version of the kernel to have those custom fields.

I'd recommend wiring up tcmalloc with the upstream mm_cid field instead to replace all this custom vcpu stuff. See include/uapi/linux/rseq.h for details.

you are correct. the vcpu stuff must be from an internal custom kernel. the tcmalloc github version disable the usage of that field.

See:
https://github.com/google/tcmalloc/blob/91f53c7725c8427083bb303fcca6d36691d54fe3/tcmalloc/internal/percpu.cc#L97

compudj · 2024-09-04T16:49:11Z

[...]

I going to summarize my takeaway from this reply.

that a RSEQ user app use the mm_cid field or not, prior to the fix, there was a cache line bouncing issue at the kernel level and that would impact every RSEQ enabled apps.

Is this correct?

No, it only affects cache locality of userspace data structures indexed using the rseq mm_cid field.

alk · 2024-09-05T18:33:53Z

Okay, so to close this off. Indeed turning off glibc rseq transforms abseil tcmalloc performance.

And its asm code looks reasonable and tight (I don't recognise that "btr" stuff which seems new and frame pointer stuff looks new too). It is (as expected) tighter than ours. And yet it does lose to gperftools a bit. On the other hand performance seems to depend a lot on microarchitectural details and compilers. (gcc beating clang, again; as seems to be the most common occurrence). And the difference isn't huge. Plus, as I noted they do prefetches which are strictly overhead on micro-benchmarks, but somewhat helpful on real-world performance.

Anyways, thanks for bringing this up. And summary is still: we lack trustful memory allocator benchmarks that show real-world strengths and weaknesses.

lano1106 · 2024-09-06T00:25:39Z

alk, it has been a pleasure to exchange with you. Yes I think that it is time to close this off and move one... but hopefully, someone will see this exchange and pick up the challenge to find a way to improve abseil tcmalloc performance...

as a departure comment, I just wanted to let you know that this morning, I have just realized that my LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so on my main app fails to accomplish it purpose... the loader is encountering issue and falls back on using glibc allocator...

I have noticed this bug by looking in the /proc process virtual address mapping and seeing that tcmalloc lib was missing... The performance benefit from one allocator over another one is so subtle that this bug went unnoticed for weeks....

Update: I found the cause...

my binary is attributed linux capabilities with setcap and doing that disable LD_PRELOAD for security reasons.

I have worked around this last issue by linking with tcmalloc this way:
-Wl,-whole-archive ./sapphirerapids/libtcmalloc_minimal.a -Wl,-no-whole-archive

alk closed this as completed Aug 31, 2024

lano1106 mentioned this issue Sep 2, 2024

introduce 2 new env vars to config ThreadCache initial state #1554

Closed

alk mentioned this issue Sep 6, 2024

Per cpu cache of bigger sizes google/tcmalloc#233

Closed

alk mentioned this issue Sep 14, 2024

What diff the tcmalloc with https://github.com/google/tcmalloc ? #1556

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observed performance regression #1549

observed performance regression #1549

observed performance regression #1549

observed performance regression #1549

Comments