-
Notifications
You must be signed in to change notification settings - Fork 1.5k
observed performance regression #1549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I have included jemalloc into the benchmark... something suspicious is that all allocators appears to suffer from a lock contention except glibc... what could be the common denominator? I have no clue on how the thread_local keyword is implemented but this could be this guy... update: |
I went as far as recompiling tag perftools-1.0 and the same crappy performance is found with this version so the problem is external to the project code... I fail to see what it could be... that would be really nice if someone could test:
to validate that he sees the same performance deterioration as soon as several threads are used. |
So this test appears to be running extremely quickly. If run time is too short, measurements cannot be trusted. Also can you please point exactly to the code you're running (I just checked out perftools-0.1 tag and built src/tests/ptmalloc/t-test1.c with -O2; no idea if you're doing same/similar) |
yes I have compiled src/tests/ptmalloc/t-test1.c. not with -O2 but with -O3.. beside that this is what I have done ptmalloc_test1 is the resulting binary from compiling t-test1.c the one that I have used is:
followed by:
but I guess that checking out perftools-1.0 is as good. I'll follow your suggestion and I will add one zero to 1,000,000 to see if it result into something different Update: |
my investigation has made me find a minor glitch in the project code... I'll prepare a pull request to share this finding soon... |
my new hypothesis to explain my observation it is that the central heap is somehow way too much accessed. it is protected by a spinlock so if it is the case, the mystery will be solved... it is a no brainer that if there is contention over a lock by 8 threads at the same time, it will show in performance. I am not too sure if there is an easy and ready to use way to monitor the central heap accesses... I might add my own temporary debug hooks to validate this... One possible factor for this sad state might have something to do with the 64 bytes class batch_size, without FLAGS_tcmalloc_transfer_num_objects, it would be 1024 but the default value is 32... I am wondering if in the initially benchmarked tcmalloc version, batch_size was NOT bounded at all considering that TCMALLOC_TRANSFER_NUM_OBJ env var appears to not be documented in the original HTML documents (Is it documented at all somewhere today?) Update: it did help a little bit but absolutely not enough to say case closed... 64 bytes 8 threads mops went from 2.5 to 2.557 |
I have done this:
for:
I got:
for
I got:
|
Hi. I'll give more elaborate comment later (likely tomorrow), but here are some quick thoughts.
|
you make me think about something... I must not be calculating mops the same way than Sanjay did. all I am doing is iterations/time by looking at graph on I now realize that he must be doing num_threads*iteration/time maybe not regression... but certainly valid interrogations... and improvements might still be possible in your very mature project I fail to see why there is so much accesses to the central heap... if it is only a matter of configuration... the library is incredibly opaque about suboptimal operation... I have been struggling few days to make sense of what I am seeing and no one has been able to provide a satisfying explanation yet... thanks for the good words and I am looking forward your more elaborated comment! p.s.: I have just run valgrind on t-test1.c exec... no leaks... p.p.s.: gperftools is still the fastest single thread allocator: I erroneously thought that the project actively maintained by Google with all its bells & whistles would automatically be better... if I can reduce central heap contention, as far as I am concerned gperftools will stay the fastest allocator currently available. |
I get what you say here... the simple benchmark may not be representative of my actual workload... yet it makes tcmalloc exhibit a behavior that I want to avoid in my real application and I only start to be aware of it and know what to look for... there must be some stats out there that I am unware of to know precisely what I want.... my app is perhaps having a 1GB virtual memory footprint (111.2MB res) running on a 128GB RAM dedicated system... I absolutely don't care at all to accumulate memory in the thread cache. The only thing that I want is the fastest allocation possible... I have multiple threads having memory allocated by openssl (95% of all my app allocations)... t-test1.c is representative in the sense that it does not appear to allocate an unreasonable amount of memory yet the library release very often memory to the central heap making it 4-5x slower than glibc. I am perhaps few hours away from finally fully understand how tcmalloc take this decision and how to configure it to behave like I want (or maybe shorter if I get some help)... the question that I have, if it is only a matter of configuration, is tcmalloc default settings adequate for a general purpose usage? Unless you tell me that there is a lot of embedded systems users in tcmalloc userbase, then maybe they could be a bit more liberal like what glibc choose to be to exhibit a better performance in t-test1.c that I believe to be reasonably representative to an average multithreaded heap usage... update: If I get it right, it will allocate approximately 4MB by doing roughly 60,000 64 bytes allocation (there is a random element to the test). It is always 4MB divided by (size*n_thr)... then the tcmalloc slowdown has to do with the class size list length and this is why with bigger allocations, tcmalloc performs better than with smaller ones... |
oh here is small gem that I have discovered... I am using GCC 14.1.2 and I am using the switch -std=c++26 when I compile tcmalloc... I do not know what in C++26 is doing that but with this settings I have seen some results from malloc_bench be 10% faster than if compiled with the default standard of -std=c++17... 10% improvement... this is not something to disregard... |
in t-test1.c malloc_test() function. I identify 3 sections
I think that it would be interesting to know how many times the central heap is accessed in step 2. possibly never or very little (hopefully)... I'll try to modify my debug hooks to answer this question... beside that, it might be possible to improve the central heap access logic to minimize access to it... I am throwing few random ideas that come my mind: Make the fetch ramp-up batchsize customizable... It could be interesting to give the option to the users to initialize FreeList::max_length_ with Static::sizemap()->num_objects_to_move(cl) |
I have enhanced my ThreadCache log trace to include ThreadCache::max_size_ the result obtained with
where does this ridiculously low value of 196608 come from? https://gperftools.github.io/gperftools/tcmalloc.html mention that TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES default to 32MB (or 16MB if you refer to the value in the text) the mysterious bogus cache objects are stealing kStealAmount (1 << 16) from the unclaimed_cache_space_ initialized with kDefaultOverallThreadCacheSize (32 MB) 196608 happens to be (3 << 16) |
So another quick thing. I am not sure what is going on on your end.
Ah btw gperftools is faster than glibc on this workloads (because, again, we're pretty good on this fast-path thingy) So I am not sure what you're chasing after. |
you are performing the test on a much faster machine than me. My dev machine is a old sandybridge system.
do you have env vars configured that would explain why you get different result? I do not understand how it is possible that you get away from ever accessing the central heap... I am going to create a profiler run output like you did to show you how different my run looks like... BTW, I have solved the bogus ThreadCache thing... I think that temporary ThreadCache objects are created because I am calling functions that allocate memory from the ThreadCache destructor... I have still improved my trace by adding few more fields... The bogus cache objects are reusing the same addresses:
in reality, my program must have 2 real caches... 1 for the main thread 1 for the test thread. The 3rd one would be the temporary one... btw, I have succeeded to have a run with zero releases... I have achieved this by changing the kMaxDynamicFreeListLength value from 8192 to 32768... at this point, I am just playing the various knobs to see what they do... |
also, since we are running on different HW, it does not matter that your run is much faster than mine... what you want to check is how the time for a 8 threads run compare with a 1 thread run... What I have observed is 8x slowdown caused by central heap accesses... If you do not have the same contention, then the expected result should be 1thread test time more or less equal to 8threads test time |
It is not that they don't happen. It is just peanuts compared to the total amount of work done. I don't set any environment variables (except LD_PRELOAD and CPUPROFILE). Yes, it shouldn't matter if you're doing it on older machine or not. I guess I try this on rpi 4 I have to double check. |
Absolutely no signs of lock contention. Here is RPI 4:
Bumping thread count proportionally scales up amount of work. And yes we have each core getting about 2x slower, but we have total number of instructions scaled 4x doing 4x more work. So whatever slowdown is due to microarchitectural reasons (thermal budget, shared caches things like that). And 100% cpu utilization too. Lock contention would lower it due to sleeping, or we'd see spinlock busy looping (we do a bit) on profiles. None of that is there. Okay, I think this can be closed as 'not a bug'. Ah, final thing. Profile you show above has catastrophically small number of profiling ticks. That screams noise everywhere. And likely explains oddities you report like random unrelated flags showing big impact that is "impossible". |
ok my bad... I must have confused user time with elapsed time (aka clock time) glibc times are better when looking user time because it must be sleeping on locks thank you for your help |
Thanks. I am slightly surprised to see "abseil" tcmalloc to loose to us here. Could be "prefetch" thingy though. I.e. they do prefetch in allocation path which has some evidence (although I wasn't 100% sure I buy it) that it benefits real world app performance despite doing slightly more work within malloc. And certainly degrades benchmark numbers a bit. |
yes... it is very surprising... their documentation is very well done... the features/benefits list is very enticing... but OTOH, I had few interrogation points:
with all those friction points, I wanted to be absolutely sure that transitioning to this variant was the best possible move. This is the driver for the benchmark effort. because of a stupid error from me, I spent a lot of time, (and some of yours) trying to resolve an imaginary problem. OTOH, I have learned a lot from it... I can now claim to have a pretty good understanding about how tcmalloc works... that was an incredible learning activity... my opinion about "abseil" tcmalloc... (I am not aware of the prefetch thing), it is that they have traded off some allocation/deallocation performance to offer a per-cpu cache... This might be very beneficial when you have a system with much more threads than CPUs... This is not the case in my setup... |
Not entirely correct. Yes, for them per-cpu caches matter or at least they think so. Because of fibers use-case (think of it like, thread per request; so that would be a ton of threads to "waste" memory on per-thread caches). But per-cpu caches actually enable some performance, despite some (fairly small) rseq overhead. This is because pushing and poping elements to the per-size-class lists doesn't involve pointer chasing. And it seems to matter quite a bit. Also enables them do to much improved central free list design (also saving tons of pointer chasing). This plus some pretty good attention to performance makes abseil tcmalloc pretty good performance wise. And there are certainly a ton of workloads, non-Google workloads, that would benefit from per-cpu caches today (e.g. Mongo). I am not sure what causes the gaps on graphs above, but they don't represent reality in practice for C++ workloads. Mentioning C++ workloads because one of changes there, they turned malloc into memalign to conform to nonsensical minimal alignment requirements formally part of C standard (BTW C++ might actually do same). So operator new is full performance. Malloc is slightly slower (but I think still faster then our memalign). Ah and another thing. If you have some appetite for central free list performance chasing there is actually one semi-prominent benchmark that we lose today and it involves CFL. Here: #1448 |
I've yet to check that link you posted above about RSEQ compatibility issues, but that numbers you posted above don't look right. I speculate (but plausibly) that your abseil tcmalloc doesn't run in per-cpu mode. I now see you statement above above "compromising" performance for per-cpu mode in potentially new light. So let me clarify. It is arguably part of big "weakness" of sorts of Abseil thing in general, but tcmalloc abseil especially. They only care above their use-case and not interested in any compromises. Open-source in this project is to enable sharing with other Alphabet bets and encouraging external contributions (like from Intel or Arm, stuff like that). They're not interested in other contributions as long they don't help their somewhat narrow use-case. So in this regard, per-cpu mode is the only mode they optimize for. There have been numerous internal discussions while I was part of that team about even dropping "legacy" per-thread stuff. I can tell you with certainty that I have seen abseil tcmalloc to perform much better than stats above and I know both Chris and Dmitry won't regress performance. So as long as you're not doing per-cpu mode in practice, you're not seeing its true might. And if (again, saying this without really checking) lack of per-cpu mode happens because of some glibc/tcmalloc RSEQ incompatibility, yes, that is a problem. But don't expect them to bother much, sadly. Now for the pointer chasing. Classic case for pointer chasing is traversing linked lists. When you have a code that continuously pops elements off free list that is classic pointer chasing. Yes it can be argued that as part of malloc/operator new you're reading memory that you're about to return and which your caller is likely to use anyways, but it adds work. Even more work and slowness it adds to "pop a bunch of per-thread list" operation that we do as part of central free list operations. And pushing elements to free list as part of free, we also have "extra" memory store, touching and potentially bringing into a caches a memory that is being freed. For the difference in workloads. I don't have super-great understanding but real workloads have: a) non-uniform distribution of object sizes. Sizes in ball-park of 32 bytes or so are most common and there is typical "peak"-ful distribution. b) non-uniform distribution of object lifetimes. With quite heavy head towards very short lifetimes. c) dependency in object lifetimes. Stuff allocated together is likely to be freed together. Also threads often run in some sort of phases, where for some time it may allocate a bunch of stuff (say, to populate hash table or something) and then does a burst of frees. None of that is simulated by ptmalloc bench (or any other bench I've seen yet). And, yes, C++ workloads at Google (but I suspect not much beyond) are by far sized delete, which may affect some design choices.
Alignment aspect is the (very tiny) issue because abseil tcmalloc malloc() does more work than operator new. |
it is a super interesting discussion... I am aware about the per-cpu cache mode can possibly be disabled because of glibc... I can assure you that the bench-malloc numbers are with per-cpu caches enabled. tbh, I took many measurements when benchmarking t-test1 only to realize that I forgot and you know what? that difference was barely noticeable... It did look very much like the graphs coming along my experimental commit #1554 to me, it seems like everyone who has been involved with the abseil tcmalloc has the strong belief that it must be better than the legacy project. I have not seen any evidence of this so far... Of course, I can imagine scenarios where abseil tcmalloc shines and leaves every other allocators far behind... ie: having hundred of threads on a 32 CPU system... I am sure that the per-cpu cache possibly enable allocations that can optimize the CPU cache usage... and possibly despite a longer allocation time, it delivers an overall better app performance... but sw design is always a matter of tradeoffs... if abseil tcmalloc is doing more complex stuff, it will need to give up some raw speed over the simpler design... imho, it is inevitable... concerning the idea that they are unwilling to compromise outside of their very narrow use-case... I understood that point by myself... The bazel only support is another sign of this... This is coming straight from Titus Winter legacy of building the whole world at each build... (the living at HEAD philosophy... which I must confess has clear benefits) I definitely invite you to check the RSEQ ticket... RSEQ creator Mathieu Desnoyers did chime in and was not very happy about some aspect to the point where he committed something in the kernel to express his disapproval... I am sure that it will end up with a happy compromise or a win-win solution but it is very fascinating what is happening around tcmalloc... It feels like a soap opera... It is very entertaining to watch very talented people trying to break the limits to create better software... |
Note that I have not pushed anything upstream that would prevent users from misusing the rseq ABI, but I indeed have a patch ready. I want to find alternative solutions in collaboration with the tcmalloc people first before starting to enforce proper ABI use from the kernel. Note that I've just posted a patch series improving cache-locality of RSEQ mm_cid. You may want to try running your benchmarks with it: https://lore.kernel.org/lkml/20240903190650.53644-1-mathieu.desnoyers@efficios.com/T/#t |
Mathieu, if this can help to see how the patch impact the result measured in my benchmark, I can give it a shot... yes I know that you did not commit anything that would impede anyone from misusing RSEQ... If I understood it correctly it was logging warnings when configured for debugging... please, explain how mm_cid can impact tcmalloc performance? I am not sure that I did see anything named mm_cid in what is shared between the kernel and RSEQ users... and keep in mind too that in every benchmark setups, there are always much less threads than CPUs (my testing setup is not a NUMA machine) so I am not expecting a lot of tasks migrations or excessive context switches that could impact performance due to suboptimal cache optimization... |
I encourage you to double check. At least on my system (up-to-date kernel and Debian Sid) running latest abseil tcmalloc's tcmalloc_bencmark thingy, I see that cpu_cache_active is false. Don't really want to bother to check why. It used to work. Broadly, I don't think we have much disagreement. We broadly agree that a number of choices made as part of abseil makes it very hard if at all possible to use their malloc. It is just when you say "per-cpu mode having to do more work", you activate my "someone is wrong on internet" mode :) I encourage you check disassembly of operator new and operator delete in both gperftools and tcmalloc and you'll see your statement isn't true (yes, despite some tiny extra overhead of storing into "active rseq" field using upstream rseq compared to what it could have been). Also don't underestimate performance hit when per-thread mode is having when a number of threads want to grow their per-thread cache and fight each other. It really gets nasty very fast. And per-cpu mode fixes it nicely despite all the troubles. And despite that, I am not really 100% on-board yet with this approach. Even more so for their "virtual cpu id" stuff. It it just your "per-cpu mode slow" statement that provokes me some |
Well actually the approach I intend to take is to kill any process that corrupt RSEQ fields documented as read-only by user-space. But as I said, I want to discuss a way forward with the tcmalloc developers before introducing this kind of validation.
Hrm, indeed in the tcmalloc master branch I see that struct kernel_rseq has padding where the node_id and mm_cid fields are located, and then has a union with vcpu stuff that's really not something that is present in the upstream ABI (and will conflict with future upstream kernel changes). So I guess you need a Google-internal version of the kernel to have those custom fields. I'd recommend wiring up tcmalloc with the upstream mm_cid field instead to replace all this custom vcpu stuff. See include/uapi/linux/rseq.h for details.
When using mm_cid without my recent cache locality improvements, even with few threads running intermittently across a few cores, you would end up having cache line bouncing due to allocation of the same concurrency ID across all used cores. My recent patch fixes that. This cache line bouncing is not present when the workload runs threads across all used cores at least every 100ms, which may explain why this slowdown does not show up in typical stress-tests, but might show up in real-life workload. |
I suppose the reason why the per-cpu cache used to be enabled and it is not anymore has something to do with a glibc upgrade. RSEQ registration by glibc has been introduced in v2.35... if tcmalloc fails to register rseq because glibc did it first, per-cpu cache is disabled. The simple fix for that is to prepend the benchmark program with I even did add a print statement in bench_malloc to confirm that I was using the per-cpu cache... I am 100% confident on this... I even went as far as comparing the result with/without the per-cpu cache... the difference is noticeable but marginal... not a big deal at all (in the benchmark context). concerning your "someone is wrong on internet" mode, ok I might be wrong... It is my best educated guess to explain the observation... (If I knew exactly why, I would offer a patch) I still think that this must not be that far from the reality. Few days ago, you did mention a "prefetch" thing as a possible explanation for the result. imho, that "prefetch" thing would fall into the extra work category... another hypothesis that I have offered is that gcc (the compiler that I am using) is maybe not able to inline as well as clang (I assume this is the one used internally since Google actively works on clang)... It turns out that I feel like my bad performance report is received with disbelief possibly by being in "someone is wrong on internet" mode too... but the fact that abseil tcmalloc is slower than legacy tcmalloc remains... If this is unexpected, some investigation should be performed to find out why... this is not something that I am going to do... I am going to leave that task to the maintainers if they are interested in discovering and potentially fix a problem in that area... |
I going to summarize my takeaway from this reply. that a RSEQ user app use the mm_cid field or not, prior to the fix, there was a cache line bouncing issue at the kernel level and that would impact every RSEQ enabled apps. Is this correct? |
you are correct. the vcpu stuff must be from an internal custom kernel. the tcmalloc github version disable the usage of that field. |
[...]
No, it only affects cache locality of userspace data structures indexed using the rseq mm_cid field. |
Okay, so to close this off. Indeed turning off glibc rseq transforms abseil tcmalloc performance. And its asm code looks reasonable and tight (I don't recognise that "btr" stuff which seems new and frame pointer stuff looks new too). It is (as expected) tighter than ours. And yet it does lose to gperftools a bit. On the other hand performance seems to depend a lot on microarchitectural details and compilers. (gcc beating clang, again; as seems to be the most common occurrence). And the difference isn't huge. Plus, as I noted they do prefetches which are strictly overhead on micro-benchmarks, but somewhat helpful on real-world performance. Anyways, thanks for bringing this up. And summary is still: we lack trustful memory allocator benchmarks that show real-world strengths and weaknesses. |
alk, it has been a pleasure to exchange with you. Yes I think that it is time to close this off and move one... but hopefully, someone will see this exchange and pick up the challenge to find a way to improve abseil tcmalloc performance... as a departure comment, I just wanted to let you know that this morning, I have just realized that my LD_PRELOAD=/usr/lib/libtcmalloc_minimal.so on my main app fails to accomplish it purpose... the loader is encountering issue and falls back on using glibc allocator... I have noticed this bug by looking in the /proc process virtual address mapping and seeing that tcmalloc lib was missing... The performance benefit from one allocator over another one is so subtle that this bug went unnoticed for weeks.... Update: I found the cause... my binary is attributed linux capabilities with setcap and doing that disable LD_PRELOAD for security reasons. I have worked around this last issue by linking with tcmalloc this way: |
I was evaluating switching to the new google maintained tcmalloc project. As part of the evaluation, I did rerun the benchmark that Sanjay initially did at:
https://gperftools.github.io/gperftools/tcmalloc.html
I do not know what, but it seems like the 2024 gperftools tcmalloc is much less performant than what was there back then...
this is mostly flagrant when you compare the 8 threads 64 bytes allocation Mops.
google/tcmalloc#260 (comment)
It went from 8 Mops to 2 Mops...
I do not know what but it seems like something did seriously break in between...
The text was updated successfully, but these errors were encountered: