we should phread_atfork around all of our lock(s) (was: LLVM thread(s) hang after fork from the parent process) #1425

kvignesh1420 · 2023-08-30T21:49:59Z

Setup: I am using gperftools 2.11 for heap profiling of tensorflow 2.11 training jobs on a RHEL 7.9 machine.

Observation: When tensorflow ops are being compiled, the main process is creating an llvm thread group and using them for parallel compilation of the ops. In my setup, I observed that the only child process being created via fork is hanging when tcmalloc+heap profiling is enabled.

The back trace for the parent process is shown below

#0  0x00007fa2ca2bba35 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
#1  0x00007fa2c9582aec in std::condition_variable::wait(std::unique_lock<std::mutex>&) () from /usr/lib64/libstdc++.so.6
#2  0x00007fa2a7f2339b in llvm::ThreadPool::wait(llvm::ThreadPoolTaskGroup&) ()
   from /opt/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#3  0x00007fa2a73a1e6c in mlir::detail::OpToOpPassAdaptor::runOnOperationAsyncImpl(bool) ()

The back trace for the child process is shown below

#0  0x00007f5f54e7ae29 in syscall () from /usr/lib64/libc.so.6
#1  0x00007f5f562b8cb0 in base::internal::SpinLockDelay (w=w@entry=0x7f5f5667adb0 <heap_lock>, value=2, loop=loop@entry=31771) at ./src/base/spinlock_linux-inl.h:86
#2  0x00007f5f562b8b67 in SpinLock::SlowLock (this=this@entry=0x7f5f5667adb0 <heap_lock>) at src/base/spinlock.cc:134
#3  0x00007f5f562b3f4a in SpinLock::Lock (this=0x7f5f5667adb0 <heap_lock>) at src/base/spinlock.h:71
#4  SpinLockHolder::SpinLockHolder (l=0x7f5f5667adb0 <heap_lock>, this=<synthetic pointer>) at src/base/spinlock.h:123
#5  RecordAlloc (skip_count=0, bytes=16, ptr=0x4a06b960) at src/heap-profiler.cc:319
#6  NewHook (ptr=0x4a06b960, size=16) at src/heap-profiler.cc:341
#7  0x00007f5f562aec02 in MallocHook::InvokeNewHookSlow (p=p@entry=0x4a06b960, s=s@entry=16) at src/malloc_hook.cc:314
#8  0x00007f5f562bafa4 in MallocHook::InvokeNewHook (s=16, p=0x4a06b960) at src/malloc_hook-inl.h:133
#9  tcmalloc::do_allocate_full<tcmalloc::cpp_throw_oom> (size=16) at src/tcmalloc.cc:1808
#10 tcmalloc::allocate_full_cpp_throw_oom (size=16) at src/tcmalloc.cc:1818
#11 0x00007f5ef29028b1 in arrow::util::(anonymous namespace)::AfterForkState::AfterFork() ()
   from /opt/site-packages/pyarrow/libarrow.so.900
#12 0x00007f5f54e47c4e in fork () from /usr/lib64/libc.so.6
#13 0x00007f5f54e70830 in __spawni () from /usr/lib64/libc.so.6
#14 0x00007f5f54e707b0 in posix_spawnp@@GLIBC_2.15 () from /usr/lib64/libc.so.6
#15 0x00007f5f3389bf6d in tsl::SubProcess::Start() ()
   from /opt/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so
#16 0x00007f5f33546975 in stream_executor::CompileGpuAsm(int, int, char const*, stream_executor::GpuAsmOpts) ()
   from /opt/site-packages/tensorflow/python/_pywrap_tensorflow_internal.so

alk · 2023-08-31T00:33:33Z

So this is issue with fork and pthread_atfork interaction. I will save all the details and complexities of atfork stuff, but here is what matters specifically in your case:

a) TF is calling (the only) right API: posix_spawn to spawn child process. But sadly for some unknown reason glibc until 2.24 had "broken" implementation that forked (instead of vfork or rather clone_vfork it should have used and which is now used). This lack of "properness" of posix_spawn in your older version of glibc is what is triggering all the mess.

b) pthread_atfork is itself super tricky thing in many cases. Google's internal policy for example is to never use it. There is internal paper with details, but somehow I am not able to find public version. Some of paper's arguments are imho not right, but main point is sound. That is: different libraries/modules lock & state nestings won't always match nestings of atfork handlers established at runtime causing deadlocks. For some reason that arrow thingy choose to employ atfork stuff. Perhaps it is occasionally used in forked settings rather than threaded settings (I guess because python is only able to offer parallelism with fork is what addes demand to this questionable feature). Then their atfork "after" handler calls into tcmalloc. You're likely using libtcmalloc LD_PRELOAD-ed (not much else makes sense). And we do some atfork business, but only for our main locks (yes, despite some arguments that maybe we shouldnt). So it would have worked, but we don't do usual atfork dance for heap profiler's heap_lock thingy. And this is where child's after handler finds heap_lock arbitrarily "broken" and hangs.

So we could have our atfork handling amended to also do the locking around heap profiler lock. Alternatively, you can avoid all the trouble by having libc which has right posix_spawn implementation. I have checked that RHEL 8 does. If upgrading to rhel8 or later isn't an option, then consider "stealing" right posix_spawn implementation from either modern glibc or from musl.

kvignesh1420 · 2023-09-01T04:55:43Z

Thanks for the insight @alk

I.e. so that we can exercise malloc in forked child. Referenced github issue #1570 and github issue #1425

Referenced github issue #1570 and github issue #1425 This enables "minimal" allocator to pass fork torture testing.

kvignesh1420 closed this as completed Sep 1, 2023

alk changed the title ~~LLVM thread(s) hang after fork from the parent process~~ we should phread_atfork around heap profiler lock(s) (was: LLVM thread(s) hang after fork from the parent process) Sep 1, 2023

alk reopened this Sep 1, 2023

alk added the enhancement label Sep 1, 2023

alk changed the title ~~we should phread_atfork around heap profiler lock(s) (was: LLVM thread(s) hang after fork from the parent process)~~ we should phread_atfork around all of our lock(s) (was: LLVM thread(s) hang after fork from the parent process) Oct 15, 2024

alk mentioned this issue Oct 15, 2024

tcmalloc is incompatible with fork(2) #1570

Closed

alk added a commit that referenced this issue Oct 15, 2024

implement fork torture testing

dd043fe

I.e. so that we can exercise malloc in forked child. Referenced github issue #1570 and github issue #1425

alk added a commit that referenced this issue Oct 15, 2024

have atfork handler also handle SlowTLS and SysAllocator locks

8560276

Referenced github issue #1570 and github issue #1425 This enables "minimal" allocator to pass fork torture testing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

we should phread_atfork around all of our lock(s) (was: LLVM thread(s) hang after fork from the parent process) #1425

we should phread_atfork around all of our lock(s) (was: LLVM thread(s) hang after fork from the parent process) #1425

we should phread_atfork around all of our lock(s) (was: LLVM thread(s) hang after fork from the parent process) #1425

we should phread_atfork around all of our lock(s) (was: LLVM thread(s) hang after fork from the parent process) #1425

Comments