-
Notifications
You must be signed in to change notification settings - Fork 1.5k
we should phread_atfork around all of our lock(s) (was: LLVM thread(s) hang after fork from the parent process) #1425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So this is issue with fork and pthread_atfork interaction. I will save all the details and complexities of atfork stuff, but here is what matters specifically in your case: a) TF is calling (the only) right API: posix_spawn to spawn child process. But sadly for some unknown reason glibc until 2.24 had "broken" implementation that forked (instead of vfork or rather clone_vfork it should have used and which is now used). This lack of "properness" of posix_spawn in your older version of glibc is what is triggering all the mess. b) pthread_atfork is itself super tricky thing in many cases. Google's internal policy for example is to never use it. There is internal paper with details, but somehow I am not able to find public version. Some of paper's arguments are imho not right, but main point is sound. That is: different libraries/modules lock & state nestings won't always match nestings of atfork handlers established at runtime causing deadlocks. For some reason that arrow thingy choose to employ atfork stuff. Perhaps it is occasionally used in forked settings rather than threaded settings (I guess because python is only able to offer parallelism with fork is what addes demand to this questionable feature). Then their atfork "after" handler calls into tcmalloc. You're likely using libtcmalloc LD_PRELOAD-ed (not much else makes sense). And we do some atfork business, but only for our main locks (yes, despite some arguments that maybe we shouldnt). So it would have worked, but we don't do usual atfork dance for heap profiler's heap_lock thingy. And this is where child's after handler finds heap_lock arbitrarily "broken" and hangs. So we could have our atfork handling amended to also do the locking around heap profiler lock. Alternatively, you can avoid all the trouble by having libc which has right posix_spawn implementation. I have checked that RHEL 8 does. If upgrading to rhel8 or later isn't an option, then consider "stealing" right posix_spawn implementation from either modern glibc or from musl. |
Thanks for the insight @alk |
Setup: I am using gperftools 2.11 for heap profiling of tensorflow 2.11 training jobs on a RHEL 7.9 machine.
Observation: When tensorflow ops are being compiled, the main process is creating an llvm thread group and using them for parallel compilation of the ops. In my setup, I observed that the only child process being created via fork is hanging when tcmalloc+heap profiling is enabled.
The back trace for the parent process is shown below
The back trace for the child process is shown below
The text was updated successfully, but these errors were encountered: