-
Notifications
You must be signed in to change notification settings - Fork 1.5k
memfs sysallocator can't share fd across forking (was: SIGBUS error in child after parent performs allocation on Hugepages) #1538
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@alk have you seen this earlier ? |
SIGBUS is how hugetlbfs reacts to failures to allocate a huge page. So just either don't fork or reserve more memory for huge pages. |
Thanks for the quick response @alk. I see your point. But fork is necessary for my application. Exactly what do you mean by reserve more memory? I already tried using memfs_malloc_limit_mb and it fails with SIGBUS again. Also I believe that I am receiving SIGBUS when I am trying to access the already allocated memory in child process again. Do you think it is because of posix_memalign call ? Finally if you mean that HugetlbPages_Free in /proc/meminfo is 0 when we get here, then I don't think I follow that because -
|
Here is what happens. You've set up your malloc to use hugepages. So that memory was allocated (and you say you have 0 free hugepages left). Then you fork, which means entire address space is logically duplicated. Then as you touch the memory, hugetlbfs needs to allocate new pages (because of duplication). And it runs out of pages to give. So it gives you SIGBUS. |
@alk I checked (through /proc/meminfo) and I have enough hugepages available in the system. Can you please try reproducing using the steps I mentioned and you would be able to see my point. |
No I will not. You pointed out above you have no free hugetlbfs pages left reserved. Reserve a lot more and try yourself. You need 2x if you fork |
I am not sure if I was not clear earlier, but I have around 28GB of free Hugepages in my system to begin with and this value does not goes down to 0. I am not sure from where you are getting the idea that I have no free hugepages left. @alk can you please check once, this is not the case of no free hugepage available. |
It just works on my box. Look, I am not here to do other people's homework. You can surely do a lot more debugging on your end. And from what I can see, you're fairly capable. Also please do note that hugetlbfs has somewhat elaborate logic around reserving hugetlbfs pages which is distinct from allocating them. If/when you investigate also consider testing or reading kernel source for MAP_NORESERVE behavior on hugetlbfs. From what I recall (but not 100% sure that I remember correctly) it significantly changes how reservation logic works. Good luck. |
oh, that was too quick. One sec. |
Ah, okay. Here is what happens. After fork, both child and parent have same FD for hugetlbfs file. As part of allocating memory, we do ftruncate, with intention to extend the file. But after child has allocated it's pages, the file has grown and then parent actually shrinks it when it allocates it's memory. So than child gets SIGBUS as per normal behavior. This code has ~never had actual working support for MAP_PRIVATE and "sharing" hugetlbfs heap file descriptor. From what I can see, we actually don't have to do ftruncate as mmap-ing "past" end of file usually works. And does seem to work on my fairly up-to-date Linux kernel for hugetlbfs (maybe it didn't way back when this code was written). Now I have to comment about the homework comment I already made above. So in the perfect world, I can see you following up from my SIGBUS comment above and reducing your test program until you get to something. Maybe not final conclusion but to much tighter test case. Please don't take this comment as some sort of offense, but rather as advise on how to become better engineer. Good luck. P.S. here is my test program that I ended up building to narrow things down:
|
Thanks @alk for the quick response. I believe I was too blinded by the mmap syscall and missed looking into ftruncate. Also thanks for providing useful tips, and yeah its all cool😃😃😃, I am always on the lookout for opportunities to improve myself and grow. Circling back to the problem now, from the comment mentioned here (LINK) it looks like we might need to perform ftruncate for tmpfs. Have to test it out though if we still need it in the newer kernel versions.
Please do let me know your views on this. Still thinking if there could be better ways to fix this. |
We're trying to keep rhel 6 / rhel 7 workable, kernel version-wise. |
@alk thought of one more approach other than the ones I mentioned above.
Also extending on my 2nd Approach- Let me know what you think so that we can work towards a patch for the same. Personally speaking, 4th one seems like a quick fix for this problem with least blast radius. |
Looks like there are only 2 decent alternatives:
|
Interesting thread, though taking a step back, who uses tcmalloc's memfs code path with tmpfs ? I thought memfs code path was specifically written to work with huge page setup. |
8000
No, some Google workloads do use (sibling of) memfs allocator over tmpfs, for complicated reasons. And in general, I don't see memfs as hugetlbfs and Linux-only. It can be useful for all kinds of memories exposed as a filesystem. As for your comment that your workloads run with MAP_PRIVATE, I am curious for more details. As far as I can see, it is only useful across forks, but those forks are broken as noted here. |
Regarding the usecase, for historic reasons, we do have a parent process which is not doing much but forks the child which does all the allocations, so dropping MAP_PRIVATE would cause lot of legacy headaches. Given the tmpfs usecases that you cited, it seems atfork is the only agreeable option. |
Assigned. But do note, that we don't normally do the bureaucracy thingy around assignments etc. Just do the thing and if its right we'll merge and be thankful. |
I am facing an error where I receive SIGBUS error in the child process. The flow of things happening here are as follows -
Steps to reproduce -
g++ tcm.cpp -o tcm -L/usr/local/lib -ltcmalloc_debug
TCMALLOC_MEMFS_DISABLE_FALLBACK=true TCMALLOC_MEMFS_MAP_PRIVATE=true TCMALLOC_MEMFS_MALLOC_PATH=/dev/hugepages/tcmalloc ./tcm
I am using MAP_PRIVATE on Hugepage setup. My understanding is that this should have ensured that parent and child have their own separate mapping and we shouldn't get into SIGBUS. But somehow after the parent process does its allocations the mappings of child process get corrupted and we receive SIGBUS. I am using gperftools master. Is this a known issue ?
tcm.cpp.txt
Output Received:
The text was updated successfully, but these errors were encountered: