8000 memfs sysallocator can't share fd across forking (was: SIGBUS error in child after parent performs allocation on Hugepages) · Issue #1538 · gperftools/gperftools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

memfs sysallocator can't share fd across forking (was: SIGBUS error in child after parent performs allocation on Hugepages) #1538

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
yashrs76 opened this issue Jul 24, 2024 · 18 comments
Assignees

Comments

@yashrs76
Copy link
yashrs76 commented Jul 24, 2024

I am facing an error where I receive SIGBUS error in the child process. The flow of things happening here are as follows -

  1. Fork()
  2. Child Process performs some allocation and goes to sleep.
  3. Parent process which was sleeping till now performs some allocations and goes to sleep again.
  4. Child Process wakes up and tries accessing the memory it allocated earlier -------> SIGBUS received.

Steps to reproduce -
g++ tcm.cpp -o tcm -L/usr/local/lib -ltcmalloc_debug
TCMALLOC_MEMFS_DISABLE_FALLBACK=true TCMALLOC_MEMFS_MAP_PRIVATE=true TCMALLOC_MEMFS_MALLOC_PATH=/dev/hugepages/tcmalloc ./tcm

I am using MAP_PRIVATE on Hugepage setup. My understanding is that this should have ensured that parent and child have their own separate mapping and we shouldn't get into SIGBUS. But somehow after the parent process does its allocations the mappings of child process get corrupted and we receive SIGBUS. I am using gperftools master. Is this a known issue ?
tcm.cpp.txt

Output Received:

Parent Process: 2202509
Child Process: 2202513
Memory Allocated by Child from 0x7fbd0b801000:0x7fbd0bc01000
malloc_hook section is missing, thus InHookCaller is broken!
Memory Allocated by Child from 0x7fbd0b201000:0x7fbd0b601000
Memory Allocated by Child from 0x7fbd0ac01000:0x7fbd0b001000
Memory Allocated by Child from 0x7fbd0a601000:0x7fbd0aa01000
Memory Allocated by Parent from 0x7fbd0b801000:0x7fbd0bc01000
About to memset on addrress: 0x7fbd0b801000
Memset done on address: 0x7fbd0b801000
About to memset on addrress: 0x7fbd0b201000
Invoking SIGBUS handler in context of 2202513
Signal number: 7
Error number: 0
Signal code: 2
Sending process ID: 186650624
Sending user ID: 32701
Fault address: 0x7fbd0b201000
Exit value or signal: 0
Band event: 140449912197120
Timer overrun count: 32701
POSIX timer ID: 0xb201000
User CPU time consumed: 0
System CPU time consumed: 0
Signal value: (nil)
POSIX.1b signal: 0
POSIX.1b signal pointer: (nil)
System call number: 0
ARCH dependent data: (nil)
Parent Process Exit
malloc_hook section is missing, thus InHookCaller is broken!

@yashrs76
Copy link
Author

@alk have you seen this earlier ?

@alk
Copy link
Contributor
alk commented Jul 24, 2024

SIGBUS is how hugetlbfs reacts to failures to allocate a huge page. So just either don't fork or reserve more memory for huge pages.

@alk alk closed this as completed Jul 24, 2024
@yashrs76
Copy link
Author
yashrs76 commented Jul 24, 2024

Thanks for the quick response @alk. I see your point. But fork is necessary for my application. Exactly what do you mean by reserve more memory? I already tried using memfs_malloc_limit_mb and it fails with SIGBUS again.

Also I believe that I am receiving SIGBUS when I am trying to access the already allocated memory in child process again. Do you think it is because of posix_memalign call ?

Finally if you mean that HugetlbPages_Free in /proc/meminfo is 0 when we get here, then I don't think I follow that because -

  1. We have enough free hugepages at this point
  2. There should be no new allocation at the point when we access that memory

@alk
Copy link
Contributor
alk commented Jul 25, 2024

Here is what happens. You've set up your malloc to use hugepages. So that memory was allocated (and you say you have 0 free hugepages left). Then you fork, which means entire address space is logically duplicated. Then as you touch the memory, hugetlbfs needs to allocate new pages (because of duplication). And it runs out of pages to give. So it gives you SIGBUS.

@yashrs76
Copy link
Author

@alk I checked (through /proc/meminfo) and I have enough hugepages available in the system. Can you please try reproducing using the steps I mentioned and you would be able to see my point.

@alk
Copy link
Contributor
alk commented Jul 25, 2024

No I will not. You pointed out above you have no free hugetlbfs pages left reserved. Reserve a lot more and try yourself. You need 2x if you fork

@yashrs76
Copy link
Author

I am not sure if I was not clear earlier, but I have around 28GB of free Hugepages in my system to begin with and this value does not goes down to 0. I am not sure from where you are getting the idea that I have no free hugepages left. @alk can you please check once, this is not the case of no free hugepage available.

@alk
Copy link
Contributor
alk commented Jul 27, 2024

It just works on my box. Look, I am not here to do other people's homework. You can surely do a lot more debugging on your end. And from what I can see, you're fairly capable.

Also please do note that hugetlbfs has somewhat elaborate logic around reserving hugetlbfs pages which is distinct from allocating them. If/when you investigate also consider testing or reading kernel source for MAP_NORESERVE behavior on hugetlbfs. From what I recall (but not 100% sure that I remember correctly) it significantly changes how reservation logic works.

Good luck.

@alk
Copy link
Contributor
alk commented Jul 27, 2024

oh, that was too quick. One sec.

@alk
Copy link
Contributor
alk commented Jul 27, 2024

Ah, okay. Here is what happens. After fork, both child and parent have same FD for hugetlbfs file. As part of allocating memory, we do ftruncate, with intention to extend the file. But after child has allocated it's pages, the file has grown and then parent actually shrinks it when it allocates it's memory. So than child gets SIGBUS as per normal behavior.

This code has ~never had actual working support for MAP_PRIVATE and "sharing" hugetlbfs heap file descriptor.

From what I can see, we actually don't have to do ftruncate as mmap-ing "past" end of file usually works. And does seem to work on my fairly up-to-date Linux kernel for hugetlbfs (maybe it didn't way back when this code was written).

Now I have to comment about the homework comment I already made above. So in the perfect world, I can see you following up from my SIGBUS comment above and reducing your test program until you get to something. Maybe not final conclusion but to much tighter test case. Please don't take this comment as some sort of offense, but rather as advise on how to become better engineer. Good luck.

P.S. here is my test program that I ended up building to narrow things down:

// -*- Mode: C++; c-basic-offset: 2; indent-tabs-mode: nil -*-
#include <inttypes.h>
#include <semaphore.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <unistd.h>

int tcm_fd;
off_t tcm_off;

struct sync_state {
  sem_t child_ready;
  sem_t parent_ready;
};

sync_state* the_sync;

static constexpr size_t kPageSize = 2 << 20;

char* allocate(size_t amt) {
  if ((amt & (kPageSize-1)) != 0) {
    abort();
  }

#if 1
  int rv = ftruncate(tcm_fd, tcm_off + amt);
  if (rv < 0) {
    perror("ftruncate");
    abort();
  }
#endif

  void* addr = mmap(nullptr, amt, PROT_READ|PROT_WRITE, MAP_PRIVATE, tcm_fd, tcm_off);
  if (addr == MAP_FAILED) {
    perror("mmap");
    abort();
  }
  tcm_off += 2 << 20;
  return static_cast<char*>(addr);
}

void create_file() {
  char path[] = "/dev/hugepages/tcmalloc.XXXXXX";
  tcm_fd = mkstemp(path);
  if (tcm_fd < 0) {
    perror("mkstemp");
    abort();
  }
  (void)unlink(path);
}

void inspect_fd() {
  struct stat st;
  if (fstat(tcm_fd, &st) != 0) {
    perror("fstat");
    abort();
  }
  printf("fd: blksize %zu\n", (size_t)st.st_blksize);
  printf("fd: st_blocks %zu\n", (size_t)st.st_blocks);
  printf("fd: st_size %zu\n", (size_t)st.st_size);
}

int main() {
  the_sync = static_cast<sync_state*>(mmap(nullptr, 4096, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_SHARED, -1, 0));
  sem_init(&the_sync->child_ready, /* shared = */ 1, 0);
  sem_init(&the_sync->parent_ready, /* shared = */ 1, 0);

  create_file();

  int pid = fork();
  if (pid) {
    printf("parent: My pid: %d, child pid: %d\n", (int)getpid(), pid);
    sem_wait(&the_sync->child_ready);
    char* addr = allocate(kPageSize);
    memset(addr, 3, kPageSize);
    printf("parent: allocated and memset at %p\n", addr);
    sem_post(&the_sync->parent_ready);

    int status;
    wait(&status);
    printf("parent: wait status: %d\n", status);

    printf("my byte: %d\n", (int)*addr);
  } else {
    printf("child: My pid: %d\n", (int)getpid());
    char* addr = allocate(2 * kPageSize);
    memset(addr, 1, kPageSize);

    printf("child: allocated and memset (one of 2 pages) at %p\n", addr);
    inspect_fd();

    sem_post(&the_sync->child_ready);
    sem_wait(&the_sync->parent_ready);

    printf("child: about to touch my page again\n");
    inspect_fd();

    memset(addr, 2, 2 * kPageSize);
    printf("child: done\n");
  }
}

@yashrs76
Copy link
Author

Thanks @alk for the quick response. I believe I was too blinded by the mmap syscall and missed looking into ftruncate. Also thanks for providing useful tips, and yeah its all cool😃😃😃, I am always on the lookout for opportunities to improve myself and grow.

Circling back to the problem now, from the comment mentioned here (LINK) it looks like we might need to perform ftruncate for tmpfs. Have to test it out though if we still need it in the newer kernel versions.
I think we can fix this issue by -

  1. Eliminate file backed mapping completely for the MAP_PRIVATE case and use MAP_ANONYMOUS
  2. Use some mechanism to create a new fd after each fork(not sure how much performance impact this would have).
  3. Eliminate ftruncate --> The approach you mentioned. Only concern is the impact on someone who is using the newer tcmalloc version with some older kernel. Do we explicitly say that we don't support so and so kernels with this tcmalloc version ?

Please do let me know your views on this. Still thinking if there could be better ways to fix this.

@alk
Copy link
Contributor
alk commented Jul 29, 2024

We're trying to keep rhel 6 / rhel 7 workable, kernel version-wise.

@alk alk reopened this Jul 29, 2024
@alk alk changed the title SIGBUS error in child after parent performs allocation on Hugepages memfs sysallocator can't share fd across forking (was: SIGBUS error in child after parent performs allocation on Hugepages) Jul 29, 2024
@yashrs76
Copy link
Author

@alk thought of one more approach other than the ones I mentioned above.

  1. We can have a bool variable to skip ftruncate for the hugetlbfs case. This variable will be set to true if the memfs_malloc_path corresponds to hugetlbfs file system. I understand that this will not solve the problem completely, and we could still end up into the SIGBUS scenarion on tmpfs case.

Also extending on my 2nd Approach-
2. We can use pthread_atfork to have seperate fds after each init for the child process. Parent can continue using the existing fd.

Let me know what you think so that we can work towards a patch for the same. Personally speaking, 4th one seems like a quick fix for this problem with least blast radius.

@alk
Copy link
Contributor
alk commented Jul 31, 2024

Looks like there are only 2 decent alternatives:

  • drop MAP_PRIVATE support and officially recognize that we don't support forking use-case
  • (super-carefully) do atfork thingy

@alokkataria
Copy link

Interesting thread, though taking a step back, who uses tcmalloc's memfs code path with tmpfs ? I thought memfs code path was specifically written to work with huge page setup.
Given all the options we have mentioned here, I think the better and less risky thing would be to drop tmpfs support and get rid of the ftruncate call here.
I don't know about others, but we do rely on the MAP_PRIVATE support in memfs, so would like to see that being preserved.

@alk
Copy link
Contributor
alk commented Aug 1, 2024
8000

No, some Google workloads do use (sibling of) memfs allocator over tmpfs, for complicated reasons.

And in general, I don't see memfs as hugetlbfs and Linux-only. It can be useful for all kinds of memories exposed as a filesystem.

As for your comment that your workloads run with MAP_PRIVATE, I am curious for more details. As far as I can see, it is only useful across forks, but those forks are broken as noted here.

@alokkataria
Copy link

Regarding the usecase, for historic reasons, we do have a parent process which is not doing much but forks the child which does all the allocations, so dropping MAP_PRIVATE would cause lot of legacy headaches.

Given the tmpfs usecases that you cited, it seems atfork is the only agreeable option.

@alk
Copy link
Contributor
alk commented Aug 6, 2024

Assigned. But do note, that we don't normally do the bureaucracy thingy around assignments etc. Just do the thing and if its right we'll merge and be thankful.

@alk alk assigned alokkataria and yashrs76 and unassigned alokkataria and yashrs76 Aug 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0