-
Notifications
You must be signed in to change notification settings - Fork 24
Don't skip munmap of mtcp_restart regions. #353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ging. FIXED: Cleanup at end of registerLocalSendsAndRecvs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please modify the comment, and add the extra requested comment. Otherwise, in the future, we look back at this mysterious special case, and wonder why.
Separately, I assume that you're going to squash the two commits together, before pushing this in.
I'd like to wait to see the added comment before approving, just to make sure we're documenting the code well. Thanks.
@@ -849,6 +849,11 @@ mtcp_plugin_skip_memory_region_munmap(Area *area, RestoreInfo *rinfo) | |||
LhCoreRegions_t *lh_regions_list = NULL; | |||
int total_lh_regions = lh_info->numCoreRegions; | |||
|
|||
// Don't skip munmap of mtcp_restart regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change the comment to remove the double negative:
// Do an munmap of mtcp_restart regions during restart. Don't skip this.
Also, please add a comment about why we need to munmap the mtcp_restart regions within MANA, but we don't need to do that within ordinary DMTCP. Where is the potential address conflict that we're trying to avoid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to add that we should not skip all [heap] but only the [heap] right after the mtcp_restart region.
@@ -849,6 +849,12 @@ mtcp_plugin_skip_memory_region_munmap(Area *area, RestoreInfo *rinfo) | |||
LhCoreRegions_t *lh_regions_list = NULL; | |||
int total_lh_regions = lh_info->numCoreRegions; | |||
|
|||
// Don't skip munmap of mtcp_restart regions. | |||
if (mtcp_strendswith(area->name, "/mtcp_restart") || | |||
mtcp_strendswith(area->name, "[heap]")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we only want to skip the [heap] right after mtcp_restart. Also in mpi_plugin.cpp we need to do the same to skip those area for libsStart consideration.
@@ -849,6 +849,12 @@ mtcp_plugin_skip_memory_region_munmap(Area *area, RestoreInfo *rinfo) | |||
LhCoreRegions_t *lh_regions_list = NULL; | |||
int total_lh_regions = lh_info->numCoreRegions; | |||
|
|||
// Don't skip munmap of mtcp_restart regions. | |||
if (mtcp_strendswith(area->name, "/mtcp_restart") || | |||
mtcp_strendswith(area->name, "[heap]")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: For some reason unmap the heap region here would cause seg fault, but without unmapping it we will encounter conflict as well later. @karya0
@karya0 ,
|
@gc00 : This PR is insufficient for the fix. The problem lies in how lower-half/lh-proxy are accounting "core" vs rest of the regions. The current logic in the split process considers all areas until Further, the upper-half plugin, mpi_plugin.cpp, logic incorrectly labels the heap created by the new lh-proxy process as part of the upper half and saves it as part of checkpoint. That's why heap also sees a conflict on second restart. We need to come up with a proper fix to handle both cases. This PR can plaster over the mtcp_restart conflict but not heap. |
See PR #357 for the continuation of this analysis. We should probably close this PR without committing |
No description provided.