-
Notifications
You must be signed in to change notification settings - Fork 8
Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Changes have been made to sections of the code which deal with the bfestimators. MPI_broadcast of the prev_bfrate_normed quantity was causing errors. As such, I have changed the do_MPI_Bcast to not broadcast the prev_bfrate_normed quantitity. This has meant that I have changed the radfield::normalise_bf_estimators to occur earlier across each rank, I this means that we dont have to broadcast the prev_bfrate_normed quantity anymore. |
Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results. However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated. |
Can you put your description into the first comment, please? Only the first message will be kept after merging. |
To me, it sounds like either the cluster is failing or there are some invalid memory accesses in the code. The CI tests for the parts of the code that you're editing can easily miss MPI problems because the nebularonezone test is run with two ranks on a single node, and the model has only a single radial shell. |
I think you are correct about this. When I ran the checks for integer overflows, I never checked when the NLTE solver turned on as this would take prohibitively long on my laptop. I would guess that there is no way to get the flag -fsanitize=integer to work on the cluster? Which would mean I just have to go through the code slowly and check for the locations of the error. |
I have tried running a full-scale simulation in testmode (make TESTMODE=ON sn3d), but it slowed to a crawl on the atomic data and I gave up waiting for it. I suggest you try that first, and it it isn't making any progress, then just add -fsanitize=interger,address to the CXXFLAGS (e.g., where -std=c++20 is set). |
I have rebased my branch onto the new main version of the code. From the looks of it, my change of the normalisation of the pop_norm_factor_vec, now breaks the GitHub actions. I have reverted to the previous normalisation method (i.e. normalised to LTE). The temperature check may catch some of the most extreme populations, causing some of the errors in 3D models. However, I do not think how we normalise should change the solution as, in both cases, they should both be valid. I will run two nebular versions of the W7 Model, one with the normalisation and one without, to test if the change is significant. Also, once these normalisation issues have been checked, can we merge these into the main branch so I can open a separate PR for the NT_Bcast issues if needed? |
…te MPI_Bcast and skip NLTE solver if Te=MINTEMP (#54) **Normalise bound-free estimators independently on all ranks** The normalisation of the bf estimators has been changed to occur across all cells instead of the rank-assigned cells. This means that we no longer need to broadcast the prev_bfratenormed array. **Skip NLTE solver if Te=MINTEMP:** The temperature check was added due to populations in some grid cells with a low temperature, as populations being unphysical (i.e producing nne outsude of the limits of a float). These grid cells typically had a lower amount of 56Ni.
…te MPI_Bcast and skip NLTE solver if Te=MINTEMP (#54) **Normalise bound-free estimators independently on all ranks** The normalisation of the bf estimators has been changed to occur across all cells instead of the rank-assigned cells. This means that we no longer need to broadcast the prev_bfratenormed array. **Skip NLTE solver if Te=MINTEMP:** The temperature check was added due to populations in some grid cells with a low temperature, as populations being unphysical (i.e producing nne outsude of the limits of a float). These grid cells typically had a lower amount of 56Ni.
Normalise bound-free estimators independently on all ranks
The normalisation of the bf estimators has been changed to occur across all cells instead of the rank-assigned cells. This means that we no longer need to broadcast the prev_bfratenormed array.
Skip NLTE solver if Te=MINTEMP:
The temperature check was added due to populations in some grid cells with a low temperature, as populations being unphysical (i.e producing nne outsude of the limits of a float). These grid cells typically had a lower amount of 56Ni.