8000 Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP by jpollin98 · Pull Request #54 · artis-mcrt/artis · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 63 commits into from
May 29, 2024

Conversation

jpollin98
Copy link
Contributor
@jpollin98 jpollin98 commented May 3, 2024

Normalise bound-free estimators independently on all ranks
The normalisation of the bf estimators has been changed to occur across all cells instead of the rank-assigned cells. This means that we no longer need to broadcast the prev_bfratenormed array.

Skip NLTE solver if Te=MINTEMP:
The temperature check was added due to populations in some grid cells with a low temperature, as populations being unphysical (i.e producing nne outsude of the limits of a float). These grid cells typically had a lower amount of 56Ni.

@jpollin98
Copy link
Contributor Author

Changes have been made to sections of the code which deal with the bfestimators.

MPI_broadcast of the prev_bfrate_normed quantity was causing errors. As such, I have changed the do_MPI_Bcast to not broadcast the prev_bfrate_normed quantitity. This has meant that I have changed the radfield::normalise_bf_estimators to occur earlier across each rank, I this means that we dont have to broadcast the prev_bfrate_normed quantity anymore.

@jpollin98
Copy link
Contributor Author

Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results.

However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated.

@lukeshingles
Copy link
Member

Changes have been made to sections of the code which deal with the bfestimators.

MPI_broadcast of the prev_bfrate_normed quantity was causing errors. As such, I have changed the do_MPI_Bcast to not broadcast the prev_bfrate_normed quantitity. This has meant that I have changed the radfield::normalise_bf_estimators to occur earlier across each rank, I this means that we dont have to broadcast the prev_bfrate_normed quantity anymore.

Can you put your description into the first comment, please? Only the first message will be kept after merging.

@lukeshingles
Copy link
Member

Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results.

However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated.

To me, it sounds like either the cluster is failing or there are some invalid memory accesses in the code. The CI tests for the parts of the code that you're editing can easily miss MPI problems because the nebularonezone test is run with two ranks on a single node, and the model has only a single radial shell.

@lukeshingles lukeshingles changed the title Nebular 50 cubed Normalise bound-free estimators on all ranks to eliminate MPI_Bcast, and remove normalisation factors on NLTE matrix solution. May 7, 2024
@lukeshingles lukeshingles changed the title Normalise bound-free estimators on all ranks to eliminate MPI_Bcast, and remove normalisation factors on NLTE matrix solution. Normalise bound-free estimators on all ranks to eliminate MPI_Bcast and remove normalisation factors on NLTE matrix solution. May 7, 2024
@jpollin98
Copy link
Contributor Author

Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results.
However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated.

To me, it sounds like either the cluster is failing or there are some invalid memory accesses in the code. The CI tests for the parts of the code that you're editing can easily miss MPI problems because the nebularonezone test is run with two ranks on a single node, and the model has only a single radial shell.

Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results.
However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated.

To me, it sounds like either the cluster is failing or there are some invalid memory accesses in the code. The CI tests for the parts of the code that you're editing can easily miss MPI problems because the nebularonezone test is run with two ranks on a single node, and the model has only a single radial shell.

I think you are correct about this. When I ran the checks for integer overflows, I never checked when the NLTE solver turned on as this would take prohibitively long on my laptop.

I would guess that there is no way to get the flag -fsanitize=integer to work on the cluster? Which would mean I just have to go through the code slowly and check for the locations of the error.

@lukeshingles
Copy link
Member
lukeshingles commented May 8, 2024

I would guess that there is no way to get the flag -fsanitize=integer to work on the cluster? Which would mean I just have to go through the code slowly and check for the locations of the error.

I have tried running a full-scale simulation in testmode (make TESTMODE=ON sn3d), but it slowed to a crawl on the atomic data and I gave up waiting for it. I suggest you try that first, and it it isn't making any progress, then just add -fsanitize=interger,address to the CXXFLAGS (e.g., where -std=c++20 is set).

@jpollin98
Copy link
Contributor Author
jpollin98 commented May 28, 2024

Hi Luke, I tried running with the other modules you suggested and still had problems. When I added the MPI_Barrier(MPI_COMM_WORLD) statements, I got slightly further by about ~150 grid cells (~150000 more broadcasts). I have had a look at the values still trying to be broadcast. They are still sensible numbers, so I still feel like this is an MPI problem.
I have changed the nt_MPI_Bcast broadcast to use MPI_Pack and MPI_Unpack instead. These have passed the GitHub tests, but would you be able to have a quick look at the code before I run the simulation? (Just in case there are any glaring issues.)

Try a rebase onto the current master. Fionn's deflagration models slowed to a crawl with some kind of bound-free process issue. The latest version is slower generally, but seems to avoid this kind of deadlock. See if it makes a difference for your models.

So, whatever issue affected Classic also affected the NLTE branch of the code? That is very surprising... I wouldn't have expected any specific issues with Classic to cause a problem in the MPI routines.

All of my NLTE models worked fine. I'm not sure exactly sure what caused the photoionisation issues in the classic model, but I can't say for sure that they didn't affect any NLTE models. Better to rebase onto the latest commit to save yourself time.

I have rebased my branch onto the new main version of the code. From the looks of it, my change of the normalisation of the pop_norm_factor_vec, now breaks the GitHub actions. I have reverted to the previous normalisation method (i.e. normalised to LTE). The temperature check may catch some of the most extreme populations, causing some of the errors in 3D models.

However, I do not think how we normalise should change the solution as, in both cases, they should both be valid. I will run two nebular versions of the W7 Model, one with the normalisation and one without, to test if the change is significant.

Also, once these normalisation issues have been checked, can we merge these into the main branch so I can open a separate PR for the NT_Bcast issues if needed?

@lukeshingles lukeshingles changed the base branch from main to develop May 29, 2024 06:24
@lukeshingles lukeshingles changed the title Normalise bound-free estimators on all ranks to eliminate MPI_Bcast and remove normalisation factors on NLTE matrix solution. Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast May 29, 2024
@lukeshingles lukeshingles changed the title Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP May 29, 2024
@lukeshingles lukeshingles enabled auto-merge (squash) May 29, 2024 07:33
@lukeshingles lukeshingles disabled auto-merge May 29, 2024 08:39
@lukeshingles lukeshingles merged commit a89857c into develop May 29, 2024
44 of 49 checks passed
lukeshingles pushed a commit that referenced this pull request May 29, 2024
…te MPI_Bcast and skip NLTE solver if Te=MINTEMP (#54)

**Normalise bound-free estimators independently on all ranks**
The normalisation of the bf estimators has been changed to occur across
all cells instead of the rank-assigned cells. This means that we no
longer need to broadcast the prev_bfratenormed array.

**Skip NLTE solver if Te=MINTEMP:**
The temperature check was added due to populations in some grid cells
with a low temperature, as populations being unphysical (i.e producing
nne outsude of the limits of a float). These grid cells typically had a
lower amount of 56Ni.
lukeshingles pushed a commit that referenced this pull request Jun 3, 2024
…te MPI_Bcast and skip NLTE solver if Te=MINTEMP (#54)

**Normalise bound-free estimators independently on all ranks**
The normalisation of the bf estimators has been changed to occur across
all cells instead of the rank-assigned cells. This means that we no
longer need to broadcast the prev_bfratenormed array.

**Skip NLTE solver if Te=MINTEMP:**
The temperature check was added due to populations in some grid cells
with a low temperature, as populations being unphysical (i.e producing
nne outsude of the limits of a float). These grid cells typically had a
lower amount of 56Ni.
@jpollin98 jpollin98 deleted the nebular-50-cubed branch June 8, 2024 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0