Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP #54

jpollin98 · 2024-05-03T12:20:14Z

Normalise bound-free estimators independently on all ranks
The normalisation of the bf estimators has been changed to occur across all cells instead of the rank-assigned cells. This means that we no longer need to broadcast the prev_bfratenormed array.

Skip NLTE solver if Te=MINTEMP:
The temperature check was added due to populations in some grid cells with a low temperature, as populations being unphysical (i.e producing nne outsude of the limits of a float). These grid cells typically had a lower amount of 56Ni.

jpollin98 · 2024-05-03T12:34:44Z

Changes have been made to sections of the code which deal with the bfestimators.

MPI_broadcast of the prev_bfrate_normed quantity was causing errors. As such, I have changed the do_MPI_Bcast to not broadcast the prev_bfrate_normed quantitity. This has meant that I have changed the radfield::normalise_bf_estimators to occur earlier across each rank, I this means that we dont have to broadcast the prev_bfrate_normed quantity anymore.

jpollin98 · 2024-05-06T16:01:30Z

Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results.

However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated.

lukeshingles · 2024-05-07T07:34:28Z

Changes have been made to sections of the code which deal with the bfestimators.

MPI_broadcast of the prev_bfrate_normed quantity was causing errors. As such, I have changed the do_MPI_Bcast to not broadcast the prev_bfrate_normed quantitity. This has meant that I have changed the radfield::normalise_bf_estimators to occur earlier across each rank, I this means that we dont have to broadcast the prev_bfrate_normed quantity anymore.

Can you put your description into the first comment, please? Only the first message will be kept after merging.

lukeshingles · 2024-05-07T07:39:05Z

Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results.

However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated.

To me, it sounds like either the cluster is failing or there are some invalid memory accesses in the code. The CI tests for the parts of the code that you're editing can easily miss MPI problems because the nebularonezone test is run with two ranks on a single node, and the model has only a single radial shell.

nltepop.cc

radfield.cc

update_grid.cc

radfield.cc

…NTEMP

jpollin98 · 2024-05-08T10:12:19Z

Regarding the movement of the normalise_bf_estimators, I have ran the W7 model and it produces consistent results.
However, a mpirun error has appeared within the nt_MPI_Bcast() function when broadcasting the frac_deposition ratecoeffperdeposition and lineindex across all ranks. I think this error is happening because the broadcast is MPI_COMM_WORLD. Any advice on how to fix/change this part of the code would be appreciated.

To me, it sounds like either the cluster is failing or there are some invalid memory accesses in the code. The CI tests for the parts of the code that you're editing can easily miss MPI problems because the nebularonezone test is run with two ranks on a single node, and the model has only a single radial shell.

I think you are correct about this. When I ran the checks for integer overflows, I never checked when the NLTE solver turned on as this would take prohibitively long on my laptop.

I would guess that there is no way to get the flag -fsanitize=integer to work on the cluster? Which would mean I just have to go through the code slowly and check for the locations of the error.

lukeshingles · 2024-05-08T13:27:15Z

I would guess that there is no way to get the flag -fsanitize=integer to work on the cluster? Which would mean I just have to go through the code slowly and check for the locations of the error.

I have tried running a full-scale simulation in testmode (make TESTMODE=ON sn3d), but it slowed to a crawl on the atomic data and I gave up waiting for it. I suggest you try that first, and it it isn't making any progress, then just add -fsanitize=interger,address to the CXXFLAGS (e.g., where -std=c++20 is set).

…NTEMP

…te did not match

jpollin98 · 2024-05-28T16:13:12Z

Hi Luke, I tried running with the other modules you suggested and still had problems. When I added the MPI_Barrier(MPI_COMM_WORLD) statements, I got slightly further by about ~150 grid cells (~150000 more broadcasts). I have had a look at the values still trying to be broadcast. They are still sensible numbers, so I still feel like this is an MPI problem.
I have changed the nt_MPI_Bcast broadcast to use MPI_Pack and MPI_Unpack instead. These have passed the GitHub tests, but would you be able to have a quick look at the code before I run the simulation? (Just in case there are any glaring issues.)

Try a rebase onto the current master. Fionn's deflagration models slowed to a crawl with some kind of bound-free process issue. The latest version is slower generally, but seems to avoid this kind of deadlock. See if it makes a difference for your models.

So, whatever issue affected Classic also affected the NLTE branch of the code? That is very surprising... I wouldn't have expected any specific issues with Classic to cause a problem in the MPI routines.

All of my NLTE models worked fine. I'm not sure exactly sure what caused the photoionisation issues in the classic model, but I can't say for sure that they didn't affect any NLTE models. Better to rebase onto the latest commit to save yourself time.

I have rebased my branch onto the new main version of the code. From the looks of it, my change of the normalisation of the pop_norm_factor_vec, now breaks the GitHub actions. I have reverted to the previous normalisation method (i.e. normalised to LTE). The temperature check may catch some of the most extreme populations, causing some of the errors in 3D models.

However, I do not think how we normalise should change the solution as, in both cases, they should both be valid. I will run two nebular versions of the W7 Model, one with the normalisation and one without, to test if the change is significant.

Also, once these normalisation issues have been checked, can we merge these into the main branch so I can open a separate PR for the NT_Bcast issues if needed?

…te MPI_Bcast and skip NLTE solver if Te=MINTEMP (#54) **Normalise bound-free estimators independently on all ranks** The normalisation of the bf estimators has been changed to occur across all cells instead of the rank-assigned cells. This means that we no longer need to broadcast the prev_bfratenormed array. **Skip NLTE solver if Te=MINTEMP:** The temperature check was added due to populations in some grid cells with a low temperature, as populations being unphysical (i.e producing nne outsude of the limits of a float). These grid cells typically had a lower amount of 56Ni.

jpollin98 added 2 commits May 3, 2024 11:53

Changed MPI routines and noramlisation of BF estimators

43d1bca

Update radfield.cc

8347608

jpollin98 added 12 commits May 4, 2024 12:01

Update nltepop.cc

6ea08b8

Update update_grid.cc

ebbaaac

Update update_grid.cc

243965d

Update update_grid.cc

5ae1b16

Test changes of nt_MPI_Bcast

932cfbe

nt_MPI_Bcast

2ba0e7c

Update nonthermal.cc

4dfc81b

Update nonthermal.cc

d8d537d

Update nonthermal.cc

e0c1a39

reverted nt_MPI_Bcast

e0f47a5

Update nonthermal.cc

7978eea

Update nonthermal.cc

f98fab5

lukeshingles requested changes May 7, 2024

View reviewed changes

nltepop.cc Outdated Show resolved Hide resolved

nltepop.cc Outdated Show resolved Hide resolved

nltepop.cc Outdated Show resolved Hide resolved

radfield.cc Outdated Show resolved Hide resolved

update_grid.cc Outdated Show resolved Hide resolved

radfield.cc Outdated Show resolved Hide resolved

lukeshingles changed the title ~~Nebular 50 cubed~~ Normalise bound-free estimators on all ranks to eliminate MPI_Bcast, and remove normalisation factors on NLTE matrix solution. May 7, 2024

lukeshingles changed the title ~~Normalise bound-free estimators on all ranks to eliminate MPI_Bcast, and remove normalisation factors on NLTE matrix solution.~~ Normalise bound-free estimators on all ranks to eliminate MPI_Bcast and remove normalisation factors on NLTE matrix solution. May 7, 2024

jpollin98 added 5 commits May 7, 2024 17:54

Update nltepop.cc

35afb94

Clang Test

51c1a7f

Consolidat 8000 ed code into radfield::normalise_bf_estimator

a81e557

Cell Te in solve_nlte_pops_element() is explicitly checked against MI…

fd98043

…NTEMP

removed mpi code to do with frac_excitations_list_size

222941f

jpollin98 added 2 commits May 9, 2024 09:51

8000

Update radfield.cc

855a7df

added printouts to test clang / TESTMODE=ON Errors

ac14da8

jpollin98 added 14 commits May 28, 2024 16:28

Cell Te in solve_nlte_pops_element() is explicitly checked against MI…

2123a73

…NTEMP

removed mpi code to do with frac_excitations_list_size

aed1bf0

Update radfield.cc

db5d242

added printouts to test clang / TESTMODE=ON Errors

5799fdc

added testline to the radfield code to check clang

53c565f

changed printout to test clang

a4e9358

Update radfield.cc

2ded696

removed debug statements / reverted to singular Allreduce

1834fc7

Changed Bcast to be Pack and Unpack

67e84fc

Update nonthermal.cc

e0a245d

Update nonthermal.cc

bd390dc

Update nonthermal.cc

a524f91

reverted MPI_Pack to Bcast with issues with thread_local storage fixed

938e63b

Reverted NLTE notmalisation as nebularonezone_1d_3dgrid_limitbfest nl…

19d329d

…te did not match

lukeshingles changed the base branch from main to develop May 29, 2024 06:24

lukeshingles changed the title ~~Normalise bound-free estimators on all ranks to eliminate MPI_Bcast and remove normalisation factors on NLTE matrix solution.~~ Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast May 29, 2024

lukeshingles added 4 commits May 29, 2024 08:16

Merge branch 'develop' into nebular-50-cubed

3693e0f

Revert changes to nonthermal.cc

70e8572

Fix clang-tidy style warnings

e8326df

Update radfield.cc

9E88

3e3a07d

lukeshingles changed the title ~~Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast~~ Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP May 29, 2024

Merge branch 'develop' into nebular-50-cubed

e7bb72b

lukeshingles approved these changes May 29, 2024

View reviewed changes

lukeshingles enabled auto-merge (squash) May 29, 2024 07:33

lukeshingles disabled auto-merge May 29, 2024 08:39

lukeshingles merged commit a89857c into develop May 29, 2024
44 of 49 checks passed

jpollin98 deleted the nebular-50-cubed branch June 8, 2024 20:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP #54

Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP #54

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP #54

Normalise bound-free estimators on each rank independently to eliminate MPI_Bcast and skip NLTE solver if Te=MINTEMP #54

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!