8000 Increasing GPU memory usage caused by the iterative use of FBP · Issue #125 · LLNL/LEAP · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Increasing GPU memory usage caused by the iterative use of FBP #125

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
TonyLi-Shu opened this issue Oct 27, 2024 · 4 comments
Open

Increasing GPU memory usage caused by the iterative use of FBP #125

TonyLi-Shu opened this issue Oct 27, 2024 · 4 comments

Comments

@TonyLi-Shu
Copy link
TonyLi-Shu commented Oct 27, 2024

Hi Kyle Champley,

Thanks a lot for this amazing toolkit. It is very convenient to use with Pytorch. However, I am facing an issue that my GPU memory always increases when I iteratively use FBP on different projections (either from leaptorch or leapctypes).

Background:

I want to train a neural network to identify the noise in 2D CT FBP images from the projection domain. So, my input is the projection (batch_size, Num_Projections, Num_rows, Num_cols). Thus, I need to constantly do the filter-back projection(FBP) in order to get the 2D FBP images.

Current Way:

I created a Projector using leaptorch. Then I used this proj.fbp to do the FBP for all the batches of projections. I observed that the GPU memory increases as the number of FBP increases. In the end, the GPU memory is full.

Trouble Shooting:

  1. I have also double checked on other pytorch codes by comment the FBP operations. The GPU memory remains the same (would not increase) during the training of the neural network, which means the increasing GPU memory is cause by the FBP or probably my incorrect use of FBP.
  2. I also tried out fbp in leaptorch, FBP_gpu in leapctypes with "inplace = True", and even the FBP_gpu in libprojectors. I observed that these three functions all make GPU memory increases when I iteratively do the FBP for each batch.
  3. I also tuned my batch size down to 12 or 8 or 6. The gpu memory still increase as long as I do the FBP iteratively.
  4. I tried torch.cuda.empty_cache() after del the variables as well as gc.collect(). Besides, I also tried generating a new projector for every iteration or epoch. Unfortunately, none of them works. The memory still increases.
  5. My pytorch version is 2.4.1+cu118, LEAP version is 1.23 (I just upgrade the version to 1.23 on 2024/10/24, I believe this is the newest one)

Therefore, I dig further into the cuda codes, and I think there are many operations of GPU memory transfer (Memcpy and Memcpy3D), for which I worried that there might be a conflict between the Pytorch training and FBP? Or there might be some unfree GPU memory left by the FBP and accumulated over batches and batches.

I am wondering is my code and setting right for FBP? If it is ok. is there a way to free up the GPU memory after FBP so that it would not accumulate over iterations?

Detail info about the projector:

The current Num_Projections = 720, Num_rows=1, Num_cols=1024, batch_size=16.
1730028955483

Detail info about the FBP code

# self.tempo_A is the Conebeam Projector from LEAPtorch
def tempo_A_FBP(self, y):
        result_x = torch.zeros((y.size(0), 1, self.image_size, self.image_size), requires_grad=False).contiguous().to(self.device)
        y = y.contiguous()
        for i_ in range(y.size(0)):
            with torch.no_grad():
                if self.tempo_A.leapct.verify_inputs(y[i_,:,:,:], result_x[i_,:,:,:]):
                    # result_x[i_,:,:,:] = self.temp_A.fbp(y[i_,:,:,:])
                    self.tempo_A.leapct.FBP_gpu(y[i_,:,:,:], result_x[i_,:,:,:], inplace = True)
                    # self.tempo_A.leapct.libprojectors.FBP_gpu.restype = ctypes.c_bool
                    # self.tempo_A.leapct.libprojectors.FBP_gpu.argtypes = [ctypes.c_void_p, ctypes.c_void_p]
                    # self.tempo_A.leapct.set_model()
                    # self.tempo_A.leapct.libprojectors.FBP_gpu(y[i_,:,:,:].data_ptr(), result_x[i_,:,:,:].data_ptr())
                else:
                    raise Exception("Error in FBP!")

        return result_x

Erorr of the GPU Memory full

Here is the error returned to me when the GPU memory is full when I use fbp in leaptorch
864a3aa60afce6c7f9c52039ed318a3
And Here is another one when I used FBP_gpu in leapctypes
1730030123365

Let me know if there is anything else I need to provide to make this issue clear. Looking forward to hear your thoughts. Thanks a lot in advance.

@TonyLi-Shu TonyLi-Shu changed the title Increasing GPU memory usage caused by the iterative using of FBP Increasing GPU memory usage caused by the iterative use of FBP Oct 27, 2024
@kylechampley
Copy link
Collaborator

Thanks for reporting this issue.

Yes, LEAP definitely needs to create temporary memory to perform its operations and I am pretty careful about freeing the memory when I don't need it any more, but I may have missed something. I'll do some tests and get back to you.

@kylechampley
Copy link
Collaborator

FYI, if your cone-beam data only has one detector row, I recommend using a fan-beam geometry because the cone-beam geometry models the divergence of the rays in the z-direction and may clip off some of your volume.

@TonyLi-Shu
Copy link
Author

Thanks a lot for the fast reply as well as the great suggestions, Kyle! Looking forward to your messages.

@kylechampley
Copy link
Collaborator

I've done a lot of stress testing of LEAP trying to find memory leaks, but I cannot find any.

Have you tried the newest version of LEAP which was released a few days ago? This version uses less memory and may help resolve your issues.

Also note that for some algorithms to work, LEAP must make temporary copies of the volume and/or projection data which are freed when the algorithm completes, but may push you beyond the available GPU memory because PyTorch typically uses a TON of memory. Although it is possible there is a memory leak in LEAP, at this point if there is still a memory issue when using the latest version of LEAP I think the memory issue has to do with PyTorch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0