-
Notifications
You must be signed in to change notification settings - Fork 24.2k
[LTC] Memory growing up during training unti 8000 l OOM #65640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@leslie-fang-intel I threw this into a script and ran it, and I didn't see an obvious (or fast) increase in memory using htop. Can you say what amount of memory you are seeing used near the beginning of the loop and how fast it grows? Also, I deleted the 2 prints and added another So far I got to iter 2600 without seeing an issue. |
With RN50 training, we can quickly reproduce the OOM issue with BS112. Since in this case, the model is quite small, we only see the memory keep growing up(similar as RN50 training). But it will not be OOM(maybe after a very long time). |
I didn't see a noticeable increase in memory with this example. For Rn50 were you using torchbenchmark or something else? Note that our convolution operator is broken currently, so don't read too much into convnet performance. (but that should be unrelated from a memory leak issue) |
I am using
The RN50 is based on https://github.com/pytorch/examples/tree/master/imagenet. Let me check if I can public it. |
Here is the RN50 lazy tensor benchmark, from which I can reproduce the memory leak issue.
|
Sorry for the delay, we'll try to debug this soon, haven't had time yet. I did confirm that I can see a very slow leak on the small model, but it's probably easier to debug it on a bigger one, or see if asan provides clues. |
Hi @wconstab, do we have any clues about this issue😁? |
We still haven't investigated. We're focused mostly on refactoring and adding coverage to our op codegen, so we can start testing/benchmarking across a wide set of models. But @Krovatkin is planning to do a run with ASAN soon and see if we find anything that way. |
Got it. Thanks for the comments. |
@Krovatkin just check if you had a chance to run with ASAN. Any update? |
@wenzhe-nrv We fixed a few issues related to OOM, but we believe there's something else lurking in the depth of LTC. Could you try the latest LT w/ your example and use case. Note, we will be using very small amount memory caching but after 20 iterations, the memory usage should stabilize. I believe I was able to get the stable numbers for this example. |
@Krovatkin just to check before running with the latest - I'm using this commit |
@Krovatkin I tried
8000
the latest LTC ( How did you build both pytorch and LTC with ASAN? |
part of this memory leak may relate to this #64412 |
Thanks for the updates @wenzhe-nrv - just checking my understanding:
You mean that the
Helps on lazy tensor as well as eager?
Ok, interesting. |
No, not the jemalloc helped for both case. My assumption is that the memory optimization with Valgrind memcheck reported leak under the stack of the conv and linear op (MKL and MKLDNN evolved). Its upper stack points to mem allocation in thread creation. The GRU example valgrind log didn't show any mkl/mkldnn related. This could be the reason jemalloc fully fixed GRU memory leak but the conv example is not fully stable (less memory usage/jump though). |
Forgot to mention this, I tried |
Ok- so, sounds like 2 leaks were identified (a) somewhere in python (jemalloc helps), (b) in mkldnn-conv and mkl-linear ops. I'm not sure if (a) has been fully explained but it sounds like both (a) and (b) are not LTC related? Are there any more leaks that seem LTC related? |
yea, (a) and (b) are not LTC related. I'm looking into (b) for now. I'll let you know if I found other issue related to LTC. |
Hi, we are running Resnet50 FP32 training with lazy tensor core. And we find the memory will keep grow up in each iteration until OOM. To reproduce this issue, I have written a simple test case as below:
From this test case, I also see the memory is growing up when running. Is there any thing I misunderstand when modify the training test case with lazy_tensor? BTW: I am using lazy_tensor_staging branch(7f3d592).
The text was updated successfully, but these errors were encountered: