-
Notifications
You must be signed in to change notification settings - Fork 24.2k
LBFGS always give nan results, why #5953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you give more info? |
We confirm that this happens, seemingly in particular when the objective function is noisy (as, in our case, possibly due to the non determinism in gpu computation), we experienced it and it is documented here:
A simple fix is, when a NaN is detected, to reset the optimizer (no memory, and starting point = best visited point before nan). 8000 |
cc @vincentqb |
@jyzhang-bjtu -- do you have an example we could look at? @teytaud -- The first reference is for SGD, not LBFGS, so the two issues might be different. The second reference refers to a limitation of LBFGS when applied to certain problems. In this case, a modified or different algorithm would need to be used. @bamos -- have you run into such an issue with LBFGS? |
The number of users reporting that bug increases, maybe we should integrate the fix. |
I definitely noticed some weird behavior on higher dimensional spaces where it either could never really step, or gave NaNs. I can go back to try and reproduce if its helpful. Curious about @teytaud's fix, as in, if it makes sense why that would help from the math. |
I am not sure of why the NaN appears, I suspected this was related to non-convexities, anyway what is sure is that my fix has been reproduced by other people (in a similar hackish manner) and they were satisfied. |
I also came across this behavior of L-BFGS. Interestingly enough, the problem disappeared when switching from ReLU to hyperbolic tangent as activation function. Maybe this has something to do with the non differentiability of ReLU? |
I also met this behavior of LBFGS after some steps when I used it in multi-batch, may it can not be used in multi-batch? |
Face the same issue here... Adding to @teytaud , even it's a convex problem on a limited domain with some really flat dimensions, I think LBFGS can return nan from theory. |
Meet with same problem. Any solutions now? Current version of Pytorch: 1.9.1 |
another possible cause is the step size, t, of strong_wolfe linear search. |
(Notation below is from algorithm 6.1, page 140, in "Numerical Optimization" by Nocedal et al.) From my experience, it seems that a NaN occurs in one of two scenarios:
In the first scenario, this will mean that the reciprocal of Both of these scenarios can be addressed by checking the value of """
if landed in an area where Hessian is likely to be almost singular,
such that rho_k_inv is really big, or when s_k is zero, which may occur
when both the gradient and the inverse Hessian guess are not small,
but the chosen alpha_k through line search is small, restart the inverse
Hessian guess to I and randomly move around so that Hessian is less
singular or s_k is not zero.
See the following links for details
https://scicomp.stackexchange.com/q/29616
https://github.com/scipy/scipy/blob/da64f8ca0ef2353b59994e7e37ecee4e67a9b1d3/scipy/optimize/_optimize.py#L1360
https://github.com/pytorch/pytorch/issues/5953
"""
rho_k_inv = y_k.T @ s_k
if rho_k_inv == 0 or rho_k_inv > 1e7:
print("restarted here\n")
# TODO: how much to move around should ideally be inversely proportional to the
# "size" of the Hessian, so that we do not accidentally jump out
# of a valley containing the global minimum. Will probably need to change the standard deviation of the noise to account for this
x_k = x_k + torch.randn_like(x_k) * 1
x_k.requires_grad = True
y = f(x_k)
y.backward()
nabla_k = x_k.grad.clone()
x_k.requires_grad = False
x_k.grad.zero_()
H_inv = I
continue
else:
rho_k = 1 / rho_k_inv
# inverse Hessian estimate computed here |
anyone got a solution to this yet? |
Same issue when using the gradient value of a NeRF's MLP as the objective. |
This is done to avoid very large gradient steps. Solution similar to scipy: https://github.com/scipy/scipy/blob/da64f8ca0ef2353b59994e7e37ecee4e67a9b1d3/scipy/optimize/_optimize.py#L1360-L1367 Fixes LBFGS NaN bug (pytorch#5953)
I've implemented a fix for this, does anyone have a simple reproducible example I can use as a test case to check the fix works? |
I'm not completely sure of this, but either the test function |
我在MATLAB中使用L-BFGS算法拟合B样条曲线时也遇到了类似的问题(迭代次数过大时结果为NaN),我在调试后定位到了一个除0错误,导致了NaN的出现。即
注意到当ygk为零矩阵时,即ygk'*ygk=0<=1e-20时并没有将ygk的值更正为正确的值。我的修正方法为当ygk'*ygk<=1e-20时,执行
|
For ease, I've popped the above in a translator and got the following results: I had a similar problem with fitting a B-spline using the L-BFGS algorithm in MATLAB (the result was NaN when the number of iterations was too large), and I found a divide-by-0 error after debugging, which caused NaN. i.e. ygk = g-gp; s =<
8000
/span> x-xp;
if ygk'*ygk>1e-20
istore = istore + 1;
pos = mod(istore, m); if pos == 0; pos = m; end
YK(:,pos) = ygk;
SK(:,pos) = s;
rho(pos) = 1/(ygk'*s);
if istore <= m; status = istore; perm = [perm, pos];
else status = m; perm = [perm(2:m), perm(1)]; end
end Notice that when
|
I use the LBFGS alogorithm, and found that if maxiter is larger enough, i.e., maxiter >10, the optimizer always give nan results. Why?
The text was updated successfully, but these errors were encountered: