8000 LBFGS always give nan results, why · Issue #5953 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

LBFGS always give nan results, why #5953

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jyzhang-bjtu opened this issue Mar 23, 2018 · 19 comments
Open

LBFGS always give nan results, why #5953

jyzhang-bjtu opened this issue Mar 23, 2018 · 19 comments
Assignees
Labels
module: numerical-stability Problems related to numerical stability of operations module: optimizer Related to torch.optim needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@jyzhang-bjtu
Copy link

I use the LBFGS alogorithm, and found that if maxiter is larger enough, i.e., maxiter >10, the optimizer always give nan results. Why?

@ssnl
Copy link
Collaborator
ssnl commented Mar 25, 2018

can you give more info?

@zou3519 zou3519 added the awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it label May 14, 2018
@teytaud
Copy link
teytaud commented Apr 11, 2019

We confirm that this happens, seemingly in particular when the objective function is noisy (as, in our case, possibly due to the non determinism in gpu computation), we experienced it and it is documented here:

A simple fix is, when a NaN is detected, to reset the optimizer (no memory, and starting point = best visited point before nan).

8000

@ezyang ezyang added high priority module: optimizer Related to torch.optim module: numerical-stability Problems related to numerical stability of operations triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed awaiting response (this tag is deprecated) This tag is deprecated while we figure out what to do with it labels Apr 14, 2019
@ezyang
Copy link
Contributor
ezyang commented Jul 8, 2019

cc @vincentqb

@vincentqb
Copy link
Contributor
vincentqb commented Jul 9, 2019

@jyzhang-bjtu -- do you have an example we could look at?

@teytaud -- The first reference is for SGD, not LBFGS, so the two issues might be different. The second reference refers to a limitation of LBFGS when applied to certain problems. In this case, a modified or different algorithm would need to be used.

@bamos -- have you run into such an issue with LBFGS?

@vincentqb vincentqb added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Jul 9, 2019
@teytaud
Copy link
teytaud commented Jan 30, 2020

The number of users reporting that bug increases, maybe we should integrate the fix.
I've done that by a dirty hack for our needs in pytorch_gan_zoo, namely by "if we get a NaN then reboot with the best point so far".
I can help, but for doing it properly I would need discussions with someone who knows the code (I did the hack outside the algorithm, which is not what we want to do...).

@natolambert
Copy link

I definitely noticed some weird behavior on higher dimensional spaces where it either could never really step, or gave NaNs. I can go back to try and reproduce if its helpful. Curious about @teytaud's fix, as in, if it makes sense why that would help from the math.

@teytaud
Copy link
teytaud commented Jan 31, 2020

I am not sure of why the NaN appears, I suspected this was related to non-convexities, anyway what is sure is that my fix has been reproduced by other people (in a similar hackish manner) and they were satisfied.
Unfortunately as I did not know the code of LBFGS and needed a fast fix I did it in a hackish manner -- I just stopped LBFGS as soon as a NaN appeared and relaunched it from the current point, i.e. my hack was outside of the LBFGS code (fast dirty fix). I think the code using LBFGS in pytorch_gan_zoo was fixed by @Molugan with the same hack.

@HenKlei
Copy link
HenKlei commented Apr 4, 2020

I also came across this behavior of L-BFGS. Interestingly enough, the problem disappeared when switching from ReLU to hyperbolic tangent as activation function. Maybe this has something to do with the non differentiability of ReLU?

@wly2014
Copy link
wly2014 commented Apr 25, 2020

I also met this behavior of LBFGS after some steps when I used it in multi-batch, may it can not be used in multi-batch?

@Animadversio
Copy link

Face the same issue here...
So from my limited knowledge of the algorithm, https://en.wikipedia.org/wiki/Limited-memory_BFGS I feel whenever the Hessian matrix is relative ill posed, then there will be some dimensions that has a very small y_k that will generate a large \rho_k which explodes during its update of inverse Hessian and send the point far far away. If the domain of function is limited, this could cause nan.

Adding to @teytaud , even it's a convex problem on a limited domain with some really flat dimensions, I think LBFGS can return nan from theory.

@charlesxu90
Copy link

Meet with same problem. Any solutions now?

Current version of Pytorch: 1.9.1

@liangbright
Copy link

another possible cause is the step size, t, of strong_wolfe linear search.
t is unbounded, and sometimes it could be >>1, e.g., 1000

@mhdadk
Copy link
mhdadk commented Mar 14, 2022

(Notation below is from algorithm 6.1, page 140, in "Numerical Optimization" by Nocedal et al.)

From my experience, it seems that a NaN occurs in one of two scenarios:

  1. s_k is equal to zero.
  2. The estimate for the inverse Hessian is almost singular.

In the first scenario, this will mean that the reciprocal of rho_k will be equal to 0. Dividing by this reciprocal will yield a NaN. A fix for this has been implemented in SciPy here. In the second scenario, because the inverse Hessian is almost singular, the search direction p_k will most likely be very large in magnitude (or very large in direction and very small in another direction), which means that rho_k becomes large too.

Both of these scenarios can be addressed by checking the value of rho_k. If rho_k == 0, then it is likely that s_k is 0. If rho_k > 1e7 (for example), then it is likely that the guess for the inverse Hessian is almost singular. If either of these cases are true, we can add some noise to our current iterate x_k to move to a location where the actual Hessian is less likely to be singular (e.g. away from a narrow valley), restart the inverse Hessian guess to be the identity matrix, and then continue iterating. Here is some example code doing this:

"""
if landed in an area where Hessian is likely to be almost singular,
such that rho_k_inv is really big, or when s_k is zero, which may occur
when both the gradient and the inverse Hessian guess are not small,
but the chosen alpha_k through line search is small, restart the inverse
Hessian guess to I and randomly move around so that Hessian is less
singular or s_k is not zero.

See the following links for details
https://scicomp.stackexchange.com/q/29616
https://github.com/scipy/scipy/blob/da64f8ca0ef2353b59994e7e37ecee4e67a9b1d3/scipy/optimize/_optimize.py#L1360
https://github.com/pytorch/pytorch/issues/5953
"""
rho_k_inv = y_k.T @ s_k
if rho_k_inv == 0 or rho_k_inv > 1e7:
    print("restarted here\n")
    # TODO: how much to move around should ideally be inversely proportional to the
    # "size" of the Hessian, so that we do not accidentally jump out
    # of a valley containing the global minimum. Will probably need to change the standard deviation of the noise to account for this
    x_k = x_k + torch.randn_like(x_k) * 1
    x_k.requires_grad = True
    y = f(x_k)
    y.backward()
    nabla_k = x_k.grad.clone()
    x_k.requires_grad = False
    x_k.grad.zero_()
    H_inv = I
    continue
else:
    rho_k = 1 / rho_k_inv

# inverse Hessian estimate computed here

@keyshavmor
Copy link

anyone got a solution to this yet?

@csrqli
Copy link
csrqli commented Apr 2, 2023

Same issue when using the gradient value of a NeRF's MLP as the objective.

abdrysdale added a commit to abdrysdale/pytorch that referenced this issue Jan 31, 2024
@abdrysdale
Copy link

I've implemented a fix for this, does anyone have a simple reproducible example I can use as a test case to check the fix works?

@mhdadk
Copy link
mhdadk commented Feb 2, 2024

I've implemented a fix for this, does anyone have a simple reproducible example I can use as a test case to check the fix works?

I'm not completely sure of this, but either the test function $f(x,y) = x^2 + 2xy + y^2$ or $g(x,y) = 1.00001x^2 + 2xy + 1.00001y^2$ may work. For $f(x,y)$, the Hessian is singular everywhere, while for $g(x,y)$, the Hessian is almost singular everywhere. Therefore, the estimate of the inverse Hessian should blow up for $f(x,y)$ and be very large for $g(x,y)$.

@zJay26
Copy link
zJay26 commented Mar 29, 2024

我在MATLAB中使用L-BFGS算法拟合B样条曲线时也遇到了类似的问题(迭代次数过大时结果为NaN),我在调试后定位到了一个除0错误,导致了NaN的出现。即y = q/(rho(pos)* (ygk'*ygk)); 在这一行中ygk可能为零向量,导致最终除数为0。而ygk的定义源于下面这段代码:

ygk = g-gp;		s = x-xp;
    if ygk'*ygk>1e-20
        istore = istore + 1;
        pos = mod(istore, m); if pos == 0; pos = m; end
        YK(:,pos) = ygk;
        SK(:,pos) = s;
        rho(pos) = 1/(ygk'*s);
        if istore <= m; status = istore; perm = [perm, pos];
        else status = m; perm = [perm(2:m), perm(1)]; end
    end

注意到当ygk为零矩阵时,即ygk'*ygk=0<=1e-20时并没有将ygk的值更正为正确的值。我的修正方法为当ygk'*ygk<=1e-20时,执行ygk = YK(:,pos);语句更正ygk的值。经测试,修改后不会出现NaN错误。

补充:我并未深入了解L-BFGS算法的数学原理,上述分析和修改仅基于我对代码的理解,可能这种修改方式在数学上是不正确的,欢迎指正。

@abdrysdale
Copy link
abdrysdale commented Apr 2, 2024

我在MATLAB中使用L-BFGS算法拟合B样条曲线时也遇到了类似的问题(迭代次数过大时结果为NaN),我在调试后定位到了一个除0错误,导致了NaN的出现。即y = q/(rho(pos)* (ygk'*ygk)); 在这一行中ygk可能为零向量,导致最终除数为0。而ygk的定义源于下面这段代码:

ygk = g-gp;		s = x-xp;
    if ygk'*ygk>1e-20
        istore = istore + 1;
        pos = mod(istore, m); if pos == 0; pos = m; end
        YK(:,pos) = ygk;
        SK(:,pos) = s;
        rho(pos) = 1/(ygk'*s);
        if istore <= m; status = istore; perm = [perm, pos];
        else status = m; perm = [perm(2:m), perm(1)]; end
    end

注意到当ygk为零矩阵时,即ygk'*ygk=0<=1e-20时并没有将ygk的值更正为正确的值。我的修正方法为当ygk'*ygk<=1e-20时,执行ygk = YK(:,pos);语句更正ygk的值。经测试,修改后不会出现NaN错误。

补充:我并未深入了解L-BFGS算法的数学原理,上述分析和修改仅基于我对代码的理解,可能这种修改方式在数学上是不正确的,欢迎指正。

For ease, I've popped the above in a translator and got the following results:


I had a similar problem with fitting a B-spline using the L-BFGS algorithm in MATLAB (the result was NaN when the number of iterations was too large), and I found a divide-by-0 error after debugging, which caused NaN. i.e. y = q/(rho(pos)* (ygk'*ygk)); In this row ygk may be a zero vector, resulting in a final divisor of 0. The definition of ygk is derived from the following code:

ygk = g-gp;		s =<
8000
/span> x-xp;
    if ygk'*ygk>1e-20
        istore = istore + 1;
        pos = mod(istore, m); if pos == 0; pos = m; end
        YK(:,pos) = ygk;
        SK(:,pos) = s;
        rho(pos) = 1/(ygk'*s);
        if istore <= m; status = istore; perm = [perm, pos];
        else status = m; perm = [perm(2:m), perm(1)]; end
    end

Notice that when ygk is a zero matrix, i.e. ygk'*ygk=0

Addendum: I do not have an in-depth understanding of the mathematical principles of the L-BFGS algorithm, the above analysis and modification is only based on my understanding of the code, maybe this modification method is mathematically incorrect, welcome to correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: numerical-stability Problems related to numerical stability of operations module: optimizer Related to torch.optim needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

0