`torch.nn.functional.kl_div` fails gradgradcheck if the target requires a gradient #65466

pmeier · 2021-09-22T14:03:52Z

🐛 Bug

torch.nn.functional.kl_div fails gradgradcheck if the target requires a gradient.

To Reproduce

import torch
from torch.autograd import gradgradcheck
from torch.nn.functional import softmax, log_softmax, kl_div

torch.manual_seed(0)
# input should be log probabilities
input = log_softmax(torch.randn(3, dtype=torch.float64, requires_grad=True))
# target should be probabilities
target = softmax(torch.randn_like(input, requires_grad=True))

gradgradcheck(kl_div, inputs=(input, target))

torch.autograd.gradcheck.GradcheckError: Jacobian mismatch for output 0 with respect to input 1,
numerical:tensor([[-0.1344,  0.0000,  0.0000],
        [ 0.0000, -0.1344,  0.0000],
        [ 0.0000,  0.0000, -0.1344]], dtype=torch.float64)
analytical:tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]], dtype=torch.float64)

The error only shows up, if the target requires a gradient.

Additional context

As far as I tried this is not a recent regression. The behavior is the same for torch==1.8.1.
This was detected while adding an OpInfo for kl_div in add OpInfo for torch.nn.functional.kl_div #65469.

cc @ezyang @albanD @zou3519 @gqchen @pearu @nikitaved @soulitzer @lezcano @Varal7 @mruberry @jbschlosser @walterddr

The text was updated successfully, but these errors were encountered:

nikitaved · 2021-09-22T14:07:45Z

It is not a bug I guess, the issue is that tensors of type float are created, rather than of double.

pmeier · 2021-09-22T14:10:58Z

My bad, I need to recheck why the gradcheck tests in test/test_ops.py failed. They are run in float64.

pmeier · 2021-09-22T14:28:55Z

@nikitaved

It is not a bug I guess, the issue is that tensors of type float are created, rather than of double.

You were right about that. The tests in test_ops.py failed, because I used an unreasonable lower limit for the input values. Fixing that, now a gradgradcheck< 8000 /code> error shows up. I've edited the original comment with the new reproduction / error message.

nikitaved · 2021-09-22T14:53:54Z

OK, looks like the backward to kl_div has a custom backward implemented which sets the grad wrt target to zero (parameter 1),
as could be seen here:

pytorch/tools/autograd/derivatives.yaml

Line 1937 in 9324d68

    
           - name: kl_div_backward(Tensor grad_output, Tensor self, Tensor target, int reduction=Mean, *, bool log_target=False) -> Tensor

.
Which, I guess, is wrong as log is inf-many times differentiable with a non-zero value for positive values.
@pmeier , unless you want to have a look into this, I could try to resolve it.

lezcano · 2021-09-22T15:32:40Z

Note: Interestingly enough, kl_div_backward does have the second derivative implemented by hand, but the derivative of wrt the other input Q, that is, kl_div_target_backward is implicitly differentiable. This is a bit odd.
Given how simple the implementation of the backward is using TensorIterator, we could consider implementing the backward and double backward for kl_div with it.

nikitaved · 2021-09-22T15:38:30Z

In order to fix that we need to implement kl_div_backward which considers input and target jointly (as a function of two variables), without the assumption that either of them is a constant (as if a function of one variable), like in this case, where we have only partial derivatives defined. Many loss functions suffer for this issue of having double backward being wrong as the double backward of a forward function is a function that depends on both partial derivatives grad_input and grad_target, not just a single one of them.

mruberry · 2021-09-22T23:49:06Z

@albanD what do you think?

albanD · 2021-09-23T12:49:02Z

Many loss functions suffer for this issue of having double backward being wrong as the double backward of a forward function is a function that depends on both partial derivatives grad_input and grad_target, not just a single one of them.

I can see individual formulas being wrong but I don't think it is a problem to have the formulas separated out.
You can see the image below where both partial derivative influence properly both inputs inputs.
Or I am missing something?

For example if you do

import torch
# !pip install torchviz
import torchviz


a = torch.rand(1, 10, requires_grad=True)
t = torch.rand(1, 10, requires_grad=True)

loss = torch.kl_div(a, t).sum()

ga, gt = torch.autograd.grad(loss, (a, t), create_graph=True)

torchviz.make_dot((loss, ga, gt), params={k:v for k,v in locals().items() if isinstance(v, torch.Tensor)})

lezcano · 2021-09-23T12:58:17Z

fwiw, @pmeier will submit a patch for this later today / tomorrow. We found that having the formulas separated was really the way to go.

On a completely unrelated topic, that's some very cool graph @albanD :D

nikitaved · 2021-09-23T14:01:33Z

@albanD , when only partial derivatives are defined and I want to double backward, the effects of the differentiated backwards will be accumulated, right? Judging from how kl_backward is implemented, it sets the derivative for input, and the backward to kl_backward sets the grad wrt target to zero, yet kl_backward is a function of target, so it must have a non-zero derivative (kl_div nor its derivatives are constant functions wrt either target or input), or am I missing something? So, apparently, the engine does accumulate grads from the backwards of partial derivatives...

albanD · 2021-09-23T14:34:17Z

@nikitaved each backward formula is only responsible to provide the gradient flowing through the forward function they define. If some other function is also using target, then this other function is responsible for computing that part of the gradient.

the engine does accumulate grads from the backwards of partial derivatives

Yes, if a variable is re-used multiple times, the engine will make sure that the gradients from all the usage are added before processing more.
In the graph above, you can see 3 arrows coming out of t's AccumulateGrad Node. That's because it is used by 3 different functions.
Most of the complexity of the execution engine is to actually know how many of these arrows we should expect gradients from for a given backward() call and accumulate them properly.

…everse AD support. (#79007) (#79007) Summary: Fixes #78867, fixes #65466. Adds forward-over-reverse AD support. Pull Request resolved: #79007 Approved by: https://github.com/soulitzer, https://github.com/jbschlosser Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/72ad222cff59cbe730a49dd828cb0a25d2a18417 Reviewed By: osalpekar Differential Revision: D37058939 Pulled By: osalpekar fbshipit-source-id: 28ee709c47bc5fcb82ae31dd4a30e9ecac573709

pmeier added module: autograd Related to torch.autograd, and the autograd engine in general module: nn Related to torch.nn labels Sep 22, 2021

pmeier mentioned this issue Sep 22, 2021

add OpInfo for torch.nn.functional.kl_div #65469

Closed

ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 22, 2021

pmeier changed the title ~~torch.nn.functional.kl_div fails gradcheck~~ torch.nn.functional.kl_div fails gradgradcheck if the target requires a gradient Sep 22, 2021

pmeier self-assigned this Sep 22, 2021

pmeier linked a pull request Sep 27, 2021 that will close this issue

add OpInfo for torch.nn.functional.kl_div #65469

Closed

pmeier removed a link to a pull request Dec 1, 2021

add OpInfo for torch.nn.functional.kl_div #65469

Closed

pmeier linked a pull request Dec 1, 2021 that will close this issue

fix kl_div for negative targets #69212

Closed

nikitaved mentioned this issue Jun 7, 2022

kl_div: fix for grads wrt target, double backward, forward-over-reverse AD support. #79007

Closed

pytorchmergebot closed this as completed in 72ad222 Jun 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`torch.nn.functional.kl_div` fails gradgradcheck if the target requires a gradient #65466

`torch.nn.functional.kl_div` fails gradgradcheck if the target requires a gradient #65466

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

torch.nn.functional.kl_div fails gradgradcheck if the target requires a gradient #65466

torch.nn.functional.kl_div fails gradgradcheck if the target requires a gradient #65466

Comments

Uh oh!

🐛 Bug

To Reproduce

Additional context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

`torch.nn.functional.kl_div` fails gradgradcheck if the target requires a gradient #65466

`torch.nn.functional.kl_div` fails gradgradcheck if the target requires a gradient #65466