Use SimpleDistributedPerLayerClipping optimizer in hooks mode #750

iden-kalemaj · 2025-04-03T22:23:04Z

Summary:
We use SimpleDistributedPerLayerOptimizer instead of DistributedPerLayerOptimizer.

The latter causes an issue when switching to register_full_backward_hook.

The issue arises because DistributedPerLayerOptimizer uses per-parameter hooks on top of the per-module hooks. During the backward pass, the per-parameter hooks fire before the per-module hooks. Per-sample gradients are computed when the per-module hooks fire, and an error occurs when the per-parameter hooks try to access the per-sample gradients before they are computed. Forcing the order in which hooks are called is not possible with PyTorch.

Differential Revision: D72420168

…rch#720) Summary: Pull Request resolved: pytorch#720 register_backward_hook is deprecated and may lead to errors in gradient calculation. We switch to the supported register_full_backward_hook. Differential Revision: D68562558 Reviewed By: HuanyuZhang

facebook-github-bot · 2025-04-03T22:23:22Z

This pull request was exported from Phabricator. Differential Revision: D72420168

…h#750) Summary: We use SimpleDistributedPerLayerOptimizer instead of DistributedPerLayerOptimizer. The latter causes an issue when switching to `register_full_backward_hook`. The issue arises because DistributedPerLayerOptimizer uses per-parameter hooks on top of the per-module hooks. During the backward pass, the per-parameter hooks fire before the per-module hooks. Per-sample gradients are computed when the per-module hooks fire, and an error occurs when the per-parameter hooks try to access the per-sample gradients before they are computed. Forcing the order in which hooks are called is not possible with PyTorch. Differential Revision: D72420168

…h#750) Summary: Pull Request resolved: pytorch#750 We use SimpleDistributedPerLayerOptimizer instead of DistributedPerLayerOptimizer. The latter causes an issue when switching to `register_full_backward_hook`. The issue arises because DistributedPerLayerOptimizer uses per-parameter hooks on top of the per-module hooks. During the backward pass, the per-parameter hooks fire before the per-module hooks. Per-sample gradients are computed when the per-module hooks fire, and an error occurs when the per-parameter hooks try to access the per-sample gradients before they are computed. Forcing the order in which hooks are called is not possible with PyTorch. Differential Revision: D72420168

facebook-github-bot · 2025-04-03T22:51:18Z

This pull request was exported from Phabricator. Differential Revision: D72420168

facebook-github-bot · 2025-04-04T17:47:02Z

This pull request has been merged in 58f11ec.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 3, 2025

facebook-github-bot added the fb-exported label Apr 3, 2025

iden-kalemaj force-pushed the export-D72420168 branch from 3f2cbff to 72071ad Compare April 3, 2025 22:47

iden-kalemaj force-pushed the export-D72420168 branch from 72071ad to 658b9ed Compare April 3, 2025 22:51

facebook-github-bot closed this in 58f11ec Apr 4, 2025

facebook-github-bot added the Merged label Apr 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use SimpleDistributedPerLayerClipping optimizer in hooks mode #750

Use SimpleDistributedPerLayerClipping optimizer in hooks mode #750

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Use SimpleDistributedPerLayerClipping optimizer in hooks mode #750

Use SimpleDistributedPerLayerClipping optimizer in hooks mode #750

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!