Adding fsdp fp16 and bf16 hooks #80557

aovladi · 2022-06-29T19:28:17Z

Recently, register_comm_hook was introduced to FSDP, which at the moment supports only NO_SHARD strategy and has a default all_reduce hook implemented. This PR adds two lower precision hooks to an existing default hook.

I've also made slight adjustments to existing implementation of an all_reduce hook including:

AllReduceState -> DefaultState , motivation: AllReduceState is not specific to all_reduce. Gradients' pre- and post-division factors are also useful for other hooks, that require pre- and post-division, e.g. fp16_hook and bf16_hook.
I've put all 3 hooks into default_hooks.py

Additionally, FSDP supports MixedPrecision and, theoretically, it is possible to specify MixedPrecision for gradients and attach a lower precision hook to the model. To avoid double-casting, I've added a couple of checks to fully_sharded_data_parallel, i.e. casting to precision and back is performed by a lower precision hook only. I think, as a next step, it would be nice to ensure that user can't have both lower precision hook and MixedPrecision(reduce_dtype=<precision>) specified, but I am happy to discuss this and adjust current implementation.

As a test, I create two models: one with a lower precision hook and one with a MixedPrecision(reduce_dtype=<precision>) specified, perform one forward/backward and optimizer step and compare gradients.

facebook-github-bot · 2022-06-29T19:28:24Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/80557
📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓Need help or want to give feedback on the CI? Visit our office hours

✅ No Failures (0 Pending)

As of commit b9ae5a8 (more details on the Dr. CI page):

Expand to see more

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

rohan-varma

Great work! A couple of suggestions

rohan-varma · 2022-07-05T16:04:28Z

test/distributed/fsdp/test_fsdp_comm_hooks.py

+        optim_hook = torch.optim.SGD(fsdp_with_hook.parameters(), lr=0.1)
+        optim_mp = torch.optim.SGD(fsdp_with_mp.parameters(), lr=0.1)
+
+        in_data = torch.rand(16, 8).cuda()


nit: torch.rand(16, 8, device='cuda')

rohan-varma · 2022-07-05T16:04:44Z

test/distributed/fsdp/test_fsdp_comm_hooks.py

+        fsdp_with_hook.train()
+        fsdp_with_mp.train()
+        optim_mp.zero_grad()
+        optim_hook.zero_grad()


I don't think zero_grad does anything right, since we haven't accumulated any grads yet?

rohan-varma · 2022-07-05T16:06:37Z

test/distributed/fsdp/test_fsdp_comm_hooks.py

+
+        dist.barrier()
+
+        for hook_params, mp_params in zip(fsdp_with_hook.parameters(), fsdp_with_mp.parameters()):


nit: hook_param, mp_param since it is a single param

torch/distributed/algorithms/_comm_hooks/default_hooks.py

rohan-varma · 2022-07-05T16:11:03Z

torch/distributed/algorithms/_comm_hooks/default_hooks.py

+    if state.gradient_postdivide_factor > 1:
+        grad.div_(state.gradient_postdivide_factor)
+
+def fp16_compress_hook(state: LowPrecisionState, grad: torch.Tensor):


I think we can consolidate fp16 and bf16 hooks, such as the following:

def reduced_precision_hook(prec, state, grad): grad.data = grad.data.to(prec) allreduce_hook(state, grad) _decompress(state, grad)

def fp16_compress_hook(state, grad): fp16_hook = functools.partial(reduced_precision_hook, prec=torch.float16) return fp16_hook(state, grad)

and same for bfloat

Thanks for the example!

rohan-varma · 2022-07-05T16:12:03Z

torch/distributed/algorithms/_comm_hooks/default_hooks.py

+        process_group (ProcessGroup): The process group to be used for all-reduce.
+        world_size (int): The number of workers in a process group.
+            Determined based on a ``process_group``.
+        gradient_predivide_factor (float): A factor for gradients' pre-division.


I guess these aren't actually args into DefaultState, so we can remove them from docs?

rohan-varma · 2022-07-05T16:13:38Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

                    if (
-                        self._mixed_precision_enabled_for_params() or self._mixed_precision_enabled_for_reduce()
+                        not self._low_precision_hook_enabled() and


We should think about how to refactor this, such as, if reduce_dtype is available, then just register a reduced precision communication hook, but that requires a bit more flushing out. Might be good to file a follow up issue.

I will create a follow-up issue after this PR is landed

We check not self._low_precision_hook_enabled in L2980, so this check should always be true within right?

@rohan-varma right, thanks for catching this

@rohan-varma after looking into this: first check on now L3024 is for pre-casting gradients for all cases. This line is for re-casting gradients back into full-precision for NO_SHARD. This does not belong to the body of L3023 if, so this check is still needed. Otherwise, comm_hook will cast grads back and will occupy memory for orig_param_grad_data and after comm_hook is done, without this check here, it will attempt a second re-casting

Sounds good

torch/distributed/algorithms/_comm_hooks/default_hooks.py

let's just default this to torch.float32, which is the usual case so user doesn't have to specify everytime?

aovladi · 2022-07-12T19:31:02Z

@pytorchbot rebase

pytorchmergebot · 2022-07-12T19:32:35Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-07-12T19:32:39Z

Successfully rebased addons/fsdp_low_precision_hooks onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via git checkout addons/fsdp_low_precision_hooks && git pull --rebase)

aovladi · 2022-07-14T03:42:21Z

@pytorchbot rebase

pytorchmergebot · 2022-07-14T03:43:53Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-07-14T03:43:57Z

Successfully rebased addons/fsdp_low_precision_hooks onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via git checkout addons/fsdp_low_precision_hooks && git pull --rebase)

rohan-varma

LGTM, one non-minor comment, pls take a look at it

rohan-varma · 2022-07-15T23:13:06Z

torch/distributed/algorithms/_comm_hooks/default_hooks.py

+        state (DefaultState): State information, configures pre- and post-division factors
+        grad (torch.Tensor): A gradient for the local batch that needs to be communicated across ranks in a lower precision.
+    """
+    bp16_hook = functools.partial(lower_precision_hook, torch.bfloat16)


nit: bf16_hook

rohan-varma · 2022-07-15T23:18:41Z

torch/distributed/fsdp/fully_sharded_data_parallel.py

                    if (
-                        self._mixed_precision_enabled_for_params() or self._mixed_precision_enabled_for_reduce()
+                        not self._low_precision_hook_enabled() and


We check not self._low_precision_hook_enabled in L2980, so this check should always be true within right?

aovladi · 2022-07-15T23:28:55Z

@pytorchbot rebase

pytorchmergebot · 2022-07-15T23:30:22Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-07-15T23:30:27Z

Successfully rebased addons/fsdp_low_precision_hooks onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via git checkout addons/fsdp_low_precision_hooks && git pull --rebase)

aovladi · 2022-07-18T17:53:06Z

@pytorchbot rebase

pytorchmergebot · 2022-07-18T17:54:33Z

@pytorchbot successfully started a rebase job. Check the current status here

pytorchmergebot · 2022-07-18T17:54:36Z

Successfully rebased addons/fsdp_low_precision_hooks onto refs/remotes/origin/master, please pull locally before adding more changes (for example, via git checkout addons/fsdp_low_precision_hooks && git pull --rebase)

aovladi · 2022-07-18T22:39:33Z

@pytorchbot merge

pytorchmergebot · 2022-07-18T22:40:51Z

@pytorchbot successfully started a merge job. Check the current status here

github-actions · 2022-07-18T22:41:28Z

Hey @aovladi.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

aovladi · 2022-07-19T03:09:21Z

@pytorchbot revert -m "broke distributed tests on trunk"

pytorch-bot · 2022-07-19T03:09:23Z

❌ 🤖 pytorchbot command failed:

@pytorchbot revert: error: the following arguments are required: -c/--classification

usage: @pytorchbot revert -m MESSAGE -c
                          {nosignal,ignoredsignal,landrace,weird,ghfirst}

Try @pytorchbot --help for more info.

aovladi · 2022-07-19T03:10:03Z

@pytorchbot revert -m "broke distributed tests on trunk" -c weird

pytorchmergebot · 2022-07-19T03:11:16Z

@pytorchbot successfully started a revert job. Check the current status here

pytorchmergebot · 2022-07-19T03:11:22Z

@aovladi your PR has been successfully reverted.

This reverts commit f7d6828. Reverted #80557 on behalf of https://github.com/aovladi due to broke distributed tests on trunk

facebook-github-bot added the cla signed label Jun 29, 2022

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Jun 29, 2022

aovladi marked this pull request as ready for review June 29, 2022 19:28

aovladi requested review from mrshenli, pritamdamania87, zhaojuanmao, rohan-varma, H-Huang, awgu and mingzhe09088 as code owners June 29, 2022 19:28

rohan-varma reviewed Jul 5, 2022

View reviewed changes

pytorchmergebot force-pushed the addons/fsdp_low_precision_hooks branch from 6e003f0 to c208bef Compare July 12, 2022 19:32

pytorchmergebot force-pushed the addons/fsdp_low_precision_hooks branch from 51e01a4 to 386d614 Compare July 14, 2022 03:44

rohan-varma approved these changes Jul 15, 2022

View reviewed changes

pytorchmergebot force-pushed the addons/fsdp_low_precision_hooks branch from 386d614 to 65d2675 Compare July 15, 2022 23:30

Olga Andreeva added 3 commits July 18, 2022 17:54

Added low-precision hooks to fsdp_coom_hooks

9221d3c

Adding 2 lower precision communication hooks for FSDP

b7c8295

First review

fb105ae

Fix typos

b9ae5a8

pytorchmergebot force-pushed the addons/fsdp_low_precision_hooks branch from 1b0ad0e to b9ae5a8 Compare July 18, 2022 17:54

pytorchmergebot added the Merged label Jul 18, 2022

pytorchmergebot closed this in f7d6828 Jul 18, 2022

pytorchmergebot added the Reverted label Jul 19, 2022

pytorchmergebot added a commit that referenced this pull request Jul 19, 2022

Revert "Adding fsdp fp16 and bf16 hooks (#80557)"

a8f4011

This reverts commit f7d6828. Reverted #80557 on behalf of https://github.com/aovladi due to broke distributed tests on trunk

janeyx99 mentioned this pull request Jul 22, 2022

[Meta] CI Revert Tracker #66178

Closed

github-actions bot deleted the addons/fsdp_low_precision_hooks branch February 18, 2024 01:53

+                  def __init__(
+                      self,
+                      process_group,
+                      parameter_type,


		dist.barrier()

		for hook_params, mp_params in zip(fsdp_with_hook.parameters(), fsdp_with_mp.parameters()):

Adding fsdp fp16 and bf16 hooks #80557

Adding fsdp fp16 and bf16 hooks #80557

Uh oh!

Conversation

Uh oh!

Uh oh!

🔗 Helpful links

✅ No Failures (0 Pending)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!