[DCP] Allow for rank-specific tensors with duplicate keys #146566

cassanof · 2025-02-06T02:50:40Z

🚀 The feature, motivation and pitch

My understanding of DCP is that it assumes either DTensor, or fully replicated tensors in the state dict. I have some custom sharding implementation that doesn't use DTensor, and I needed to write a custom SavePlanner class that gathers the shard before saving.
The logic for loading is even uglier, as I need to modify the metadata object. For some other tensors, it's even worse because it's not clear how to gather them (e.g. torchao's TorchAOBaseTensor, used for AdamWFp8). I haven't found a workaround for this.
It would be great if there was an option to save a checkpoint with some tensors being specific to some ranks, that don't need to be gathered.

Alternatives

No response

Additional context

No response

cc @LucasLLC @pradeepfn @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

The text was updated successfully, but these errors were encountered:

yifuwang · 2025-02-10T18:32:34Z

cc @fegin

saumishr · 2025-04-08T15:25:29Z

@cassanof DCP makes no assumptions about the parallelization or replication. Once the state dict is provided, its saved using SPMD with dedupe and loaded with re-sharding if needed. Therefore whatever tensors are provided to the ranks will get saved accordingly. Both the options should work:

You can build the custom sharding logic outside of DCP APIs and provide the tensors to the ranks where those need to be saved.
Build a custom save planner which modifies the global plan after gathering the local plans. Plans can be reduced, modified and scattered back to the ranks to save. DCP does this here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/checkpoint/state_dict_saver.py#L350

cassanof · 2025-04-09T08:05:57Z

Hey @saumishr, thanks for the pointer.

Option #1 wouldn't work for us because the fully-replicated state won't fit in memory.

For option #2, I have previously built a custom save planner for this exact purpose. It all-gathers the shards before saving them. However, the main problem is loading, it's unclear to me how to do the reverse operation, i.e. create a load planner that loads only a view of the fully-replicated tensor.

Due to these issues, we decided to roll our own checkpointing logic, but would be great to come back to DCP if there is a workaround to this.

malfet added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 6, 2025

ekr0 added oncall: distributed checkpointing Oncall label should be attached to any issues related to distributed checkpointing. and removed oncall: distributed Add this issue/PR to distributed oncall triage queue labels Feb 13, 2025

jbschlosser added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DCP] Allow for rank-specific tensors with duplicate keys #146566

[DCP] Allow for rank-specific tensors with duplicate keys #146566

[DCP] Allow for rank-specific tensors with duplicate keys #146566

[DCP] Allow for rank-specific tensors with duplicate keys #146566

Comments

🚀 The feature, motivation and pitch

Alternatives

Additional context