Tags: StellarrZ/torchrec
Tags
typing for decorators - fx/_compatibility (pytorch#2322) Summary: X-link: ctrl-labs/src2#33884 X-link: pytorch/executorch#4810 Pull Request resolved: pytorch#2322 X-link: pytorch/pytorch#134054 See #131429 Reviewed By: laithsakka Differential Revision: D61493706 fbshipit-source-id: d2b3feeff2abf8610e4e9940a1b93b5f80777dc2
Don't skip_torchrec when using torchrec PT2 pipeline (pytorch#2298) Summary: Pull Request resolved: pytorch#2298 We introduced `torch._dynamo.config.skip_torchrec` to control whether tracing into torchrec paths. PT2 pipeline is mainly used for torchrec PT2 compilation, so it should set `skip_torchrec` to False by default Reviewed By: IvanKobzarev Differential Revision: D61219995 fbshipit-source-id: fa68455f0087afc1d444de70d1f26944f22d355f
fix pt2 test bug Summary: # context * test runs fine internally, but OSS failed: {F1799786365} ``` FAILED torchrec/distributed/tests/test_pt2_multiprocess.py::TestPt2Train::test_compile_multiprocess_fake_pg - hypothesis.errors.InvalidArgument: Using `settings` on a test without `given` is completely pointless. ``` * removing the extra line OSS test skips fake_pg (cuda needed so skips cpu test) {F1799792288} Reviewed By: gnahzg Differential Revision: D50747500 fbshipit-source-id: f6b187cfda42940acdb6fcd469e8904c476385f3
fix torch rec test failure (pytorch#2269) Summary: Pull Request resolved: pytorch#2269 Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work. We'll comment it out for now. A TODO is also added to the code-base of unflattener in D60528900 Reviewed By: PaulZhang12 Differential Revision: D60682384 fbshipit-source-id: 6633932269918496c1f53e7c600599ecff361f4d
Open Slots API (pytorch#2249) Summary: Pull Request resolved: pytorch#2249 Adds concept of open slots to show users if just growing table or actual replacement of ids. Also fixes default input_hash_size to max int64 (2**63 - 1) Example logs when insert only: {F1774359971} vs replacement: {F1774361026} Reviewed By: iamzainhuda Differential Revision: D59931393 fbshipit-source-id: 3d46198f5e4d2bedbeaee80886b64f8e4b1817f1
Enable prefetch stage for StagedTrainPipeline (pytorch#2239) Summary: Pull Request resolved: pytorch#2239 Add ability to run prefetch as a stage in `StagedTrainPipeline` Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows): ``` sdd = SparseDataDistUtil( model=self._model, data_dist_stream=torch.torch.cuda.Stream(), prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream ) pipeline = [ PipelineStage( name="data_copy", runnable=lambda batch, context: batch.to( self._device, non_blocking=True ), stream=torch.cuda.Stream(), ), PipelineStage( name="start_sparse_data_dist", runnable=sdd.start_sparse_data_dist, stream=sdd.data_dist_stream, fill_callback=sdd.wait_sparse_data_dist, ), PipelineStage( name="prefetch", runnable=sdd.prefetch, <--- add stage with runnable=sdd.pr 8000 efetch stream=sdd.prefetch_stream, fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch ), ] return StagedTrainPipeline(pipeline_stages=pipeline) ``` Order of execution for above pipeline: Iteration pytorch#1: _fill_pipeline(): batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback) batch 1: memcpy, start_sdd, wait_sdd (callback) batch 2: memcpy progress(): batch 3: memcpy batch 2: start_sdd batch 1: prefetch after pipeline progress(): model(batch 0) load_prefetch (prepares for model fwd on batch 1) wait_sdd (prepares for batch 2 prefetch) Iteration pytorch#2: progress(): batch 4: memcpy batch 3: start_sdd batch 2: prefetch after pipeline progress(): model(batch 1) load_prefetch (prepares for model fwd on batch 2) wait_sdd (prepares for batch 3 prefetch) Reviewed By: zzzwen, joshuadeng Differential Revision: D59786807 fbshipit-source-id: 6261c07cd6823bc541463d24ff867ab0e43631ea
benchmark of fbgemm op - regroup_kts (pytorch#2159) Summary: Pull Request resolved: pytorch#2159 # context * added **fn-level** benchmark for the `regroup_keyed_tensor` * `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage # conclusion * CPU runtime **reduces 40%** from 1.8 ms to 1.1 ms * GPU runtime **reduces 60%** from 4.9 ms to 2.0 ms * GPU memory **reduces 33%** from 1.5 K to 1.0 K * **we should migrate to the new op** unless any unknown concern/blocker # traces * [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link) ``` [hhy@24963.od /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json -rw-r--r-- 1 hhy hhy 552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json' -rw-r--r-- 1 hhy hhy 548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json' -rw-r--r-- 1 hhy hhy 559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json' -rw-r--r-- 1 hhy hhy 553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json' -rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json' -rw-r--r-- 1 hhy hhy 346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json' -rw-r--r-- 1 hhy hhy 895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json' -rw-r--r-- 1 hhy hhy 561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json' -rw-r--r-- 1 hhy hhy 559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json' -rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json' -rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json' ``` * pytorch generic {F1755208341} * current prod {F1755209251} * permute_multi_embedding (2 Ops) {F1755210682} * KT.regroup (1 Op) {F1755210008} * regroupAsDict (Module) {F1755210990} * metrics |Operator|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates| |**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly| |**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D58907223 fbshipit-source-id: 108ce355b9191cba6fe6a79e54dc7291b8463f7b
correct VBE output merging logic to only apply to multiple TBE cases (p… …ytorch#2225) Summary: Pull Request resolved: pytorch#2225 - fixes issue that was breaking with empty rank embeddings - `RuntimeError: torch.cat(): expected a non-empty list of Tensors` - we prevent this from occurring by enforcing merge logic only occurs when dealing with multi TBE outputs - stops redundant merging logic and splits calculation when only dealing with single embedding output which is most cases Reviewed By: ge0405 Differential Revision: D59705585 fbshipit-source-id: 98cd37be62289060524dee3404c71d826e8b18e4
avoid reserved python word in kwargs (pytorch#2205) Summary: Pull Request resolved: pytorch#2205 as per title Reviewed By: gnahzg, iamzainhuda Differential Revision: D59336088 fbshipit-source-id: 1614039ef2c8d7958c4e98e1b02588c18b932561
Overlap comms on backward pass (pytorch#2117) Summary: Pull Request resolved: pytorch#2117 Resolves issues around cuda streams / NCCL Deadlock with autograd. Basically create seperate streams per pipelined embedding arch. Reviewed By: sarckk Differential Revision: D58220332 fbshipit-source-id: e203acad4a92702b94a42e2106d6de4f5d89e112
PreviousNext