Tags · StellarrZ/torchrec

v2024.08.26.00

typing for decorators - fx/_compatibility (pytorch#2322)

Summary:
X-link: ctrl-labs/src2#33884

X-link: pytorch/executorch#4810

Pull Request resolved: pytorch#2322

X-link: pytorch/pytorch#134054

See #131429

Reviewed By: laithsakka

Differential Revision: D61493706

fbshipit-source-id: d2b3feeff2abf8610e4e9940a1b93b5f80777dc2

Aug 26, 2024
6fe74eb
zip
tar.gz

v2024.08.19.00

Don't skip_torchrec when using torchrec PT2 pipeline (pytorch#2298)

Summary:
Pull Request resolved: pytorch#2298

We introduced `torch._dynamo.config.skip_torchrec` to control whether tracing into torchrec paths. PT2 pipeline is mainly used for torchrec PT2 compilation, so it should set `skip_torchrec` to False by default

Reviewed By: IvanKobzarev

Differential Revision: D61219995

fbshipit-source-id: fa68455f0087afc1d444de70d1f26944f22d355f

Aug 16, 2024
12d31d6
zip
tar.gz

v2024.08.12.00

fix pt2 test bug

Summary:
# context
* test runs fine internally, but OSS failed:
{F1799786365}
```
FAILED torchrec/distributed/tests/test_pt2_multiprocess.py::TestPt2Train::test_compile_multiprocess_fake_pg - hypothesis.errors.InvalidArgument: Using `settings` on a test without `given` is completely pointless.
```
* removing the extra line OSS test skips fake_pg (cuda needed so skips cpu test)
 {F1799792288}

Reviewed By: gnahzg

Differential Revision: D50747500

fbshipit-source-id: f6b187cfda42940acdb6fcd469e8904c476385f3

Aug 10, 2024
927d7db
zip
tar.gz

v2024.08.05.00

fix torch rec test failure (pytorch#2269)

Summary:
Pull Request resolved: pytorch#2269

Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work. We'll comment it out for now.

A TODO is also added to the code-base of unflattener in D60528900

Reviewed By: PaulZhang12

Differential Revision: D60682384

fbshipit-source-id: 6633932269918496c1f53e7c600599ecff361f4d

Aug 2, 2024
76c7e65
zip
tar.gz

v2024.07.29.00

Open Slots API (pytorch#2249)

Summary:
Pull Request resolved: pytorch#2249

Adds concept of open slots to show users if just growing table or actual replacement of ids.   Also fixes default input_hash_size to max int64 (2**63 - 1)

Example logs when insert only:

{F1774359971}

vs replacement:

{F1774361026}

Reviewed By: iamzainhuda

Differential Revision: D59931393

fbshipit-source-id: 3d46198f5e4d2bedbeaee80886b64f8e4b1817f1

Jul 27, 2024
ddcfd64
zip
tar.gz

v0.8.0

Enable prefetch stage for StagedTrainPipeline (pytorch#2239)

Summary:
Pull Request resolved: pytorch#2239

Add ability to run prefetch as a stage in `StagedTrainPipeline`

Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows):
```
sdd = SparseDataDistUtil(
    model=self._model,
    data_dist_stream=torch.torch.cuda.Stream(),
    prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream
)

pipeline = [
    PipelineStage(
        name="data_copy",
        runnable=lambda batch, context: batch.to(
            self._device, non_blocking=True
        ),
        stream=torch.cuda.Stream(),
    ),
    PipelineStage(
        name="start_sparse_data_dist",
        runnable=sdd.start_sparse_data_dist,
        stream=sdd.data_dist_stream,
        fill_callback=sdd.wait_sparse_data_dist,
    ),
    PipelineStage(
        name="prefetch",
        runnable=sdd.prefetch, <--- add stage with runnable=sdd.pr
8000
efetch
        stream=sdd.prefetch_stream,
        fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch
    ),
]

return StagedTrainPipeline(pipeline_stages=pipeline)
```

Order of execution for above pipeline:

Iteration pytorch#1:

_fill_pipeline():
batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback)
batch 1: memcpy, start_sdd, wait_sdd (callback)
batch 2: memcpy

progress():
batch 3: memcpy
batch 2: start_sdd
batch 1: prefetch

after pipeline progress():
model(batch 0)
load_prefetch (prepares for model fwd on batch 1)
wait_sdd (prepares for batch 2 prefetch)

Iteration pytorch#2:
progress():
batch 4: memcpy
batch 3: start_sdd
batch 2: prefetch

after pipeline progress():
model(batch 1)
load_prefetch (prepares for model fwd on batch 2)
wait_sdd (prepares for batch 3 prefetch)

Reviewed By: zzzwen, joshuadeng

Differential Revision: D59786807

fbshipit-source-id: 6261c07cd6823bc541463d24ff867ab0e43631ea

Jul 23, 2024
9264186
zip
tar.gz

v2024.07.22.00

benchmark of fbgemm op - regroup_kts (pytorch#2159)

Summary:
Pull Request resolved: pytorch#2159

# context
* added **fn-level** benchmark for the `regroup_keyed_tensor`
* `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage

# conclusion
* CPU runtime **reduces 40%** from 1.8 ms to 1.1 ms
* GPU runtime **reduces 60%** from 4.9 ms to 2.0 ms
* GPU memory **reduces 33%** from 1.5 K to 1.0 K
* **we should migrate to the new op** unless any unknown concern/blocker

# traces
* [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link)
```
[hhy@24963.od /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json
-rw-r--r-- 1 hhy hhy  552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json'
-rw-r--r-- 1 hhy hhy  548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json'
-rw-r--r-- 1 hhy hhy  559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json'
-rw-r--r-- 1 hhy hhy  553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json'
-rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json'
-rw-r--r-- 1 hhy hhy  346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json'
-rw-r--r-- 1 hhy hhy  895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json'
-rw-r--r-- 1 hhy hhy  561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json'
-rw-r--r-- 1 hhy hhy  559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json'
-rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json'
-rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json'
```
* pytorch generic
 {F1755208341}
* current prod
 {F1755209251}
* permute_multi_embedding (2 Ops)
 {F1755210682}
* KT.regroup (1 Op)
 {F1755210008}
* regroupAsDict (Module)
 {F1755210990}
* metrics
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
|**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly|
|**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D58907223

fbshipit-source-id: 108ce355b9191cba6fe6a79e54dc7291b8463f7b

Jul 20, 2024
09d1ff2
zip
tar.gz

v2024.07.15.00

correct VBE output merging logic to only apply to multiple TBE cases (p…

…ytorch#2225)

Summary:
Pull Request resolved: pytorch#2225

- fixes issue that was breaking with empty rank embeddings
  - `RuntimeError: torch.cat(): expected a non-empty list of Tensors`
  - we prevent this from occurring by enforcing merge logic only occurs when dealing with multi TBE outputs
- stops redundant merging logic and splits calculation when only dealing with single embedding output which is most cases

Reviewed By: ge0405

Differential Revision: D59705585

fbshipit-source-id: 98cd37be62289060524dee3404c71d826e8b18e4

Jul 13, 2024
ed3b6b3
zip
tar.gz

v2024.07.08.00

avoid reserved python word in kwargs (pytorch#2205)

Summary:
Pull Request resolved: pytorch#2205

as per title

Reviewed By: gnahzg, iamzainhuda

Differential Revision: D59336088

fbshipit-source-id: 1614039ef2c8d7958c4e98e1b02588c18b932561

Jul 4, 2024
001396b
zip
tar.gz

v2024.07.01.00

Overlap comms on backward pass (pytorch#2117)

Summary:
Pull Request resolved: pytorch#2117

Resolves issues around cuda streams / NCCL Deadlock with autograd.

Basically create seperate streams per pipelined embedding arch.

Reviewed By: sarckk

Differential Revision: D58220332

fbshipit-source-id: e203acad4a92702b94a42e2106d6de4f5d89e112

Jun 29, 2024
7e4ef94
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2024.08.26.00

v2024.08.19.00

v2024.08.12.00

v2024.08.05.00

v2024.07.29.00

v0.8.0

v2024.07.22.00

v2024.07.15.00

v2024.07.08.00

v2024.07.01.00

Tags: StellarrZ/torchrec