8000 Tags · StellarrZ/torchrec · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tags: StellarrZ/torchrec

Tags

v2024.08.26.00

Toggle v2024.08.26.00's commit message
typing for decorators - fx/_compatibility (pytorch#2322)

Summary:
X-link: ctrl-labs/src2#33884

X-link: pytorch/executorch#4810

Pull Request resolved: pytorch#2322

X-link: pytorch/pytorch#134054

See #131429

Reviewed By: laithsakka

Differential Revision: D61493706

fbshipit-source-id: d2b3feeff2abf8610e4e9940a1b93b5f80777dc2

v2024.08.19.00

Toggle v2024.08.19.00's commit message
Don't skip_torchrec when using torchrec PT2 pipeline (pytorch#2298)

Summary:
Pull Request resolved: pytorch#2298

We introduced `torch._dynamo.config.skip_torchrec` to control whether tracing into torchrec paths. PT2 pipeline is mainly used for torchrec PT2 compilation, so it should set `skip_torchrec` to False by default

Reviewed By: IvanKobzarev

Differential Revision: D61219995

fbshipit-source-id: fa68455f0087afc1d444de70d1f26944f22d355f

v2024.08.12.00

Toggle v2024.08.12.00's commit message
fix pt2 test bug

Summary:
# context
* test runs fine internally, but OSS failed:
{F1799786365}
```
FAILED torchrec/distributed/tests/test_pt2_multiprocess.py::TestPt2Train::test_compile_multiprocess_fake_pg - hypothesis.errors.InvalidArgument: Using `settings` on a test without `given` is completely pointless.
```
* removing the extra line OSS test skips fake_pg (cuda needed so skips cpu test)
 {F1799792288}

Reviewed By: gnahzg

Differential Revision: D50747500

fbshipit-source-id: f6b187cfda42940acdb6fcd469e8904c476385f3

v2024.08.05.00

Toggle v2024.08.05.00's commit message
fix torch rec test failure (pytorch#2269)

Summary:
Pull Request resolved: pytorch#2269

Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work. We'll comment it out for now.

A TODO is also added to the code-base of unflattener in D60528900

Reviewed By: PaulZhang12

Differential Revision: D60682384

fbshipit-source-id: 6633932269918496c1f53e7c600599ecff361f4d

v2024.07.29.00

Toggle v2024.07.29.00's commit message
Open Slots API (pytorch#2249)

Summary:
Pull Request resolved: pytorch#2249

Adds concept of open slots to show users if just growing table or actual replacement of ids.   Also fixes default input_hash_size to max int64 (2**63 - 1)

Example logs when insert only:

{F1774359971}

vs replacement:

{F1774361026}

Reviewed By: iamzainhuda

Differential Revision: D59931393

fbshipit-source-id: 3d46198f5e4d2bedbeaee80886b64f8e4b1817f1

v0.8.0

Toggle v0.8.0's commit message
Enable prefetch stage for StagedTrainPipeline (pytorch#2239)

Summary:
Pull Request resolved: pytorch#2239

Add ability to run prefetch as a stage in `StagedTrainPipeline`

Recommended usage to run 3-stage pipeline with data copy, sparse dist and prefetch steps (changes required shown with arrows):
```
sdd = SparseDataDistUtil(
    model=self._model,
    data_dist_stream=torch.torch.cuda.Stream(),
    prefetch_stream=torch.torch.cuda.Stream(), <--- define prefetch stream
)

pipeline = [
    PipelineStage(
        name="data_copy",
        runnable=lambda batch, context: batch.to(
            self._device, non_blocking=True
        ),
        stream=torch.cuda.Stream(),
    ),
    PipelineStage(
        name="start_sparse_data_dist",
        runnable=sdd.start_sparse_data_dist,
        stream=sdd.data_dist_stream,
        fill_callback=sdd.wait_sparse_data_dist,
    ),
    PipelineStage(
        name="prefetch",
        runnable=sdd.prefetch, <--- add stage with runnable=sdd.pr
8000
efetch
        stream=sdd.prefetch_stream,
        fill_callback=sdd.load_prefetch, <--- fill_callback of sdd.load_prefetch
    ),
]

return StagedTrainPipeline(pipeline_stages=pipeline)
```

Order of execution for above pipeline:

Iteration pytorch#1:

_fill_pipeline():
batch 0: memcpy, start_sdd, wait_sdd (callback), prefetch, load_prefetch (callback)
batch 1: memcpy, start_sdd, wait_sdd (callback)
batch 2: memcpy

progress():
batch 3: memcpy
batch 2: start_sdd
batch 1: prefetch

after pipeline progress():
model(batch 0)
load_prefetch (prepares for model fwd on batch 1)
wait_sdd (prepares for batch 2 prefetch)

Iteration pytorch#2:
progress():
batch 4: memcpy
batch 3: start_sdd
batch 2: prefetch

after pipeline progress():
model(batch 1)
load_prefetch (prepares for model fwd on batch 2)
wait_sdd (prepares for batch 3 prefetch)

Reviewed By: zzzwen, joshuadeng

Differential Revision: D59786807

fbshipit-source-id: 6261c07cd6823bc541463d24ff867ab0e43631ea

v2024.07.22.00

Toggle v2024.07.22.00's commit message
benchmark of fbgemm op - regroup_kts (pytorch#2159)

Summary:
Pull Request resolved: pytorch#2159

# context
* added **fn-level** benchmark for the `regroup_keyed_tensor`
* `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage

# conclusion
* CPU runtime **reduces 40%** from 1.8 ms to 1.1 ms
* GPU runtime **reduces 60%** from 4.9 ms to 2.0 ms
* GPU memory **reduces 33%** from 1.5 K to 1.0 K
* **we should migrate to the new op** unless any unknown concern/blocker

# traces
* [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link)
```
[hhy@24963.od /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json
-rw-r--r-- 1 hhy hhy  552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json'
-rw-r--r-- 1 hhy hhy  548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json'
-rw-r--r-- 1 hhy hhy  559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json'
-rw-r--r-- 1 hhy hhy  553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json'
-rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json'
-rw-r--r-- 1 hhy hhy  346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json'
-rw-r--r-- 1 hhy hhy  895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json'
-rw-r--r-- 1 hhy hhy  561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json'
-rw-r--r-- 1 hhy hhy  559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json'
-rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json'
-rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json'
```
* pytorch generic
 {F1755208341}
* current prod
 {F1755209251}
* permute_multi_embedding (2 Ops)
 {F1755210682}
* KT.regroup (1 Op)
 {F1755210008}
* regroupAsDict (Module)
 {F1755210990}
* metrics
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
|**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly|
|**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Reviewed By: dstaay-fb

Differential Revision: D58907223

fbshipit-source-id: 108ce355b9191cba6fe6a79e54dc7291b8463f7b

v2024.07.15.00

Toggle v2024.07.15.00's commit message
correct VBE output merging logic to only apply to multiple TBE cases (p…

…ytorch#2225)

Summary:
Pull Request resolved: pytorch#2225

- fixes issue that was breaking with empty rank embeddings
  - `RuntimeError: torch.cat(): expected a non-empty list of Tensors`
  - we prevent this from occurring by enforcing merge logic only occurs when dealing with multi TBE outputs
- stops redundant merging logic and splits calculation when only dealing with single embedding output which is most cases

Reviewed By: ge0405

Differential Revision: D59705585

fbshipit-source-id: 98cd37be62289060524dee3404c71d826e8b18e4

v2024.07.08.00

Toggle v2024.07.08.00's commit message
avoid reserved python word in kwargs (pytorch#2205)

Summary:
Pull Request resolved: pytorch#2205

as per title

Reviewed By: gnahzg, iamzainhuda

Differential Revision: D59336088

fbshipit-source-id: 1614039ef2c8d7958c4e98e1b02588c18b932561

v2024.07.01.00

Toggle v2024.07.01.00's commit message
Overlap comms on backward pass (pytorch#2117)

Summary:
Pull Request resolved: pytorch#2117

Resolves issues around cuda streams / NCCL Deadlock with autograd.

Basically create seperate streams per pipelined embedding arch.

Reviewed By: sarckk

Differential Revision: D58220332

fbshipit-source-id: e203acad4a92702b94a42e2106d6de4f5d89e112
0