8000 [KT.regroup Ops][4/N] benchmark of fbgemm op - regroup_kts by TroyGarden · Pull Request #2159 · pytorch/torchrec · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[KT.regroup Ops][4/N] benchmark of fbgemm op - regroup_kts #2159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

TroyGarden
Copy link
Contributor
@TroyGarden TroyGarden commented Jun 22, 2024

context

  • added fn-level benchmark for the regroup_keyed_tensor
  • keyed_tensor_regroup further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage

conclusion

  • CPU runtime reduces 40% from 1.8 ms to 1.1 ms
  • GPU runtime reduces 60% from 4.9 ms to 2.0 ms
  • GPU memory reduces 33% from 1.5 K to 1.0 K
  • we should migrate to the new op unless any unknown concern/blocker

traces

[hhy@24963.od /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json
-rw-r--r-- 1 hhy hhy  552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json'
-rw-r--r-- 1 hhy hhy  548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json'
-rw-r--r-- 1 hhy hhy  559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json'
-rw-r--r-- 1 hhy hhy  553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json'
-rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json'
-rw-r--r-- 1 hhy hhy  346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json'
-rw-r--r-- 1 hhy hhy  895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json'
-rw-r--r-- 1 hhy hhy  561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json'
-rw-r--r-- 1 hhy hhy  559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json'
-rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json'
-rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json'
  • pytorch generic
    {F1755208341}
  • current prod
    {F1755209251}
  • permute_multi_embedding (2 Ops)
    {F1755210682}
  • KT.regroup (1 Op)
    {F1755210008}
  • regroupAsDict (Module)
    {F1755210990}
  • metrics
    |Operator|CPU runtime|GPU runtime|GPU memory|notes|
    |---|---|---|---|---|
    |[fallback] pytorch generic|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
    |[prod] _fbgemm_permute_pooled_embs|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does NOT allow duplicates, PT2 non-compatible pin_and_move|
    |[hybrid python/cu] keyed_tensor_regroup|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, ALLOW duplicates, PT2 friendly|
    |[pure c++/cu] permute_multi_embedding|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, ALLOW duplicates, PT2 friendly|

Differential Revision: D58907223

Summary:

# context
* we are adding fbgemm operators for the KT.regroup function.
* we wanted a good way to measure the performance beside the runtime
* **trace is very important to evaluate the actual performance impact**
* for example, just from the GPU runtime readings, it seems like the native-pytorch implementation (`_regroup_keyed_tenors`) has better performance over the fbgemm_gpu implementation (`KeyedTensor.regroup`)
* but if we look at the CPU/GPU traces, we'll find that the native-pytorch implementation is actually CPU-bounded, and has very bad impact on the overall performance.

# usage
* to generate trace file in the given path (.)
```
buck2 run fbcode//mode/opt fbcode//torchrec/sparse/tests:jagged_tensor_benchmark -- --profile=.
```
```
$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062963 Jun 21 22:21 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  943675 Jun 21 22:21 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140105 Jun 21 22:21 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350349 Jun 21 22:21 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy 8025287 Jun 21 22:21 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041473 Jun 21 22:21 trace-_regroup_keyed_tenors.json
```

# performance
* GPU
```
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiCallbackApi.cpp:78] Callback: domain = 3, cbid = 1
INFO:2024-06-21 22:22:51 1102779:1102779 CuptiActivityProfiler.cpp:241] CUDA versions. CUPTI: 18; Runtime: 12000; Driver: 12000
INFO:2024-06-21 22:22:51 1102779:1102779 NcclProfiler.cpp:150] NCCL Profiler Instantiated
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
```
* CPU
```
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 144.8 ms | Memory (P90):   0.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 159.1 ms | Memory (P90):   0.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 203.0 ms | Memory (P90):   0.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 132.4 ms | Memory (P90):   0.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 134.7 ms | Memory (P90):   0.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cpu      | Runtime (P90): 131.8 ms | Memory (P90):   0.0
```
# traces
* _regroup_keyed_tenors
 {F1712147044} 
* KeyedTensor.regroup
 {F1712148863} 
* KTRegroupAsDict
 {F1712150411}

Differential Revision: D58906521
Summary:
X-link: pytorch/FBGEMM#2738


# context
* current we have a working function `permute_pooled_embs_auto_grad` to do a full permute of KTs, including forward and backward
* it has several limitations:
a) it has to be a full permute, duplicates are not supported;
b) in the main [use case](https://fburl.com/code/89od0rqm) there has to be a torch.concat on the input KTs, which is not very efficient;
c) the function output a single KT which requires a split operation
* there is some attempt to support duplicated outputs, but the backward doesn't work
* this diff is trying to create a new kernel (named `permute_multi_embedding`) to support a multiple-KT to multiple-KT mapping operation with backward support

# notes
* this diff focuses on the implemenation and test of the operator
* performance analysis and benchmark are in the next diff

# operator example usage
* used in python
```
# test inputs: 3 KTs with batch_size=2048
batch_size = 2048
keys = [["f1", "f2"], ["f3", "f4", "f5"], ["f6"]]
lengths = [[96, 256], [512, 128, 768], [1024]]
values = [
    torch.randn(batch_size, sum(lens), device="cuda", requires_grad=True)
    for lens in lengths
]

# target outputs: 4 KTs with re-arranged keys (features), duplicates are allowed
groups = [["f1", "f3"], ["f2"], ["f4", "f1", "f6"], ["f1", "f5"]]

# accessorial arguments to the op/kernel
permutes, in_lengths, out_lengths = _multi_remap_to_groups(
    keys, lengths, groups
)

# arguments
outputs = torch.ops.fbgemm.permute_multi_embedding(
    values, permutes, in_lengths, out_lengths
)
```
* permutes
```
permutes = tensor(
            [
                [0, 0, 0, 0, 3, 4],  # f1
                [1, 0, 0, 3, 5, 0],  # f3
                [0, 1, 3, 0, 4, 0],  # f2
                [1, 2, 5, 0, 6, 0],  # f4
                [0, 2, 0, 6, 3, -6],  # f1
                [2, 2, 0, 9, 8, 0],  # f6
                [0, 3, 0, 0, 3, -8],  # f1
                [1, 3, 11, 3, 7, 0],  # f5
            ]
)
```

# details
1. from the above example usage, we can clearly see that the operatior takes in the following:
a) values: List[torch.Tensor], which represents the input KTs
b) permutes: torch.Tensor, which contains the permute information, will be explained later
c) output_lengths_list: List[int], the lengths of the output tensors (KTs), which is needed to allocate memory on device ahead
d) in_lengths: torch.Tensor, lengths of input tensors, which is on device
e) out_lengths: torch.Tensor, lengths of output tensors, which is on device
2. the operator returns a list of tensors, which represents the permuted KTs
3. `permute` is the most critical argument in this operator:
a) 2-D tensor
b) each row represents a key (feature) permute move
c) a permute move = [input_tensor_id, output_tensor_id, input_start_idx, output_start_idx, feature_length, jump]
d) jump is used in backward when a key (feature) from the input tensor is mapped to multiple places in the output tensors

Differential Revision: D57055616
Summary:

X-link: pytorch/FBGEMM#2771

# context
* added both **op-level** and **fn-level** benchmarks for the KT.regroup implementations
* analyze the op-level and fn-level performance in runtime and memory usage
* findings are that:
**a**. In the fn-level performance, the `permute_multi_embedding` (new op) outperforms both the native-pytorch implementation and the `permute_pooled_embs_auto_grad` (current Prod) by 50% GPU runtime and 33% memory usage
**b**. In the op-level performance, the new op is slightly slower than the current prod (by ~5% GPU runtime)
* conclusion: **we should use the new op**

# other considerations
The good:
1. the algorithm is designed in a way that it doesn't need to know in advance whether the 1-to-N mapping exists in the permutes. 
2. `_all_keys_used_once` is no longer needed
3. no longer need a torch.cat before calling the old operator
4. no need to use `_pin_and_move` for the meta data (arguments), it will be handled inside the operator, it's more friendly to tracing.
5. no longer need to fallback to native-pytorch implementation when duplicates existed

The same bad:
1. it requires several HtoD communications (move tensor to device):
a) [resolved] 3 tensors, which are `permutes`, `input_lengths`, and `output_lengths`. Those tensors needs to be on the device so that the cuda kernels has access to it.
b) [resolved] 2 lists of (scalar_t*) pointers, input and output tensor lists.
c) [resolved] Didn't find a good way to let the kernel knows the address of the lists of input/output tensors, because the lists are also need to be on the device. 
2. tensor.contiguous for the backward function, it looks like the grad from the backward are somehow not contiguous.

# benchmark
* op-level results: new op is ~5% slower in GPU runtime
```
INFO:root:size: 1024 x 136896; permute_multi_embedding: 2.25 ms; permute_pooled_embs: 2.15 ms; delta: 4.5%
INFO:root:size: 1024 x 108432; permute_multi_embedding: 1.79 ms; permute_pooled_embs: 1.7 ms; delta: 5.3%
INFO:root:size: 1024 x 277232; permute_multi_embedding: 4.54 ms; permute_pooled_embs: 4.37 ms; delta: 3.9%
INFO:root:size: 1024 x 244352; permute_multi_embedding: 4.01 ms; permute_pooled_embs: 3.83 ms; delta: 4.9%
INFO:root:size: 1024 x 524224; permute_multi_embedding: 8.62 ms; permute_pooled_embs: 8.25 ms; delta: 4.5%
INFO:root:size: 1024 x 564080; permute_multi_embedding: 9.27 ms; permute_pooled_embs: 8.92 ms; delta: 3.9%
```
* fn-level results: new op is 50%+ faster in GPU runtime and uses 33% fewer GPU memory
```
  _regroup_keyed_tenors               | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1011.0
  KeyedTensor.regroup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   5.0 ms | Memory (P90): 1517.0
  KTRegroupAsDict                     | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   4.9 ms | Memory (P90): 1517.0
  permute_multi_embs                  | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 1011.0
  _regroup_keyed_tenors_dup           | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KeyedTensor.regroup_dup             | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  KTRegroupAsDict_dup                 | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 1011.0
  permute_multi_embs_dup              | B: 1024     | F: 1020     | device: cuda     | Runtime (P90):   3.2 ms | Memory (P90): 1011.0
```

# traces
* [files](https://drive.google.com/drive/folders/1_9hOtQUQeFICBVxQtusvpQ_VajduFUmR?usp=sharing)
```
[hhy@50836.od /data/sandcastle/boxes/fbsource (ae677c240)]$ ll *.json
-rw-rw-r-- 1 hhy hhy 8062993 Jun 21 23:26 trace-KeyedTensor.regroup_dup.json
-rw-rw-r-- 1 hhy hhy  949610 Jun 21 23:26 trace-KeyedTensor.regroup.json
-rw-rw-r-- 1 hhy hhy 5140143 Jun 21 23:26 trace-KTRegroupAsDict_dup.json
-rw-rw-r-- 1 hhy hhy  350370 Jun 21 23:26 trace-KTRegroupAsDict.json
-rw-rw-r-- 1 hhy hhy  581033 Jun 21 23:26 trace-permute_multi_embs_dup.json
-rw-rw-r-- 1 hhy hhy  582607 Jun 21 23:26 trace-permute_multi_embs.json
-rw-rw-r-- 1 hhy hhy 8025337 Jun 21 23:26 trace-_regroup_keyed_tenors_dup.json
-rw-rw-r-- 1 hhy hhy 8041586 Jun 21 23:26 trace-_regroup_keyed_tenors.json
```
* native-pytorch
 {F1713052022} 
* current prod
 {F1713052648} 
* new op
 {F1713052907} 
* runtime
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**native-pytorch**|3.9 ms|3.1 ms|1.0 K|CPU-bounded, allow duplicates|
|**prod op**|2.1 ms|4.9 ms|1.5 K|GPU-boudned due to torch.cat, does **NOT** allow duplicates|
|**new op**|2.0 ms|2.2 ms|1.0 K|both CPU and GPU runtime outperformed, **ALLOW** duplicates|

Differential Revision: D58906839
Summary:
X-link: pytorch/FBGEMM#2772


# context
* learned from previous benchmark/trace analysis, that the CPU runtime (~2.0 ms) is still comparable with the GPU runtime (~2.2 ms)
|Operator|CPU runtime|GPU runtime|
|---|---|---|
|**native-pytorch**|3.9 ms|3.1 ms|
|**[prod] permute_pooled_embs**|2.1 ms|4.9 ms|
|**[new] permute_multi_embedding**|2.0 ms|2.2 ms|
* after a closer look, it takes ~1.1 ms in the meta arguments preparation/calculation, particularly, the `_multi_remap_to_groups`
 {F1713121552}
* in order to further improve the CPU runtime performance, we are moving the meta argument preparation into the C++ domain (inside the operator)

# details
* the meta data from python are the 1) feature/key list: `List[List[str]]`, 2) feature/key lengths: `List[List[int]]`, 3) permute feature/key list: `List[List[str]]`. 
* since the ctype only supports `List[str]` and `List[int]`, so we have to flatten those list of list of int/str, and also pass the splits to the downstream operator
* the minimal meta operation from python are the following:
```
keys, lengths, values = _desugar_keyed_tensors(keyed_tensors)
_keys = [a for b in keys for a in b],
_groups = [a for b in groups for a in b],
_lengths = [a for b in lengths for a in b],
key_splits = [len(k) for k in keys],
group_splits = [len(v) for v in groups],
```
* use `torch.ops.fbgemm.generate_keyed_tensor_permutes` to do the same work as `_multi_remap_to_groups`
* in the new op `regroup_keyed_tensor`, first calls the `generate_keyed_tensor_permutes` to get the proper arguments
* then call the `permute_multi_embedding` op for the actual operation.

Differential Revision: D58649553
Differential Revision: D58907223
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 22, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D58907223

facebook-github-bot pushed a commit that referenced this pull request Jul 13, 2024
Summary:

# context
* added **fn-level** benchmark for the `regroup_keyed_tensor`
* `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage

# conclusion
* CPU runtime **reduces 40%** from 1.8 ms to 1.1 ms
* GPU runtime **reduces 60%** from 4.9 ms to 2.0 ms
* GPU memory **reduces 33%** from 1.5 K to 1.0 K
* **we should migrate to the new op** unless any unknown concern/blocker

# traces
* [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link)
```
[hhy@24963.od /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json
-rw-r--r-- 1 hhy hhy  552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json'
-rw-r--r-- 1 hhy hhy  548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json'
-rw-r--r-- 1 hhy hhy  559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json'
-rw-r--r-- 1 hhy hhy  553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json'
-rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json'
-rw-r--r-- 1 hhy hhy  346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json'
-rw-r--r-- 1 hhy hhy  895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json'
-rw-r--r-- 1 hhy hhy  561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json'
-rw-r--r-- 1 hhy hhy  559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json'
-rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json'
-rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json'
```
* pytorch generic
 {F1755208341} 
* current prod
 {F1755209251} 
* permute_multi_embedding (2 Ops)
 {F1755210682} 
* KT.regroup (1 Op)
 {F1755210008} 
* regroupAsDict (Module)
 {F1755210990} 
* metrics
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
|**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly|
|**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Differential Revision: D58907223
TroyGarden pushed a commit that referenced this pull request Jul 13, 2024
Summary:
Pull Request resolved: #2159

# context
* added **fn-level** benchmark for the `regroup_keyed_tensor`
* `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage

# conclusion
* CPU runtime **reduces 1/3** from 1.8 ms to 1.1 ms
* GPU runtime **reduces 2/3** from 7.0 ms to 2.0 ms
* GPU memory **reduces 1/3** from 1.5 K to 1.0 K
* **we should migrate to the new op** unless any unknown concern/blocker

# traces
* [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link)
```
[hhy@24963.od /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json
-rw-r--r-- 1 hhy hhy  552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json'
-rw-r--r-- 1 hhy hhy  548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json'
-rw-r--r-- 1 hhy hhy  559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json'
-rw-r--r-- 1 hhy hhy  553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json'
-rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json'
-rw-r--r-- 1 hhy hhy  346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json'
-rw-r--r-- 1 hhy hhy  895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json'
-rw-r--r-- 1 hhy hhy  561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json'
-rw-r--r-- 1 hhy hhy  559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json'
-rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json'
-rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json'
```
* pytorch generic
 {F1752502508}
* current prod
 {F1752503546}
* permute_multi_embedding (2 Ops)
 {F1752503160}
* KT.regroup (1 Op)
 {F1752504258}
* regroupAsDict (Module)
{F1752504964}
* metrics
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
|**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|7.1 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly|
|**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Differential Revision: D58907223
@TroyGarden TroyGarden changed the title benchmark of fbgemm op - regroup_keyed_tensor [KT.regroup Ops][4/N] benchmark of fbgemm op - regroup_kts Jul 13, 2024
TroyGarden pushed a commit to TroyGarden/torchrec that referenced this pull request Jul 20, 2024
Summary:
Pull Request resolved: pytorch#2159

# context
* added **fn-level** benchmark for the `regroup_keyed_tensor`
* `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage

# conclusion
* CPU runtime **reduces 1/3** from 1.8 ms to 1.1 ms
* GPU runtime **reduces 2/3** from 7.0 ms to 2.0 ms
* GPU memory **reduces 1/3** from 1.5 K to 1.0 K
* **we should migrate to the new op** unless any unknown concern/blocker

# traces
* [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link)
```
[hhy@24963.od /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json
-rw-r--r-- 1 hhy hhy  552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json'
-rw-r--r-- 1 hhy hhy  548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json'
-rw-r--r-- 1 hhy hhy  559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json'
-rw-r--r-- 1 hhy hhy  553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json'
-rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json'
-rw-r--r-- 1 hhy hhy  346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json'
-rw-r--r-- 1 hhy hhy  895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json'
-rw-r--r-- 1 hhy hhy  561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json'
-rw-r--r-- 1 hhy hhy  559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json'
-rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json'
-rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json'
```
* pytorch generic
 {F1752502508}
* current prod
 {F1752503546}
* permute_multi_embedding (2 Ops)
 {F1752503160}
* KT.regroup (1 Op)
 {F1752504258}
* regroupAsDict (Module)
{F1752504964}
* metrics
|Operator|CPU runtime|GPU runtime|GPU memory|notes|
|---|---|---|---|---|
|**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates|
|**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|7.1 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`|
|**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly|
|**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly|

Differential Revision: D58907223
@TroyGarden TroyGarden deleted the export-D58907223 branch June 4, 2025 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Pro 44AF jects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0