Tags · pytorch/torchrec

v2025.06.16.00

Fix Unit Test SkipIf Worldsize check (#3098)

Summary:
Pull Request resolved: #3098

These unit tests actually require at least 4 GPUs - due to world size requirements. Updating skipif to match

Created from CodeHub with https://fburl.com/edit-in-codehub

Reviewed By: aliafzal

Differential Revision: D76621861

fbshipit-source-id: 09f9b04c4d3cb7b10736fbbaff3886a8534b96fa

Jun 13, 2025
be4e6d7
zip
tar.gz

v2025.06.09.00

minior refactor github workflow (#3062)

Summary:
Pull Request resolved: #3062

# context
* refactor the matrix statement in the github workflow
* in case of pull-request
only cu128+py313 would be running in gpu ci
only py39 and py313 would be running in cpu ci

Reviewed By: aporialiao

Differential Revision: D76242338

fbshipit-source-id: b56ba4965842d89371d9a7baec858734bc306aaf

Jun 9, 2025
71db31d
zip
tar.gz

v1.2.0

update cuda version in readme

Jun 6, 2025
440b1c6
zip
tar.gz
Notes

v1.2.0-rc3

update tensordict version

Jun 6, 2025
16f6124
zip
tar.gz
Notes

v1.2.0-rc2

change version number

Jun 6, 2025
bc831a1
zip
tar.gz
Notes

v2025.06.02.00

PMT (#3023)

Summary:
Pull Request resolved: #3023

# context
* `_test_sharding` is frequently used test function covering many TorchRec sharding test cases
* the multiprocess env often introduces additional difficulties when debugging, espeically for kernel-size issues (the multiprocess env is not actually needed)
* this change make it run on the main process when the `world_size==1` so that a simple `breakpoint()` can just work.

Reviewed By: iamzainhuda

Differential Revision: D74131796

fbshipit-source-id: ccc34ab589c0153cc0ce1187bba3df7dd63cbfc6

May 31, 2025 A974
10f1c7d
zip
tar.gz

v2025.05.26.00

Fix empty sharding constraints in `test_model_parallel.py` (#2998)

Summary:
Pull Request resolved: #2998

#### Context

Several unit tests in `test_model_parallel.py` passed **empty constraints** into `self._test_sharding` because the constraints are generated using an empty `self.tables` before invoking `self._build_tables_and_groups`.

Impacted tests are:
* `test_sharding_twcw`
* `test_sharding_variable_batch`
* `test_sharding_multiple_kernels`

#### Changes

* Constraints only depend on table names. A new list `self.table_names` is created in `setUp()` stage to be used to construct constraints.
* Updates `self._build_tables_and_groups` to use the generated table names.
* Increases `max_examples` for `test_sharding_multiple_kernels` to cover both FP32 and FP16 cases.

Reviewed By: TroyGarden

Differential Revision: D75306149

fbshipit-source-id: b93f7656e45a8c79393a1c347437f757aac07557

May 24, 2025
b0919ce
zip
tar.gz

v2025.05.19.00

split train_pipeline.utils - pipeline_context (#2978)

Summary:
Pull Request resolved: #2978

# context
* train_pipeline.utils file is overloaded
* split the functions, classes, etc. into three files with each ~< 1000 lines
* this diff:
pipeline_context.py

Reviewed By: malaybag

Differential Revision: D73906059

fbshipit-source-id: 7b3e59279a5b27b1953d0e24cc206c8a395bbd8e

May 19, 2025
2fa7ea7
zip
tar.gz

v2025.05.12.00

Add raw embedding streaming needed params in trec and mvai (#2935)

Summary:
Pull Request resolved: #2935

Add the variables needed in D73792631 to mvai and torch rec to be able to control them via config.

Reviewed By: aliafzal

Differential Revision: D74086201

fbshipit-source-id: 53fb269c17f08d87589a837d2049b733db0d665e

May 9, 2025
d6031f9
zip
tar.gz

v2025.05.05.00

support zero collision tables in ssd tbe (#2919)

Summary:
X-link: pytorch/FBGEMM#4033

X-link: facebookresearch/FBGEMM#1117

Pull Request resolved: #2919

# What is Key-Value Zero-Collision-Hash
Details could be found [here](https://fburl.com/oni52nmh)
In short, we want to introduce 1 to 1 mapping between embedding lookup ids(values in KJT) and embeddings. To do that we use an extremely large embedding space, e.g. 2^50, and utilize the kv embedding capability already provided by SSD TBE. Differently, we don't need to preallocate all the embeddings but allocate and deallocate while training, akak dynamic embedding.

The major functionality is provided by SSD TBE already, we need to do extra support as follows
1. optimizer offloading(is taken care of by Benson and Sarunya), since we can not pre-allocate optimizer anymore
2. update split_embedding_weight to make it return not only weights but also weight_ids and bucket.(these 2 are introduced in detailed below)
3. dram kv, a new backend solution in addition to SSD kv, this is needed for smaller model which size can be handled by inference.

NOTE: weight id is needed because the embedding id(aka embedding offset originally) is not continuous anymore, bucket is a new concept introduced specifically for tackling checkpoint/publish resharding issue). These 2 are generated every time split_embedding_weights is called, instead of member variables.

# change list
1. add bucket concept into ssd tbe
2. update split_embedding_weights to make it return a tuple of 3 tensors(weight, weight_id, id_cnt_per_bucket)
3. add new ut for the key value embedding cases
4. modify debug_optimizer_split to make it return only valid optimizer state by the weight id

Reviewed By: q10

Differential Revision: D73274786

fbshipit-source-id: c3c37bdd306f2a542c7d90e14ffdb7f96594b4df

May 5, 2025
7011587
zip
tar.gz

PreviousNext

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v2025.06.16.00

v2025.06.09.00

v1.2.0

v1.2.0-rc3

v1.2.0-rc2

v2025.06.02.00

v2025.05.26.00

v2025.05.19.00

v2025.05.12.00

v2025.05.05.00

Tags: pytorch/torchrec