SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse

This repo contains the code for this paper SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse

Giang Do, Hung Le, Truyen Tran

Overview

Sparse mixture of experts (SMoE) have emerged as an effective approach for scaling large language models while keeping a constant computational cost. Regardless of several notable successes of SMoE, effective training such architecture remains elusive due to the representation collapse problem, which in turn harms model performance and causes parameter redundancy. In this work, we present Similarity-based Sparse Mixture of Experts (SimSMoE), a novel similarity of neural network algorithm, that guarantees a solution to address the representation collapse issue between experts given a fixed FLOPs budget. We conduct extensive empirical evaluations on three large language models for both Pre-training and Fine-tuning tasks to illustrate the efficacy, robustness, and scalability of our method. The results demonstrate that SimSMoE significantly enhances existing routing policy and outperforms other SMoE routing methods in performance for the tasks.

Prerequisites

FastMoE: A fast MoE impl for PyTorch

Running Experiments in the Paper

Pre-training

Download the enwik8, text8, wikitext-2 dataset from here, then change bash scripts based on your local data paths`

data_folder/
└── pretraining
    └── enwik8
        ├── test.txt
        ├── train.txt
        └── valid.txt
    └── text8
        ├── test.txt
        ├── train.txt
        └── valid.txt
    └── wikitext-2
        ├── test.txt
        ├── train.txt
        └── valid.txt

Select the Transformer architecture, its scale, and the type of SMoE layer. We support:

	SMoE	SMoE-Dropout	XMoE	StableMoE	SimSMoE< 5EE1 /th>
Transformer (S/M/L)	✅	✅	✅	✅	✅
GLAM (S/M/L)	✅	✅	✅	✅	✅

Run all corresponding scripts:
bash enwik8_exp.sh bash text8_exp.sh bash wikitext2_exp.sh
The checkpoint will be saved at checkpoints/enwik8/transformers-s during training.

Citation

If you use SimSMoE in your research, please cite our paper:

Giang Do, Hung Le, and Truyen Tran. 2025.
SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse.
In Findings of the Association for Computational Linguistics: NAACL 2025, pages 2012–2025, Albuquerque, New Mexico. Association for Computational Linguistics.
https://aclanthology.org/2025.findings-naacl.107/
ISBN: 979-8-89176-195-7
@inproceedings{do-etal-2025-simsmoe,
  title = "{S}im{SM}o{E}: Toward Efficient Training Mixture of Experts via Solving Representational Collapse",
  author = "Do, Giang and Le, Hung and Tran, Truyen",
  editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu",
  booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
  month = apr,
  year = "2025",
  address = "Albuquerque, New Mexico",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.findings-naacl.107/",
  pages = "2012--2025",
  ISBN = "979-8-89176-195-7"
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
fastermoe		fastermoe
gates		gates
losses		losses
scripts		scripts
.gitignore		.gitignore
README.md		README.md
config.py		config.py
custom_functions.py		custom_functions.py
custom_gates.py		custom_gates.py
custom_layers.py		custom_layers.py
custom_transformer.py		custom_transformer.py
custom_utils.py		custom_utils.py
data.py		data.py
loss.py		loss.py
models.py		models.py
train.py		train.py
trainer.py		trainer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse

Overview

Prerequisites

Running Experiments in the Paper

Pre-training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

giangdip2410/SimSMoE

Folders and files

Latest commit

History

Repository files navigation

SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse

Overview

Prerequisites

Running Experiments in the Paper

Pre-training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages