8000 GitHub - giangdip2410/SimSMoE: Code for this paper "SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse".
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Code for this paper "SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse".

Notifications You must be signed in to change notification settings

giangdip2410/SimSMoE

Repository files navigation

SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse

License: MIT

This repo contains the code for this paper SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse

Giang Do, Hung Le, Truyen Tran

Overview

Sparse mixture of experts (SMoE) have emerged as an effective approach for scaling large language models while keeping a constant computational cost. Regardless of several notable successes of SMoE, effective training such architecture remains elusive due to the representation collapse problem, which in turn harms model performance and causes parameter redundancy. In this work, we present Similarity-based Sparse Mixture of Experts (SimSMoE), a novel similarity of neural network algorithm, that guarantees a solution to address the representation collapse issue between experts given a fixed FLOPs budget. We conduct extensive empirical evaluations on three large language models for both Pre-training and Fine-tuning tasks to illustrate the efficacy, robustness, and scalability of our method. The results demonstrate that SimSMoE significantly enhances existing routing policy and outperforms other SMoE routing methods in performance for the tasks.

Prerequisites

  • FastMoE: A fast MoE impl for PyTorch

Running Experiments in the Paper

Pre-training

  • Download the enwik8, text8, wikitext-2 dataset from here, then change bash scripts based on your local data paths`
data_folder/
└── pretraining
    └── enwik8
        ├── test.txt
        ├── train.txt
        └── valid.txt
    └── text8
        ├── test.txt
        ├── train.txt
        └── valid.txt
    └── wikitext-2
        ├── test.txt
        ├── train.txt
        └── valid.txt
  • Select the Transformer architecture, its scale, and the type of SMoE layer. We support:
SMoE SMoE-Dropout XMoE StableMoE SimSMoE< 5EE1 /th>
Transformer (S/M/L)
GLAM (S/M/L)
  • Run all corresponding scripts:
    bash enwik8_exp.sh bash text8_exp.sh bash wikitext2_exp.sh

  • The checkpoint will be saved at checkpoints/enwik8/transformers-s during training.

Citation

If you use SimSMoE in your research, please cite our paper:

Giang Do, Hung Le, and Truyen Tran. 2025.
SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse.
In Findings of the Association for Computational Linguistics: NAACL 2025, pages 2012–2025, Albuquerque, New Mexico. Association for Computational Linguistics.
https://aclanthology.org/2025.findings-naacl.107/
ISBN: 979-8-89176-195-7

@inproceedings{do-etal-2025-simsmoe,
  title = "{S}im{SM}o{E}: Toward Efficient Training Mixture of Experts via Solving Representational Collapse",
  author = "Do, Giang and Le, Hung and Tran, Truyen",
  editor = "Chiruzzo, Luis and Ritter, Alan and Wang, Lu",
  booktitle = "Findings of the Association for Computational Linguistics: NAACL 2025",
  month = apr,
  year = "2025",
  address = "Albuquerque, New Mexico",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.findings-naacl.107/",
  pages = "2012--2025",
  ISBN = "979-8-89176-195-7"
}

About

Code for this paper "SimSMoE: Toward Efficient Training Mixture of Experts via Solving Representational Collapse".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0