Stars
Official code of "StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs".
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…
A lightweight design for computation-communication overlap.
yuanzhoulvpi2017 / nano_rl
Forked from volcengine/verl在verl上做reward的定制开发
Distributed Compiler Based on Triton for Parallel Systems
Efficient Triton Kernels for LLM Training
Train transformer language models with reinforcement learning.
A Datacenter Scale Distributed Inference Serving Framework
PyTorch building blocks for the OLMo ecosystem
An Open-source RL System from ByteDance Seed and Tsinghua AIR
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
verl: Volcano Engine Reinforcement Learning for LLMs
AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient MLA decoding kernels
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A very simple GRPO implement for reproducing r1-like LLM thinking.
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment…