zhanjiqing

lucas zhanjiqing

5 followers · 13 following

Stars

Ledzy / StreamBP

Official code of "StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs".

Python 65 4 Updated Jun 23, 2025

Qihoo360 / Light-R1

Python 723 47 Updated May 30, 2025

agentica-project / verl-pipeline

Async pipelined version of Verl

Python 103 11 Updated Apr 8, 2025

OpenBMB / CPM.cu

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 155 13 Updated Jul 1, 2025

infinigence / FlashOverlap

A lightweight design for computation-communication overlap.

Cuda 144 5 Updated Jun 20, 2025

yuanzhoulvpi2017 / nano_rl

Forked from volcengine/verl

在verl上做reward的定制开发

Python 61 4 Updated May 22, 2025

ByteDance-Seed / Triton-distributed

Distributed Compiler Based on Triton for Parallel Systems

Python 859 67 Updated Jun 18, 2025

ByteDance-Seed / Seed-Thinking-v1.5

792 16 Updated Jun 9, 2025

ppl-ai / pplx-kernels

Perplexity GPU Kernels

C++ 385 46 Updated Jun 10, 2025

linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training

Python 5,288 361 Updated Jul 2, 2025

huggingface / trl

Train transformer language models with reinforcement learning.

Python 14,412 2,002 Updated Jul 1, 2025

NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines

C++ 7,780 1,294 Updated Jun 27, 2025

ai-dynamo / dynamo

A Datacenter Scale Distributed Inference Serving Framework

Rust 4,392 460 Updated Jul 2, 2025

allenai / OLMo-core

PyTorch building blocks for the OLMo ecosystem

Python 243 43 Updated Jul 2, 2025

BytedTsinghua-SIA / DAPO

An Open-source RL System from ByteDance Seed and Tsinghua AIR

Python 1,394 59 Updated May 11, 2025

bytedance / flux

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 993 67 Updated May 28, 2025

volcengine / verl

verl: Volcano Engine Reinforcement Learning for LLMs

Python 10,231 1,692 Updated Jul 2, 2025

xpan413 / FSMoE

Python 10 Updated Jan 14, 2025

Infrasys-AI / AIInfra

AIInfra（AI 基础设施）指AI系统从底层芯片等硬件，到上层软件栈支持AI大模型训练和推理。

Python 3,267 455 Updated Jul 2, 2025

deepseek-ai / EPLB

Expert Parallelism Load Balancer

Python 1,221 195 Updated Mar 24, 2025

deepseek-ai / DualPipe

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,821 298 Updated Mar 10, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,492 635 Updated Jun 23, 2025

deepseek-ai / DeepEP

DeepEP: an efficient expert-parallel communication library

Cuda 8,239 830 Updated Jul 1, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA decoding kernels

Cuda 11,637 872 Updated Apr 29, 2025

QwenLM / Qwen2.5-VL

Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 11,304 821 Updated May 15, 2025

lsdefine / simple_GRPO

A very simple GRPO implement for reproducing r1-like LLM thinking.

Python 1,154 95 Updated Apr 3, 2025

deepseek-ai / open-infra-index

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,846 279 Updated May 15, 2025

NVlabs / COAT

[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training

Python 211 15 Updated Jun 16, 2025

Azure / MS-AMP

Microsoft Automatic Mixed Precision Library

Python 611 49 Updated Sep 29, 2024

NVIDIA / TensorRT-Model-Optimizer

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment…

Python 1,025 92 Updated Jun 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly