8000 zhanjiqing (lucas) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View zhanjiqing's full-sized avatar

Block or report zhanjiqing

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Official code of "StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs".

Python 65 4 Updated Jun 23, 2025
Python 723 47 Updated May 30, 2025

Async pipelined version of Verl

Python 103 11 Updated Apr 8, 2025

CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge techniques in sparse architecture, speculative sampling and qua…

Cuda 155 13 Updated Jul 1, 2025

A lightweight design for computation-communication overlap.

Cuda 144 5 Updated Jun 20, 2025

在verl上做reward的定制开发

Python 61 4 Updated May 22, 2025

Distributed Compiler Based on Triton for Parallel Systems

Python 859 67 Updated Jun 18, 2025

Perplexity GPU Kernels

C++ 385 46 Updated Jun 10, 2025

Efficient Triton Kernels for LLM Training

Python 5,288 361 Updated Jul 2, 2025

Train transformer language models with reinforcement learning.

Python 14,412 2,002 Updated Jul 1, 2025

CUDA Templates for Linear Algebra Subroutines

C++ 7,780 1,294 Updated Jun 27, 2025

A Datacenter Scale Distributed Inference Serving Framework

Rust 4,392 460 Updated Jul 2, 2025

PyTorch building blocks for the OLMo ecosystem

Python 243 43 Updated Jul 2, 2025

An Open-source RL System from ByteDance Seed and Tsinghua AIR

Python 1,394 59 Updated May 11, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 993 67 Updated May 28, 2025

verl: Volcano Engine Reinforcement Learning for LLMs

Python 10,231 1,692 Updated Jul 2, 2025
Python 10 Updated Jan 14, 2025

AIInfra(AI 基础设施)指AI系统从底层芯片等硬件,到上层软件栈支持AI大模型训练和推理。

Python 3,267 455 Updated Jul 2, 2025

Expert Parallelism Load Balancer

Python 1,221 195 Updated Mar 24, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.

Python 2,821 298 Updated Mar 10, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,492 635 Updated Jun 23, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,239 830 Updated Jul 1, 2025

FlashMLA: Efficient MLA decoding kernels

Cuda 11,637 872 Updated Apr 29, 2025

Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

Jupyter Notebook 11,304 821 Updated May 15, 2025

A very simple GRPO implement for reproducing r1-like LLM thinking.

Python 1,154 95 Updated Apr 3, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,846 279 Updated May 15, 2025

[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training

Python 211 15 Updated Jun 16, 2025

Microsoft Automatic Mixed Precision Library

Python 611 49 Updated Sep 29, 2024

A unified library of state-of-the-art model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deployment…

Python 1,025 92 Updated Jun 30, 2025
Next
0