limin2021

Li Min limin2021

major in computer science:high performance computing and parallel computing.

23 followers · 63 following

ISCAS
Beijing

Achievements

Stars

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,474 628 Updated Jun 23, 2025

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 2,279 205 Updated Jun 25, 2025

KEKE046 / mlir-tutorial

Hands-On Practical MLIR Tutorial

C++ 514 73 Updated Oct 20, 2023

NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,508 435 Updated Jun 25, 2025

laekov / fastmoe

A fast MoE impl for PyTorch

Python 1,749 196 Updated Feb 10, 2025

volcengine / veScale

A PyTorch Native LLM Training Framework

Python 824 49 Updated Dec 27, 2024

feifeibear / long-context-attention

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 520 53 Updated May 27, 2025

FlagOpen / FlagGems

FlagGems is an operator library for large language models implemented in the Triton Language.

Python 581 105 Updated Jun 26, 2025

xlite-dev / LeetCUDA

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.

Cuda 4,891 536 Updated Jun 21, 2025

intelligent-machine-learning / dlrover

DLRover: An Automatic Distributed Deep Learning System

Python 1,491 183 Updated Jun 23, 2025

huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Python 8,871 1,137 Updated Jun 24, 2025

imarvinle / awesome-cs-books

🔥 经典编程书籍大全，涵盖：计算机系统与网络、系统架构、算法与数据结构、前端开发、后端开发、移动开发、数据库、测试、项目与团队、程序员职业修炼、求职面试等

17,915 2,553 Updated Mar 5, 2025

NVIDIA / nccl-tests

NCCL Tests

Cuda 1,156 290 Updated Jun 6, 2025

zhuzilin / ring-flash-attention

Ring attention implementation with flash attention

Python 790 69 Updated Jun 12, 2025

exists-forall / striped_attention

Python 39 2 Updated Nov 10, 2023

RulinShao / LightSeq

Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training

Python 211 10 Updated Aug 19, 2024

NVIDIA / AMGX

Distributed multigrid linear solver library on GPU

Cuda 572 157 Updated Feb 7, 2025

InternLM / InternEvo

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Python 393 69 Updated Jun 11, 2025

sail-sg / zero-bubble-pipeline-parallelism

Forked from NVIDIA/Megatron-LM

Zero Bubble Pipeline Parallelism

Python 399 26 Updated May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Li Min limin2021

Achievements

Achievements

Block or report limin2021

Stars

deepseek-ai / DeepGEMM

BBuf / how-to-optim-algorithm-in-cuda

KEKE046 / mlir-tutorial

NVIDIA / TransformerEngine

laekov / fastmoe

volcengine / veScale

feifeibear / long-context-attention

FlagOpen / FlagGems

xlite-dev / LeetCUDA

intelligent-machine-learning / dlrover

huggingface / accelerate

imarvinle / awesome-cs-books

NVIDIA / nccl-tests

zhuzilin / ring-flash-attention

exists-forall / striped_attention

RulinShao / LightSeq

NVIDIA / AMGX

InternLM / InternEvo

sail-sg / zero-bubble-pipeline-parallelism

haoliuhl / ringattention

UDC-GAC / venom

codeplaysoftware / portDNN

codeplaysoftware / portBLAS

icl-utk-edu / blaspp

anyscale / llm-continuous-batching-benchmarks

bytedance / ByteTransformer

lzhangbv / dear_pytorch

opencv / opencv

mit-han-lab / inter-operator-scheduler

alibaba / MNN