8000 limin2021 (Li Min) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View limin2021's full-sized avatar
  • ISCAS
  • Beijing

Block or report limin2021

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Python 5,474 628 Updated Jun 23, 2025

how to optimize some algorithm in cuda.

Cuda 2,279 205 Updated Jun 25, 2025

Hands-On Practical MLIR Tutorial

C++ 514 73 Updated Oct 20, 2023

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…

Python 2,508 435 Updated Jun 25, 2025

A fast MoE impl for PyTorch

Python 1,749 196 Updated Feb 10, 2025

A PyTorch Native LLM Training Framework

Python 824 49 Updated Dec 27, 2024

USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference

Python 520 53 Updated May 27, 2025

FlagGems is an operator library for large language models implemented in the Triton Language.

Python 581 105 Updated Jun 26, 2025

📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.

Cuda 4,891 536 Updated Jun 21, 2025

DLRover: An Automatic Distributed Deep Learning System

Python 1,491 183 Updated Jun 23, 2025

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Python 8,871 1,137 Updated Jun 24, 2025

🔥 经典编程书籍大全,涵盖:计算机系统与网络、系统架构、算法与数据结构、前端开发、后端开发、移动开发、数据库、测试、项目与团队、程序员职业修炼、求职面试等

17,915 2,553 Updated Mar 5, 2025

NCCL Tests

Cuda 1,156 290 Updated Jun 6, 2025

Ring attention implementation with flash attention

Python 790 69 Updated Jun 12, 2025

Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training

Python 211 10 Updated Aug 19, 2024

Distributed multigrid linear solver library on GPU

Cuda 572 157 Updated Feb 7, 2025

InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.

Python 393 69 Updated Jun 11, 2025

Zero Bubble Pipeline Parallelism

Python 399 26 Updated May 7, 2025

Large Context Attention

Python 716 53 Updated Jan 24, 2025

A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores

Python 51 7 Updated Nov 24, 2023

portDNN is a library implementing neural network algorithms written using SYCL

C++ 113 22 Updated May 21, 2024

Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.

C++ 261 51 Updated Jan 13, 2025

BLAS++ is a C++ wrapper around CPU and GPU BLAS (basic linear algebra subroutines), developed as part of the SLATE project.

C++ 80 28 Updated Jun 19, 2025

optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052

C++ 474 37 Updated Mar 15, 2024

[ICDCS 2023] DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining

Python 12 3 Updated Dec 4, 2023

Open Source Computer Vision Library

C++ 82,774 56,202 Updated Jun 25, 2025

[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration

C++ 200 33 Updated Apr 27, 2022

MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba. Full multimodal LLM Android App:[MNN-LLM-Android](./apps/Android/MnnLlmChat/READ…

C++ 12,054 1,958 Updated Jun 24, 2025
Next
0