-
ISCAS
- Beijing
Stars
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
how to optimize some algorithm in cuda.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Blackwell GPUs, to provide better performance with lower memory…
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
FlagGems is an operator library for large language models implemented in the Triton Language.
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.
DLRover: An Automatic Distributed Deep Learning System
🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
🔥 经典编程书籍大全,涵盖:计算机系统与网络、系统架构、算法与数据结构、前端开发、后端开发、移动开发、数据库、测试、项目与团队、程序员职业修炼、求职面试等
Ring attention implementation with flash attention
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
InternEvo is an open-sourced lightweight training framework aims to support model pre-training without the need for extensive dependencies.
Zero Bubble Pipeline Parallelism
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
portDNN is a library implementing neural network algorithms written using SYCL
Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.
BLAS++ is a C++ wrapper around CPU and GPU BLAS (basic linear algebra subroutines), developed as part of the SLATE project.
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
[ICDCS 2023] DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba. Full multimodal LLM Android App:[MNN-LLM-Android](./apps/Android/MnnLlmChat/READ…