Stars
A lightweight design for computation-communication overlap.
kwai / Megatron-Kwai
Forked from NVIDIA/Megatron-LM[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Parallelism
PyTorch distributed training acceleration framework
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA/Tensor Cores Kernels, HGEMM, FA-2 MMA.
My learning notes/codes for ML SYS.
Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O
KV cache compression for high-throughput LLM inference
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
A throughput-oriented high-performance serving framework for LLMs
SGLang is a fast serving framework for large language models and vision language models.
TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.
A performance library for machine learning applications.
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
校招、秋招、春招、实习好项目!带你从零实现一个高性能的深度学习推理库,支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API
The official gpt4free repository | various collection of powerful language models | o4, o3 and deepseek r1, gpt-4.1, gemini 2.5
🐍 Geometric Computer Vision Library for Spatial AI
A Data Streaming Library for Efficient Neural Network Training
Profiling and Improving the PyTorch Dataloader for high-latency Storage
《Effective Modern C++》- 完成翻译
MegBox is an easy-to-use, well-rounded and safe toolbox of MegEngine. Aim to imporving usage experience and speeding up develop process.
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.