-
18:39
(UTC +08:00)
Highlights
- Pro
Lists (1)
Sort Name ascending (A-Z)
Stars
Distributed Compiler Based on Triton for Parallel Systems
Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train Qwen3, Llama 4, DeepSeek-R1, Gemma 3, TTS 2x faster with 70% less VRAM.
Efficient Triton Kernels for LLM Training
Notes for EE290 Mathematics of Data Science at UC Berkeley, taught by Jiantao Jiao in Fall 2019
Flash Attention in ~100 lines of CUDA (forward pass only)
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
What would you do with 1000 H100s...
Puzzles for exploring transformers
Shared Middle-Layer for Triton Compilation
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…
给新员工和实习生的生存指南。 Good Luck and Survive!
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
micropuma / torch-mlir
Forked from llvm/torch-mlirThe Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines (FPGA 2025 Best Paper Nominee)
An MLIR Complier for PyTorch/C/C++ Codes into HLS Dataflow Designs
An extremely fast Python package and project manager, written in Rust.
Student version of Assignment 1 for Stanford CS336 - Language Modeling From Scratch
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.