8000 micropuma (Leon) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View micropuma's full-sized avatar
  • 18:39 (UTC +08:00)

Highlights

  • Pro

Block or report micropuma

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Distributed Compiler Based on Triton for Parallel Systems

Python 853 67 Updated Jun 18, 2025

Perplexity GPU Kernels

C++ 378 46 Updated Jun 10, 2025

Fine-tuning & Reinforcement Learning for LLMs. 🦥 Train Qwen3, Llama 4, DeepSeek-R1, Gemma 3, TTS 2x faster with 70% less VRAM.

Python 41,192 3,277 Updated Jun 27, 2025

Efficient Triton Kernels for LLM Training

Python 5,275 358 Updated Jun 27, 2025

Notes for EE290 Mathematics of Data Science at UC Berkeley, taught by Jiantao Jiao in Fall 2019

TeX 1 Updated Nov 16, 2019

CS252 & EE290 Project, Spring 2020

C 6 2 Updated Mar 29, 2021

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 854 84 Updated Dec 30, 2024

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 1,334 106 Updated Jun 27, 2025

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

Python 37,732 6,533 Updated Jun 28, 2025

What would you do with 1000 H100s...

Jupyter Notebook 1,056 66 Updated Jan 10, 2024

Puzzles for exploring transformers

Jupyter Notebook 352 30 Updated May 4, 2023

Shared Middle-Layer for Triton Compilation

MLIR 257 66 Updated Jun 23, 2025

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…

C++ 10,882 1,529 Updated Jun 27, 2025

Cataloging released Triton kernels.

240 12 Updated Jan 10, 2025
Jupyter Notebook 138 14 Updated Apr 29, 2025

给新员工和实习生的生存指南。 Good Luck and Survive!

Python 244 34 Updated Mar 6, 2025

FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.

C++ 54 11 Updated Jun 27, 2025

The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.

C++ 1 Updated Feb 13, 2025

ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines (FPGA 2025 Best Paper Nominee)

C++ 33 8 Updated Jun 24, 2025

An MLIR Complier for PyTorch/C/C++ Codes into HLS Dataflow Designs

MLIR 41 5 Updated May 20, 2025

An extremely fast Python package and project manager, written in Rust.

Rust 59,565 1,695 Updated Jun 28, 2025
Cuda 1 Updated May 23, 2025

CUDA 算子手撕与面试指南

Cuda 443 55 Updated Jan 15, 2025

Student version of Assignment 1 for Stanford CS336 - Language Modeling From Scratch

Python 252 224 Updated Apr 16, 2025

Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.

Cuda 358 50 Updated Jan 2, 2025
Python 510 50 Updated Jul 11, 2024

CUDA/Metal accelerated language model inference

C 592 27 Updated May 29, 2025

LLM inference in C/C++

C++ 82,288 12,213 Updated Jun 28, 2025
Next
0