Stars
hipBLASLt is a library that provides general matrix-matrix operations with a flexible API and extends functionalities beyond a traditional BLAS library
FlagGems is an operator library for large language models implemented in the Triton Language.
Efficient Triton Kernels for LLM Training
Helpful tools and examples for working with flex-attention
Penn CIS 5650 (GPU Programming and Architecture) Final Project
Llama3-Tutorial(XTuner、LMDeploy、OpenCompass)
Supporting PyTorch models with the Google AI Edge TFLite runtime.
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
CUDA tutorials for Maths & ML tutorials with examples, covers multi-gpus, fused attention, winograd convolution, reinforcement learning.
llama3 implementation one matrix multiplication at a time
Xiao's CUDA Optimization Guide [Active Adding New Contents]
GPU programming related news and material links
【LLMs九层妖塔】分享 LLMs在自然语言处理(ChatGLM、Chinese-LLaMA-Alpaca、小羊驼 Vicuna、LLaMA、GPT4ALL等)、信息检索(langchain)、语言合成、语言识别、多模态等领域(Stable Diffusion、MiniGPT-4、VisualGLM-6B、Ziya-Visual等)等 实战与经验。
PyTorch extensions for high performance and large scale training.