10000 cjmcv (Chen Jianming) / Starred · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
View cjmcv's full-sized avatar

Block or report cjmcv

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

Cuda 1 Updated Jun 27, 2025

Fast and memory-efficient exact attention

Python 1 Updated Jun 28, 2025

Quantized Attention achieves speedup of 2-5x and 3-11x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.

Cuda 1,836 139 Updated Jul 1, 2025

FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/

C++ 1,390 609 Updated Jul 2, 2025

CUDA Matrix Multiplication Optimization

Cuda 196 21 Updated Jul 19, 2024

Fast and memory-efficient exact attention

Python 18,114 1,778 Updated Jul 2, 2025

KV cache store for distributed LLM inference

C++ 278 28 Updated Jun 6, 2025

程序员在家做饭方法指南。Programmer's guide about how to cook at home (Simplified Chinese only).

Dockerfile 90,537 10,326 Updated Jul 1, 2025

Open Source DeepWiki: AI-Powered Wiki Generator for GitHub/Gitlab/Bitbucket Repositories. Join the discord: https://discord.gg/gMwThUMeme

TypeScript 7,639 733 Updated Jul 1, 2025

C++ extensions in PyTorch

Python 1,110 233 Updated Jun 13, 2025

Perplexity GPU Kernels

C++ 384 46 Updated Jun 10, 2025

Ongoing research training transformer models at scale

Python 12,722 2,891 Updated Jul 2, 2025

A fast communication-overlapping library for tensor/expert parallelism on GPUs.

C++ 992 67 Updated May 28, 2025

My learning notes/codes for ML SYS.

Python 2,719 169 Updated Jul 1, 2025

A CPU tool for benchmarking the peak of floating points

Assembly 551 131 Updated May 8, 2025

NVIDIA curated collection of educational resources related to general purpose GPU programming.

Jupyter Notebook 541 96 Updated Jun 30, 2025

High-performance inference framework for large language models, focusing on efficiency, flexibility, and availability.

Python 1,153 76 Updated Jun 26, 2025

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,433 177 Updated Jul 12, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 1 Updated Feb 9, 2025

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Python 1 Updated Feb 10, 2025

CUDA Templates for Linear Algebra Subroutines

C++ 1 Updated Jun 20, 2025

collection of benchmarks to measure basic GPU capabilities

C++ 386 55 Updated Feb 11, 2025

FlashMLA: Efficient MLA decoding kernels

Cuda 11,637 872 Updated Apr 29, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,239 829 Updated Jul 1, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 14,478 1,030 Updated Jul 1, 2025

Reading notes on the open source code of AI infrastructure (sglang, llm, cutlass, hpc, etc.)

3 Updated Jun 29, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 1 Updated Jun 26, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 805 68 Updated Jun 3, 2025

CUDA Core Compute Libraries

C++ 1,724 231 Updated Jul 2, 2025
3335

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends

Python 1,675 300 Updated Jun 30, 2025
Next
0