Highlights
- Pro
Stars
Using Tree-of-Thought Prompting to boost ChatGPT's reasoning
A dummy's guide to setting up (and using) HPC clusters on Ubuntu 22.04LTS using Slurm and Munge. Created by the Quant Club @ UIowa.
FlashInfer: Kernel Library for LLM Serving
Running large language models on a single GPU for throughput-oriented scenarios.
A low-latency & high-throughput serving engine for LLMs
Disaggregated serving system for Large Language Models (LLMs).
A collection of benchmarks and datasets for evaluating LLM.
Large Language Model (LLM) Systems Paper List
Triton implementation of FlashAttention2 that adds Custom Masks.
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"
Doing simple retrieval from LLM models at various context lengths to measure accuracy
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
[ICLR 2023] "Learning to Grow Pretrained Models for Efficient Transformer Training" by Peihao Wang, Rameswar Panda, Lucas Torroba Hennigen, Philip Greengard, Leonid Karlinsky, Rogerio Feris, David …
MoMA: Multimodal LLM Adapter for Fast Personalized Image Generation
✨✨Latest Advances on Multimodal Large Language Models
Reading list for research topics in multimodal machine learning
A curated list for Efficient Large Language Models
[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
📰 Must-read papers and blogs on Speculative Decoding ⚡️
Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)
High-speed Large Language Model Serving for Local Deployment
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding