Lists (1)
Sort Name ascending (A-Z)
Starred repositories
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
The official implementation for Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
IKEA: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent
User-friendly implementation of the Mixture-of-Sparse-Attention (MoSA). MoSA selects distinct tokens for each head with expert choice routing providing a content-based sparse attention mechanism.
Awesome RL Reasoning Recipes ("Triple R")
SkyRL-v0: Train Real-World Long-Horizon Agents via Reinforcement Learning
The code for Consistent In-Context Editing, an approach for tuning language models through contextual distributions, overcoming the limitations of traditional fine-tuning methods that learn towards…
[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation
Accelerate LLM preference tuning via prefix sharing with a single line of code
[ICLR 2025🔥] D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
Paper list for Efficient Reasoning.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
📰 Must-read papers and blogs on LLM based Long Context Modeling 🔥
Efficient LLM Inference over Long Sequences
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
Quantized Attention achieves speedup of 2-3x and 3-5x compared to FlashAttention and xformers, without lossing end-to-end metrics across language, image, and video models.
[NAACL 2025] Official Implementation of "HMT: Hierarchical Memory Transformer for Long Context Language Processing"
Code repo for "CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs".
Trains Transformer model variants. Data isn't shuffled between batches.
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
KV cache compression for high-throughput LLM inference
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
[ICLR'25] Data and code for our paper "Why Does the Effective Context Length of LLMs Fall Short?"
Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"