SCBench: A KV Cache-Centric Analysis of Long-Context Methods

Now, you can use SCBench to evaluate long-context methods across the full KV cache lifecycle and various long-context capabilities.

News

🧩 [24/12/11] We will present SCBench at the Microsoft Booth and ENLSP at NeurIPS'24. See you in Vancouver!.

Abstract

Long-context Large Language Models (LLMs) have enabled numerous downstream applications but also introduced significant challenges related to computational and memory efficiency. To address these challenges, optimizations for long-context inference have been developed, centered around the KV cache. However, existing benchmarks often evaluate in single-request, neglecting the full lifecycle of the KV cache in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic. To address this gap, we introduce SCBench (SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, and 4) KV cache loading. Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes, covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task. With SCBench, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs (Codestal-Mamba), Mamba-Attention hybrids (Jamba-1.5-Mini), and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on six Transformer-based long-context LLMs: Llama-3.1-8B/70B, Qwen2.5-72B/32B, Llama-3-8B-262K, and GLM-4-9B. Our findings reveal that sub-O(n) memory methods often struggle with accuracy in multi-turn scenarios, while sparse encoding methods with O(n) memory and sub-O(n^2) computation in pre-filling perform robustly. Additionally, Additionally, dynamic sparse patterns yield more expressive KV caches than static ones, and layer-level sparsity in hybrid architectures effectively reduces memory usage while delivering promising results.

Why SCBench?

Long-context methods are designed and utilized around the KV cache, but existing benchmarks focus only on single-request scenarios, ignoring its full lifecycle in real-world use. This oversight is particularly critical, as KV cache reuse has become widely adopted in LLMs inference frameworks, such as vLLM and SGLang, as well as by LLM providers, including OpenAI, Microsoft, Google, and Anthropic.

Figure 1. KV cache lifecycle. Prior benchmarks focus on single-request, while real-world applications reuse KV cache across requests. We propose SCBench and categorize long-context methods into KV Cache Generation, Compression, Retrieval, and Loading from a KV-cache-centric perspective.

To address this gap, we introduce SCBench (SharedContextBench), a comprehensive benchmark for evaluating long-context methods from a KV cachecentric perspective: 1) KV cache generation, 2) KV cache compression, 3) KV cache retrieval, and 4) KV cache loading, as shown in Fig.(1). Specifically, SCBench uses test examples with shared context, ranging 12 tasks with two shared context modes (Multi-Turn Mode, Multi-Request Mode), covering four categories of long-context capabilities: string retrieval, semantic retrieval, global information, and multi-task.

Figure 2. Long-context tasks often involve contexts sharing, e.g., multi-turn dialogues, multi-step reasoning, and repository-level tasks. (a) Illustration of two common shared-context patterns. (b) Overview of tasks and scenarios covered by our benchmark, encompassing four categories of longcontext abilities and two shared-context modes.

With SCBench, we provide an extensive KV cache-centric analysis of eight categories long-context solutions, including Gated Linear RNNs (Codestal-Mamba), Mamba-Attention hybrids (Jamba-1.5-Mini), and efficient methods such as sparse attention, KV cache dropping, quantization, retrieval, loading, and prompt compression. The evaluation is conducted on six Transformer-based long-context LLMs: Llama-3.1-8B/70B, Qwen2.5-72B/32B, Llama-3-8B-262K, and GLM-4-9B.

Table 1. We evaluated long-context methods on SCBench, where n represents the token size of the input prompt and m represents the generation token size, with n ≫ m.

Base on SCBench, our experimental results reveal the following insights:

1) Sub-O(n) memory is almost infeasible in multi-turn decoding, as shown in Fig.(3). Sparse decoding methods (sub-O(n) memory) perform well on the first query but lose accuracy in subsequent requests. In contrast, sparse encoding methods (O(n) memory with O(n^2)computation during pre-filling) can approximate full attention accuracy across multiple queries.

2) Task performance shows varying decline trends, as illustrated in Fig.(3b). Sparse KV cache methods excel in tasks requiring global information, whereas O(n) memory is essential for tasks involving exact match retrieval

3) All long-context methods experience performance degradation as the compression rate decreases, as shown in Fig.(4). However, sub-O(n) memory methods exhibit a significant performance drop at a 1/4 compression rate. Methods such as RetrievalAttention and KIVI, which maintain O(n) memory with sparse decoding, sustain higher performance even at higher compression rates.

4) Long-generation scenarios exhibit distribution shift issues, as generation length and the number of rounds increase, the importance distribution of the KV cache changes significantly. This out-of-distribution (OOD) issue leads to performance degradation, even for O(n) memory methods like RetrievalAttention, as observed in extended tasks, as shown in Fig.(3).

Figure 3. Overview of performance results for SCBench. (a) Performance trends of various longcontext methods across multiple requests. Methods with O(n) memory cost in decoding show improving performance as requests increase. In contrast, methods with sub-O(n) KV cache in decoding, like KV cache dropping methods, perform well only in the first request. (b) Specific performance of different long-context methods across various long-context capability tasks. All evaluated long-context methods exhibit some loss in Retrieval capability while largely maintaining Global Information processing capability.

Figure 4. Performance of various long-context methods at different compression rates on SCBench using Llama-3.1-8B.

A KV Cache-Centric Perspective on Long-Context Methods

In this work, we propose a novel perspective: these long-context methods can be viewed as optimizations centered around the KV cache at different stages. Specifically, we introduce a KV-cache-centric framework that systematically categorizes long-context methods into four stages: KV Cache Generation, Compression, Retrieval, and Loading, as illustrated in Fig.(1).

Specifically, the four stages of the KV-cache-centric framework are defined as follows:

1) KV Cache Generation: This stage optimizes the efficient generation of KV cache during inference. Techniques include sparse attention (e.g., A-shape, Tri-shape, MInference), SSM or hybrid approaches (e.g., Mamba, Jamba), and prompt compression (e.g., LLMLingua-2).

2) KV Cache Compression: After generation, the KV cache is compressed before being stored. Methods include KV cache dropping (e.g., StreamingLLM, SnapKV) and KV cache quantization (e.g., KIVI).

3) KV Cache Retrieval: Relevant KV cache blocks are retrieved from a storage pool based on the request’s prefix, reducing time-to-first-token (TTFT). Approaches include semantic retrieval methods like CacheBlend.

4) KV Cache Loading: This stage dynamically loads the KV cache and computes sparse attention, from KV cache storage (e.g., VRAM, DRAM, SSD, or RDMA) to GPU on-chip SRAM, including Quest, RetrievalAttention, MagicPIG.

Tri-shape Sparse Attention

We also introduce a novel training-free sparse attention method, Tri-shape, with improved first-turn accuracy, as shown in Fig.(5). Specifically, in addition to retaining the sink token and local window regions preserved by A-shape, Tri-shape also retains the last window query region, forming a triangular pattern for sparse attention during the pre-filling stage. The motivation arises from the observation that A-shape with dense decoding exhibits significant performance improvement after multiple requests. Tri-shape can notably enhance performance in both turn-0 and multi-request scenarios, as detailed in Sec.4. Furthermore, it preserves the ability of LLMs to follow instructions, as demonstrated in the case study in Appendix F. Notably, some recent concurrent works (e.g. StarAttention) have also proposed similar patterns to accelerate the long-context pre-filling stage.

Figure 5. The sparse attention methods framework.

Benchmark Building

SCBench comprises 12 tasks covering four long-context abilities: string retrieval, semantic retrieval, global information processing, and multi-tasking, across two shared context modes—multi-turn and multi-request. These tasks span various domains, including code, retrieval, question answering, summarization, in-context learning, multi-hop tracing, and multi-tasking, as shown in Fig.(2b). In total, SCBench includes 931 multi-turn sessions with 4,853 queries, averaging 5 turns per session. Task statistics are provided in Table 2, with examples and configurations in Table 3. Below, we detail the construction of our benchmark.

Table 2. Overview of SCBench tasks.

Table 3. Task examples and configurations in SCBench. We use different colors to highlight the questions, answers, and distractors in our examples.

Anlysis

Sub-O(n) Memory is Almost Infeasible in Multi-Turn Decoding

We analyzed the attention distribution for the Retr.KV task across multiple turns with a shared context. As shown in Fig.(5a), the critical key-value pairs (KVs) are highly query-dependent and vary significantly between turns.

Figure 5. Attention visualization of Retr.KV for the shared context across multiple turns.

The Sparsity in Encoding and Decoding.

Fig.(6) illustrate the performance of various long-context methods across multiple tasks and shared context modes in different base LLMs. Key observations include: 1) In retrieval tasks, most long-context methods, except MInference, perform poorly, particularly in string retrieval. 2) Tri-shape also generalizes well across tasks, ranking second only to MInference across models. Our analysis reveals that the Tri-shape bottom improves firstturn instruction-following, thus enhancing overall performance, while A-shape disrupts instruction information, leading to random outputs, as shown in Table 17. 3) KV cache compression methods generally underperform in shared scenarios. 4) Prompt compression methods enhance global information tasks like many-shot ICL but degrade performance significantly in retrieval-related tasks.

Figure 6. Performance of different long-context methods across various tasks and turns. The results for multi-tasking tasks are shown in Fig. 10, and the results are averaged across all tested base LLMs.

Compared to Prior Long-Context Benchmark

We have compared SCBench against existing long-context benchmarks across long-context capability assessed, request types considered, and implementation they adopted, as shown in Table 4.

Table 4. Comparison of Long-Context Benchmarks.

BibTeX

If you find this project helpful, please cite the following papers:

@inproceedings{li2025scbench,
    title={{SCB}ench: A {KV} Cache-Centric Analysis of Long-Context Methods},
    author={Yucheng Li and Huiqiang Jiang and Qianhui Wu and Xufang Luo and Surin Ahn and Chengruidong Zhang and Amir H. Abdi and Dongsheng Li and Jianfeng Gao and Yuqing Yang and Lili Qiu},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=gkUyYcY1W9}
}

SCBench: A KV Cache-Centric Analysis of Long-Context Methods Dynamic Sparse Attention