🍊 Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

A comprehensive analysis of concentrated massive values appearing in attention Q and K matrices are mainly responsible for contextual knowledge understanding.

🆕 News

[Feb 2025] We submit the paper Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding to Arxiv, the concentrated massive values appearing in attention Q and K matrices are mainly responsible for contextual knowledge understanding, and this phenomenon originates from RoPE's effects on low-frequency channels.
[May 2025] Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding was accepted by ICML 2025 🎉🎉🎉🎉🎉
[May 2025] I will give a talk about Massive Values in Self-Attention Modules at NiceNLP
[May 2025] We will release the camera-ready version of Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

❗Abstract

Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concentrated in low-frequency dimensions across different attention heads exclusively in attention queries (Q) and keys (K) while absent in values (V). Besides, we find that these massive values are more responsible for understanding contextual knowledge (i.e., knowledge from context window) rather than retrieving parametric knowledge (i.e., knowledge from trained parameters) by extensive experiments. Further analysis of quantization methods shows that failing to protect massive values during quantization leads to more substantial performance degradation in contextual knowledge understanding, aligning with this observation. In addition, we analyze the root cause of the massive values and demonstrate that these massive values are induced by the mechanisms of RoPE (Rotary Positional Encoding) and that the values of low-frequency dimensions are less affected by position information. Our findings collectively uncover the functions of massive values in the Q and K, providing valuable insights for future model design and optimization strategies.

📝 How to run the Code

🤗 1. Environment Setting

>> conda create -n myenv python=3.9
>> conda activate myenv
>> pip install -r requirements.txt

Set the environment variables in the .env file. For example:

OPENAI_API_KEY=""
HF_AUTH_TOKENS=""

📊 2. Passkey Retrieve Data Synthesis

>> bash scripts/run_passkey.sh

Passkey Retrieval Data Synthesis Parameters：

[Core Parameters]:

seq_length=128: Total length of the generated text sequence (must be ≥ 101)
begin_pos=50: Starting position for password insertion
passkey_length=6: Length of the password to be inserted

[Data Generation Controls]:

num_gen_example=200: Number of examples to generate
max_data_num=200: Maximum number of examples in the final dataset

Note

To adjust the dataset size, both num_gen_example and max_data_num should be set to the same value. For example, to generate 300 examples, set both parameters to 300.

Warning

Setting seq_length below 101 will result in an error.

pos_interval=500
begin_pos=50
seq_length=128
passkey_length=6
# Dataset Parameters
DATASET_CONFIG=(
    --dataset 'passkey_retrieval'
    --split 'test'
    --dataset_folder './synthetic_tasks'
    --num_gen_example 200
    --max_data_num 200
    --max_generation_length 10
)

You can find other datasets in datasets file.

📊 3. Knowledge QA Data Synthesis

The pipeline of knowledge QA data synthesis can be seen from the figure below.

>> bash scripts/run_knowledge_qa.sh

The generated datasets are already available in the datasets folder.

You can also run the code in the datasets/create_knowledge_qa.py to customize the categories and number of knowledge QA pairs.

>> python datasets/create_knowledge_qa.py --category <YOUR_CATEGORY> --num_pairs <YOUR_NUM_PAIRS>

[Core Parameters]:

categories: The categories of the generated knowledge QA pairs
num_pairs: The number of knowledge QA pairs to be generated

4. 🎯 Get the Embedding Vector in different LLMs

>>  sh scripts/save_attn_map.sh

Step 1, pattern="save_attn", select a language model like meta-llama/Llama-2-7b-chat-hf

>> CUDA_VISIBLE_DEVICES=0 python llm_example_save_attn.py \
    --model_name meta-llama/Llama-2-7b-chat-hf\
    --pattern "$pattern" \
    --round "$round" \

Step 2: How to choose layers in different LLM

For example, in Llama🦙

Find modeling_llama.py and search the code below. Then you can change the code: if GLOBAL_L == 1 or GLOBAL_L == 2 or GLOBAL_L == 10:, choose the layer GLOBAL_L you want.

global GLOBAL_L

head_set = range(32)

if GLOBAL_L in range(32):
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    if GLOBAL_L == 1 or GLOBAL_L == 2 or GLOBAL_L == 10:
        print(query_states.shape)
        torch.save(query_states, f"{save_dir}/q_merged_attn_weights_layer{GLOBAL_L}.pt")
        torch.save(key_states, f"{save_dir}/k_merged_attn_weights_layer{GLOBAL_L}.pt")
        torch.save(value_states, f"{save_dir}/v_merged_attn_weights_layer{GLOBAL_L}.pt")

Step 3: Use attn.ipynb or appendix_result/run.ipynb to show the result.

5. 🔬 Massive Value Disruption: Replicating Contextual Knowledge Understanding Experiments

>>  sh scripts/save_attn_map.sh

[Core Parameters]:

model_name: use different model by model_name like meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Llama-2-7b-chat-hf, mistralai/Mistral-7B-Instruct-v0.3, Qwen/Qwen2.5-7B-Instruct, google/gemma-2-9b-it, facebook/opt-2.7b, ai21labs/Jamba-v0.1, Qwen/Qwen2-VL-2B-Instruct
round: Decide how many rounds to run and then take the average
pattern: Determine which dataset we will use: "city", "aqua", "imdb", "sports", "art", "cele", "long".

pattern="city"
round=2

CUDA_VISIBLE_DEVICES=0 python llm_example_save_attn.py \
    --model_name meta-llama/Llama-2-7b-chat-hf\
    --pattern "$pattern" \
    --round "$round" \
    #2>&1 | tee ./imdb_destroy.log

[How to destroy Massive Value]: Find the file modeling_llama.py and search the code below(If you want to use other LLm, you can use modeling_gemma2.py, modeling_qwen2.py.......):

add_mean_perturbation = True: replace the massive value in the QK embedding vector with their mean value.
add_other_perturbation = True: replace the other place in the QK embedding vector with their mean value.

add_mean_perturbation = False
add_other_perturbation = False
if q_len != 1:
    if add_mean_perturbation == True:
        num_heads = query_states.size(2)
        matrix = query_states.norm(dim=1).squeeze(0)
        for head_idx in range(num_heads):
            outlier = matrix[head_idx, :]
            values, indices = torch.topk(outlier, 1)
            top_indices = indices.tolist()
            target_vectors = [query_states[:, :, head_idx, idx] for idx in top_indices]
            mean_value = torch.mean(torch.stack([vector.mean() for vector in target_vectors]))
            for idx in top_indices:
                query_states[:, :, head_idx, idx] = mean_value
                # rest of the code

torch.topk(outlier, 1): Sometimes, some LLMs have many outliers, you can adjust the number 1 to 2,3,4.....

6. Quantization Experiments

[Core Parameters]:

model_name: use model by model_name, we use meta-llama/Meta-Llama-3-8B-Instruct by default
round: Decide how many rounds to run and then take the average
pattern: Determine which dataset we will use: "city", "aqua", "imdb", "sports", "art", "cele", "long".
quantized: Determine which quantization method we will use: "awq", "smooth_quant", "gptq".

>>  python run_quantization.py \ 
    --model_name meta-llama/Meta-Llama-3-8B-Instruct \
    --pattern city \
    --quantized smooth_quant \
    --round 1

Citation

If you find the code is valuable, please use this citation.

@inproceedings{
jin2025massive,
title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
author={Jin, Mingyu and Mei, Kai and Xu, Wujiang and Sun, Mingjie and Tang, Ruixiang and Du, Mengnan and Liu, Zirui and Zhang, Yongfeng},
booktitle={Forty-second International Conference on Machine Learning},
year={2025}
}

@article{jin2025massive,
  title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
  author={Jin, Mingyu and Mei, Kai and Xu, Wujiang and Sun, Mingjie and Tang, Ruixiang and Du, Mengnan and Liu, Zirui and Zhang, Yongfeng},
  journal={arXiv preprint arXiv:2502.01563},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
appendix_result		appendix_result
datasets		datasets
image		image
new_result		new_result
saved_attn_scores/meta-llama		saved_attn_scores/meta-llama
scripts		scripts
syn_data		syn_data
.env		.env
README.md		README.md
attn.ipynb		attn.ipynb
llm_example_save_attn.py		llm_example_save_attn.py
modeling_gemma2.py		modeling_gemma2.py
modeling_gpt2.py		modeling_gpt2.py
modeling_gpt_neo.py		modeling_gpt_neo.py
modeling_gpt_neox.py		modeling_gpt_neox.py
modeling_jamba.py		modeling_jamba.py
modeling_llama.py		modeling_llama.py
modeling_mistral.py		modeling_mistral.py
modeling_opt.py		modeling_opt.py
modeling_qwen2.py		modeling_qwen2.py
modeling_qwen2_vl.py		modeling_qwen2_vl.py
modify_utils.py		modify_utils.py
requirements.txt		requirements.txt
run_datasets.py		run_datasets.py
run_quantization.py		run_quantization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🍊 Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

A comprehensive analysis of concentrated massive values appearing in attention Q and K matrices are mainly responsible for contextual knowledge understanding.

🆕 News

❗Abstract

📝 How to run the Code

🤗 1. Environment Setting

📊 2. Passkey Retrieve Data Synthesis

Passkey Retrieval Data Synthesis Parameters：

📊 3. Knowledge QA Data Synthesis

4. 🎯 Get the Embedding Vector in different LLMs

5. 🔬 Massive Value Disruption: Replicating Contextual Knowledge Understanding Experiments

6. Quantization Experiments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

MingyuJ666/Rope_with_LLM

Folders and files

Latest commit

History

Repository files navigation

🍊 Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

A comprehensive analysis of concentrated massive values appearing in attention Q and K matrices are mainly responsible for contextual knowledge understanding.

🆕 News

❗Abstract

📝 How to run the Code

🤗 1. Environment Setting

📊 2. Passkey Retrieve Data Synthesis

Passkey Retrieval Data Synthesis Parameters：

📊 3. Knowledge QA Data Synthesis

4. 🎯 Get the Embedding Vector in different LLMs

5. 🔬 Massive Value Disruption: Replicating Contextual Knowledge Understanding Experiments

6. Quantization Experiments

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages