8000 GitHub - MingyuJ666/Rope_with_LLM: [ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concentrated in low-frequency dimensions across different attention heads exclusively in attention queries (Q) and keys (K) while absent in values (V).
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concentrated in low-frequency dimensions across different attention heads exclusively in attention queries (Q) and keys (K) while absent in values (V).

Notifications You must be signed in to change notification settings

MingyuJ666/Rope_with_LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

84 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐ŸŠ Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

A comprehensive analysis of concentrated massive values appearing in attention Q and K matrices are mainly responsible for contextual knowledge understanding.

๐Ÿ“ƒ Paper ๐ŸŒ Website

๐Ÿ†• News

โ—Abstract

Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concentrated in low-frequency dimensions across different attention heads exclusively in attention queries (Q) and keys (K) while absent in values (V). Besides, we find that these massive values are more responsible for understanding contextual knowledge (i.e., knowledge from context window) rather than retrieving parametric knowledge (i.e., knowledge from trained parameters) by extensive experiments. Further analysis of quantization methods shows that failing to protect massive values during quantization leads to more substantial performance degradation in contextual knowledge understanding, aligning with this observation. In addition, we analyze the root cause of the massive values and demonstrate that these massive values are induced by the mechanisms of RoPE (Rotary Positional Encoding) and that the values of low-frequency dimensions are less affected by position information. Our findings collectively uncover the functions of massive values in the Q and K, providing valuable insights for future model design and optimization strategies.

๐Ÿ“ How to run the Code

๐Ÿค— 1. Environment Setting

>> conda create -n myenv python=3.9
>> conda activate myenv
>> pip install -r requirements.txt

Set the environment variables in the .env file. For example:

OPENAI_API_KEY=""
HF_AUTH_TOKENS=""

๐Ÿ“Š 2. Passkey Retrieve Data Synthesis

>> bash scripts/run_passkey.sh 

Passkey Retrieval Data Synthesis Parameters๏ผš

[Core Parameters]:

  • seq_length=128: Total length of the generated text sequence (must be โ‰ฅ 101)
  • begin_pos=50: Starting position for password insertion
  • passkey_length=6: Length of the password to be inserted

[Data Generation Controls]:

  • num_gen_example=200: Number of examples to generate
  • max_data_num=200: Maximum number of examples in the final dataset

Note

To adjust the dataset size, both num_gen_example and max_data_num should be set to the same value. For example, to generate 300 examples, set both parameters to 300.

Warning

Setting seq_length below 101 will result in an error.

pos_interval=500
begin_pos=50
seq_length=128
passkey_length=6
# Dataset Parameters
DATASET_CONFIG=(
    --dataset 'passkey_retrieval'
    --split 'test'
    --dataset_folder './synthetic_tasks'
    --num_gen_example 200
    --max_data_num 200
    --max_generation_length 10
)

You can find other datasets in datasets file.

๐Ÿ“Š 3. Knowledge QA Data Synthesis

The pipeline of knowledge QA data synthesis can be seen from the figure below.

Knowledge QA Data Synthesis

>> bash scripts/run_knowledge_qa.sh 

The generated datasets are already available in the datasets folder.

You can also run the code in the datasets/create_knowledge_qa.py to customize the categories and number of knowledge QA pairs.

>> python datasets/create_knowledge_qa.py --category <YOUR_CATEGORY> --num_pairs <YOUR_NUM_PAIRS>

[Core Parameters]:

  • categories: The categories of the generated knowledge QA pairs
  • num_pairs: The number of knowledge QA pairs to be generated

4. ๐ŸŽฏ Get the Embedding Vector in different LLMs

>>  sh scripts/save_attn_map.sh 

Step 1, pattern="save_attn", select a language model like meta-llama/Llama-2-7b-chat-hf

>> CUDA_VISIBLE_DEVICES=0 python llm_example_save_attn.py \
    --model_name meta-llama/Llama-2-7b-chat-hf\
    --pattern "$pattern" \
    --round "$round" \

Step 2: How to choose layers in different LLM

For example, in Llama๐Ÿฆ™

Find modeling_llama.py and search the code below. Then you can change the code: if GLOBAL_L == 1 or GLOBAL_L == 2 or GLOBAL_L == 10:, choose the layer GLOBAL_L you want.

global GLOBAL_L

head_set = range(32)

if GLOBAL_L in range(32):
    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    if GLOBAL_L == 1 or GLOBAL_L == 2 or GLOBAL_L == 10:
        print(query_states.shape)
        torch.save(query_states, f"{save_dir}/q_merged_attn_weights_layer{GLOBAL_L}.pt")
        torch.save(key_states, f"{save_dir}/k_merged_attn_weights_layer{GLOBAL_L}.pt")
        torch.save(value_states, f"{save_dir}/v_merged_attn_weights_layer{GLOBAL_L}.pt")
                

Step 3: Use attn.ipynb or appendix_result/run.ipynb to show the result. architect

5. ๐Ÿ”ฌ Massive Value Disruption: Replicating Contextual Knowledge Understanding Experiments

architect

>>  sh scripts/save_attn_map.sh 

[Core Parameters]:

  • model_name: use different model by model_name like meta-llama/Meta-Llama-3-8B-Instruct, meta-llama/Llama-2-7b-chat-hf, mistralai/Mistral-7B-Instruct-v0.3, Qwen/Qwen2.5-7B-Instruct, google/gemma-2-9b-it, facebook/opt-2.7b, ai21labs/Jamba-v0.1, Qwen/Qwen2-VL-2B-Instruct
  • round: Decide how many rounds to run and then take the average
  • pattern: Determine which dataset we will use: "city", "aqua", "imdb", "sports", "art", "cele", "long".
pattern="city"
round=2

CUDA_VISIBLE_DEVICES=0 python llm_example_save_attn.py \
    --model_name meta-llama/Llama-2-7b-chat-hf\
    --pattern "$pattern" \
    --round "$round" \
    #2>&1 | tee ./imdb_destroy.log      

[How to destroy Massive Value]: Find the file modeling_llama.py and search the code below(If you want to use other LLm, you can use modeling_gemma2.py, modeling_qwen2.py.......):

  • add_mean_perturbation = True: replace the massive value in the QK embedding vector with their mean value.
  • add_other_perturbation = True: replace the other place in the QK embedding vector with their mean value.
add_mean_perturbation = False
add_other_perturbation = False
if q_len != 1:
    if add_mean_perturbation == True:
        num_heads = query_states.size(2)
        matrix = query_states.norm(dim=1).squeeze(0)
        for head_idx in range(num_heads):
            outlier = matrix[head_idx, :]
            values, indices = torch.topk(outlier, 1)
            top_indices = indices.tolist()
            target_vectors = [query_states[:, :, head_idx, idx] for idx in top_indices]
            mean_value = torch.mean(torch.stack([vector.mean() for vector in target_vectors]))
            for idx in top_indices:
                query_states[:, :, head_idx, idx] = mean_value
                # rest of the code
  • torch.topk(outlier, 1): Sometimes, some LLMs have many outliers, you can adjust the number 1 to 2,3,4.....

6. Quantization Experiments

[Core Parameters]:

  • model_name: use model by model_name, we use meta-llama/Meta-Llama-3-8B-Instruct by default
  • round: Decide how many rounds to run and then take the average
  • pattern: Determine which dataset we will use: "city", "aqua", "imdb", "sports", "art", "cele", "long".
  • quantized: Determine which quantization method we will use: "awq", "smooth_quant", "gptq".
>>  python run_quantization.py \ 
    --model_name meta-llama/Meta-Llama-3-8B-Instruct \
    --pattern city \
    --quantized smooth_quant \
    --round 1

Citation

If you find the code is valuable, please use this citation.

@inproceedings{
jin2025massive,
title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
author={Jin, Mingyu and Mei, Kai and Xu, Wujiang and Sun, Mingjie and Tang, Ruixiang and Du, Mengnan and Liu, Zirui and Zhang, Yongfeng},
booktitle={Forty-second International Conference on Machine Learning},
year={2025}
}
@article{jin2025massive,
  title={Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding},
  author={Jin, Mingyu and Mei, Kai and Xu, Wujiang and Sun, Mingjie and Tang, Ruixiang and Du, Mengnan and Liu, Zirui and Zhang, Yongfeng},
  journal={arXiv preprint arXiv:2502.01563},
  year={2025}
}

About

[ICML'25] Our study systematically investigates massive values in LLMs' attention mechanisms. First, we observe massive values are concentrated in low-frequency dimensions across different attention heads exclusively in attention queries (Q) and keys (K) while absent in values (V).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0