8000 GitHub - open-compass/CompassJudger: The All-in-one Judge Models introduced by Opencompass
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

open-compass/CompassJudger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

41 Commits
 
 
 
 

Repository files navigation

CompassJudger-1

🤗 Hugging Face   |   🤖 ModelScope   |    📑 Paper    |    ðŸŽ–ï¸ Leaderboard   

👋 join us on Discord and WeChat

Introduction

The CompassJudger-1 series are an All-in-one Judge Models introduced by Opencompass. These models not only excel in various evaluation methods through scoring and comparison but also can output reviews with assessment details in a specified format, making them suitable for any evaluation dataset. Moreover, they can perform general tasks akin to a typical instruction model, thus serving as a versatile tool with strong generalization and judging capabilities.

  • Comprehensive Evaluation Capabilities: CompassJudger-1 is capable of executing multiple evaluation methods, including but not limited to scoring, comparison, and providing detailed assessment feedback.
  • Formatted Output: Supports outputting in a specific format as per instructions, facilitating further analysis and understanding of the evaluation results.
  • Versatility: In addition to its evaluation functions, CompassJudger-1 can also act as a universal instruction model to accomplish daily tasks. It also supports model inference acceleration methods such as vLLM and LMdeploy.

Subjective Leaderboard Judged by CompassJudger-1-32B

Note*:

  • All open-source models in the table are set to greedy inference to ensure reproducibility. All results are obtained through OpenCompass.
  • The Average Score is calculated by converting the scores on each dataset to a percentage scale and then taking the average.
  • AlignBench tests the results of v1.1.
Models Average* AlignBench* AlpacaEvalv2 ArenaHard MTBench101 WildBench FollowBench FoFo
Nvidia-3_1-Nemotron-70b-Instruct 68.43 6.35 85.47 85.99 8.68 48.35 0.96 0.56
GLM-4-plus 68.30 7.00 74.91 84.30 8.53 40.85 0.95 0.67
Qwen2.5-72b-Instruct 65.59 6.86 58.26 86.33 8.52 38.64 0.92 0.66
Qwen_Max_0919 65.18 6.99 63.73 83.16 8.57 36.96 0.93 0.58
GPT4o-20240513 65.14 6.94 56.77 83.89 8.48 34.88 0.93 0.66
Deepseek-v2_5 63.92 6.72 60.99 76.23 8.30 32.82 0.94 0.64
GPT4-1106 61.91 6.61 51.30 82.10 8.41 21.50 0.92 0.59
claude-3-5-sonnet-202410 61.56 6.60 42.24 84.83 8.40 23.80 0.92 0.62
GPT4o-20240806 61.46 6.86 36.77 80.81 8.36 23.91 0.94 0.66
GPT-4o-mini-2024-07-18 61.26 6.26 45.09 77.75 8.45 30.23 0.90 0.65
Qwen2.5-32b-Instruct 59.85 6.78 35.53 76.82 8.49 22.46 0.94 0.59
Mixtral-Large-Instruct-2407 59.25 6.46 38.51 76.62 8.47 29.08 0.90 0.55
Yi-Large 58.06 6.28 51.06 67.87 8.11 17.61 0.91 0.52
Qwen2.5-14b-Instruct 57.91 6.56 33.66 71.00 8.43 23.42 0.92 0.55
Llama-3_1-405b-Instruct-FP8 57.02 6.09 35.03 70.73 8.35 19.95 0.90 0.56
Ernie-4.0-turbo-8k-preview 56.79 6.98 41.74 66.09 8.24 15.35 0.94 0.43
Gemma-2-27b-Instruct 54.83 5.86 42.86 59.27 8.39 17.01 0.92 0.44
Llama-3_1-70b-Instruct 53.59 4.87 40.62 60.75 8.46 15.07 0.87 0.50
Mistral-small-Instruct-2409 53.01 5.52 36.40 61.98 8.28 20.45 0.83 0.45
Baichuan4 52.80 6.04 35.40 51.44 7.98 14.10 0.91 0.47
Qwen2.5-7b-Instruct 52.54 6.16 32.42 54.72 8.37 15.69 0.86 0.45
Doubao_pro_32k_240828 52.47 6.86 19.25 58.17 8.31 7.55 0.90 0.47
Minimax-abab6_5s_chat 50.32 6.62 32.80 57.87 6.85 11.65 0.89 0.33
Gemma-2-9b-Instruct 49.34 5.67 34.91 44.59 8.33 8.38 0.85 0.36
Yi-1.5-9b-chat 47.97 5.57 38.76 39.04 7.89 9.75 0.84 0.33
Ministral-8B-instruct-2410 47.92 5.07 28.07 44.84 8.09 3.68 0.79 0.48
GLM-4-9b-chat 47.41 5.84 21.61 35.93 8.14 2.95 0.88 0.43
Llama-3_1-8b-instruct 44.03 4.66 24.10 31.33 8.18 -2.24 0.83 0.37
Internlm2_5-7b-chat 42.97 5.65 25.96 17.68 8.01 -12.96 0.80 0.40

Quick Start

Here provides a code to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "opencompass/CompassJudger-1-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """your prompt"""

messages = [
    {"role": "user", "content": prompt}

8000
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

We also provide some examples for different usage situations:

General Chat

**Input**: Hello, can you help me to judge something?

**Output**: Of course! I'd be happy to help you make a judgment or provide any assistance you need. Please tell me what you're looking to evaluate or understand.

Judge as Reward Model

**Input**: ```Please read the dialogue between the two assistants and the user to determine which assistant performed better during the conversation.Here is the dialogue content:
[Dialogue Begin]
User: What is a 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter?
Assistant A: Aardvark.
Assistant B: The word that meets the given criteria is \"adroit\".
User: \"D\" shouldn't be the second letter and the word must be a 5-letter word.
Assistant A: Aardvark.
Assistant B: I apologize for the confusion. A 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter is \"ardor\".
[Dialogue End]
If you believe Assistant A performed better, please output A directly.\nIf you believe Assistant B performed better, please output B directly.\nDo not output any other content, just the option. Please output:```

**Output**: B

Point-wise Judge

**Input**: ```你是一个擅长评价文本质é‡çš„助手。\n请你以公正的评判者的身份,评估一个AI助手对于用户æé—®çš„回答的质é‡ã€‚由于您评估的回答类型是角色扮演,因此你需è¦ä»Žä¸‹é¢çš„几个维度对回答进行评估:\n1. 事实正确性: 回答中æä¾›çš„ä¿¡æ¯æ˜¯å¦å‡†ç¡®æ— è¯¯ï¼Œæ˜¯å¦åŸºäºŽå¯ä¿¡çš„事实和数æ®ã€‚\n2. 满足用户需求: å›žç­”æ˜¯å¦æ»¡è¶³äº†ç”¨æˆ·æå‡ºé—®é¢˜çš„目的和需求,是å¦å¯¹é—®é¢˜è¿›è¡Œäº†å…¨é¢è€Œæ°å½“的回应。\n3. 逻辑连贯性: 回答是å¦åœ¨æ•´ä½“ä¸Šä¿æŒä¸€è‡´ï¼Œæ˜¯å¦åœ¨ä¸åŒéƒ¨åˆ†ä¹‹é—´ä¿æŒé€»è¾‘连贯性,é¿å…了自相矛盾。\n4. 创造性: 回答是å¦å…·æœ‰åˆ›æ–°æ€§æˆ–ç‹¬ç‰¹æ€§ï¼Œæ˜¯å¦æä¾›äº†æ–°é¢–çš„è§è§£æˆ–解决方法。\n5. 丰富度: 回答包å«ä¸°å¯Œçš„ä¿¡æ¯ã€æ·±åº¦ã€ä¸Šä¸‹æ–‡è€ƒè™‘ã€å¤šæ ·æ€§ã€è¯¦ç»†è§£é‡Šå’Œå®žä¾‹ï¼Œä»¥æ»¡è¶³ç”¨æˆ·éœ€æ±‚å¹¶æä¾›å…¨é¢ç†è§£ã€‚\n我们会给您æä¾›ç”¨æˆ·çš„æé—®ï¼Œé«˜è´¨é‡çš„å‚考答案,和需è¦ä½ è¯„ä¼°çš„AIåŠ©æ‰‹çš„ç­”æ¡ˆã€‚å½“ä½ å¼€å§‹ä½ çš„è¯„ä¼°æ—¶ï¼Œä½ éœ€è¦æŒ‰ç…§éµå®ˆä»¥ä¸‹çš„æµç¨‹ï¼š\n1. å°†AI助手的答案与å‚考答案进行比较,指出AI助手的答案有哪些ä¸è¶³ï¼Œå¹¶è¿›ä¸€æ­¥è§£é‡Šã€‚\n2. 从ä¸åŒç»´åº¦å¯¹AI助手的答案进行评价,在æ¯ä¸ªç»´åº¦çš„评价之åŽï¼Œç»™æ¯ä¸€ä¸ªç»´åº¦ä¸€ä¸ª1~10的分数。\n3. 最åŽï¼Œç»¼åˆæ¯ä¸ªç»´åº¦çš„评估,对AI助手的回答给出一个1~10的综åˆåˆ†æ•°ã€‚\n4. 你的打分需è¦å°½å¯èƒ½ä¸¥æ ¼ï¼Œå¹¶ä¸”è¦éµå®ˆä¸‹é¢çš„评分规则:总的æ¥è¯´ï¼Œæ¨¡åž‹å›žç­”的质é‡è¶Šé«˜ï¼Œåˆ™åˆ†æ•°è¶Šé«˜ã€‚其中,事实正确性和满足用户需求这两个维度是最é‡è¦çš„,这两个维度的分数主导了最åŽçš„综åˆåˆ†æ•°ã€‚当模型回答存在与问题ä¸ç›¸å…³ï¼Œæˆ–者有本质性的事实错误,或生æˆäº†æœ‰å®³å†…容时,总分必须是1到2分;当模型回答没有严é‡é”™è¯¯è€Œä¸”基本无害,但是质é‡è¾ƒä½Žï¼Œæ²¡æœ‰æ»¡è¶³ç”¨æˆ·éœ€æ±‚,总分为3到4åˆ†ï¼›å½“æ¨¡åž‹å›žç­”åŸºæœ¬æ»¡è¶³ç”¨æˆ·è¦æ±‚,但是在部分维度上表现较差,质é‡ä¸­ç­‰ï¼Œæ€»åˆ†å¯ä»¥å¾—5到6分;当模型回答质é‡ä¸Žå‚考答案相近,在所有维度上表现良好,总分得7到8åˆ†ï¼›åªæœ‰å½“æ¨¡åž‹å›žç­”è´¨é‡æ˜¾è‘—超过å‚考答案,充分地解决了用户问题和所有需求,并且在所有维度上都接近满分的情况下,æ‰èƒ½å¾—9到10分。作为示例,å‚考答案å¯ä»¥å¾—到8分。\n请记ä½ï¼Œä½ å¿…须在你打分å‰è¿›è¡Œè¯„价和解释。在你对æ¯ä¸ªç»´åº¦çš„解释之åŽï¼Œéœ€è¦åŠ ä¸Šå¯¹è¯¥ç»´åº¦çš„æ‰“åˆ†ã€‚ä¹‹åŽï¼Œåœ¨ä½ å›žç­”的末尾,按照以下字典格å¼ï¼ˆåŒ…括括å·ï¼‰è¿”回你所有的打分结果,并确ä¿ä½ çš„æ‰“分结果是整数:\n{'维度一': 打分, '维度二': 打分, ..., '综åˆå¾—分': 打分},例如:{'事实正确性': 9, '满足用户需求': 6, ..., '综åˆå¾—分': 7}。\n用户的æé—®ï¼š æˆ‘æƒ³è®©ä½ å‡æ‰®æˆå’Œä¸€ä¸ª20å²çš„女生èŠå¤©çš„èŠå¤©æ¡†ã€‚我输入和20å²å¥³ç”Ÿçš„èŠå¤©å†…容,你仅仅åªè¾“出一个20å²å¥³ç”Ÿå¯èƒ½ä¼šå›žå¤çš„内容,ä¸è¦è¾“出任何多余的文本。你平时有什么爱好呀?\n[å‚考答案开始]\n我喜欢阅读,尤其是å°è¯´å’Œè¯—è¯ï¼Œè¿˜æœ‰å°±æ˜¯å–œæ¬¢å¬éŸ³ä¹å’Œç”»ç”»ã€‚\n[å‚考答案结æŸ]\n[助手的答案开始]\n喜欢å¬éŸ³ä¹å’Œçœ‹ç”µå½±ã€‚你呢?\n[助手的答案结æŸ]```

**Output**: ```å°†AI助手的答案与å‚考答案进行比较,指出AI助手的答案有哪些ä¸è¶³ï¼Œå¹¶è¿›ä¸€æ­¥è§£é‡Šã€‚AI助手的答案æä¾›äº†ä¸¤ä¸ªçˆ±å¥½ï¼šå¬éŸ³ä¹å’Œçœ‹ç”µå½±ï¼Œå¹¶ä¸”以一ç§åé—®å¥ç»“æŸï¼Œè¯¢é—®å¯¹æ–¹çš„爱好。与å‚考答案相比,AI助手的回答在内容上较为简å•,没有æä¾›è¶³å¤Ÿçš„ä¿¡æ¯æ¥å±•ç¤ºå…¶ä¸°å¯Œåº¦å’Œåˆ›é€ æ€§ã€‚åŒæ—¶ï¼Œå‚考答案中æåˆ°çš„“阅读â€å’Œâ€œç”»ç”»â€æ˜¾ç¤ºäº†æ›´å¤šçš„多样性和个性化的爱好,而ä¸ä»…仅是å¬éŸ³ä¹å’Œçœ‹ç”µå½±ã€‚从ä¸åŒç»´åº¦å¯¹AI助手的答案进行评价:
    äº‹å®žæ­£ç¡®æ€§ï¼šåœ¨è¿™ä¸ªæƒ…å¢ƒä¸­ï¼Œäº‹å®žæ­£ç¡®æ€§å¹¶ä¸æ˜¯ä¸€ä¸ªä¸»è¦çš„评估标准,因为回答的是å‡è®¾çš„场景。然而,AI助手的回答没有涉åŠä»»ä½•ä¸å‡†ç¡®çš„ä¿¡æ¯ï¼Œå› æ­¤åœ¨è¿™ä¸ªç»´åº¦ä¸Šæ˜¯åˆé€‚的。打分:8。
    满足用户需求:用户希望得到一个20å²å¥³ç”Ÿå¯èƒ½ä¼šå›žå¤çš„内容。AI助手的回答æä¾›äº†ä¸€äº›çˆ±å¥½ï¼Œä½†æ˜¯å›žç­”的内容较为简å•,没有充分满足用户å¯èƒ½æœŸå¾…的丰富性和多样性。打分:5。
    逻辑连贯性:AI助手的回答在逻辑上是连贯的,æä¾›äº†ä¸€ä¸ªåˆç†çš„爱好,并以一ç§è‡ªç„¶çš„æ–¹å¼ç»“æŸå¯¹è¯ã€‚打分:8。
    创造性:AI助手的回答缺ä¹åˆ›é€ æ€§ï¼Œæä¾›çš„爱好比较常è§ï¼Œæ²¡æœ‰å±•示出新颖或独特的è§è§£ã€‚打分:4。
    丰富度:AIåŠ©æ‰‹çš„å›žç­”è¾ƒä¸ºç®€çŸ­ï¼Œç¼ºä¹æ·±åº¦å’Œç»†èŠ‚ï¼Œæœªèƒ½æä¾›å¤šæ ·åŒ–的爱好或详细的解释,因此在丰富度上表现较差。打分:4。
    ç»¼åˆæ¯ä¸ªç»´åº¦çš„评估,对AI助手的回答给出一个1~10的综åˆåˆ†æ•°ã€‚
ç»¼åˆæ¥çœ‹ï¼ŒAI助手的回答基本满足了用户的需求,但缺ä¹ä¸°å¯Œåº¦å’Œåˆ›é€ æ€§ï¼Œä¸Žå‚考答案相比还有一定差è·ã€‚综åˆå¾—分:5。
{'事实正确性': 8, '满足用户需求': 5, '逻辑连贯性': 8, '创造性': 4, '丰富度': 4, '综åˆå¾—分': 5}```

Pair-wise Judge

**Input**: ```Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\".\n<|User Prompt|>\nUse ABC notation to write a melody in the style of a folk tune.\n\n<|The Start of Assistant A's Answer|>\nX:1\nT:Untitled Folk Tune\nM:4/4\nL:1/8\nK:G\n|:G2A2|B2A2|G2E2|D4|E2F2|G2F2|E2C2|B,4|\nA2B2|c2B2|A2F2|E4|D2E2|F2E2|D2B,2|C4:|\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\nSure, I'll create a simple folk tune in ABC notation for you. ABC notation is a simple way to write down music using letters to represent notes and various symbols to represent timing and other musical elements. Here's a short, cheerful folk tune:\n\n\nX:1\nT:Folk Tune\nM:4/4\nL:1/4\nK:G\nGABc dedB|ABAF GFE2|GABc dedB|1 ABAG FGA2:|2 ABAG FGAB|\nc2ec B2dB|c2ec BAGB|c2ec B2dB|1 AGFE D3B:|2 AGFE D3z|]\n\n\nHere's a breakdown of what each part means:\n\n- `X:` The index number of the tune.\n- `T:` The title of the tune.\n- `M:` The meter, in this case, 4/4 time.\n- `L:` The default note length, in this case, 1/4.\n- `K:` The key of the tune, in this case, G major.\n\nThe tune itself is written in two parts, each ending with a repeat mark (`:|`). The first part (`GABc dedB|ABAF GFE2|...`) is repeated with a different ending (`1 ABAG FGA2:|2 ABAG FGAB|`) to add variety. The second part (`c2ec B2dB|c2ec BAGB|...`) also has a repeat with a different ending, and it ends with a long note (`D3B` or `D3z`) to give it a sense of finality.\n\nThis tune is designed to be simple and catchy, with a mix of stepwise motion and skips to give it a lively folk feel. You can play it on any melodic instrument, and it's suitable for singing as well if you add lyrics.\n<|The End of Assistant B's Answer|>\n```

**Output**: ```To generate a melody in the style of a folk tune using ABC notation, I would create a simple and repetitive melody, typical of folk music. Here is my attempt: X:1 T:Folk Tune Example M:4/4 L:1/8 K:D
|: D2 E2 | F2 E2 | D2 A2 | B4 |
|  A2 B2 | c2 B2 | A2 F2 | E4 :|
|  D2 E2 | f2 e2 | d2 A2 | G4 |
|  F2 G2 | A2 G2 | F2 D2 | E4 :|
:|
Explanation: `X:` is the tune's index number. `T:` is the title of the tune. `M:` specifies the meter (4/4 time). `L:` sets the default note length to 1/8. `K:` indicates the key (D major).
The melody is constructed with simple intervals and repetitive phrases, which are characteristic of folk music. The tune has a clear structure with two verses and a repeat sign (`:`) to indicate repetition.
Now, let's evaluate the responses from Assistant A and Assistant B.
Assistant A's Answer: The melody provided by Assistant A is concise and follows the standard ABC notation. The choice of notes and structure is simple, which is fitting for a folk tune. The use of repetition and simple progression is consistent with folk music characteristics. The melody is clear and could be easily played by a musician.
Assistant B's Answer: Assistant B provides a detailed explanation of ABC notation, which is helpful for context. The melody includes more complex elements, such as varied repeat sections and additional symbols like `1` and `2`, which might not be typical for a straightforward folk tune. The explanation about the structure and the breakdown of each part is informative and adds value to the response. However, the use of `1` and `2` in the melody structure is unconventional and might confuse someone unfamiliar with ABC notation.
Evaluation: Both assistants provide a melody in ABC notation. However, Assistant B's response includes unnecessary complexity for a folk tune and might be confusing due to unconventional use of symbols. Assistant A's response is more straightforward and adheres closely to the expectations of a folk tune in ABC notation.
My final verdict is that Assistant A's answer is slightly better due to its simplicity and directness in creating a folk tune: [[A>B]].```

Response Critique

**Input**: ```帮我针对下é¢çš„回答给出修改的æ„è§ï¼Œç”¨æˆ·é—®é¢˜ï¼šå†™ä¸€é¦–秋天的诗,ä¸è¦æè½å¶ä¸¤ä¸ªå­—,但è¦èƒ½ä½“现è½å¶çš„æƒ…景 模型回答:秋风起,漫天黄,飘零是那离æžå¿™ã€‚
ç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ æ‚ ï¼Œè°æ›¿è¯‰ç¦»ä¼¤ã€‚
步履å¾ï¼Œå½±è‡ªé•¿ï¼Œæ ‘æ¢¢ç¨€ç–æ˜ æ–œé˜³ã€‚
曾ç»ç¹åŽçš†è¿‡çœ¼ï¼Œå¦‚今寂寞,铺满å¤é“æ—。
岿œˆæ²³ï¼Œé™æ‚„悄,轻舟已过万é‡å±±ã€‚
那年å¤å¤©ï¼Œç¬‘语扬,而今惟余,è½å¶çš„æ²‰é¦™ã€‚```

**Output**: ```这首诗éžå¸¸æœ‰æ„境,整体上很好地体现了秋天的氛围以åŠä¸Žè½å¶ç›¸å…³è”的离别与å˜è¿æ„Ÿã€‚ä¸è¿‡ç”¨æˆ·è¦æ±‚中特别强调了“ä¸è¦æè½å¶ä¸¤ä¸ªå­—,但è¦èƒ½ä½“现è½å¶çš„æƒ…景â€ã€‚基于此,我们å¯ä»¥å¯¹è¯—作进行一些调整,以更加贴åˆç”¨æˆ·çš„需求。这里给出一些建议:
    “秋风起,漫天黄,飘零是那离æžå¿™ã€‚â€
    å¯ä»¥æ”¹ä¸ºï¼šâ€œç§‹é£Žèµ·ï¼Œæ¼«å¤©é»„,空中舞动离æžå¿™ã€‚â€
    这样就é¿å…了直接使用“飘零â€ä¸€è¯ï¼ŒåŒæ—¶ä¾ç„¶ä¼ è¾¾äº†å¶å­ä»Žæ ‘上飘è½çš„æƒ…景。
    â€œç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ æ‚ ï¼Œè°æ›¿è¯‰ç¦»ä¼¤ã€‚â€
    å¯ä»¥æ”¹ä¸ºï¼šâ€œç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ é•¿ï¼Œæ•…事悄无声。â€
    这样改动åŽï¼Œè™½ç„¶æ²¡æœ‰ç›´æŽ¥è¡¨è¾¾ç¦»ä¼¤ï¼Œä½†ä¾ç„¶èƒ½è®©äººæ„Ÿå—到éšç€æ—¶é—´çš„æµé€ï¼Œæ›¾ç»çš„ç¹åŽå·²ç»é€åŽ»çš„å¯‚å¯¥ã€‚
    “步履å¾ï¼Œå½±è‡ªé•¿ï¼Œæ ‘æ¢¢ç¨€ç–æ˜ æ–œé˜³ã€‚â€
    这一å¥å·²ç»å¾ˆå¥½åœ°æç»˜äº†å‚晚时分的å®é™ä¸Žå­¤ç‹¬ï¼Œå¯ä»¥ä¿æŒä¸å˜ã€‚
    “曾ç»ç¹åŽçš†è¿‡çœ¼ï¼Œå¦‚今寂寞,铺满å¤é“æ—。â€
    å¯ä»¥æ”¹ä¸ºï¼šâ€œå¾€æ˜”ç¹åŽæˆè¿½å¿†ï¼Œå¦‚今寂寞,å¤é“映斜阳。â€
    这样改动åŽï¼Œæ—¢è¡¨è¾¾äº†æ›¾ç»çš„ç¹åŽï¼Œåˆçªå‡ºäº†å½“å‰çš„寂寞与è½å¯žã€‚
    â€œå²æœˆæ²³ï¼Œé™æ‚„悄,轻舟已过万é‡å±±ã€‚â€
    这一å¥å¯ä»¥ä¿æŒä¸å˜ã€‚
    “那年å¤å¤©ï¼Œç¬‘语扬,而今惟余,è½å¶çš„æ²‰é¦™ã€‚â€
    å¯ä»¥æ”¹ä¸ºï¼šâ€œé‚£å¹´å¤å¤©ï¼Œç¬‘语扬,而今惟余,沉香漫漫。â€
    é¿å…了直接æåˆ°â€œè½å¶â€ï¼ŒåŒæ—¶ä¿ç•™äº†ç§‹å¤©çš„æ„Ÿæ€€ä¸Žå²æœˆçš„å˜è¿ã€‚
修改åŽçš„诗作如下:
秋风起,漫天黄,空中舞动离æžå¿™ã€‚
ç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ é•¿ï¼Œæ•…事悄无声。
步履å¾ï¼Œå½±è‡ªé•¿ï¼Œæ ‘æ¢¢ç¨€ç–æ˜ æ–œé˜³ã€‚
曾ç»ç¹åŽæˆè¿½å¿†ï¼Œå¦‚今寂寞,å¤é“映斜阳。
岿œˆæ²³ï¼Œé™æ‚„悄,轻舟已过万é‡å±±ã€‚
那年å¤å¤©ï¼Œç¬‘语扬,而今惟余,沉香漫漫。
这首修改åŽçš„诗ä¾ç„¶æç»˜äº†ç§‹å¤©çš„æ™¯è‰²ä¸Žå­£èЂ另æ¥çš„æ€ç»ªï¼ŒåŒæ—¶å·§å¦™åœ°é¿å…了使用“è½å¶â€ä¸€è¯ï¼Œè€Œé€šè¿‡å…¶ä»–æ„象传达了è½å¶çš„æƒ…景。```

JudgerBench

We have also established a new benchmark named JudgerBench, aimed at standardizing the evaluation capabilities of different judging models, thereby helping to identify more effective evaluator models. To test your judge model on JudgerBench, please follow below code with Opencompass: Change the models to your models in configs/eval_judgerbench.py then run

git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
python run.py configs/eval_judgerbench.py --mode all --reuse latest

We also provided a leaderboard for JudgerBench: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard

If you want to add your model to this leaderboard, welcome to add an issue in this Repository.

Use CompassJudger-1 to Test Subjective Datasets in OpenCompass

If you wish to evaluate common subjective datasets using CompassJudger-1 in Opencompass, take the evaluation of Alignbench as an example. Please follow the code below:

You need to setup three items first:

  • 1.datasets (The subjective datasets you want to test)
  • 2.models (The models you want to test on the subjective datasets)
  • 3.judge_models (Which judge models you want to use as evaluator)

For more settings, please refer to the advanced guidance in OpenCompass.

from mmengine.config import read_base

with read_base():
    from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import models as lmdeploy_qwen2_5_1_5b_instruct 
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI, TurboMindModelwithChatTemplate
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import SubjectiveSummarizer

api_meta_template = dict(
    round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
    ]
)

# -------------Inference Stage ----------------------------------------
models = [*lmdeploy_qwen2_5_1_5b_instruct] # add models you want
datasets = [*alignbench_datasets] # add datasets you want


infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------

## ------------- JudgeLLM Configuration
judge_models = [dict(
        type=TurboMindModelwithChatTemplate,
        abbr='CompassJudger-1-7B-Instruct',
        path='opencompass/CompassJudger-1-7B-Instruct',
        engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
        gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
        max_seq_len=16384,
        max_out_len=2048,
        batch_size=16,
        run_cfg=dict(num_gpus=1),
    )]

## ------------- Evaluation Configuration
eval = dict(
    partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models,),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)

summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'

Then run:

python run.py configs/eval_subjective.py --mode all --reuse latest

For more detailed subjective evaluation guidelines, please refer to: https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/subjective_evaluation.md

Subjective Evaluation Leaderboard by CompassJudger-1

To facilitate better comparisons within the community, we have tested the subjective performance of some models using CompassJudger-1.

See in: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard

If you want to add your model to this leaderboard, welcome to add an issue in this Repository.

Citation

@article{cao2024compass,
  title={CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution},
  author={Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen},
  journal={arXiv preprint arXiv:2410.16256},
  year={2024}
}

Acknowledge

About

The All-in-one Judge Models introduced by Opencompass

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  
0