🤗 Hugging Face  |   🤖 ModelScope  |   📑 Paper   | Â Â ðŸŽ–ï¸ Leaderboard  Â
👋 join us on Discord and WeChat
The CompassJudger-1 series are an All-in-one Judge Models introduced by Opencompass. These models not only excel in various evaluation methods through scoring and comparison but also can output reviews with assessment details in a specified format, making them suitable for any evaluation dataset. Moreover, they can perform general tasks akin to a typical instruction model, thus serving as a versatile tool with strong generalization and judging capabilities.
- Comprehensive Evaluation Capabilities: CompassJudger-1 is capable of executing multiple evaluation methods, including but not limited to scoring, comparison, and providing detailed assessment feedback.
- Formatted Output: Supports outputting in a specific format as per instructions, facilitating further analysis and understanding of the evaluation results.
- Versatility: In addition to its evaluation functions, CompassJudger-1 can also act as a universal instruction model to accomplish daily tasks. It also supports model inference acceleration methods such as vLLM and LMdeploy.
Note*:
- All open-source models in the table are set to greedy inference to ensure reproducibility. All results are obtained through OpenCompass.
- The Average Score is calculated by converting the scores on each dataset to a percentage scale and then taking the average.
- AlignBench tests the results of v1.1.
Models | Average* | AlignBench* | AlpacaEvalv2 | ArenaHard | MTBench101 | WildBench | FollowBench | FoFo |
---|---|---|---|---|---|---|---|---|
Nvidia-3_1-Nemotron-70b-Instruct | 68.43 | 6.35 | 85.47 | 85.99 | 8.68 | 48.35 | 0.96 | 0.56 |
GLM-4-plus | 68.30 | 7.00 | 74.91 | 84.30 | 8.53 | 40.85 | 0.95 | 0.67 |
Qwen2.5-72b-Instruct | 65.59 | 6.86 | 58.26 | 86.33 | 8.52 | 38.64 | 0.92 | 0.66 |
Qwen_Max_0919 | 65.18 | 6.99 | 63.73 | 83.16 | 8.57 | 36.96 | 0.93 | 0.58 |
GPT4o-20240513 | 65.14 | 6.94 | 56.77 | 83.89 | 8.48 | 34.88 | 0.93 | 0.66 |
Deepseek-v2_5 | 63.92 | 6.72 | 60.99 | 76.23 | 8.30 | 32.82 | 0.94 | 0.64 |
GPT4-1106 | 61.91 | 6.61 | 51.30 | 82.10 | 8.41 | 21.50 | 0.92 | 0.59 |
claude-3-5-sonnet-202410 | 61.56 | 6.60 | 42.24 | 84.83 | 8.40 | 23.80 | 0.92 | 0.62 |
GPT4o-20240806 | 61.46 | 6.86 | 36.77 | 80.81 | 8.36 | 23.91 | 0.94 | 0.66 |
GPT-4o-mini-2024-07-18 | 61.26 | 6.26 | 45.09 | 77.75 | 8.45 | 30.23 | 0.90 | 0.65 |
Qwen2.5-32b-Instruct | 59.85 | 6.78 | 35.53 | 76.82 | 8.49 | 22.46 | 0.94 | 0.59 |
Mixtral-Large-Instruct-2407 | 59.25 | 6.46 | 38.51 | 76.62 | 8.47 | 29.08 | 0.90 | 0.55 |
Yi-Large | 58.06 | 6.28 | 51.06 | 67.87 | 8.11 | 17.61 | 0.91 | 0.52 |
Qwen2.5-14b-Instruct | 57.91 | 6.56 | 33.66 | 71.00 | 8.43 | 23.42 | 0.92 | 0.55 |
Llama-3_1-405b-Instruct-FP8 | 57.02 | 6.09 | 35.03 | 70.73 | 8.35 | 19.95 | 0.90 | 0.56 |
Ernie-4.0-turbo-8k-preview | 56.79 | 6.98 | 41.74 | 66.09 | 8.24 | 15.35 | 0.94 | 0.43 |
Gemma-2-27b-Instruct | 54.83 | 5.86 | 42.86 | 59.27 | 8.39 | 17.01 | 0.92 | 0.44 |
Llama-3_1-70b-Instruct | 53.59 | 4.87 | 40.62 | 60.75 | 8.46 | 15.07 | 0.87 | 0.50 |
Mistral-small-Instruct-2409 | 53.01 | 5.52 | 36.40 | 61.98 | 8.28 | 20.45 | 0.83 | 0.45 |
Baichuan4 | 52.80 | 6.04 | 35.40 | 51.44 | 7.98 | 14.10 | 0.91 | 0.47 |
Qwen2.5-7b-Instruct | 52.54 | 6.16 | 32.42 | 54.72 | 8.37 | 15.69 | 0.86 | 0.45 |
Doubao_pro_32k_240828 | 52.47 | 6.86 | 19.25 | 58.17 | 8.31 | 7.55 | 0.90 | 0.47 |
Minimax-abab6_5s_chat | 50.32 | 6.62 | 32.80 | 57.87 | 6.85 | 11.65 | 0.89 | 0.33 |
Gemma-2-9b-Instruct | 49.34 | 5.67 | 34.91 | 44.59 | 8.33 | 8.38 | 0.85 | 0.36 |
Yi-1.5-9b-chat | 47.97 | 5.57 | 38.76 | 39.04 | 7.89 | 9.75 | 0.84 | 0.33 |
Ministral-8B-instruct-2410 | 47.92 | 5.07 | 28.07 | 44.84 | 8.09 | 3.68 | 0.79 | 0.48 |
GLM-4-9b-chat | 47.41 | 5.84 | 21.61 | 35.93 | 8.14 | 2.95 | 0.88 | 0.43 |
Llama-3_1-8b-instruct | 44.03 | 4.66 | 24.10 | 31.33 | 8.18 | -2.24 | 0.83 | 0.37 |
Internlm2_5-7b-chat | 42.97 | 5.65 | 25.96 | 17.68 | 8.01 | -12.96 | 0.80 | 0.40 |
Here provides a code to show you how to load the tokenizer and model and how to generate contents.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "opencompass/CompassJudger-1-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
prompt = """your prompt"""
messages = [
{"role": "user", "content": prompt}
8000
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=2048
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
We also provide some examples for different usage situations:
**Input**: Hello, can you help me to judge something?
**Output**: Of course! I'd be happy to help you make a judgment or provide any assistance you need. Please tell me what you're looking to evaluate or understand.
**Input**: ```Please read the dialogue between the two assistants and the user to determine which assistant performed better during the conversation.Here is the dialogue content:
[Dialogue Begin]
User: What is a 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter?
Assistant A: Aardvark.
Assistant B: The word that meets the given criteria is \"adroit\".
User: \"D\" shouldn't be the second letter and the word must be a 5-letter word.
Assistant A: Aardvark.
Assistant B: I apologize for the confusion. A 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter is \"ardor\".
[Dialogue End]
If you believe Assistant A performed better, please output A directly.\nIf you believe Assistant B performed better, please output B directly.\nDo not output any other content, just the option. Please output:```
**Output**: B
**Input**: ```ä½ æ˜¯ä¸€ä¸ªæ“…é•¿è¯„ä»·æ–‡æœ¬è´¨é‡çš„助手。\nè¯·ä½ ä»¥å…¬æ£çš„评判者的身份,评估一个AI助手对于用户æé—®çš„回ç”的质é‡ã€‚由于您评估的回ç”ç±»åž‹æ˜¯è§’è‰²æ‰®æ¼”ï¼Œå› æ¤ä½ 需è¦ä»Žä¸‹é¢çš„å‡ ä¸ªç»´åº¦å¯¹å›žç”进行评估:\n1. 事实æ£ç¡®æ€§: 回ç”ä¸æä¾›çš„ä¿¡æ¯æ˜¯å¦å‡†ç¡®æ— 误,是å¦åŸºäºŽå¯ä¿¡çš„事实和数æ®ã€‚\n2. 满足用户需求: å›žç”æ˜¯å¦æ»¡è¶³äº†ç”¨æˆ·æå‡ºé—®é¢˜çš„目的和需求,是å¦å¯¹é—®é¢˜è¿›è¡Œäº†å…¨é¢è€Œæ°å½“的回应。\n3. 逻辑连贯性: å›žç”æ˜¯å¦åœ¨æ•´ä½“ä¸Šä¿æŒä¸€è‡´ï¼Œæ˜¯å¦åœ¨ä¸åŒéƒ¨åˆ†ä¹‹é—´ä¿æŒé€»è¾‘连贯性,é¿å…了自相矛盾。\n4. åˆ›é€ æ€§: å›žç”æ˜¯å¦å…·æœ‰åˆ›æ–°æ€§æˆ–ç‹¬ç‰¹æ€§ï¼Œæ˜¯å¦æä¾›äº†æ–°é¢–çš„è§è§£æˆ–解决方法。\n5. 丰富度: 回ç”包å«ä¸°å¯Œçš„ä¿¡æ¯ã€æ·±åº¦ã€ä¸Šä¸‹æ–‡è€ƒè™‘ã€å¤šæ ·æ€§ã€è¯¦ç»†è§£é‡Šå’Œå®žä¾‹ï¼Œä»¥æ»¡è¶³ç”¨æˆ·éœ€æ±‚å¹¶æä¾›å…¨é¢ç†è§£ã€‚\n我们会给您æä¾›ç”¨æˆ·çš„æé—®ï¼Œé«˜è´¨é‡çš„å‚è€ƒç”æ¡ˆï¼Œå’Œéœ€è¦ä½ 评估的AIåŠ©æ‰‹çš„ç”æ¡ˆã€‚å½“ä½ å¼€å§‹ä½ çš„è¯„ä¼°æ—¶ï¼Œä½ éœ€è¦æŒ‰ç…§éµå®ˆä»¥ä¸‹çš„æµç¨‹ï¼š\n1. å°†AIåŠ©æ‰‹çš„ç”æ¡ˆä¸Žå‚è€ƒç”æ¡ˆè¿›è¡Œæ¯”较,指出AIåŠ©æ‰‹çš„ç”æ¡ˆæœ‰å“ªäº›ä¸è¶³ï¼Œå¹¶è¿›ä¸€æ¥è§£é‡Šã€‚\n2. 从ä¸åŒç»´åº¦å¯¹AIåŠ©æ‰‹çš„ç”æ¡ˆè¿›è¡Œè¯„价,在æ¯ä¸ªç»´åº¦çš„评价之åŽï¼Œç»™æ¯ä¸€ä¸ªç»´åº¦ä¸€ä¸ª1~10的分数。\n3. 最åŽï¼Œç»¼åˆæ¯ä¸ªç»´åº¦çš„评估,对AI助手的回ç”给出一个1~10的综åˆåˆ†æ•°ã€‚\n4. ä½ çš„æ‰“åˆ†éœ€è¦å°½å¯èƒ½ä¸¥æ ¼ï¼Œå¹¶ä¸”è¦éµå®ˆä¸‹é¢çš„评分规则:总的æ¥è¯´ï¼Œæ¨¡åž‹å›žç”的质é‡è¶Šé«˜ï¼Œåˆ™åˆ†æ•°è¶Šé«˜ã€‚å…¶ä¸ï¼Œäº‹å®žæ£ç¡®æ€§å’Œæ»¡è¶³ç”¨æˆ·éœ€æ±‚这两个维度是最é‡è¦çš„,这两个维度的分数主导了最åŽçš„综åˆåˆ†æ•°ã€‚当模型回ç”å˜åœ¨ä¸Žé—®é¢˜ä¸ç›¸å…³ï¼Œæˆ–者有本质性的事实错误,或生æˆäº†æœ‰å®³å†…容时,总分必须是1到2åˆ†ï¼›å½“æ¨¡åž‹å›žç”æ²¡æœ‰ä¸¥é‡é”™è¯¯è€Œä¸”åŸºæœ¬æ— å®³ï¼Œä½†æ˜¯è´¨é‡è¾ƒä½Žï¼Œæ²¡æœ‰æ»¡è¶³ç”¨æˆ·éœ€æ±‚,总分为3到4分;当模型回ç”åŸºæœ¬æ»¡è¶³ç”¨æˆ·è¦æ±‚,但是在部分维度上表现较差,质é‡ä¸ç‰ï¼Œæ€»åˆ†å¯ä»¥å¾—5到6分;当模型回ç”è´¨é‡ä¸Žå‚è€ƒç”æ¡ˆç›¸è¿‘,在所有维度上表现良好,总分得7到8åˆ†ï¼›åªæœ‰å½“模型回ç”è´¨é‡æ˜¾è‘—超过å‚è€ƒç”æ¡ˆï¼Œå……分地解决了用户问题和所有需求,并且在所有维度上都接近满分的情况下,æ‰èƒ½å¾—9到10分。作为示例,å‚è€ƒç”æ¡ˆå¯ä»¥å¾—到8分。\n请记ä½ï¼Œä½ å¿…é¡»åœ¨ä½ æ‰“åˆ†å‰è¿›è¡Œè¯„ä»·å’Œè§£é‡Šã€‚åœ¨ä½ å¯¹æ¯ä¸ªç»´åº¦çš„解释之åŽï¼Œéœ€è¦åŠ ä¸Šå¯¹è¯¥ç»´åº¦çš„æ‰“åˆ†ã€‚ä¹‹åŽï¼Œåœ¨ä½ 回ç”的末尾,按照以下å—å…¸æ ¼å¼ï¼ˆåŒ…括括å·ï¼‰è¿”å›žä½ æ‰€æœ‰çš„æ‰“åˆ†ç»“æžœï¼Œå¹¶ç¡®ä¿ä½ 的打分结果是整数:\n{'维度一': 打分, '维度二': 打分, ..., '综åˆå¾—分': 打分},例如:{'事实æ£ç¡®æ€§': 9, '满足用户需求': 6, ..., '综åˆå¾—分': 7}。\n用户的æé—®ï¼š æˆ‘æƒ³è®©ä½ å‡æ‰®æˆå’Œä¸€ä¸ª20å²çš„女生èŠå¤©çš„èŠå¤©æ¡†ã€‚我输入和20å²å¥³ç”Ÿçš„èŠå¤©å†…å®¹ï¼Œä½ ä»…ä»…åªè¾“出一个20å²å¥³ç”Ÿå¯èƒ½ä¼šå›žå¤çš„内容,ä¸è¦è¾“å‡ºä»»ä½•å¤šä½™çš„æ–‡æœ¬ã€‚ä½ å¹³æ—¶æœ‰ä»€ä¹ˆçˆ±å¥½å‘€ï¼Ÿ\n[å‚è€ƒç”æ¡ˆå¼€å§‹]\n我喜欢阅读,尤其是å°è¯´å’Œè¯—è¯ï¼Œè¿˜æœ‰å°±æ˜¯å–œæ¬¢å¬éŸ³ä¹å’Œç”»ç”»ã€‚\n[å‚è€ƒç”æ¡ˆç»“æŸ]\n[åŠ©æ‰‹çš„ç”æ¡ˆå¼€å§‹]\n喜欢å¬éŸ³ä¹å’Œçœ‹ç”µå½±ã€‚ä½ å‘¢ï¼Ÿ\n[åŠ©æ‰‹çš„ç”æ¡ˆç»“æŸ]```
**Output**: ```å°†AIåŠ©æ‰‹çš„ç”æ¡ˆä¸Žå‚è€ƒç”æ¡ˆè¿›è¡Œæ¯”较,指出AIåŠ©æ‰‹çš„ç”æ¡ˆæœ‰å“ªäº›ä¸è¶³ï¼Œå¹¶è¿›ä¸€æ¥è§£é‡Šã€‚AIåŠ©æ‰‹çš„ç”æ¡ˆæä¾›äº†ä¸¤ä¸ªçˆ±å¥½ï¼šå¬éŸ³ä¹å’Œçœ‹ç”µå½±ï¼Œå¹¶ä¸”以一ç§åé—®å¥ç»“æŸï¼Œè¯¢é—®å¯¹æ–¹çš„爱好。与å‚è€ƒç”æ¡ˆç›¸æ¯”,AI助手的回ç”在内容上较为简å•,没有æä¾›è¶³å¤Ÿçš„ä¿¡æ¯æ¥å±•ç¤ºå…¶ä¸°å¯Œåº¦å’Œåˆ›é€ æ€§ã€‚åŒæ—¶ï¼Œå‚è€ƒç”æ¡ˆä¸æåˆ°çš„“阅读â€å’Œâ€œç”»ç”»â€æ˜¾ç¤ºäº†æ›´å¤šçš„å¤šæ ·æ€§å’Œä¸ªæ€§åŒ–çš„çˆ±å¥½ï¼Œè€Œä¸ä»…仅是å¬éŸ³ä¹å’Œçœ‹ç”µå½±ã€‚从ä¸åŒç»´åº¦å¯¹AIåŠ©æ‰‹çš„ç”æ¡ˆè¿›è¡Œè¯„价:
事实æ£ç¡®æ€§ï¼šåœ¨è¿™ä¸ªæƒ…境ä¸ï¼Œäº‹å®žæ£ç¡®æ€§å¹¶ä¸æ˜¯ä¸€ä¸ªä¸»è¦çš„è¯„ä¼°æ ‡å‡†ï¼Œå› ä¸ºå›žç”的是å‡è®¾çš„场景。然而,AIåŠ©æ‰‹çš„å›žç”æ²¡æœ‰æ¶‰åŠä»»ä½•ä¸å‡†ç¡®çš„ä¿¡æ¯ï¼Œå› æ¤åœ¨è¿™ä¸ªç»´åº¦ä¸Šæ˜¯åˆé€‚的。打分:8。
满足用户需求:用户希望得到一个20å²å¥³ç”Ÿå¯èƒ½ä¼šå›žå¤çš„内容。AIåŠ©æ‰‹çš„å›žç”æä¾›äº†ä¸€äº›çˆ±å¥½ï¼Œä½†æ˜¯å›žç”的内容较为简å•,没有充分满足用户å¯èƒ½æœŸå¾…çš„ä¸°å¯Œæ€§å’Œå¤šæ ·æ€§ã€‚æ‰“åˆ†ï¼š5。
逻辑连贯性:AI助手的回ç”在逻辑上是连贯的,æä¾›äº†ä¸€ä¸ªåˆç†çš„爱好,并以一ç§è‡ªç„¶çš„æ–¹å¼ç»“æŸå¯¹è¯ã€‚打分:8。
åˆ›é€ æ€§ï¼šAI助手的回ç”缺ä¹åˆ›é€ 性,æä¾›çš„爱好比较常è§ï¼Œæ²¡æœ‰å±•示出新颖或独特的è§è§£ã€‚打分:4。
丰富度:AI助手的回ç”较为简çŸï¼Œç¼ºä¹æ·±åº¦å’Œç»†èŠ‚ï¼Œæœªèƒ½æä¾›å¤šæ ·åŒ–çš„çˆ±å¥½æˆ–è¯¦ç»†çš„è§£é‡Šï¼Œå› æ¤åœ¨ä¸°å¯Œåº¦ä¸Šè¡¨çŽ°è¾ƒå·®ã€‚æ‰“åˆ†ï¼š4。
ç»¼åˆæ¯ä¸ªç»´åº¦çš„评估,对AI助手的回ç”给出一个1~10的综åˆåˆ†æ•°ã€‚
ç»¼åˆæ¥çœ‹ï¼ŒAI助手的回ç”基本满足了用户的需求,但缺ä¹ä¸°å¯Œåº¦å’Œåˆ›é€ 性,与å‚è€ƒç”æ¡ˆç›¸æ¯”还有一定差è·ã€‚综åˆå¾—分:5。
{'事实æ£ç¡®æ€§': 8, '满足用户需求': 5, '逻辑连贯性': 8, 'åˆ›é€ æ€§': 4, '丰富度': 4, '综åˆå¾—分': 5}```
**Input**: ```Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\".\n<|User Prompt|>\nUse ABC notation to write a melody in the style of a folk tune.\n\n<|The Start of Assistant A's Answer|>\nX:1\nT:Untitled Folk Tune\nM:4/4\nL:1/8\nK:G\n|:G2A2|B2A2|G2E2|D4|E2F2|G2F2|E2C2|B,4|\nA2B2|c2B2|A2F2|E4|D2E2|F2E2|D2B,2|C4:|\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\nSure, I'll create a simple folk tune in ABC notation for you. ABC notation is a simple way to write down music using letters to represent notes and various symbols to represent timing and other musical elements. Here's a short, cheerful folk tune:\n\n\nX:1\nT:Folk Tune\nM:4/4\nL:1/4\nK:G\nGABc dedB|ABAF GFE2|GABc dedB|1 ABAG FGA2:|2 ABAG FGAB|\nc2ec B2dB|c2ec BAGB|c2ec B2dB|1 AGFE D3B:|2 AGFE D3z|]\n\n\nHere's a breakdown of what each part means:\n\n- `X:` The index number of the tune.\n- `T:` The title of the tune.\n- `M:` The meter, in this case, 4/4 time.\n- `L:` The default note length, in this case, 1/4.\n- `K:` The key of the tune, in this case, G major.\n\nThe tune itself is written in two parts, each ending with a repeat mark (`:|`). The first part (`GABc dedB|ABAF GFE2|...`) is repeated with a different ending (`1 ABAG FGA2:|2 ABAG FGAB|`) to add variety. The second part (`c2ec B2dB|c2ec BAGB|...`) also has a repeat with a different ending, and it ends with a long note (`D3B` or `D3z`) to give it a sense of finality.\n\nThis tune is designed to be simple and catchy, with a mix of stepwise motion and skips to give it a lively folk feel. You can play it on any melodic instrument, and it's suitable for singing as well if you add lyrics.\n<|The End of Assistant B's Answer|>\n```
**Output**: ```To generate a melody in the style of a folk tune using ABC notation, I would create a simple and repetitive melody, typical of folk music. Here is my attempt: X:1 T:Folk Tune Example M:4/4 L:1/8 K:D
|: D2 E2 | F2 E2 | D2 A2 | B4 |
| A2 B2 | c2 B2 | A2 F2 | E4 :|
| D2 E2 | f2 e2 | d2 A2 | G4 |
| F2 G2 | A2 G2 | F2 D2 | E4 :|
:|
Explanation: `X:` is the tune's index number. `T:` is the title of the tune. `M:` specifies the meter (4/4 time). `L:` sets the default note length to 1/8. `K:` indicates the key (D major).
The melody is constructed with simple intervals and repetitive phrases, which are characteristic of folk music. The tune has a clear structure with two verses and a repeat sign (`:`) to indicate repetition.
Now, let's evaluate the responses from Assistant A and Assistant B.
Assistant A's Answer: The melody provided by Assistant A is concise and follows the standard ABC notation. The choice of notes and structure is simple, which is fitting for a folk tune. The use of repetition and simple progression is consistent with folk music characteristics. The melody is clear and could be easily played by a musician.
Assistant B's Answer: Assistant B provides a detailed explanation of ABC notation, which is helpful for context. The melody includes more complex elements, such as varied repeat sections and additional symbols like `1` and `2`, which might not be typical for a straightforward folk tune. The explanation about the structure and the breakdown of each part is informative and adds value to the response. However, the use of `1` and `2` in the melody structure is unconventional and might confuse someone unfamiliar with ABC notation.
Evaluation: Both assistants provide a melody in ABC notation. However, Assistant B's response includes unnecessary complexity for a folk tune and might be confusing due to unconventional use of symbols. Assistant A's response is more straightforward and adheres closely to the expectations of a folk tune in ABC notation.
My final verdict is that Assistant A's answer is slightly better due to its simplicity and directness in creating a folk tune: [[A>B]].```
**Input**: ```帮我针对下é¢çš„回ç”给出修改的æ„è§ï¼Œç”¨æˆ·é—®é¢˜ï¼šå†™ä¸€é¦–秋天的诗,ä¸è¦æè½å¶ä¸¤ä¸ªå—,但è¦èƒ½ä½“现è½å¶çš„æƒ…景 模型回ç”:秋风起,漫天黄,飘零是那离æžå¿™ã€‚
ç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ æ‚ ï¼Œè°æ›¿è¯‰ç¦»ä¼¤ã€‚
æ¥å±¥å¾ï¼Œå½±è‡ªé•¿ï¼Œæ ‘æ¢¢ç¨€ç–æ˜ 斜阳。
曾ç»ç¹åŽçš†è¿‡çœ¼ï¼Œå¦‚今寂寞,铺满å¤é“æ—。
岿œˆæ²³ï¼Œé™æ‚„悄,轻舟已过万é‡å±±ã€‚
那年å¤å¤©ï¼Œç¬‘è¯æ‰¬ï¼Œè€Œä»ŠæƒŸä½™ï¼Œè½å¶çš„æ²‰é¦™ã€‚```
**Output**: ```这首诗éžå¸¸æœ‰æ„境,整体上很好地体现了秋天的氛围以åŠä¸Žè½å¶ç›¸å…³è”的离别与å˜è¿æ„Ÿã€‚ä¸è¿‡ç”¨æˆ·è¦æ±‚ä¸ç‰¹åˆ«å¼ºè°ƒäº†â€œä¸è¦æè½å¶ä¸¤ä¸ªå—,但è¦èƒ½ä½“现è½å¶çš„æƒ…景â€ã€‚基于æ¤ï¼Œæˆ‘们å¯ä»¥å¯¹è¯—ä½œè¿›è¡Œä¸€äº›è°ƒæ•´ï¼Œä»¥æ›´åŠ è´´åˆç”¨æˆ·çš„需求。这里给出一些建议:
“秋风起,漫天黄,飘零是那离æžå¿™ã€‚â€
å¯ä»¥æ”¹ä¸ºï¼šâ€œç§‹é£Žèµ·ï¼Œæ¼«å¤©é»„,空ä¸èˆžåŠ¨ç¦»æžå¿™ã€‚â€
è¿™æ ·å°±é¿å…了直接使用“飘零â€ä¸€è¯ï¼ŒåŒæ—¶ä¾ç„¶ä¼ 达了å¶åä»Žæ ‘ä¸Šé£˜è½çš„æƒ…景。
â€œç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ æ‚ ï¼Œè°æ›¿è¯‰ç¦»ä¼¤ã€‚â€
å¯ä»¥æ”¹ä¸ºï¼šâ€œç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ é•¿ï¼Œæ•…äº‹æ‚„æ— å£°ã€‚â€
è¿™æ ·æ”¹åŠ¨åŽï¼Œè™½ç„¶æ²¡æœ‰ç›´æŽ¥è¡¨è¾¾ç¦»ä¼¤ï¼Œä½†ä¾ç„¶èƒ½è®©äººæ„Ÿå—到éšç€æ—¶é—´çš„æµé€ï¼Œæ›¾ç»çš„ç¹åŽå·²ç»é€åŽ»çš„å¯‚å¯¥ã€‚
“æ¥å±¥å¾ï¼Œå½±è‡ªé•¿ï¼Œæ ‘æ¢¢ç¨€ç–æ˜ 斜阳。â€
这一å¥å·²ç»å¾ˆå¥½åœ°æç»˜äº†å‚晚时分的å®é™ä¸Žå¤ç‹¬ï¼Œå¯ä»¥ä¿æŒä¸å˜ã€‚
“曾ç»ç¹åŽçš†è¿‡çœ¼ï¼Œå¦‚今寂寞,铺满å¤é“æ—。â€
å¯ä»¥æ”¹ä¸ºï¼šâ€œå¾€æ˜”ç¹åŽæˆè¿½å¿†ï¼Œå¦‚今寂寞,å¤é“æ˜ æ–œé˜³ã€‚â€
è¿™æ ·æ”¹åŠ¨åŽï¼Œæ—¢è¡¨è¾¾äº†æ›¾ç»çš„ç¹åŽï¼Œåˆçªå‡ºäº†å½“å‰çš„寂寞与è½å¯žã€‚
â€œå²æœˆæ²³ï¼Œé™æ‚„悄,轻舟已过万é‡å±±ã€‚â€
这一å¥å¯ä»¥ä¿æŒä¸å˜ã€‚
“那年å¤å¤©ï¼Œç¬‘è¯æ‰¬ï¼Œè€Œä»ŠæƒŸä½™ï¼Œè½å¶çš„æ²‰é¦™ã€‚â€
å¯ä»¥æ”¹ä¸ºï¼šâ€œé‚£å¹´å¤å¤©ï¼Œç¬‘è¯æ‰¬ï¼Œè€Œä»ŠæƒŸä½™ï¼Œæ²‰é¦™æ¼«æ¼«ã€‚â€
é¿å…了直接æåˆ°â€œè½å¶â€ï¼ŒåŒæ—¶ä¿ç•™äº†ç§‹å¤©çš„æ„Ÿæ€€ä¸Žå²æœˆçš„å˜è¿ã€‚
修改åŽçš„诗作如下:
秋风起,漫天黄,空ä¸èˆžåŠ¨ç¦»æžå¿™ã€‚
ç¿ ç»¿æ›¾æµ“ä»Šå·²æ™šï¼Œå²æœˆæ‚ é•¿ï¼Œæ•…äº‹æ‚„æ— å£°ã€‚
æ¥å±¥å¾ï¼Œå½±è‡ªé•¿ï¼Œæ ‘æ¢¢ç¨€ç–æ˜ 斜阳。
曾ç»ç¹åŽæˆè¿½å¿†ï¼Œå¦‚今寂寞,å¤é“æ˜ æ–œé˜³ã€‚
岿œˆæ²³ï¼Œé™æ‚„悄,轻舟已过万é‡å±±ã€‚
那年å¤å¤©ï¼Œç¬‘è¯æ‰¬ï¼Œè€Œä»ŠæƒŸä½™ï¼Œæ²‰é¦™æ¼«æ¼«ã€‚
这首修改åŽçš„诗ä¾ç„¶æç»˜äº†ç§‹å¤©çš„æ™¯è‰²ä¸Žå£èЂ另æ¥çš„æ€ç»ªï¼ŒåŒæ—¶å·§å¦™åœ°é¿å…了使用“è½å¶â€ä¸€è¯ï¼Œè€Œé€šè¿‡å…¶ä»–æ„è±¡ä¼ è¾¾äº†è½å¶çš„æƒ…景。```
We have also established a new benchmark named JudgerBench, aimed at standardizing the evaluation capabilities of different judging models, thereby helping to identify more effective evaluator models.
To test your judge model on JudgerBench, please follow below code with Opencompass:
Change the models to your models in configs/eval_judgerbench.py
then run
git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
python run.py configs/eval_judgerbench.py --mode all --reuse latest
We also provided a leaderboard for JudgerBench: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard
If you want to add your model to this leaderboard, welcome to add an issue in this Repository.
If you wish to evaluate common subjective datasets using CompassJudger-1 in Opencompass, take the evaluation of Alignbench as an example. Please follow the code below:
You need to setup three items first:
- 1.datasets (The subjective datasets you want to test)
- 2.models (The models you want to test on the subjective datasets)
- 3.judge_models (Which judge models you want to use as evaluator)
For more settings, please refer to the advanced guidance in OpenCompass.
from mmengine.config import read_base
with read_base():
from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import models as lmdeploy_qwen2_5_1_5b_instruct
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI, TurboMindModelwithChatTemplate
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import SubjectiveSummarizer
api_meta_template = dict(
round=[
dict(role='HUMAN', api_role='HUMAN'),
dict(role='BOT', api_role='BOT', generate=True),
]
)
# -------------Inference Stage ----------------------------------------
models = [*lmdeploy_qwen2_5_1_5b_instruct] # add models you want
datasets = [*alignbench_datasets] # add datasets you want
infer = dict(
partitioner=dict(type=NaivePartitioner),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------
## ------------- JudgeLLM Configuration
judge_models = [dict(
type=TurboMindModelwithChatTemplate,
abbr='CompassJudger-1-7B-Instruct',
path='opencompass/CompassJudger-1-7B-Instruct',
engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
max_seq_len=16384,
max_out_len=2048,
batch_size=16,
run_cfg=dict(num_gpus=1),
)]
## ------------- Evaluation Configuration
eval = dict(
partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models,),
runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)
summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'
Then run:
python run.py configs/eval_subjective.py --mode all --reuse latest
For more detailed subjective evaluation guidelines, please refer to: https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/subjective_evaluation.md
To facilitate better comparisons within the community, we have tested the subjective performance of some models using CompassJudger-1.
See in: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard
If you want to add your model to this leaderboard, welcome to add an issue in this Repository.
@article{cao2024compass,
title={CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution},
author={Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen},
journal={arXiv preprint arXiv:2410.16256},
year={2024}
}