Description
Thanks for your great work, I've learned a lot from it. I have some questions about reproducing your results.
1. I couldn’t reproduce the results on V* as reported in the paper
I tried running the experiments several times following the instructions, but my results were different from what’s in the paper. Here’s a table with some of my runs:
Run | direct_attributes | relative_position | overall |
---|---|---|---|
1 | 87.83 | 86.84 | 87.43 |
2 | 87.83 | 88.16 | 87.96 |
3 | 87.83 | 89.47 | 88.48 |
4 | 88.70 | 90.79 | 89.53 |
5 | 87.83 | 89.47 | 88.48 |
I added "stop": ["<|im_end|>\n".strip(), "</tool_call>"]
because the model kept generating repeated tool call or extra \n, but I also tried without adding it. Here’s the result without it:
direct_attributes | relative_position | overall | |
---|---|---|---|
1 | 87.83 | 84.21 | 86.39 |
Any tips would be helpful to reproduce your results in paper.
2. Results are similar with or without image insertion
I tried using your checkpoint and calling the tool without inserting the image. The response looked like this:
And the results were similar to when I did insert the image:
Run | direct_attributes | relative_position | overall |
---|---|---|---|
1 | 87.83 | 89.47 | 88.48 |
2 | 86.96 | 89.47 | 87.96 |
3 | 86.96 | 89.47 | 87.96 |
3. A question about the user prompt
In addition, I'm curious why you chose USER_PROMPT_V2. From what I see, USER_PROMPT_V5 seems more aligned with your setup. Did you encounter any issues when training with USER_PROMPT_V5?
USER_PROMPT_V2 = "\nThink first, call **image_zoom_in_tool** if needed, then answer. Format strictly as: <think>...</think> <tool_call>...</tool_call> (if tools needed) <answer>...</answer> "
USER_PROMPT_V5 = "\nThink in the mind first, and then decide whether to call tools one or more times OR provide final answer. Format strictly as: <think>...</think> <tool_call>...</tool_call> <tool_call>...</tool_call> (if any tools needed) OR <answer>...</answer> (if no tools needed)."