Questions about reproducing Vstar results and the effect of image insertion

@JaaackHongggg

Thanks for your great work, I've learned a lot from it. I have some questions about reproducing your results.

1. I couldn’t reproduce the results on V* as reported in the paper

I tried running the experiments several times following the instructions, but my results were different from what’s in the paper. Here’s a table with some of my runs:

Run	direct_attributes	relative_position	overall
1	87.83	86.84	87.43
2	87.83	88.16	87.96
3	87.83	89.47	88.48
4	88.70	90.79	89.53
5	87.83	89.47	88.48

I added "stop": ["<|im_end|>\n".strip(), "</tool_call>"] because the model kept generating repeated tool call or extra \n, but I also tried without adding it. Here’s the result without it:

	direct_attributes	relative_position	overall
1	87.83	84.21	86.39

Any tips would be helpful to reproduce your results in paper.

2. Results are similar with or without image insertion

I tried using your checkpoint and calling the tool without inserting the image. The response looked like this:

{"role": "assistant", "content": "<think>The motorcycle is on the right side of the street. I can see it clearly in the image, positioned further down the street on the right side. The question asks for the side of the street where the motorcycle is located, and it is clearly on the right side.</think>  
<tool_call>
{\"name\": \"image_zoom_in_tool\", \"arguments\": {\"bbox_2d\": [1330, 1410, 1368, 1450], \"label\": \"motorcycle\"}}
</tool_call>"}, 
{"role": "assistant", "content": "<answer>B. The motorcycle is on the right side of the street.</answer>  
<tool_call>
{\"name\": \"image_zoom_in_tool\", \"arguments\": {\"bbox_2d\": [1330, 1410, 1368, 1450], \"label\": \"motorcycle\"}}
</tool_call>"}], "status": "success"}

And the results were similar to when I did insert the image:

Run	direct_attributes	relative_position	overall
1	87.83	89.47	88.48
2	86.96	89.47	87.96
3	86.96	89.47	87.96

3. A question about the user prompt

In addition, I'm curious why you chose USER_PROMPT_V2. From what I see, USER_PROMPT_V5 seems more aligned with your setup. Did you encounter any issues when training with USER_PROMPT_V5?

USER_PROMPT_V2 = "\nThink first, call **image_zoom_in_tool** if needed, then answer. Format strictly as:  <think>...</think>  <tool_call>...</tool_call> (if tools needed)  <answer>...</answer> "
USER_PROMPT_V5 = "\nThink in the mind first, and then decide whether to call tools one or more times OR provide final answer. Format strictly as: <think>...</think> <tool_call>...</tool_call> <tool_call>...</tool_call> (if any tools needed) OR <answer>...</answer> (if no tools needed)."

@JaaackHongggg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1. I couldn’t reproduce the results on V* as reported in the paper

2. Results are similar with or without image insertion

3. A question about the user prompt

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

1. I couldn’t reproduce the results on V* as reported in the paper

2. Results are similar with or without image insertion

3. A question about the user prompt

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions