8000 Questions about reproducing Vstar results and the effect of image insertion · Issue #60 · Visual-Agent/DeepEyes · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Questions about reproducing Vstar results and the effect of image insertion #60
Open
@yeahhh999

Description

@yeahhh999

Thanks for your great work, I've learned a lot from it. I have some questions about reproducing your results.

1. I couldn’t reproduce the results on V* as reported in the paper

I tried running the experiments several times following the instructions, but my results were different from what’s in the paper. Here’s a table with some of my runs:

Run direct_attributes relative_position overall
1 87.83 86.84 87.43
2 87.83 88.16 87.96
3 87.83 89.47 88.48
4 88.70 90.79 89.53
5 87.83 89.47 88.48

I added "stop": ["<|im_end|>\n".strip(), "</tool_call>"] because the model kept generating repeated tool call or extra \n, but I also tried without adding it. Here’s the result without it:

direct_attributes relative_position overall
1 87.83 84.21 86.39

Any tips would be helpful to reproduce your results in paper.

2. Results are similar with or without image insertion

I tried using your checkpoint and calling the tool without inserting the image. The response looked like this:

{"role": "assistant", "content": "<think>The motorcycle is on the right side of the street. I can see it clearly in the image, positioned further down the street on the right side. The question asks for the side of the street where the motorcycle is located, and it is clearly on the right side.</think>  
<tool_call>
{\"name\": \"image_zoom_in_tool\", \"arguments\": {\"bbox_2d\": [1330, 1410, 1368, 1450], \"label\": \"motorcycle\"}}
</tool_call>"}, 
{"role": "assistant", "content": "<answer>B. The motorcycle is on the right side of the street.</answer>  
<tool_call>
{\"name\": \"image_zoom_in_tool\", \"arguments\": {\"bbox_2d\": [1330, 1410, 1368, 1450], \"label\": \"motorcycle\"}}
</tool_call>"}], "status": "success"}

And the results were similar to when I did insert the image:

Run direct_attributes relative_position overall
1 87.83 89.47 88.48
2 86.96 89.47 87.96
3 86.96 89.47 87.96

3. A question about the user prompt

In addition, I'm curious why you chose USER_PROMPT_V2. From what I see, USER_PROMPT_V5 seems more aligned with your setup. Did you encounter any issues when training with USER_PROMPT_V5?

USER_PROMPT_V2 = "\nThink first, call **image_zoom_in_tool** if needed, then answer. Format strictly as:  <think>...</think>  <tool_call>...</tool_call> (if tools needed)  <answer>...</answer> "
USER_PROMPT_V5 = "\nThink in the mind first, and then decide whether to call tools one or more times OR provide final answer. Format strictly as: <think>...</think> <tool_call>...</tool_call> <tool_call>...</tool_call> (if any tools needed) OR <answer>...</answer> (if no tools needed)."

@JaaackHongggg

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0