8000 GitHub - bixoryai/describe-anything: Implementation for Describe Anything: Detailed Localized Image and Video Captioning
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

bixoryai/describe-anything

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Describe Anything: Detailed Localized Image and Video Captioning

NVIDIA, UC Berkeley, UCSF

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

[Paper] | [Project Page] | [Video] | [HuggingFace Demo] | [Model/Benchmark/Datasets] | [Citation]

Main Image

TL;DR: Our Describe Anything Model (DAM) takes in a region of an image or a video in the form of points/boxes/scribbles/masks and outputs detailed descriptions to the region. For videos, it is sufficient to supply an annotation on any frame. We also release a new benchmark, DLC-Bench, to evaluate models on the DLC task.

Running Describe Anything Model

Installation

Install the dam package:

# You can install it without cloning the repo
pip install git+https://github.com/NVlabs/describe-anything

# You can also clone the repo and install it locally
git clone https://github.com/NVlabs/describe-anything
cd describe-anything
pip install -v .

We also provide a self-contained script for detailed localized image descriptions without installing additional dependencies. Please refer to the examples/dam_with_sam_self_contained.py or this Colab for more details.

Interactive Demo

Full Huggingface Demo (this demo is also hosted on Huggingface Spaces)

demo.mov

To run the demo, run the following command:

cd demo
python app.py

Simple Gradio Demo for Detailed Localized Image Descriptions

demo_simple.py - Interactive Gradio web interface for drawing masks on images and getting descriptions, with optional SAM integration for automated mask generation. This demo is tested with gradio 5.5.0.

Simple Gradio Demo for Detailed Localized Video Descriptions

demo_video.py - Interactive Gradio web interface for drawing masks on videos and getting descriptions, with SAM 2 integration for automated mask generation. This demo is tested with gradio 5.5.0.

Examples

Detailed Localized Image Descriptio 8000 ns

  • examples/dam_with_sam.py - Command-line tool for processing single images using SAM v1, allowing users to specify points or bounding boxes for mask generation
Expand to see example commands
# You can use it with points or a bounding box for the region of interest.
# SAM is used to turn points or a bounding box into a mask.
# You can also use mask directly, see `examples/query_dam_server.py`.
python examples/dam_with_sam.py --image_path images/1.jpg --points '[[1172, 812], [1572, 800]]' --output_image_path output_visualization.png
python examples/dam_with_sam.py --image_path images/1.jpg --box '[800, 500, 1800, 1000]' --use_box --output_image_path output_visualization.png
A medium-sized dog with a thick, reddish-brown coat and white markings on its face, chest, and paws. The dog has pointed ears, a bushy tail, and is wearing a red collar. Its mouth is open, showing its tongue and teeth, and it appears to be in mid-leap.

Detailed Localized Image Descriptions without Installing Additional Dependencies

  • examples/dam_with_sam_self_contained.py - Self-contained script for processing single images using SAM v1, allowing users to specify points or bounding boxes for mask generation without cloning the dam package.
Expand to see example commands
python examples/dam_with_sam_self_contained.py --image_path images/1.jpg --points '[[1172, 812], [1572, 800]]' --output_image_path output_visualization.png

Detailed Localized Video Descriptions

  • examples/dam_video_with_sam2.py - Video processing script using SAM v2.1 that only requires first-frame localization and automatically propagates masks through the video
Expand to see example commands
# You can use it with points or a bounding box for the region of interest. Annotation on one frame is sufficient.
# You can also use mask directly, see `examples/query_dam_server_video.py`.
python examples/dam_video_with_sam2.py --video_dir videos/1 --points '[[1824, 397]]' --output_image_dir videos/1_visualization
python examples/dam_video_with_sam2.py --video_dir videos/1 --box '[1612, 364, 1920, 430]' --use_box --output_image_dir videos/1_visualization

# You can also use video file directly.
python examples/dam_video_with_sam2.py --video_file videos/1.mp4 --points '[[1824, 397]]' --output_image_dir videos/1_visualization
A sleek, silver SUV is prominently featured, showcasing a modern and aerodynamic design. The vehicle's smooth, metallic surface reflects light, highlighting its well-defined contours and sharp lines. The front of the SUV is characterized by a bold grille and sharp headlights, giving it a dynamic and assertive appearance. As the sequence progresses, the SUV moves steadily forward, its wheels turning smoothly on the road. The side profile reveals tinted windows and a streamlined body, emphasizing its spacious interior and robust build. The rear of the SUV is equipped with stylish taillights and a subtle spoiler, adding to its sporty aesthetic. Throughout the sequence, the SUV maintains a consistent speed, suggesting a confident and controlled drive, seamlessly integrating into the flow of traffic.

OpenAI-compatible API

  • dam_server.py - Core server implementation providing an OpenAI-compatible API endpoint for the Describe Anything Model (DAM), handling both image and video inputs with streaming support. We use the alpha channel for image/video mask for the region of interest.
Expand to see example commands
# Image-only DAM
python dam_server.py --model-path nvidia/DAM-3B --conv-mode v1 --prompt-mode focal_prompt --temperature 0.2 --top_p 0.9 --num_beams 1 --max_new_tokens 512 --workers 1

# Image-video joint DAM
python dam_server.py --model-path nvidia/DAM-3B-Video --conv-mode v1 --prompt-mode focal_prompt --temperature 0.2 --top_p 0.9 --num_beams 1 --max_new_tokens 512 --workers 1 --image_video_joint_checkpoint

Examples for DAM Server with OpenAI-compatible API

Expand to see example commands
python examples/query_dam_server.py --model describe_anything_model --server_url http://localhost:8000
Expand to see example commands
python examples/query_dam_server_raw.py --model describe_anything_model --server_url http://localhost:8000
  • examples/query_dam_server_video.py - Client example demonstrating how to process videos through the DAM server using the OpenAI SDK, handling multiple frames in a single request
Expand to see example commands Note that please use the joint checkpoint trained on both images and videos for video processing.
python examples/query_dam_server_video.py --model describe_anything_model --server_url http://localhost:8000

Evaluating your model on DLC-Bench

We provide a script to evaluate your model on DLC-Bench. Please refer to the evaluation README for more details.

License

We are releasing the Describe Anything Models under the following licenses:

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation in the following format.

@article{lian2025describe,
  title={Describe Anything: Detailed Localized Image and Video Captioning}, 
  author={Long Lian and Yifan Ding and Yunhao Ge and Sifei Liu and Hanzi Mao and Boyi Li and Marco Pavone and Ming-Yu Liu and Trevor Darrell and Adam Yala and Yin Cui},
  journal={arXiv preprint arXiv:2504.16072},
  year={2025}
}

Acknowledgements

We would like to thank the following projects for their contributions to this work:

About

Implementation for Describe Anything: Detailed Localized Image and Video Captioning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%
0