8000 GitHub - bixoryai/describe-anything: Implementation for Describe Anything: Detailed Localized Image and Video Captioning

More Web Proxy on the site http://driver.im/

bixoryai / describe-anything Public

forked from NVlabs/describe-anything

Notifications You must be signed in to change notification settings
Fork 0
Star 0

Implementation for Describe Anything: Detailed Localized Image and Video Captioning

describe-anything.github.io/

Apache-2.0 license

0 stars 68 forks Branches Tags Activity

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
dam		dam
demo		demo
evaluation		evaluation
examples		examples
images		images
videos		videos
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dam_server.py		dam_server.py
demo_simple.py		demo_simple.py
demo_video.py		demo_video.py
pyproject.toml		pyproject.toml

Repository files navigation

Describe Anything: Detailed Localized Image and Video Captioning

NVIDIA, UC Berkeley, UCSF

Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, Yin Cui

[Paper] | [Project Page] | [Video] | [HuggingFace Demo] | [Model/Benchmark/Datasets] | [Citation]

TL;DR: Our Describe Anything Model (DAM) takes in a region of an image or a video in the form of points/boxes/scribbles/masks and outputs detailed descriptions to the region. For videos, it is sufficient to supply an annotation on any frame. We also release a new benchmark, DLC-Bench, to evaluate models on the DLC task.

Running Describe Anything Model

Installation

Install the dam package:

# You can install it without cloning the repo
pip install git+https://github.com/NVlabs/describe-anything

# You can also clone the repo and install it locally
git clone https://github.com/NVlabs/describe-anything
cd describe-anything
pip install -v .

We also provide a self-contained script for detailed localized image descriptions without installing additional dependencies. Please refer to the examples/dam_with_sam_self_contained.py or this Colab for more details.

Interactive Demo

Full Huggingface Demo (this demo is also hosted on Huggingface Spaces)

demo.mov

To run the demo, run the following command:

cd demo
python app.py

Simple Gradio Demo for Detailed Localized Image Descriptions

demo_simple.py - Interactive Gradio web interface for drawing masks on images and getting descriptions, with optional SAM integration for automated mask generation. This demo is tested with gradio 5.5.0.

Simple Gradio Demo for Detailed Localized Video Descriptions

demo_video.py - Interactive Gradio web interface for drawing masks on videos and getting descriptions, with SAM 2 integration for automated mask generation. This demo is tested with gradio 5.5.0.

Examples

Detailed Localized Image Descriptio 8000 ns

examples/dam_with_sam.py - Command-line tool for processing single images using SAM v1, allowing users to specify points or bounding boxes for mask generation

Expand to see example commands

# You can use it with points or a bounding box for the region of interest.
# SAM is used to turn points or a bounding box into a mask.
# You can also use mask directly, see `examples/query_dam_server.py`.
python examples/dam_with_sam.py --image_path images/1.jpg --points '[[1172, 812], [1572, 800]]' --output_image_path output_visualization.png
python examples/dam_with_sam.py --image_path images/1.jpg --box '[800, 500, 1800, 1000]' --use_box --output_image_path output_visualization.png

A medium-sized dog with a thick, reddish-brown coat and white markings on its face, chest, and paws. The dog has pointed ears, a bushy tail, and is wearing a red collar. Its mouth is open, showing its tongue and teeth, and it appears to be in mid-leap.

Detailed Localized Image Descriptions without Installing Additional Dependencies

examples/dam_with_sam_self_contained.py - Self-contained script for processing single images using SAM v1, allowing users to specify points or bounding boxes for mask generation without cloning the dam package.

Expand to see example commands

python examples/dam_with_sam_self_contained.py --image_path images/1.jpg --points '[[1172, 812], [1572, 800]]' --output_image_path output_visualization.png

Detailed Localized Video Descriptions

examples/dam_video_with_sam2.py - Video processing script using SAM v2.1 that only requires first-frame localization and automatically propagates masks through the video

Expand to see example commands

# You can use it with points or a bounding box for the region of interest. Annotation on one frame is sufficient.
# You can also use mask directly, see `examples/query_dam_server_video.py`.
python examples/dam_video_with_sam2.py --video_dir videos/1 --points '[[1824, 397]]' --output_image_dir videos/1_visualization
python examples/dam_video_with_sam2.py --video_dir videos/1 --box '[1612, 364, 1920, 430]' --use_box --output_image_dir videos/1_visualization

# You can also use video file directly.
python examples/dam_video_with_sam2.py --video_file videos/1.mp4 --points '[[1824, 397]]' --output_image_dir videos/1_visualization

A sleek, silver SUV is prominently featured, showcasing a modern and aerodynamic design. The vehicle's smooth, metallic surface reflects light, highlighting its well-defined contours and sharp lines. The front of the SUV is characterized by a bold grille and sharp headlights, giving it a dynamic and assertive appearance. As the sequence progresses, the SUV moves steadily forward, its wheels turning smoothly on the road. The side profile reveals tinted windows and a streamlined body, emphasizing its spacious interior and robust build. The rear of the SUV is equipped with stylish taillights and a subtle spoiler, adding to its sporty aesthetic. Throughout the sequence, the SUV maintains a consistent speed, suggesting a confident and controlled drive, seamlessly integrating into the flow of traffic.

OpenAI-compatible API

dam_server.py - Core server implementation providing an OpenAI-compatible API endpoint for the Describe Anything Model (DAM), handling both image and video inputs with streaming support. We use the alpha channel for image/video mask for the region of interest.

Expand to see example commands

# Image-only DAM
python dam_server.py --model-path nvidia/DAM-3B --conv-mode v1 --prompt-mode focal_prompt --temperature 0.2 --top_p 0.9 --num_beams 1 --max_new_tokens 512 --workers 1

# Image-video joint DAM
python dam_server.py --model-path nvidia/DAM-3B-Video --conv-mode v1 --prompt-mode focal_prompt --temperature 0.2 --top_p 0.9 --num_beams 1 --max_new_tokens 512 --workers 1 --image_video_joint_checkpoint

Examples for DAM Server with OpenAI-compatible API

examples/query_dam_server.py - Client example using OpenAI's SDK to interact with the DAM server for single image processing

Expand to see example commands

python examples/query_dam_server.py --model describe_anything_model --server_url http://localhost:8000

examples/query_dam_server_raw.py - Client example showing how to make direct HTTP requests to the DAM server without using the OpenAI SDK

Expand to see example commands

python examples/query_dam_server_raw.py --model describe_anything_model --server_url http://localhost:8000

examples/query_dam_server_video.py - Client example demonstrating how to process videos through the DAM server using the OpenAI SDK, handling multiple frames in a single request

Expand to see example commands

Note that please use the joint checkpoint trained on both images and videos for video processing.

python examples/query_dam_server_video.py --model describe_anything_model --server_url http://localhost:8000

Evaluating your model on DLC-Bench

We provide a script to evaluate your model on DLC-Bench. Please refer to the evaluation README for more details.

License

We are releasing the Describe Anything Models under the following licenses:

Code: Apache License 2.0
Model weights (in Hugging Face):
Training Data (in Hugging Face): NVIDIA Noncommercial License
DLC-Bench (in Hugging Face): CC BY-NC-SA 4.0

Citation

If you use our work or our implementation in this repo, or find them helpful, please consider giving a citation in the following format.

@article{lian2025describe,
  title={Describe Anything: Detailed Localized Image and Video Captioning}, 
  author={Long Lian and Yifan Ding and Yunhao Ge and Sifei Liu and Hanzi Mao and Boyi Li and Marco Pavone and Ming-Yu Liu and Trevor Darrell and Adam Yala and Yin Cui},
  journal={arXiv preprint arXiv:2504.16072},
  year={2025}
}

Acknowledgements

We would like to thank the following projects for their contributions to this work:

About

Implementation for Describe Anything: Detailed Localized Image and Video Captioning

describe-anything.github.io/

Apache-2.0 license

Custom properties

Report repository

Releases

No releases published

Packages

No packages published

Languages

Python 100.0%

0