More Web Proxy on the site http://driver.im/

short-paper

Open access

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models

Authors:

Bill HoweAuthors Info & Claims

SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems

Pages 601 - 604

https://doi.org/10.1145/3678717.3691296

Published: 22 November 2024 Publication History

Abstract

Equitable urban transportation applications require high-fidelity digital representations of the built environment (streets, crossings, curb ramps and more). Direct inspections and manual annotations are costly at scale, while conventional machine learning methods require substantial annotated training data for adequate performance. This study explores vision language models as a tool for annotating diverse urban features from satellite images, reducing the dependence on human annotation. Although these models excel at describing common objects in human-centric images, their training sets may lack signals for esoteric built environment features, making their performance uncertain. We demonstrate a proof-of-concept using a vision language model and a visual prompting strategy that considers segmented image elements. Experiments on two urban features --- stop lines and raised tables --- show that while zero-shot prompting rarely works, the segmentation and visual prompting strategies achieve nearly 40% intersection-over-union accuracy. We describe how these results motivate further research in automatic annotation of the built environment to improve equity, accessibility, and safety at scale and in diverse environments.

References

[1]

Andrew R Benson, Ian GM Lawson, Matthew K Clifford, and Sean M McBride. 2021. Using robotics to detect footpath displacement caused by tree roots: A proof of concept. Urban Forestry & Urban Greening 65 (2021), 127312.

[2]

Nicholas Bolten and Anat Caspi. 2019. AccessMap Website Demonstration: Individualized, Accessible Pedestrian Trip Planning at Scale. 676--678. https://doi.org/10.1145/3308561.3354598

Digital Library

[3]

Luca Caltagirone, Mauro Bellone, Lennart Svensson, and Mattias Wahde. 2019. LIDAR-camera fusion for road detection using fully convolutional neural networks. Robotics and Autonomous Systems 111 (2019), 125--131.

[4]

Yueli Ding, Haojie Xu, Di Wang, Ke Li, and Yumin Tian. Visual Selection and Multi-Stage Reasoning for RSVG. IEEE Geoscience and Remote Sensing Letters ().

[5]

Yochai Eisenberg, Kerri A Vanderbom, and Vijay Vasudevan. 2017. Does the built environment moderate the relationship between having a disability and lower levels of physical activity? A systematic review. Preventive medicine 95 (2017).

[6]

Jon E Froehlich, Yochai Eisenberg, Maryam Hosseini, Fabio Miranda, Marc Adams, Anat Caspi, Holger Dieterich, Heather Feldner, Aldo Gonzalez, Claudina De Gyves, et al. 2022. The Future of Urban Accessibility for people with disabilities: Data collection, analytics, policy, and Tools. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility. 1--8.

Digital Library

[7]

Bill Howe, Jackson Maxfield Brown, Bin Han, Bernease Herman, Nic Weber, An Yan, Sean Yang, and Yiwei Yang. 2022. Integrative urban AI to expand coverage, access, and equity of urban data. The European Physical Journal Special Topics 231, 9 (2022), 1741--1752.

[8]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.

[9]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. LLaVA-OneVision: Easy Visual Task Transfer. arXiv preprint arXiv:2408.03326 (2024).

[10]

Xiang Li, Jian Ding, and Mohamed Elhoseiny. 2024. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. arXiv preprint arXiv:2406.12384 (2024).

[11]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems 36 (2024).

[12]

CHRISTINE MENDOZA and ANAT CASPI. 2023. Development and Taxonomization of an Accessibility-First Open Data GIS Standard. UniversityofNorthCarolina at Chapel Hill and University of Washington. https://christineiym.github.io/dreu-2023/files/finalreport.pdf

[13]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.

[14]

Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. 2023. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11987--11997.

[15]

Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. 2023. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11987--11997.

[16]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824--24837.

[17]

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023).

[18]

Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9499--9508.

[19]

Zhenghang Yuan, Lichao Mou, Yuansheng Hua, and Xiao Xiang Zhu. 2024. Rrsis: Referring remote sensing image segmentation. IEEE Transactions on Geoscience and Remote Sensing (2024).

[20]

Yang Zhan, Zhitong Xiong, and Yuan Yuan. 2023. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--13.

[21]

Yuxiang Zhang, Nicholas Bolten, Sachin Mehta, and Anat Caspi. 2023. APE: An Open and Shared Annotated Dataset for Learning Urban Pedestrian Path Networks. arXiv preprint arXiv:2303.02323 (2023).

[22]

Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision. Springer, 598--615.

Digital Library

Index Terms

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Image segmentation
      2. Computer vision tasks
        Scene understanding
        Visual inspection

Recommendations

GL-NER: Generation-Aware Large Language Models for Few-Shot Named Entity Recognition
Artificial Neural Networks and Machine Learning – ICANN 2024
Abstract
Nowadays, prompt-based methods are widely used in the realm of few-shot named entity recognition. By guiding Pre-trained Language Models to learn token features from training samples with prompts, they obtain several achievements. Nevertheless, ...
One-shot Biomedical Named Entity Recognition via Knowledge-Inspired Large Language Model
BCB '24: Proceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

Large Language Models (LLMs) have demonstrated exceptional performance in numerous natural language processing tasks, particularly in generative tasks. Nevertheless, their performance in non-generative tasks, such as information extraction, especially ...
Identifying urban villages from city-wide satellite imagery leveraging mask R-CNN
UbiComp/ISWC '19 Adjunct: Adjunct Proceedings of the 2019 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2019 ACM International Symposium on Wearable Computers

Urban villages emerge with the rapid urbanization process in many developing countries, and bring serious social and economic challenges to urban authorities, such as overcrowding and low living standards. A comprehensive understanding of the locations ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems

October 2024

743 pages

ISBN:9798400711077

DOI:10.1145/3678717

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGSPATIAL: ACM Special Interest Group on Spatial Information

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

SIGSPATIAL '24

Sponsor:

SIGSPATIAL

SIGSPATIAL '24: The 32nd ACM International Conference on Advances in Geographic Information Systems

October 29 - November 1, 2024

GA, Atlanta, USA

Acceptance Rates

SIGSPATIAL '24 Paper Acceptance Rate 37 of 122 submissions, 30%;

Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
74
Total Downloads

Downloads (Last 12 months)74
Downloads (Last 6 weeks)32

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten