[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3678717.3691296acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
short-paper
Open access

Towards Zero-Shot Annotation of the Built Environment with Vision-Language Models

Published: 22 November 2024 Publication History

Abstract

Equitable urban transportation applications require high-fidelity digital representations of the built environment (streets, crossings, curb ramps and more). Direct inspections and manual annotations are costly at scale, while conventional machine learning methods require substantial annotated training data for adequate performance. This study explores vision language models as a tool for annotating diverse urban features from satellite images, reducing the dependence on human annotation. Although these models excel at describing common objects in human-centric images, their training sets may lack signals for esoteric built environment features, making their performance uncertain. We demonstrate a proof-of-concept using a vision language model and a visual prompting strategy that considers segmented image elements. Experiments on two urban features --- stop lines and raised tables --- show that while zero-shot prompting rarely works, the segmentation and visual prompting strategies achieve nearly 40% intersection-over-union accuracy. We describe how these results motivate further research in automatic annotation of the built environment to improve equity, accessibility, and safety at scale and in diverse environments.

References

[1]
Andrew R Benson, Ian GM Lawson, Matthew K Clifford, and Sean M McBride. 2021. Using robotics to detect footpath displacement caused by tree roots: A proof of concept. Urban Forestry & Urban Greening 65 (2021), 127312.
[2]
Nicholas Bolten and Anat Caspi. 2019. AccessMap Website Demonstration: Individualized, Accessible Pedestrian Trip Planning at Scale. 676--678. https://doi.org/10.1145/3308561.3354598
[3]
Luca Caltagirone, Mauro Bellone, Lennart Svensson, and Mattias Wahde. 2019. LIDAR-camera fusion for road detection using fully convolutional neural networks. Robotics and Autonomous Systems 111 (2019), 125--131.
[4]
Yueli Ding, Haojie Xu, Di Wang, Ke Li, and Yumin Tian. Visual Selection and Multi-Stage Reasoning for RSVG. IEEE Geoscience and Remote Sensing Letters ().
[5]
Yochai Eisenberg, Kerri A Vanderbom, and Vijay Vasudevan. 2017. Does the built environment moderate the relationship between having a disability and lower levels of physical activity? A systematic review. Preventive medicine 95 (2017).
[6]
Jon E Froehlich, Yochai Eisenberg, Maryam Hosseini, Fabio Miranda, Marc Adams, Anat Caspi, Holger Dieterich, Heather Feldner, Aldo Gonzalez, Claudina De Gyves, et al. 2022. The Future of Urban Accessibility for people with disabilities: Data collection, analytics, policy, and Tools. In Proceedings of the 24th International ACM SIGACCESS Conference on Computers and Accessibility. 1--8.
[7]
Bill Howe, Jackson Maxfield Brown, Bin Han, Bernease Herman, Nic Weber, An Yan, Sean Yang, and Yiwei Yang. 2022. Integrative urban AI to expand coverage, access, and equity of urban data. The European Physical Journal Special Topics 231, 9 (2022), 1741--1752.
[8]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4015--4026.
[9]
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. LLaVA-OneVision: Easy Visual Task Transfer. arXiv preprint arXiv:2408.03326 (2024).
[10]
Xiang Li, Jian Ding, and Mohamed Elhoseiny. 2024. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. arXiv preprint arXiv:2406.12384 (2024).
[11]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems 36 (2024).
[12]
CHRISTINE MENDOZA and ANAT CASPI. 2023. Development and Taxonomization of an Accessibility-First Open Data GIS Standard. UniversityofNorthCarolina at Chapel Hill and University of Washington. https://christineiym.github.io/dreu-2023/files/finalreport.pdf
[13]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748--8763.
[14]
Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. 2023. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11987--11997.
[15]
Aleksandar Shtedritski, Christian Rupprecht, and Andrea Vedaldi. 2023. What does clip know about a red circle? visual prompt engineering for vlms. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11987--11997.
[16]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824--24837.
[17]
Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. 2023. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 (2023).
[18]
Li Yang, Yan Xu, Chunfeng Yuan, Wei Liu, Bing Li, and Weiming Hu. 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9499--9508.
[19]
Zhenghang Yuan, Lichao Mou, Yuansheng Hua, and Xiao Xiang Zhu. 2024. Rrsis: Referring remote sensing image segmentation. IEEE Transactions on Geoscience and Remote Sensing (2024).
[20]
Yang Zhan, Zhitong Xiong, and Yuan Yuan. 2023. Rsvg: Exploring data and models for visual grounding on remote sensing data. IEEE Transactions on Geoscience and Remote Sensing 61 (2023), 1--13.
[21]
Yuxiang Zhang, Nicholas Bolten, Sachin Mehta, and Anat Caspi. 2023. APE: An Open and Shared Annotated Dataset for Learning Urban Pedestrian Path Networks. arXiv preprint arXiv:2303.02323 (2023).
[22]
Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. 2022. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision. Springer, 598--615.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSPATIAL '24: Proceedings of the 32nd ACM International Conference on Advances in Geographic Information Systems
October 2024
743 pages
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 November 2024

Check for updates

Author Tags

  1. Image Segmentation
  2. Large Language Model
  3. Urban Computing
  4. Urban Data Annotation
  5. Vision Language Model

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

SIGSPATIAL '24
Sponsor:

Acceptance Rates

SIGSPATIAL '24 Paper Acceptance Rate 37 of 122 submissions, 30%;
Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 40
    Total Downloads
  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)30
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media