[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3382507.3418861acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

ROSMI: A Multimodal Corpus for Map-based Instruction-Giving

Published: 22 October 2020 Publication History

Abstract

We present the publicly-available Robot Open Street Map Instructions (ROSMI) corpus: a rich multimodal dataset of map and natural language instruction pairs that was collected via crowdsourcing. The goal of this corpus is to aid in the advancement of state-of-the-art visual-dialogue tasks, including reference resolution and robot-instruction understanding. The domain described here concerns robots and autonomous systems being used for inspection and emergency response. The ROSMI corpus is unique in that it captures interaction grounded in map-based visual stimuli that is both human-readable but also contains rich metadata that is needed to plan and deploy robots and autonomous systems, thus facilitating human-robot teaming.

Supplementary Material

MP4 File (3382507.3418861.mp4)
We present ROSMI: A Multimodal Corpus for Map-based Instruction-Giving. The motivation behind this data collection was the fact that in emergency response scenarios, which is the domain of our research, a central hub is used for controlling and monitoring of the autonomous vehicles. We would like to push the state-of-the-art in map understanding and robot instruction giving and increase the interaction between human operators and Autonomous Robot Systems (RAS). Some related work is being introduced with the main differences to our dataset. We show the web interface that we used for the data collection. In addition, we show some statistics of our dataset and explain the corpus analysis. We conclude with our current work and the usage of this dataset, such as, training end-to-end models.

References

[1]
Haiyang Ai and Xiaofei Lu. 2010. A web-based system for automatic measurement of lexical complexity. In Proceedings of the 27th Annual Symposium of the Computer-Assisted Language Consortium (CALICO-10) (06 2010).
[2]
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko Sünderhauf, Ian Reid, Stephen Gould, and Anton van den Hengel. 2018. Vision-andLanguage Navigation: Interpreting visually-grounded navigation instructions in real environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3]
N. Blaylock. 2011. Semantic Annotation of Street-Level Geospatial Entities. In Proceedings of the 2011 IEEE Fifth International Conference on Semantic Computing. 444--448. https://doi.org/10.1109/ICSC.2011.53
[4]
Volkan Cirik, Louis-Philippe Morency, and Taylor Berg-Kirkpatrick. 2018. Visual Referring Expression Recognition: What Do Systems Actually Learn?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 781--787. https://doi.org/10.18653/v1/N18--2123
[5]
Nils Dahlback and Ame Jonsson. 1989. Empirical Studies of Discourse Representations for Natural Language Interfaces. In Fourth Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Manchester, England. https://www.aclweb.org/anthology/E89--1039
[6]
Jana Götze and Johan Boye. 2016. SpaceRef: A corpus of street-level geographic descriptions. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA), Slovenia, 3822--3827. https://www.aclweb.org/anthology/L16-
[7]
M. Haklay and P. Weber. 2008. OpenStreetMap: User-Generated Street Maps. IEEE Pervasive Computing 7, 4 (Oct 2008), 12--18. https://doi.org/10.1109/MPRV.2008.80
[8]
M. Hentschel and B. Wagner. 2010. Autonomous robot navigation based on OpenStreetMap geodata. In Proceedings of the 13th International IEEE Conference on Intelligent Transportation Systems. 1645--1650. https://doi.org/10.1109/ITSC. 2010.5625092
[9]
Victoria Johansson. 2008. Lexical diversity and lexical density in speech and writing: a developmental perspective. Technical Report. Lund University, Dept. of Linguistics and Phonetics. 61--79 pages. https://doi.org/10.7820/vli.v01.1.koizumi
[10]
Miltiadis Marios Katsakioris, Helen Hastie, Ioannis Konstas, and Atanas Laskov. 2019. Corpus of Multimodal Interaction for Collaborative Planning. In Proceedings of the Combined Workshop on Spatial Language Understanding (SpLU) and Grounded Communication for Robotics (RoboNLP). Association for Computational Linguistics, Minneapolis, Minnesota, 1--6. https://doi.org/10.18653/v1/W19--1601
[11]
Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci, Jordi Pont-Tuset, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. 2017. OpenImages: A public dataset for largescale multi-label and multi-class image classification. Dataset available from https://storage.googleapis.com/openimages/web/index.html (2017).
[12]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. 2017. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations. International Journal of Computer Vision 123, 1 (01 May 2017), 32--73. https://doi.org/10.1007/s11263-016-0981--7
[13]
David Lane, Keith Brown, Yvan Petillot, Emilio Miguelanez, and Pedro Patron. 2013. An ontology-based approach to fault tolerant mission execution for autonomous platforms. In Marine Robot Autonomy, Mae L. Seto (Ed.). Springer, 225--255. https://doi.org/10.1007/978--1--4614--5659--9_5
[14]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Computer Vision -- ECCV 2014, David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International Publishing, Cham, 740--755.
[15]
François Mairesse, Simon Keizer, Blaise Thomson, Kai Yu, and Steve Young. 2010. Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Uppsala, Sweden, 1552--1561. https://www.aclweb.org/anthology/ P10--1157
[16]
J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy. 2016. Generation and Comprehension of Unambiguous Object Descriptions. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 11--20. https://doi.org/10.1109/CVPR.2016.9
[17]
Emilio Miguelanez, Pedro Patron, Keith E Brown, Yvan R Petillot, and David M Lane. 2011. Semantic knowledge-based framework to improve the situation awareness of autonomous underwater vehicles. IEEE Transactions on Knowledge and Data Engineering 23, 5 (2011), 759--773.
[18]
Dipendra Misra, Andrew Bennett, Valts Blukis, Eyvind Niklasson, Max Shatkhin, and Yoav Artzi. 2018. Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2667--2678. https://doi.org/10.18653/v1/D18--1287
[19]
Jekaterina Novikova, Ondrej Du?ek, and Verena Rieser. 2017. The E2E Dataset: New Challenges for End-to-End Generation. In Proceedings of the 18th Annual Meeting of the Special Interest Group on Discourse and Dialogue. Saarbrücken, Germany. https://arxiv.org/abs/1706.09254
[20]
Yvan Petillot, Chris Sotzing, Pedro Patron, David Lane, and Joel Cartright. 2009. Multiple system collaborative planning and sensing for autonomous platforms with shared and distributed situational awareness. In Proceedings of the AUVSI's Unmanned Systems Europe, La Spezia, Italy.
[21]
Clearpath Robotics. 2011. Husky Unmanned Ground Vehicle. https://clearpathrobotics.com/husky-unmanned-ground-vehicle-robot/. Accessed: 201009--30.
[22]
Joel Ross, Lilly Irani, M. Six Silberman, Andrew Zaldivar, and Bill Tomlinson. 2010. Who Are the Crowdworkers?: Shifting Demographics in Mechanical Turk. In Proceedings of the CHI '10 Extended Abstracts on Human Factors in Computing Systems (Atlanta, Georgia, USA) (CHI EA '10). ACM, New York, NY, USA, 2863-- 2872. https://doi.org/10.1145/1753846.1753873
[23]
Joan Torruella and Ramon Capsada. 2013. Lexical Statistics and Tipological Structures: A Measure of Lexical Richness. Procedia - Social and Behavioral Sciences 95 (2013), 447 -- 454. https://doi.org/10.1016/j.sbspro.2013.10.668 Corpus Resources for Descriptive and Applied Studies. Current Challenges and Future Directions: Selected Papers from the 5th International Conference on Corpus Linguistics (CILC2013).
[24]
Kelvin Xu, Jimmy Lei Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Bengio. 2015. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML'15). JMLR.org, 2048--2057. http: //dl.acm.org/citation.cfm?id=3045118.3045336

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction
October 2020
920 pages
ISBN:9781450375818
DOI:10.1145/3382507
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 October 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. crowdsourcing
  2. data collection
  3. dialogue system
  4. human-robot interaction
  5. multimodal

Qualifiers

  • Short-paper

Conference

ICMI '20
Sponsor:
ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION
October 25 - 29, 2020
Virtual Event, Netherlands

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 119
    Total Downloads
  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media