[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472749.3474771acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article

Hierarchical Summarization for Longform Spoken Dialog

Published: 12 October 2021 Publication History

Abstract

Every day we are surrounded by spoken dialog. This medium delivers rich diverse streams of information auditorily; however, systematically understanding dialog can often be non-trivial. Despite the pervasiveness of spoken dialog, automated speech understanding and quality information extraction remains markedly poor, especially when compared to written prose. Furthermore, compared to understanding text, auditory communication poses many additional challenges such as speaker disfluencies, informal prose styles, and lack of structure. These concerns all demonstrate the need for a distinctly speech tailored interactive system to help users understand and navigate the spoken language domain. While individual automatic speech recognition (ASR) and text summarization methods already exist, they are imperfect technologies; neither consider user purpose and intent nor address spoken language induced complications. Consequently, we design a two stage ASR and text summarization pipeline and propose a set of semantic segmentation and merging algorithms to resolve these speech modeling challenges. Our system enables users to easily browse and navigate content as well as recover from errors in these underlying technologies. Finally, we present an evaluation of the system which highlights user preference for hierarchical summarization as a tool to quickly skim audio and identify content of interest to the user.

References

[1]
Connelly Barnes, Dan B Goldman, Eli Shechtman, and Adam Finkelstein. 2010. Video tapestries with continuous temporal zoom. In ACM SIGGRAPH 2010 papers. 1–9.
[2]
Michel Galley, Kathleen McKeown, Eric Fosler-Lussier, and Hongyan Jing. 2003. Discourse segmentation of multi-party conversation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 562–569.
[3]
Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. Allennlp: A deep semantic natural language processing platform. arXiv preprint arXiv:1803.07640(2018).
[4]
Yashesh Gaur, Walter S Lasecki, Florian Metze, and Jeffrey P Bigham. 2016. The effects of automatic speech recognition quality on human transcription latency. In Proceedings of the 13th Web for All Conference. 1–8.
[5]
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. 2019. SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization. Proceedings of the 2nd Workshop on New Frontiers in Summarization (2019). https://doi.org/10.18653/v1/d19-5409
[6]
Dan B Goldman, Brian Curless, David Salesin, and Steven M Seitz. 2006. Schematic storyboarding for video visualization and editing. Acm transactions on graphics (tog) 25, 3 (2006), 862–871.
[7]
Google. 2021. Speech-to-Text. https://cloud.google.com/speech-to-text/
[8]
Dan Jackson, James Nicholson, Gerrit Stoeckigt, Rebecca Wrobel, Anja Thieme, and Patrick Olivier. 2013. Panopticon: A parallel video overview system. In proceedings of the 26th annual ACM symposium on User interface software and technology. 123–130.
[9]
Derry Jatnika, Moch Arif Bijaksana, and Arie Ardiyanti Suryani. 2019. Word2vec model analysis for semantic similarities in english words. Procedia Computer Science 157 (2019), 160–167.
[10]
Hannes Karlbom and Ann Clifton. 2020. Abstractive Podcast Summarization using BART with Longformer attention. (2020).
[11]
Michael P Kaschak and Arthur M Glenberg. 2000. Constructing meaning: The role of affordances and grammatical constructions in sentence comprehension. Journal of memory and language 43, 3 (2000), 508–529.
[12]
Joshua Y. Kim, Chunfeng Liu, Rafael A. Calvo, Kathryn McCabe, Silas C. R. Taylor, Björn W. Schuller, and Kaihang Wu. 2019. A Comparison of Online Automatic Speech Recognition Systems and the Nonverbal Responses to Unintelligible Speech. arxiv:1904.12403 [cs.SD]
[13]
Manling Li, Lingyu Zhang, Heng Ji, and Richard J Radke. 2019. Keep meeting summaries on topic: Abstractive multi-modal meeting summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2190–2196.
[14]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
[15]
Douglas W. Maynard. 1980. Placement of topic changes in conversation. 30, 3-4 (1980), 263–290. https://doi.org/
[16]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factuality in Abstractive Summarization. arxiv:2005.00661 [cs.CL]
[17]
Amy Pavel, Dan B Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. Sceneskim: Searching and browsing movies using synchronized captions, scripts and plot summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology. 181–190.
[18]
Amy Pavel, Colorado Reed, Björn Hartmann, and Maneesh Agrawala. 2014. Video digests: a browsable, skimmable format for informational lecture videos. In UIST, Vol. 10. Citeseer, 2642918–2647400.
[19]
Peter. Pirolli. 2007. Information foraging theory : adaptive interaction with information / Peter Pirolli. Oxford University Press New York. 204 p. : pages. http://www.loc.gov/catdir/toc/ecip0617/2006021795.html
[20]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. CoRR abs/1908.10084(2019). arxiv:1908.10084http://arxiv.org/abs/1908.10084
[21]
Guokan Shang, Wensi Ding, Zekun Zhang, Antoine Jean-Pierre Tixier, Polykarpos Meladianos, Michalis Vazirgiannis, and Jean-Pierre Lorré. 2018. Unsupervised abstractive meeting summarization with multi-sentence compression and budgeted submodular maximization. arXiv preprint arXiv:1805.05271(2018).
[22]
Wee Meng Soon, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A machine learning approach to coreference resolution of noun phrases. Computational linguistics 27, 4 (2001), 521–544.
[23]
Veselin Stoyanov and Claire Cardie. 2006. Partially Supervised Coreference Resolution for Opinion Summarization through Structured Rule Learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Sydney, Australia, 336–344. https://aclanthology.org/W06-1640
[24]
Ryuichi Takanobu, Minlie Huang, Zhongzhou Zhao, Feng-Lin Li, Haiqing Chen, Xiaoyan Zhu, and Liqiang Nie. 2018. A Weakly Supervised Method for Topic Segmentation and Labeling in Goal-oriented Dialogues via Reinforcement Learning. In IJCAI. 4403–4410.
[25]
Anh Truong, Peggy Chi, David Salesin, Irfan Essa, and Maneesh Agrawala. 2021. Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos. (2021).
[26]
Aneesh Vartakavi and Amanmeet Garg. 2020. PodSumm–Podcast Audio Summarization. arXiv preprint arXiv:2009.10315(2020).
[27]
René Witte and Sabine Bergler. 2003. Fuzzy coreference resolution for summarization. In Proceedings of 2003 International Symposium on Reference Resolution and Its Applications to Question Answering and Summarization (ARQAS). 43–50.
[28]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. CoRR abs/1910.03771(2019). arxiv:1910.03771http://arxiv.org/abs/1910.03771
[29]
Haijun Xia. 2020. Crosspower: Bridging Graphics and Linguistics. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 722–734.
[30]
Haijun Xia, Jennifer Jacobs, and Maneesh Agrawala. 2020. Crosscast: Adding Visuals to Audio Travel Podcasts. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 735–746.
[31]
Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Discourse-aware neural extractive text summarization. arXiv preprint arXiv:1910.14142(2019).
[32]
Amy X Zhang, Lea Verou, and David Karger. 2017. Wikum: Bridging discussion forums and wikis using recursive summarization. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. 2082–2096.
[33]
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter J. Liu. 2019. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. CoRR abs/1912.08777(2019). arxiv:1912.08777http://arxiv.org/abs/1912.08777
[34]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. arxiv:1904.09675 [cs.CL]
[35]
Zheng Zhao, Shay B Cohen, and Bonnie Webber. 2020. Reducing Quantity Hallucinations in Abstractive Summarization. arXiv preprint arXiv:2009.13312(2020).
[36]
Chujie Zheng, Kunpeng Zhang, Harry Jiannan Wang, and Ling Fan. 2020. A Two-Phase Approach for Abstractive Podcast Summarization. arxiv:2011.08291 [cs.CL]
[37]
Chenguang Zhu, Ruochen Xu, Michael Zeng, and Xuedong Huang. 2020. A hierarchical network for abstractive meeting summarization with cross-domain pretraining. arXiv preprint arXiv:2004.02016(2020).

Cited By

View all
  • (2024)PodReels: Human-AI Co-Creation of Video Podcast TeasersProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661591(958-974)Online publication date: 1-Jul-2024
  • (2024)Making Short-Form Videos Accessible with Hierarchical Video SummariesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642839(1-17)Online publication date: 11-May-2024
  • (2024)Rambler: Supporting Writing With Speech via LLM-Assisted Gist ManipulationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642217(1-19)Online publication date: 11-May-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
UIST '21: The 34th Annual ACM Symposium on User Interface Software and Technology
October 2021
1357 pages
ISBN:9781450386357
DOI:10.1145/3472749
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. automatic speech recognition
  2. information retrieval
  3. machine learning applications
  4. natural language interaction
  5. summarization

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

UIST '21

Acceptance Rates

Overall Acceptance Rate 561 of 2,567 submissions, 22%

Upcoming Conference

UIST '25
The 38th Annual ACM Symposium on User Interface Software and Technology
September 28 - October 1, 2025
Busan , Republic of Korea

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)93
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)PodReels: Human-AI Co-Creation of Video Podcast TeasersProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661591(958-974)Online publication date: 1-Jul-2024
  • (2024)Making Short-Form Videos Accessible with Hierarchical Video SummariesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642839(1-17)Online publication date: 11-May-2024
  • (2024)Rambler: Supporting Writing With Speech via LLM-Assisted Gist ManipulationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642217(1-19)Online publication date: 11-May-2024
  • (2023)Taxonomy of Abstractive Dialogue Summarization: Scenarios, Approaches, and Future DirectionsACM Computing Surveys10.1145/362293356:3(1-38)Online publication date: 5-Oct-2023
  • (2022)Synthesis-Assisted Video Prototyping From a DocumentProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545676(1-10)Online publication date: 29-Oct-2022
  • (2022)Beyond Text Generation: Supporting Writers with Continuous Automatic Text SummariesProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545672(1-13)Online publication date: 29-Oct-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media