[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3526113.3545676acmconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article
Open access

Synthesis-Assisted Video Prototyping From a Document

Published: 28 October 2022 Publication History

Abstract

Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document – such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain – programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.

References

[1]
Faisal Ahmed, Yevgen Borodin, Andrii Soviak, Muhammad Islam, I.V. Ramakrishnan, and Terri Hedgpeth. 2012. Accessible Skimming: Faster Screen Reading of Web Pages. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology (Cambridge, Massachusetts, USA) (UIST ’12). Association for Computing Machinery, New York, NY, USA, 367–378. https://doi.org/10.1145/2380116.2380164
[2]
Daniel Arijon. 1991. Grammar of the film language. Silman-James Press.
[3]
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for Placing Cuts and Transitions in Interview Video. ACM Trans. Graph. 31, 4, Article 67 (July 2012), 8 pages. https://doi.org/10.1145/2185520.2185563
[4]
Juan Casares, A. Chris Long, Brad A. Myers, Rishi Bhatnagar, Scott M. Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. 2002. Simplifying Video Editing Using Metadata. In Proceedings of the 4th Conference on Designing Interactive Systems: Processes, Practices, Methods, and Techniques (London, England) (DIS ’02). Association for Computing Machinery, New York, NY, USA, 157–166. https://doi.org/10.1145/778712.778737
[5]
Minsuk Chang, Mina Huh, and Juho Kim. 2021. RubySlippers: Supporting Content-Based Voice Navigation for How-to Videos. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 97, 14 pages. https://doi.org/10.1145/3411764.3445131
[6]
Minsuk Chang, Anh Truong, Oliver Wang, Maneesh Agrawala, and Juho Kim. 2019. How to Design Voice Based Navigation for How-To Videos. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). ACM, New York, NY, USA, Article 701, 11 pages. https://doi.org/10.1145/3290605.3300931
[7]
Jiajian Chen, Jun Xiao, and Yuli Gao. 2010. ISlideShow: A Content-Aware Slideshow System. In Proceedings of the 15th International Conference on Intelligent User Interfaces (Hong Kong, China) (IUI ’10). Association for Computing Machinery, New York, NY, USA, 293–296. https://doi.org/10.1145/1719970.1720014
[8]
Yan Chen, Walter S. Lasecki, and Tao Dong. 2021. Towards Supporting Programming Education at Scale via Live Streaming. Proc. ACM Hum.-Comput. Interact. 4, CSCW3, Article 259 (jan 2021), 19 pages. https://doi.org/10.1145/3434168
[9]
Peggy Chi, Nathan Frey, Katrina Panovich, and Irfan Essa. 2021. Automatic Instructional Video Creation from a Markdown-Formatted Tutorial. In The 34th Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’21). Association for Computing Machinery, New York, NY, USA, 677–690. https://doi.org/10.1145/3472749.3474778
[10]
Peggy Chi, Zheng Sun, Katrina Panovich, and Irfan Essa. 2020. Automatic Video Creation From a Web Page. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 279–292. https://doi.org/10.1145/3379337.3415814
[11]
Pei-Yu Chi, Sally Ahn, Amanda Ren, Mira Dontcheva, Wilmot Li, and Björn Hartmann. 2012. MixT: Automatic Generation of Step-by-step Mixed Media Tutorials. In Proceedings of the 25th Annual ACM Symposium on User Interface Software and Technology (Cambridge, Massachusetts, USA) (UIST ’12). ACM, New York, NY, USA, 93–102. https://doi.org/10.1145/2380116.2380130
[12]
Pei-Yu Chi, Joyce Liu, Jason Linder, Mira Dontcheva, Wilmot Li, and Björn Hartmann. 2013. DemoCut: Generating Concise Instructional Videos for Physical Demonstrations. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology (St. Andrews, Scotland, United Kingdom) (UIST ’13). Association for Computing Machinery, New York, NY, USA, 141–150. https://doi.org/10.1145/2501988.2502052
[13]
Han L. Han et al.2022. Passages: Interacting with Text Across Documents(CHI ’22). Association for Computing Machinery, New York, NY, USA.
[14]
Flutter. 2022. Cookbook | Flutter. Retrieved April, 2022 from https://github.com/flutter/website/tree/main/src/cookbook
[15]
Ohad Fried, Ayush Tewari, Michael Zollhöfer, Adam Finkelstein, Eli Shechtman, Dan B Goldman, Kyle Genova, Zeyu Jin, Christian Theobalt, and Maneesh Agrawala. 2019. Text-Based Editing of Talking-Head Video. ACM Trans. Graph. 38, 4, Article 68 (July 2019), 14 pages. https://doi.org/10.1145/3306346.3323028
[16]
Camille Gobert and Michel Beaudouin-Lafon. 2022. i-LaTeX: Manipulating Transitional Representations between LaTeX Code and Generated Documents(CHI ’22). Association for Computing Machinery, New York, NY, USA.
[17]
John Gruber. 2004. Daring fireball: Markdown. (2004). https://daringfireball.net/projects/markdown/
[18]
Philip J. Guo, Juho Kim, and Rob Rubin. 2014. How Video Production Affects Student Engagement: An Empirical Study of MOOC Videos. In Proceedings of the First ACM Conference on Learning @ Scale Conference (Atlanta, Georgia, USA) (L@S ’14). Association for Computing Machinery, New York, NY, USA, 41–50. https://doi.org/10.1145/2556325.2566239
[19]
Joshua M. Hailpern and Bernardo A. Huberman. 2014. Odin: Contextual Document Opinions on the Go. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (Toronto, Ontario, Canada) (CHI ’14). Association for Computing Machinery, New York, NY, USA, 1525–1534. https://doi.org/10.1145/2556288.2556959
[20]
Bernd Huber, Hijung Valentina Shin, Bryan Russell, Oliver Wang, and Gautham J. Mysore. 2019. B-Script: Transcript-Based B-Roll Video Editing with Recommendations. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). Association for Computing Machinery, New York, NY, USA, Article 81, 11 pages. https://doi.org/10.1145/3290605.3300311
[21]
Corneliu Ilisescu, Halil Aytac Kanaci, Matteo Romagnoli, Neill D. F. Campbell, and Gabriel J. Brostow. 2017. Responsive Action-Based Video Synthesis. Association for Computing Machinery, New York, NY, USA, 6569–6580. https://doi.org/10.1145/3025453.3025880
[22]
Google Inc. 2022. Text-to-Speech: Lifelike Speech Synthesis. Retrieved April, 2022 from https://cloud.google.com/text-to-speech/
[23]
Christopher Jeffrey. 2018. Marked: A markdown parser and compiler. Built for speed.Retrieved April, 2021 from https://github.com/markedjs/marked
[24]
Murat Kalender, Mustafa Eren, Zonghuan Wu, Ozgun Cirakman, Sezer Kutluk, Gunay Gultekin, and Emin Korkmaz. 2018. Videolization: knowledge graph based automated video generation from web content. Multimedia Tools and Applications 77 (12 2018). https://doi.org/10.1007/s11042-016-4275-4
[25]
Kandarp Khandwala and Philip J. Guo. 2018. Codemotion: Expanding the Design Space of Learner Interactions with Computer Programming Tutorial Videos. In Proceedings of the Fifth Annual ACM Conference on Learning at Scale (London, United Kingdom) (L@S ’18). Association for Computing Machinery, New York, NY, USA, Article 57, 10 pages. https://doi.org/10.1145/3231644.3231652
[26]
Avisek Lahiri, Vivek Kwatra, Christian Frueh, John Lewis, and Chris Bregler. 2021. LipSync3D: Data-Efficient Learning of Personalized 3D Talking Faces from Video using Pose and Lighting Normalization. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 2754–2763. https://doi.org/10.1109/CVPR46437.2021.00278
[27]
Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational Video Editing for Dialogue-Driven Scenes. ACM Trans. Graph. 36, 4, Article 130 (July 2017), 14 pages. https://doi.org/10.1145/3072959.3073653
[28]
Mackenzie Leake, Hijung Valentina Shin, Joy O. Kim, and Maneesh Agrawala. 2020. Generating Audio-Visual Slideshows from Text Articles Using Word Concreteness. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–11. https://doi.org/10.1145/3313831.3376519
[29]
Bridjet Lee and Kasia Muldner. 2020. Instructional Video Design: Investigating the Impact of Monologue- and Dialogue-Style Presentations. Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376845
[30]
Daniel Li, Thomas Chen, Albert Tung, and Lydia B Chilton. 2021. Hierarchical Summarization for Longform Spoken Dialog. In The 34th Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’21). Association for Computing Machinery, New York, NY, USA, 582–597. https://doi.org/10.1145/3472749.3474771
[31]
MasterClass. 2022. What Is a Table Read? How to Set Up a Table Read, Including Who to Invite and What to Provide. Retrieved April, 2022 from https://www.masterclass.com/articles/what-is-a-table-read-how-to-set-up-a-table-read-including-who-to-invite-and-what-to-provide#what-is-a-table-read
[32]
Alok Mysore and Philip J. Guo. 2017. Torta: Generating Mixed-Media GUI and Command-Line App Tutorials Using Operating-System-Wide Activity Tracing. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (Québec City, QC, Canada) (UIST ’17). Association for Computing Machinery, New York, NY, USA, 703–714. https://doi.org/10.1145/3126594.3126628
[33]
Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th Annual ACM Symposium on User Interface Software and Technology (Charlotte, NC, USA) (UIST ’15). Association for Computing Machinery, New York, NY, USA, 181–190. https://doi.org/10.1145/2807442.2807502
[34]
Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2016. VidCrit: Video-Based Asynchronous Video Review. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (Tokyo, Japan) (UIST ’16). Association for Computing Machinery, New York, NY, USA, 517–528. https://doi.org/10.1145/2984511.2984552
[35]
Amy Pavel, Gabriel Reyes, and Jeffrey P. Bigham. 2020. Rescribe: Authoring and Automatically Editing Audio Descriptions. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 747–759. https://doi.org/10.1145/3379337.3415864
[36]
Hariharan Subramonyam, Wilmot Li, Eytan Adar, and Mira Dontcheva. 2018. TakeToons: Script-Driven Performance Animation. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (Berlin, Germany) (UIST ’18). Association for Computing Machinery, New York, NY, USA, 663–674. https://doi.org/10.1145/3242587.3242618
[37]
Synthesia. 2022. Synthesia - AI Video Generation Platform. Retrieved April, 2022 from https://www.synthesia.io/
[38]
Sarah Taylor, Taehwan Kim, Yisong Yue, Moshe Mahler, James Krahe, Anastasio Garcia Rodriguez, Jessica Hodgins, and Iain Matthews. 2017. A Deep Learning Approach for Generalized Speech Animation. ACM Trans. Graph. 36, 4, Article 93 (July 2017), 11 pages. https://doi.org/10.1145/3072959.3073699
[39]
Anh Truong, Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2016. QuickCut: An Interactive Tool for Editing Narrated Video. In Proceedings of the 29th Annual Symposium on User Interface Software and Technology (Tokyo, Japan) (UIST ’16). Association for Computing Machinery, New York, NY, USA, 497–507. https://doi.org/10.1145/2984511.2984569
[40]
Anh Truong, Sara Chen, Ersin Yumer, David Salesin, and Wilmot Li. 2018. Extracting Regular FOV Shots from 360 Event Footage. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). Association for Computing Machinery, New York, NY, USA, Article 316, 11 pages. https://doi.org/10.1145/3173574.3173890
[41]
Anh Truong, Peggy Chi, David Salesin, Irfan Essa, and Maneesh Agrawala. 2021. Automatic Generation of Two-Level Hierarchical Tutorials from Instructional Makeup Videos. In Proceedings of the 2021 ACM Conference on Human Factors in Computing Systems(CHI ’21).
[42]
Sylvaine Tuncer, Barry Brown, and Oskar Lindwall. 2020. On Pause: How Online Instructional Videos Are Used to Achieve Practical Tasks. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376759
[43]
Bryan Wang, Meng Yu Yang, and Tovi Grossman. 2021. Soloist: Generating Mixed-Initiative Tutorials from Existing Guitar Instructional Videos Through Audio Processing. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI ’21). Association for Computing Machinery, New York, NY, USA, Article 98, 14 pages. https://doi.org/10.1145/3411764.3445162
[44]
Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, and Ariel Shamir. 2019. Write-a-Video: Computational Video Montage from Themed Text. ACM Trans. Graph. 38, 6, Article 177 (Nov. 2019), 13 pages. https://doi.org/10.1145/3355089.3356520
[45]
Nora S. Willett, Wilmot Li, Jovan Popovic, and Adam Finkelstein. 2017. Triggering Artwork Swaps for Live Animation. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology (Québec City, QC, Canada) (UIST ’17). Association for Computing Machinery, New York, NY, USA, 85–95. https://doi.org/10.1145/3126594.3126596
[46]
Haijun Xia, Jennifer Jacobs, and Maneesh Agrawala. 2020. Crosscast: Adding Visuals to Audio Travel Podcasts. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’20). Association for Computing Machinery, New York, NY, USA, 735–746. https://doi.org/10.1145/3379337.3415882
[47]
Saelyne Yang, Jisu Yim, Aitolkyn Baigutanova, Seoyoung Kim, Minsuk Chang, and Juho Kim. 2022. SoftVideo: Improving the Learning Experience of Software Tutorial Videos with Collective Interaction Data. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 646–660. https://doi.org/10.1145/3490099.3511106
[48]
Mingyuan Zhong, Gang Li, Peggy Chi, and Yang Li. 2021. HelpViz: Automatic Generation of Contextual Visual Mobile Tutorials from Text-Based Instructions. In The 34th Annual ACM Symposium on User Interface Software and Technology (Virtual Event, USA) (UIST ’21). Association for Computing Machinery, New York, NY, USA, 1144–1153. https://doi.org/10.1145/3472749.3474812
[49]
Douglas E. Zongker and David H. Salesin. 2003. On Creating Animated Presentations. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation (San Diego, California) (SCA ’03). Eurographics Association, Goslar, DEU, 298–308.

Cited By

View all
  • (2024)The Generative Fairy Tale of Scary Little Red Riding HoodProceedings of the 2024 ACM International Conference on Interactive Media Experiences10.1145/3639701.3656303(129-144)Online publication date: 7-Jun-2024
  • (2024)AI-Generated Media for Exploring Alternate RealitiesExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650861(1-8)Online publication date: 11-May-2024
  • (2024)Piet: Facilitating Color Authoring for Motion Graphics VideoProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642711(1-17)Online publication date: 11-May-2024
  • Show More Cited By

Index Terms

  1. Synthesis-Assisted Video Prototyping From a Document

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    UIST '22: Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology
    October 2022
    1363 pages
    ISBN:9781450393201
    DOI:10.1145/3526113
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 October 2022

    Check for updates

    Author Tags

    1. Video creation
    2. creativity tools.
    3. programming cookbooks
    4. talking head videos
    5. tutorials
    6. video prototyping
    7. voiceover

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    UIST '22

    Acceptance Rates

    Overall Acceptance Rate 561 of 2,567 submissions, 22%

    Upcoming Conference

    UIST '25
    The 38th Annual ACM Symposium on User Interface Software and Technology
    September 28 - October 1, 2025
    Busan , Republic of Korea

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)481
    • Downloads (Last 6 weeks)36
    Reflects downloads up to 01 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)The Generative Fairy Tale of Scary Little Red Riding HoodProceedings of the 2024 ACM International Conference on Interactive Media Experiences10.1145/3639701.3656303(129-144)Online publication date: 7-Jun-2024
    • (2024)AI-Generated Media for Exploring Alternate RealitiesExtended Abstracts of the CHI Conference on Human Factors in Computing Systems10.1145/3613905.3650861(1-8)Online publication date: 11-May-2024
    • (2024)Piet: Facilitating Color Authoring for Motion Graphics VideoProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642711(1-17)Online publication date: 11-May-2024
    • (2023)Papeos: Augmenting Research Papers with Talk VideosProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology10.1145/3586183.3606770(1-19)Online publication date: 29-Oct-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media