More Web Proxy on the site http://driver.im/

research-article

Automated Conversion of Music Videos into Lyric Videos

Authors:

Rubaiat Habib Kazi,

Hijung Valentina Shin,

Maneesh AgrawalaAuthors Info & Claims

UIST '23: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

Article No.: 13, Pages 1 - 11

https://doi.org/10.1145/3586183.3606757

Published: 29 October 2023 Publication History

Abstract

Musicians and fans often produce lyric videos, a form of music videos that showcase the song’s lyrics, for their favorite songs. However, making such videos can be challenging and time-consuming as the lyrics need to be added in synchrony and visual harmony with the video. Informed by prior work and close examination of existing lyric videos, we propose a set of design guidelines to help creators make such videos. Our guidelines ensure the readability of the lyric text while maintaining a unified focus of attention. We instantiate these guidelines in a fully automated pipeline that converts an input music video into a lyric video. We demonstrate the robustness of our pipeline by generating lyric videos from a diverse range of input sources. A user study shows that lyric videos generated by our pipeline are effective in maintaining text readability and unifying the focus of attention.

Supplemental Material

CSV File

List of top 100 most viewed lyric videos on YouTube

Download
9.88 KB

ZIP File

Supplemental File

Download
81.62 MB

References

[1]

Maneesh Agrawala, Wilmot Li, and Floraine Berthouzoz. 2011. Design Principles for Visual Communication. Commun. ACM 54, 4 (apr 2011), 60–69. https://doi.org/10.1145/1924421.1924439

Digital Library

[2]

Aitor Álvarez, Haritz Arzelus, and Thierry Etchegoyhen. 2014. Towards Customized Automatic Segmentation of Subtitles. In Advances in Speech and Language Technologies for Iberian Languages. Springer, Cham, 229–238.

[3]

Sudan Archives. 2022. Sudan Archives - Selfish Soul (Official Video). Retrieved July 24, 2023 from https://youtu.be/eaY8kI0oEpA

[4]

David R Bennett. 2002. Meant to be read: Typesetting principles for the digital age. Lamar University-Beaumont, Beaumont, Texas.

[5]

BLACKPINK and Selena Gomez. 2021. BLACKPINK - ’Ice Cream (with Selena Gomez)’ M/V. Retrieved July 24, 2023 from https://youtu.be/vRXZj0DzXIA

[6]

Andy Brown, Rhia Jones, Mike Crabb, James Sandford, Matthew Brooks, Mike Armstrong, and Caroline Jay. 2015. Dynamic Subtitles: The User Experience. In Proceedings of the ACM International Conference on Interactive Experiences for TV and Online Video(TVX ’15). Association for Computing Machinery, New York, NY, USA, 103–112. https://doi.org/10.1145/2745197.2745204

Digital Library

[7]

The Chainsmokers. 2022. The Chainsmokers - iPad (Live at SUMMIT at One Vanderbilt). Retrieved July 24, 2023 from https://youtu.be/w1DZWaOHImk

[8]

Chenglizhao Chen, Mengke Song, Wenfeng Song, Li Guo, and Muwei Jian. 2023. A Comprehensive Survey on Video Saliency Detection With Auditory Information: The Audio-Visual Consistency Perceptual is the Key!IEEE Transactions on Circuits and Systems for Video Technology 33, 2 (2023), 457–477. https://doi.org/10.1109/TCSVT.2022.3203421

[9]

Ho Kei Cheng and Alexander G. Schwing. 2022. XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model. In ECCV 2022, Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, Cham, 640–658.

Digital Library

[10]

Coldplay. 2022. Coldplay - The Scientist. Retrieved July 24, 2023 from https://www.youtube.com/shorts/KsyALRZSn2o

[11]

Miley Cyrus. 2023. Miley Cyrus - Flowers (Official Video). Retrieved July 24, 2023 from https://youtu.be/G7KNmW9a75Y

[12]

The Described and Captioned Media Program. 2023. Guidelines and Best Practices for Captioning Educational Video. https://dcmp.org/learn/captioningkey

[13]

Bob Dylan. 1965. Bob Dylan - Subterranean Homesick Blues (Official HD Video). https://youtu.be/MGxjIBEZvx0

[14]

C. Feng, Y. Zhong, Y. Gao, M. R. Scott, and W. Huang. 2021. TOOD: Task-aligned One-stage Object Detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 3490–3499. https://doi.org/10.1109/ICCV48922.2021.00349

[15]

Olivia Gerber-Morón and Agnieszka Szarkowska. 2018. Line breaks in subtitling: an eye tracking study on viewer preferences. Journal of eye movement research 11, 3 (2018), 18 pages.

[16]

Chitralekha Gupta, Emre Yılmaz, and Haizhou Li. 2020. Automatic Lyrics Alignment and Transcription in Polyphonic Music: Does Background Music Help?. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Barcelona, Spain, 496–500. https://doi.org/10.1109/ICASSP40776.2020.9054567

[17]

Srinidhi Hegde, Jitender Maurya, Ramya Hebbalaguppe, and Aniruddha Kalkar. 2020. SmartOverlays: A Visual Saliency Driven Label Placement for Intelligent Human-Computer Interfaces. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Snowmass, CO, USA, 1110–1119. https://doi.org/10.1109/WACV45572.2020.9093587

[18]

Yongtao Hu, Jan Kautz, Yizhou Yu, and Wenping Wang. 2015. Speaker-Following Video Subtitles. ACM Trans. Multimedia Comput. Commun. Appl. 11, 2, Article 32 (jan 2015), 17 pages. https://doi.org/10.1145/2632111

Digital Library

[19]

Q. Huang, Y. Xiong, and D. Lin. 2018. Unifying Identification and Context Learning for Person Recognition. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Los Alamitos, CA, USA, 2217–2225. https://doi.org/10.1109/CVPR.2018.00236

[20]

Adobe Inc.2023. Premiere Pro. https://www.adobe.com/products/premiere.html

[21]

Jun Kato, Tomoyasu Nakano, and Masataka Goto. 2015. TextAlive: Integrated Design Environment for Kinetic Typography. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems(CHI ’15). Association for Computing Machinery, New York, NY, USA, 3403–3412. https://doi.org/10.1145/2702123.2702140

Digital Library

[22]

Fumi Katsuki and Christos Constantinidis. 2014. Bottom-Up and Top-Down Attention: Different Processes and Overlapping Neural Systems. The Neuroscientist 20, 5 (2014), 509–521. https://doi.org/10.1177/1073858413514136 arXiv:https://doi.org/10.1177/1073858413514136PMID: 24362813.

[23]

Kuno Kurzhals, Emine Cetinkaya, Yongtao Hu, Wenping Wang, and Daniel Weiskopf. 2017. Close to the Action: Eye-Tracking Evaluation of Speaker-Following Subtitles. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems(CHI ’17). Association for Computing Machinery, New York, NY, USA, 6559–6568. https://doi.org/10.1145/3025453.3025772

Digital Library

[24]

Kuno Kurzhals, Fabian Göbel, Katrin Angerbauer, Michael Sedlmair, and Martin Raubal. 2020. A View on the Viewer: Gaze-Adaptive Captions for Videos. Association for Computing Machinery, New York, NY, USA, 1–12. https://doi.org/10.1145/3313831.3376266

Digital Library

[25]

John Legend. 2022. John Legend - Nervous (Live in Las Vegas). Retrieved July 24, 2023 from https://youtu.be/V5vLVPQ-S-0

[26]

Google LLC. 2023. YouTube Studio. https://studio.youtube.com/

[27]

Brian McFee, Colin Raffel, Dawen Liang, Daniel Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. 2015. librosa: Audio and Music Signal Analysis in Python. In Proceedings of the 14th Python in Science Conference. Scipy, USA, 18–24. https://doi.org/10.25080/majora-7b98e3ed-003

[28]

Idina Menzel. 2013. Idina Menzel - Let It Go (from Frozen) (Official Video). Retrieved July 24, 2023 from https://youtu.be/YVVTZgwYwVo

[29]

Musixmatch. 2023. Musixmatch. https://www.musixmatch.com/

[30]

Netflix. 2023. English Timed Text Style Guide. https://partnerhelp.netflixstudios.com/hc/en-us/articles/217350977-English-Timed-Text-Style-Guide

[31]

NFL. 2023. Rihanna’s FULL Apple Music Super Bowl LVII Halftime Show. Retrieved July 24, 2023 from https://youtu.be/HjBo–1n8lI

[32]

OneRepublic. 2022. OneRepublic - I Ain’t Worried (Live From The Tonight Show Starring Jimmy Fallon). Retrieved July 24, 2023 from https://youtu.be/fDuNemLHGzw

[33]

Matthieu Paul, Martin Danelljan, Christoph Mayer, and Luc Van Gool. 2022. Robust Visual Tracking By Segmentation. In European Conference on Computer Vision (ECCV). Springer-Verlag, Berlin, Heidelberg, 571–588. https://doi.org/10.1007/978-3-031-20047-2_33

Digital Library

[34]

Elisa Perego. 2008. What Would We Read Best? Hypotheses and Suggestions for the Location of Line Breaks in Film Subtitles. The Sign Language Translator and Interpreter 2 (2008), 35–63.

[35]

Katy Perry. 2013. Katy Perry - Birthday. https://youtu.be/jqYxyd1iSNk

[36]

Prince. 1987. Prince - Sign O’ The Times. https://youtu.be/8EdxM72EZ94

[37]

Charlie Puth and Selena Gomez. 2016. Charlie Puth & Selena Gomez - We Don’t Talk Anymore [Official Live Performance]. Retrieved July 24, 2023 from https://youtu.be/i_yLpCLMaKk

[38]

A. Rao, L. Xu, Y. Xiong, G. Xu, Q. Huang, B. Zhou, and D. Lin. 2020. A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Los Alamitos, CA, 10143–10152. https://doi.org/10.1109/CVPR42600.2020.01016

[39]

Keith Rayner. 1975. The perceptual span and peripheral cues in reading. Cognitive psychology 7, 1 (1975), 65–81.

[40]

Reddit. 2023. r/HighQualityGifs. http://reddit.com/r/HQGStudios

[41]

Rihanna. 2009. Rihanna - Umbrella (Orange Version) (Official Music Video) ft. JAY-Z. Retrieved July 24, 2023 from https://youtu.be/CvBfHwUxHIk

[42]

Rihanna. 2010. Rihanna - Only Girl (In The World). Retrieved July 24, 2023 from https://youtu.be/pa14VNsdSYM

[43]

S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun. 2019. Objects365: A Large-Scale, High-Quality Dataset for Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE Computer Society, Los Alamitos, CA, USA, 8429–8438. https://doi.org/10.1109/ICCV.2019.00852

[44]

Ed Sheeran. 2022. Ed Sheeran - Sandman. Retrieved July 24, 2023 from https://youtube.com/shorts/0T5yt0MzmdQ

[45]

Taylor Swift. 2021. Taylor Swift - august (studio sessions). Retrieved July 24, 2023 from https://youtu.be/pc_2ZKB4LVc

[46]

Taylor Swift. 2022. Taylor Swift - Anti-Hero. https://youtu.be/XqN2qFvY64U

[47]

Tan Tang, Junxiu Tang, Jiewen Lai, Lu Ying, Yingcai Wu, Lingyun Yu, and Peiran Ren. 2022. SmartShots: An Optimization Approach for Generating Videos with Data Visualizations Embedded. ACM Trans. Interact. Intell. Syst. 12, 1, Article 4 (mar 2022), 21 pages. https://doi.org/10.1145/3484506

Digital Library

[48]

Friederike Tegge and Katharina Parry. 2020. The impact of differences in text segmentation on the automated quantitative evaluation of song-lyrics. Plos one 15, 11 (2020), e0241979.

[49]

Quoc V. Vy, Jorge A. Mori, David W. Fourney, and Deborah I. Fels. 2008. EnACT: A Software Tool for Creating Animated Text Captions. In Computers Helping People with Special Needs, Klaus Miesenberger, Joachim Klaus, Wolfgang Zagler, and Arthur Karshmer (Eds.). Springernet, Berlin, Heidelberg, 609–616.

[50]

Fangzhou Wang, Hidehisa Nagano, Kunio Kashino, and Takeo Igarashi. 2017. Visualizing Video Sounds With Sound Word Animation to Enrich User Experience. IEEE Transactions on Multimedia 19, 2 (2017), 418–429. https://doi.org/10.1109/TMM.2016.2613641

Digital Library

[51]

Waxahatchee. 2020. Waxahatchee - Fire (Official Video). Retrieved July 24, 2023 from https://youtu.be/cEyYlyRr2_U

[52]

Waxahatchee. 2020. Waxahatchee - Lilacs (Official Video). Retrieved July 24, 2023 from https://youtu.be/OaA7I7B1pOk

[53]

Gareth Ford Williams. 2009. BBC Online Subtitling Editorial Guidelines. https://www.bbc.co.uk/accessibility/forproducts/guides/subtitles

[54]

YouTube. 2023. HQG Studios. https://www.youtube.com/@HQGStudios

[55]

Sean Zdenek. 2015. Reading Sounds: Closed-Captioned Media and Popular Culture. University of Chicago Press, Chicago, IL, USA.

Cited By

Tilekbay BYang SLewkowicz MSuryapranata AKim J(2024)ExpressEdit: Video Editing with Natural Language and SketchingProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645164(515-536)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645164

Index Terms

Automated Conversion of Music Videos into Lyric Videos
1. Human-centered computing
  1. Human computer interaction (HCI)

Recommendations

Music/lyrics composition system considering user's image and music genre
SMC'09: Proceedings of the 2009 IEEE international conference on Systems, Man and Cybernetics

This paper proposes a music/lyrics composition system consisting of two sections, a lyric composing section and a music composing section, which considers user's image of a song and music genre. First of all, a user has an image of music/lyrics to ...
Foley Music: Learning to Generate Music from Videos
Computer Vision – ECCV 2020
Abstract
In this paper, we introduce Foley Music, a system that can synthesize plausible music for a silent video clip about people playing musical instruments. We first identify two key intermediate representations for a successful video to music ...
LAMP, A Lyrics and Audio MandoPop Dataset for Music Mood Estimation: Dataset Compilation, System Construction, and Testing
TAAI '10: Proceedings of the 2010 International Conference on Technologies and Applications of Artificial Intelligence

Music mood estimation (MME) is an emerging subfield in music information retrieval research. Whereas most MME research focuses on audio analysis, exploring the significance of lyrics in predicting song emotion has been receiving more attention in recent ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

UIST '23: Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology

October 2023

1825 pages

ISBN:9798400701320

DOI:10.1145/3586183

Editors:
Sean Follmer
Stanford University, USA
,
Jeff Han,
Jürgen Steimle
Saarland University, Germany
,
Nathalie Henry Riche
Microsoft Research, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

UIST '23

Sponsor:

UIST '23: The 36th Annual ACM Symposium on User Interface Software and Technology

October 29 - November 1, 2023

CA, San Francisco, USA

Acceptance Rates

Overall Acceptance Rate 561 of 2,567 submissions, 22%

Upcoming Conference

UIST '25

Sponsor:
sigchi
sigchi

The 38th Annual ACM Symposium on User Interface Software and Technology

September 28 - October 1, 2025

Busan , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
550
Total Downloads

Downloads (Last 12 months)428
Downloads (Last 6 weeks)33

Reflects downloads up to 10 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Tilekbay BYang SLewkowicz MSuryapranata AKim J(2024)ExpressEdit: Video Editing with Natural Language and SketchingProceedings of the 29th International Conference on Intelligent User Interfaces10.1145/3640543.3645164(515-536)Online publication date: 18-Mar-2024
https://dl.acm.org/doi/10.1145/3640543.3645164

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents