[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3654777.3676416acmotherconferencesArticle/Chapter ViewAbstractPublication PagesuistConference Proceedingsconference-collections
research-article
Open access

Towards Music-Aware Virtual Assistants

Published: 11 October 2024 Publication History

Abstract

We propose a system for modifying spoken notifications in a manner that is sensitive to the music a user is listening to. Spoken notifications provide convenient access to rich information without the need for a screen. Virtual assistants see prevalent use in hands-free settings such as driving or exercising, activities where users also regularly enjoy listening to music. In such settings, virtual assistants will temporarily mute a user’s music to improve intelligibility. However, users may perceive these interruptions as intrusive, negatively impacting their music-listening experience. To address this challenge, we propose the concept of music-aware virtual assistants, where speech notifications are modified to resemble a voice singing in harmony with the user’s music. We contribute a system that processes user music and notification text to produce a blended mix, replacing original song lyrics with the notification content. In a user study comparing musical assistants to standard virtual assistants, participants expressed that musical assistants fit better with music, reduced intrusiveness, and provided a more delightful listening experience overall.

References

[1]
Tawfiq Ammari, Jofish Kaye, Janice Y. Tsai, and Frank Bentley. 2019. Music, Search, and IoT: How People (Really) Use Voice Assistants. ACM Trans. Comput.-Hum. Interact. 26, 3, Article 17 (apr 2019), 28 pages. https://doi.org/10.1145/3311956
[2]
Ishwarya Ananthabhotla and Joseph A Paradiso. 2018. SoundSignaling: Realtime, stylistic modification of a personal music corpus for information delivery. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 4 (2018), 1–23.
[3]
Julia Barnett. 2023. The ethical implications of generative audio models: A systematic literature review. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 146–161.
[4]
Ltd Beijing Timedomain Technology Co.2024. ACE Studio | Create Limitless AI Vocals. https://acestudio.ai/.
[5]
Meera M Blattner, Denise A Sumikawa, and Robert M Greenberg. 1989. Earcons and icons: Their structure and common design principles. Human–Computer Interaction 4, 1 (1989), 11–44.
[6]
Richard Burleson. 1992. Functional Relationships of Language and Music: The Two-Profile View of Text Disposition. La Linguistique 28, 2 (1992), 49–63. http://www.jstor.org/stable/30248666
[7]
Andreas Butz and Ralf Jung. 2005. Seamless user notification in ambient soundscapes. In Proceedings of the 10th International Conference on Intelligent User Interfaces (San Diego, California, USA) (IUI ’05). Association for Computing Machinery, New York, NY, USA, 320–322. https://doi.org/10.1145/1040830.1040914
[8]
Kyoyun Choi, Jonggwon Park, Wan Heo, Sungwook Jeon, and Jonghun Park. 2021. Chord conditioned melody generation with transformer based decoders. IEEE Access 9 (2021), 42071–42080.
[9]
Lauren B Collister and David Huron. 2008. Comparison of word intelligibility in spoken and sung phrases. (2008).
[10]
Nathaniel Condit-Schultz and David Huron. 2015. Catching the lyrics: Intelligibility in twelve song genres. Music Perception: An Interdisciplinary Journal 32, 5 (2015), 470–483.
[11]
Benjamin R Cowan, Nadia Pantidi, David Coyle, Kellie Morrissey, Peter Clarke, Sara Al-Shehri, David Earley, and Natasha Bandeira. 2017. " What can i help you with?" infrequent users’ experiences of intelligent personal assistants. In Proceedings of the 19th international conference on human-computer interaction with mobile devices and services. 1–12.
[12]
Anne Cutler and D Robert Ladd. 2013. Prosody: Models and measurements. Vol. 14. Springer Science & Business Media.
[13]
Shuqi Dai, Siqi Chen, Yuxuan Wu, Ruxin Diao, Roy Huang, and Roger B Dannenberg. 2023. Singstyle111: A multilingual singing dataset with style transfer. In in Proc. of the 24th Int. Society for Music Information Retrieval Conf, Vol. 1. 4–2.
[14]
Kendrick Davis. 2018. The Manic Mashups Charming China’s Internet. https://www.sixthtone.com/news/1003118 Accessed: April 1, 2024.
[15]
Diana Deutsch, Rachael Lapidis, and Trevor Henthorn. 2008. The speech-to-song illusion. J. Acoust. Soc. Am 124, 2471 (2008), 10–1121.
[16]
Chris Donahue and Percy Liang. 2021. Sheet sage: Lead sheets from music audio. Proc. ISMIR Late-Breaking and Demo (2021).
[17]
Chris Donahue, Huanru Henry Mao, Yiting Ethan Li, Garrison W Cottrell, and Julian McAuley. 2019. LakhNES: Improving multi-instrumental music generation with cross-domain pre-training. arXiv preprint arXiv:1907.04868 (2019).
[18]
Chris Donahue, John Thickstun, and Percy Liang. 2022. Melody transcription via generative pre-training. arXiv preprint arXiv:2212.01884 (2022).
[19]
Jeff Donahue, Sander Dieleman, Mikołaj Bińkowski, Erich Elsen, and Karen Simonyan. 2020. End-to-end adversarial text-to-speech. arXiv preprint arXiv:2006.03575 (2020).
[20]
Pierre Nicolas Durette. 2024. gTTS. https://pypi.org/project/gTTS/. Version 2.5.1.
[21]
Gavin Edwards. 1995. Scuse Me While I Kiss This Guy. Simon and Schuster.
[22]
Michael Feffer, Zachary C. Lipton, and Chris Donahue. 2023. DeepDrake ft. BTS-GAN and TayloRVC: An Exploratory Analysis of Musical Deepfakes and Hosting Platforms. In HCMIR.
[23]
William W Gaver. 1989. The SonicFinder: An interface that uses auditory icons. Human–Computer Interaction 4, 1 (1989), 67–94.
[24]
Celemony Software GmbH. 2024. Celemony - What is Melodyne?https://www.celemony.com/en/melodyne/what-is-melodyne.
[25]
Florian Heller and Johannes Schöning. 2018. Navigatone: Seamlessly embedding navigation cues in mobile music listening. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–7.
[26]
Romain Hennequin, Anis Khlif, Felix Voituret, and Manuel Moussallam. 2020. Spleeter: a fast and efficient music source separation tool with pre-trained models. Journal of Open Source Software 5, 50 (2020), 2154.
[27]
Hooktheory. [n. d.]. Top 50 Songs - Hooktheory. https://www.hooktheory.com/theorytab/charts/chart/top. Accessed March 29 2024.
[28]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music transformer. arXiv preprint arXiv:1809.04281 (2018).
[29]
Randolph B Johnson, David Huron, and Lauren Collister. 2014. Music and lyrics interactions and their influence on recognition of sung words: An investigation of word frequency, rhyme, metric stress, vocal timbre, melisma, and repetition priming. Empirical Musicology Review 9, 1 (2014), 2–20.
[30]
Ralf Jung. 2008. Ambience for auditory displays: Embedded musical instruments as peripheral audio cues. In Proc. ICAD.
[31]
Mohamed Kari, Tobias Grosse-Puppendahl, Alexander Jagaciak, David Bethge, Reinhard Schütte, and Christian Holz. 2021. Soundsride: Affordance-synchronized music mixing for in-car audio augmented reality. In The 34th Annual ACM Symposium on User Interface Software and Technology. 118–133.
[32]
Eugene Kharitonov, Damien Vincent, Zalán Borsos, Raphaël Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. 2023. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Transactions of the Association for Computational Linguistics 11 (2023), 1703–1718.
[33]
Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning. PMLR, 5530–5540.
[34]
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, 2024. Voicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information processing systems 36 (2024).
[35]
Chien-Feng Liao, Jen-Yu Liu, and Yi-Hsuan Yang. 2022. Karasinger: Score-free singing voice synthesis with vq-vae using mel-spectrograms. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 956–960.
[36]
Jinglin Liu, Chengxi Li, Yi Ren, Feiyang Chen, and Zhou Zhao. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 11020–11028.
[37]
Max Morrison. 2021. PSOLA. https://pypi.org/project/psola/. Version 0.0.1.
[38]
Eric Moulines and Francis Charpentier. 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones. Speech communication (1990).
[39]
Elizabeth D. Mynatt and W. Keith Edwards. 1992. Mapping GUIs to auditory interfaces. In Proceedings of the 5th Annual ACM Symposium on User Interface Software and Technology (Monteray, California, USA) (UIST ’92). Association for Computing Machinery, New York, NY, USA, 61–70. https://doi.org/10.1145/142621.142629
[40]
Adrian C North, David J Hargreaves, and Jon J Hargreaves. 2004. Uses of music in everyday life. Music perception 22, 1 (2004), 41–77.
[41]
Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, 2018. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning. PMLR, 3918–3926.
[42]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[43]
Christine Payne. 2019. MuseNet. openai.com/blog/musenet
[44]
schmoyoho. 2019. Very Thin Ice: 10 years of Auto-Tune the News and Songify This. YouTube video. https://www.youtube.com/watch?v=TDf-oYz6hLQ Accessed: April 1, 2024.
[45]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 4779–4783.
[46]
Ian Simon and Sageev Oore. 2017. Performance RNN: Generating Music with Expressive Timing and Dynamics. https://magenta.tensorflow.org/performance-rnn.
[47]
svc-develop-team. 2023. so-vits-svc. GitHub repository. https://github.com/svc-develop-team/so-vits-svc
[48]
Antares Audio Technologies. 1997. Autotune. https://www.antarestech.com/.
[49]
John Thickstun, David Hall, Chris Donahue, and Percy Liang. 2023. Anticipatory Music Transformer. arXiv preprint arXiv:2306.08620 (2023).
[50]
Amrita S Tulshan and Sudhir Namdeorao Dhage. 2019. Survey on virtual assistant: Google assistant, siri, cortana, alexa. In Advances in Signal Processing and Intelligent Recognition Systems: 4th International Symposium SIRS 2018, Bangalore, India, September 19–22, 2018, Revised Selected Papers 4. Springer, 190–201.
[51]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[52]
Alexander Wang, Yi Fei Cheng, and David Lindlbauer. 2024. MARingBA: Music-Adaptive Ringtones for Blended Audio Notification Delivery(CHI ’24). Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3544548.3581027
[53]
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, 2023. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111 (2023).
[54]
Jing Yang, Tristan Cinquin, and Gábor Sörös. 2021. Unsupervised Musical Timbre Transfer for Notification Sounds. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3735–3739.
[55]
Jing Yang and Andreas Roth. [n. d.]. Musical Features Modification for Less Intrusive Delivery of Popular Notification Sounds. In Proceedings of the 26th International Conference on Auditory Display.
[56]
Yi Yu, Abhishek Srivastava, and Simon Canales. 2021. Conditional lstm-gan for melody generation from lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 17, 1 (2021), 1–20.
[57]
Zach. 2021. MAD - Know your meme. https://knowyourmeme.com/memes/mad Accessed: April 1, 2024.
[58]
Xinquan Zhou and Alexander Lerch. 2015. Chord detection using deep learning. In Proceedings of the 16th ISMIR Conference, Vol. 53. 152.
[59]
Jian Zhu, Cong Zhang, and David Jurgens. 2022. Phone-to-audio alignment without text: A Semi-supervised Approach. arxiv:2110.03876 [cs.CL]

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
UIST '24: Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology
October 2024
2334 pages
ISBN:9798400706288
DOI:10.1145/3654777
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2024

Check for updates

Author Tags

  1. Audio
  2. Interruptions
  3. Machine Learning
  4. Music
  5. Notification
  6. Speech
  7. Virtual Assistants

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

UIST '24

Acceptance Rates

Overall Acceptance Rate 561 of 2,567 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 312
    Total Downloads
  • Downloads (Last 12 months)312
  • Downloads (Last 6 weeks)159
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media