CycleGAN-Based Singing/Humming to Instrument Conversion Technique
<p>Dual conversion model for singing to instrument.</p> "> Figure 2
<p>The convergence of the generator (<b>left</b>) and discriminator (<b>right</b>) losses of (<b>a</b>) CycleGAN-VC, (<b>b</b>) CycleGAN-VC2, (<b>c</b>) CycleGAN-IC, and (<b>d</b>) CycleGAN-IC2.</p> "> Figure 3
<p>The box plots of (<b>a</b>) RMSE and (<b>b</b>) MCD for testing set MIR-QBSH-3 in humming to viola.</p> "> Figure 4
<p>The box plots of (<b>a</b>) RMSE and (<b>b</b>) MCD for testing set MIR-QBSH-4 in singing to viola.</p> ">
Abstract
:1. Introduction
2. CycleGAN-VC and CycleGAN-VC2
3. Proposed Methods
3.1. CycleGAN-IC and CycleGAN-IC2
3.2. Dual Conversion Model
3.3. Theoretical Analysis of Convergence and Complexity
4. Experiment
4.1. Corpus
4.2. Objective and Subjective Measures
4.3. Experiments Results of Humming to Viola
4.4. Experiments Results of Singing to Viola
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Chen, Y.; Chu, M.; Chang, E.; Liu, J.; Liu, R. Voice Conversion with Smoothed GMM and MAP Adaptation. In Proceedings of the European Conference on Speech Communication and Technology, Geneva, Switzerland, 1–4 September 2003; pp. 2413–2416. [Google Scholar]
- Toda, T.; Black, A.W.; Tokuda, K. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 2222–2235. [Google Scholar] [CrossRef]
- Desai, S.; Black, A.W.; Yegnanarayana, B.; Prahallad, K. Spectral Mapping Using Artificial Neural Networks for Voice Conversion. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 954–964. [Google Scholar] [CrossRef]
- Chen, L.-H.; Ling, Z.-H.; Liu, L.-J.; Dai, L.-R. Voice Conversion Using Deep Neural Networks With Layer-Wise Generative Training. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 1859–1872. [Google Scholar] [CrossRef]
- Nakashika, T.; Takiguchi, T.; Ariki, Y. Voice Conversion Using RNN Pre-Trained by Recurrent Temporal Restricted Boltzmann Machines. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 580–587. [Google Scholar] [CrossRef]
- Sun, L.; Li, K.; Wang, H.; Kang, S.; Meng, H. Phonetic Posteriorgrams for Many-to-One Voice Conversion without Parallel Data Training. In Proceedings of the 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, USA, 11–15 July 2016; pp. 1–6. [Google Scholar]
- Liu, L.-J.; Ling, Z.-H.; Jiang, Y.; Zhou, M.; Dai, L.-R. WaveNet Vocoder with Limited Training Data for Voice Conversion. In Proceedings of the Interspeech 2018, ISCA, Hyderabad, India, 2–6 September 2018; pp. 1983–1987. [Google Scholar]
- Saito, Y.; Ijima, Y.; Nishida, K.; Takamichi, S. Non-Parallel Voice Conversion Using Variational Autoencoders Conditioned by Phonetic Posteriorgrams and D-Vectors. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5274–5278. [Google Scholar]
- Kaneko, T.; Kameoka, H. CycleGAN-VC: Non-Parallel Voice Conversion Using Cycle-Consistent Adversarial Networks. In Proceedings of the 2018 26th European Signal Processing Conference (EUSIPCO), Rome, Italy, 3–7 September 2018; pp. 2100–2104. [Google Scholar]
- Kaneko, T.; Kameoka, H.; Tanaka, K.; Hojo, N. CycleGAN-VC2: Improved CycleGAN-Based Non-Parallel Voice Conversion. arXiv 2019, arXiv:1904.04631. [Google Scholar]
- Fang, F.; Yamagishi, J.; Echizen, I.; Lorenzo-Trueba, J. High-Quality Nonparallel Voice Conversion Based on Cycle-Consistent Adversarial Network. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5279–5283. [Google Scholar]
- Wang, C.; YU, Y.-B. CycleGAN-VC-GP: Improved CycleGAN-Based Non-Parallel Voice Conversion. In Proceedings of the 2020 IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 28–31 October 2020; pp. 1281–1284. [Google Scholar]
- Serrà, J.; Pascual, S.; Segura, C. Blow: A Single-Scale Hyperconditioned Flow for Non-Parallel Raw-Audio Voice Conversion. arXiv 2019, arXiv:1906.00794. [Google Scholar]
- Deng, C.; Yu, C.; Lu, H.; Weng, C.; Yu, D. Pitchnet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network. In Proceedings of the ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 7749–7753. [Google Scholar]
- Nachmani, E.; Wolf, L. Unsupervised Singing Voice Conversion. arXiv 2019, arXiv:1904.06590. [Google Scholar]
- Lu, J.; Zhou, K.; Sisman, B.; Li, H. VAW-GAN for Singing Voice Conversion with Non-Parallel Training Data. arXiv 2020, arXiv:2008.03992. [Google Scholar]
- Zhou, K.; Sisman, B.; Liu, R.; Li, H. Seen and Unseen Emotional Style Transfer for Voice Conversion with A New Emotional Speech Dataset. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 920–924. [Google Scholar]
- Gao, J.; Chakraborty, D.; Tembine, H.; Olaleye, O. Nonparallel Emotional Speech Conversion. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2858–2862. [Google Scholar] [CrossRef] [Green Version]
- AlBadawy, E.A.; Lyu, S. Voice Conversion Using Speech-to-Speech Neuro-Style Transfer. In Proceedings of the Interspeech 2020, ISCA, Shanghai, China, 25–29 October 2020; pp. 4726–4730. [Google Scholar]
- Seshadri, S.; Juvela, L.; Räsänen, O.; Alku, P. Vocal Effort Based Speaking Style Conversion Using Vocoder Features and Parallel Learning. IEEE Access 2019, 7, 17230–17246. [Google Scholar] [CrossRef]
- Lian, H.; Hu, Y.; Yu, W.; Zhou, J.; Zheng, W. Whisper to Normal Speech Conversion Using Sequence-to-Sequence Mapping Model With Auditory Attention. IEEE Access 2019, 7, 130495–130504. [Google Scholar] [CrossRef]
- Lian, H.; Hu, Y.; Zhou, J.; Wang, H.; Tao, L. Whisper to Normal Speech Based on Deep Neural Networks with MCC and F0 Features. In Proceedings of the 2018 IEEE 23rd International Conference on Digital Signal Processing (DSP), Shanghai, China, 19–21 November 2018; pp. 1–5. [Google Scholar]
- O’Connor, B.; Dixon, S.; Fazekas, G. Zero-Shot Singing Technique Conversion. arXiv 2021, arXiv:2111.08839. [Google Scholar]
- Biadsy, F.; Weiss, R.J.; Moreno, P.J.; Kanevsky, D.; Jia, Y. Parrotron: An End-to-End Speech-to-Speech Conversion Model and Its Applications to Hearing-Impaired Speech and Speech Separation. arXiv 2019, arXiv:1904.04169. [Google Scholar]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning. MIT Press. 2016. Available online: http://www.deeplearningbook.org (accessed on 4 May 2022).
- Haykin, S. Neural Networks and Learning Machines; Pearson Prentice Hall: Hoboken, NJ, USA, 2009. [Google Scholar]
- Zhu, J.-Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. arXiv 2018, arXiv:1703.10593. [Google Scholar]
- Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. arXiv 2016, arXiv:1603.08155. [Google Scholar]
- Dauphin, Y.N.; Fan, A.; Auli, M.; Grangier, D. Language Modeling with Gated Convolutional Networks. arXiv 2017, arXiv:1612.08083. [Google Scholar]
- Li, C.; Wand, M. Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks. arXiv 2016, arXiv:1604.04382. [Google Scholar]
- Lai, W.-H.; Wang, S.-L.; Xu, Z.-Y. Humming-to-Instrument Conversion Based on CycleGAN. In Proceedings of the 2021 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Hualien City, Taiwan, 16–19 November 2021; pp. 1–2. [Google Scholar]
- Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
- Kodali, N.; Abernethy, J.; Hays, J.; Kira, Z. On Convergence and Stability of GANs. arXiv 2017, arXiv:1705.07215. [Google Scholar]
- Kindler, J. A Simple Proof of Sion’s Minimax Theorem. Am. Math. Mon. 2005, 112, 356–358. [Google Scholar] [CrossRef]
- Hazan, E.; Singh, K.; Zhang, C. Efficient regret minimization in non-convex games. arXiv 2017, arXiv:1708.00075. [Google Scholar]
- Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; Kautz, J. Pruning Convolutional Neural Networks for Resource Efficient Inference. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
- Jyh-Shing Roger Jang MIR-QBSH Corpus. Available online: http://mirlab.org/dataset/public/MIR-QBSH-corpus.rar (accessed on 29 April 2015).
- Cakewalk Inc. Cakewalk—SONAR Family—SONAR Platinum, SONAR Studio and SONAR Artist. Available online: https://www.cakewalk.com/products/SONAR (accessed on 12 June 2021).
- P.800: Methods for Subjective Determination of Transmission Quality. Available online: https://www.itu.int/rec/T-REC-P.800-199608-I (accessed on 9 January 2021).
(a) | |||||
State | Amount | Operation | Architecture | ||
Input | |||||
Conv | |||||
GLU | - | - | - | ||
down-sample (2D) | 1 | Conv | |||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
2D → 1D | 1 | Reshape | |||
Conv | |||||
Instance norm | - | - | - | ||
residual blocks (1D) | 6 | Conv | |||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Instance norm | - | - | - | ||
Sum | - | - | - | ||
1D → 2D | 1 | Conv | |||
Instance norm | - | - | - | ||
Reshape | |||||
up-sample (2D) | 1 | Conv | |||
Pixel Shuffler | - | - | - | ||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Pixel Shuffler | - | - | - | ||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Output | |||||
(b) | |||||
State | Amount | Operation | Architecture | ||
Input | |||||
Conv | |||||
GLU | - | - | - | ||
down-sample (2D) | 1 | Conv | |||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Instance norm | - | - | - | ||
GLU | - | - | - | ||
Conv | |||||
Real/Fake | - | - | - |
vs. Original Viola | vs. Original Humming | |
---|---|---|
CycleGAN-VC | 1.3193 | 1.7613 |
CycleGAN-VC2 | 1.2566 | 1.7171 |
CycleGAN-IC | 1.3085 | 1.7206 |
CycleGAN-IC2 | 1.2264 | 1.6883 |
vs. Original Viola | vs. Original Humming | |
---|---|---|
CycleGAN-VC | 4.7665 | 7.4907 |
CycleGAN-VC2 | 4.3093 | 7.6802 |
CycleGAN-IC | 4.5214 | 7.8334 |
CycleGAN-IC2 | 4.2884 | 7.7419 |
Bad | Poor | Fair | Good | Excellent | |
---|---|---|---|---|---|
CycleGAN-VC | 0% | 58% | 42% | 0% | 0% |
CycleGAN-VC2 | 0% | 25% | 73% | 2% | 0% |
CycleGAN-IC | 0% | 46% | 54% | 0% | 0% |
CycleGAN-IC2 | 0% | 10% | 79% | 11% | 0% |
CycleGAN-VC | CycleGAN-VC2 | CycleGAN-IC | CycleGAN-IC2 | Equal | |
---|---|---|---|---|---|
CycleGAN-IC: CycleGAN-VC | 12% | - | 18% | - | 70% |
CycleGAN-IC: CycleGAN-VC2 | - | 39% | 24% | - | 37% |
CycleGAN-IC2: CycleGAN-VC | 12% | - | - | 60% | 28% |
CycleGAN-IC2: CycleGAN-VC2 | - | 24% | - | 50% | 26% |
CycleGAN-IC: CycleGAN-IC2 | - | - | 13% | 60% | 27% |
CycleGAN-IC2 | 1.1697 |
CycleGAN-ICd | 1.0473 |
CycleGAN-IC2 | 4.0608 |
CycleGAN-ICd | 3.9803 |
Bad | Poor | Fair | Good | Excellent | |
---|---|---|---|---|---|
CycleGAN-IC2 | 0% | 17% | 73% | 11% | 0% |
CycleGAN-ICd | 0% | 10% | 68% | 22% | 0% |
CycleGAN-IC2 | CycleGAN-ICd | Equal | |
---|---|---|---|
CycleGAN-IC2: CycleGAN-ICd | 23% | 56% | 21% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Lai, W.-H.; Wang, S.-L.; Xu, Z.-Y. CycleGAN-Based Singing/Humming to Instrument Conversion Technique. Electronics 2022, 11, 1724. https://doi.org/10.3390/electronics11111724
Lai W-H, Wang S-L, Xu Z-Y. CycleGAN-Based Singing/Humming to Instrument Conversion Technique. Electronics. 2022; 11(11):1724. https://doi.org/10.3390/electronics11111724
Chicago/Turabian StyleLai, Wen-Hsing, Siou-Lin Wang, and Zhi-Yao Xu. 2022. "CycleGAN-Based Singing/Humming to Instrument Conversion Technique" Electronics 11, no. 11: 1724. https://doi.org/10.3390/electronics11111724
APA StyleLai, W.-H., Wang, S.-L., & Xu, Z.-Y. (2022). CycleGAN-Based Singing/Humming to Instrument Conversion Technique. Electronics, 11(11), 1724. https://doi.org/10.3390/electronics11111724