Latent Timbre Synthesis

1307 Accesses
10 Citations
1 Altmetric
Explore all metrics

Abstract

We present the Latent Timbre Synthesis, a new audio synthesis method using deep learning. The synthesis method allows composers and sound designers to interpolate and extrapolate between the timbre of multiple sounds using the latent space of audio frames. We provide the details of two Variational Autoencoder architectures for the Latent Timbre Synthesis and compare their advantages and drawbacks. The implementation includes a fully working application with a graphical user interface, called interpolate_two, which enables practitioners to generate timbres between two audio excerpts of their selection using interpolation and extrapolation in the latent space of audio frames. Our implementation is open source, and we aim to improve the accessibility of this technology by providing a guide for users with any technical background. Our study includes a qualitative analysis where nine composers evaluated the Latent Timbre Synthesis and the interpolate_two application within their practices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

An Exploration of the Latent Space of a Convolutional Variational Autoencoder for the Generation of Musical Instrument Tones

Deep Generative Models for Musical Audio Synthesis

Sound Model Factory: An Integrated System Architecture for Generative Audio Modelling

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

The source code is available at https://www.gitlab.com/ktatar/latent-timbre-synthesis.
We provide sound examples at https://kivanctatar.com/Latent-Timbre-Synthesis.
Appendix A summarizes the calculation and parameters of CQT.
https://librosa.github.io/librosa/.
We outline the inverse CQT algorithm in Appendix B
We summarize the details of CQT calculation in Appendix A
https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.windows.hann.html.
Example audio reconstructions using trained models, training statistics with loss values, and hyper-parameter settings are available on the project page: https://kivanctatar.com/latent-timbre-synthesis.
Exploration and exploitation are two search strategies in optimization applications [35, Sect. 5.3].
The samples are available to download at the following two links: https://freesound.org/people/Erokia/packs/26656/ and https://freesound.org/people/Erokia/packs/26994/.
Pre-trained models and example sounds are available at https://kivanctatar.com/latent-timbre-synthesis.
The complete set of answers given by the composers are available at https://medienarchiv.zhdk.ch/entries/40dda1c8-6287-4356-adf4-ecdccec46119.

References

Akten M (2018) Grannma MagNet. https://www.memo.tv/works/grannma-magnet/. Library Catalog: www.memo.tv
Briot JP, Pachet F (2020) Deep learning for music generation: challenges and directions. Neural Computing and Applications 32(4):981–993. https://doi.org/10.1007/s00521-018-3813-6
Dieleman S Sander Dieleman: Generating music in the raw audio domain. https://www.youtube.com/watch?v=y8mOZSJA7Bc
Dieleman S, Oord Avd, Simonyan K (2018) The challenge of realistic music generation: modelling raw audio at scale. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), p. 11. Montreal QC, Canada
Engel J, Hantrakul LH, Gu C, Roberts A (2020) Ddsp: Differentiable digital signal processing. In: International Conference on Learning Representations. https://openreview.net/forum?id=B1x1ma4tDr
Esling P, Chemla-Romeu-Santos A, Bitton A (2018) Generative timbre spaces: regularizing variational auto-encoders with perceptual metrics. arXiv:1805.08501 [cs, eess]. http://arxiv.org/abs/1805.08501. ArXiv: 1805.08501
Gabor D (1947) Acoustical Quanta and the Theory of Hearing. Nature 159(4044):591–594. https://doi.org/10.1038/159591a0
Grey JM (1977) Multidimensional perceptual scaling of musical timbres. The Journal of the Acoustical Society of America 61(5):1270–1277. 10.1121/1.381428. https://doi.org/10.1121/1.381428
Griffin DW, Lim JS (1984) Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing 32(2):236–243. https://doi.org/10.1109/TASSP.1984.1164317
Hantrakul L, Engel J, Roberts A, Gu C (2019) Fast and Flexible Neural Audio Synthesis. In: Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR 2019), p. 7
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE, Las Vegas, NV, USA. https://doi.org/10.1109/CVPR.2016.90
Iverson P, Krumhansl CL (1993) Isolating the dynamic attributes of musical timbrea. The Journal of the Acoustical Society of America 94(5), 2595–2603. Publisher: Acoustical Society of America
Kingma DP, Welling M (2014) Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat] . http://arxiv.org/abs/1312.6114. ArXiv: 1312.6114
Kingma DP, Welling M (2019) An Introduction to Variational Autoencoders. Foundations and Trends in Machine Learning 12(4), 307–392. http://arxiv.org/abs/1906.02691. ArXiv: 1906.02691
Krumhansl CL (1989) Why is musical timbre so hard to understand. Structure and perception of electroacoustic sound and music 9:43–53
Google Scholar
Kumar K, Kumar R, de Boissiere T, Gestin L, Teoh WZ, Sotelo J, de Brebisson A, Bengio Y, Courville A (2019) MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis. In: Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), p. 12. Vancouver, BC, Canada
Lakatos S (2000) A common perceptual space for harmonic and percussive timbres. Perception & psychophysics 62(7), 1426–1439. Publisher: Springer
LeCun Y, Cortes C, Burges C MNIST handwritten digit database. http://yann.lecun.com/exdb/mnist/
Luigi R (1967) The Art of Noise. A Great Bear Pamphlet
Maaten Lvd (2014) Accelerating t-sne using tree-based algorithms. Journal of machine learning research 15(1):3221–3245
McAdams S, Winsberg S, Donnadieu S, De Soete G, Krimphoff J (1995) Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes. Psychological research 58(3), 177–192. Publisher: Springer
McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) librosa: Audio and Music Signal Analysis in Python. In: Proceedings of The 14th Python in Science Conference (SCIPY 2015)
Müller M (2015) Fundamentals of Music Processing. Springer International Publishing, Cham . https://doi.org/10.1007/978-3-319-21945-5
Nieto O, Bello JP (2016) Systematic Exploration Of Computational Music Structure Research. In: Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR 2016), p. 7. New York, NY, USA
Oord Avd, Dieleman S, Zen H, Simonyan K, Vinyals O, Graves A, Kalchbrenner N, Senior A, Kavukcuoglu K (2016) Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499
Oord Avd, Li Y, Babuschkin I, Simonyan K, Vinyals O, Kavukcuoglu K, Driessche Gvd, Lockhart E, Cobo LC, Stimberg F, Casagrande N, Grewe D, Noury S, Dieleman S, Elsen E, Kalchbrenner N, Zen H, Graves A, King H, Walters T, Belov D, Hassabis D (2017) Parallel WaveNet: Fast High-Fidelity Speech Synthesis. arXiv:1711.10433 [cs]. http://arxiv.org/abs/1711.10433. ArXiv: 1711.10433
Perraudin N, Balazs P, Sondergaard PL (2013) A fast Griffin-Lim algorithm. In: 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp. 1–4. IEEE, New Paltz, NY. 10.1109/WASPAA.2013.6701851. http://ieeexplore.ieee.org/document/6701851/
Roads C (2004) Microsound. The MIT Press, Cambridge, Mass
Google Scholar
Roads C (2015) Composing electronic music: a new aesthetic. Oxford University Press, Oxford
Book Google Scholar
Schaeffer P (1964) Traité des objets musicaux, nouv. edn. Seuil
Schörkhuber C, Klapuri A (2010) Constant-Q Transform Toolbox For Music Processing. In: Proceedings of the 7th Sound and Music Computing Conference (SMC 2010), p. 8. Barcelona, Spain
Smalley D (1997) Spectromorphology: explaining sound-shapes. Organised Sound 2(02):107–126. 10.1017/S1355771897009059. http://journals.cambridge.org/article_S1355771897009059
Stockhausen K (1972) Four Criteria of Electronic Music with Examples from Kontakte . https://www.youtube.com/watch?v=7xyGtI7KKIY&list=PLRBdTyZ76lvAFOtZvocPjpRVTL6htJzoP
Sønderby CK, Raiko T, Maaløe L, Sønderby SK, Winther O (2016) How to Train Deep Variational Autoencoders and Probabilistic Ladder Networks. In: Proceedings of the 23rd international conference on Machine learning (ICML 2016). ACM Press, Pittsburgh, Pennsylvania
Tatar K, Macret M, Pasquier P (2016) Automatic Synthesizer Preset Generation with PresetGen. Journal of New Music Research 45(2):124–144. https://doi.org/10.1080/09298215.2016.1175481
Tatar K, Pasquier P (2017) MASOM: A Musical Agent Architecture based on Self Organizing Maps, Affective Computing, and Variable Markov Models. In: Proceedings of the 5th International Workshop on Musical Metacreation (MUME 2017). Atlanta, Georgia, USA
Tatar K, Pasquier P (2019) Musical agents: A typology and state of the art towards Musical Metacreation. Journal of New Music Research 48(1):56–105. https://doi.org/10.1080/09298215.2018.1511736
Tatar K, Pasquier P, Siu R (2019) Audio-based Musical Artificial Intelligence and Audio-Reactive Visual Agents in Revive. In: Proceedings of the joint International Computer Music Conference and New York City Electroacoustic Music Festival 2019 (ICMC-NYCEMF 2019), p. 8. International Computer Music Association, New York City, NY, USA
Technavio: Global Music Synthesizers Market 2019-2023. https://www.technavio.com/report/global-music-synthesizers-market-industry-analysis
Vaggione H (2001) Some ontological remarks about music composition processes. Computer Music Journal 25(1):54–61
Article MathSciNet Google Scholar
Varese E, Wen-chung C (1966) The liberation of Sound. Perspectives of New Music 5(1), 11–19 . https://www.jstor.org/stable/832385?origin=JSTOR-pdf&seq=1#page_scan_tab_contents
Velasco GA, Holighaus N, Dörfler M, Grill T (2011) Constructing An Invertible Constant-Q Transform With Nonstationary Gabor Frames. In: Proceedings of the 14th International Conference on Digital Audio Effects (DAFx-11)), p. 7. Paris, France
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122

Download references

Acknowledgements

This research has been supported by the Swiss National Science Foundation, Natural Sciences and Engineering Research Council of Canada, Social Sciences and Humanities Research Council of Canada, and Compute Canada.

Author information

Authors and Affiliations

Simon Fraser University, Vancouver, BC, Canada
Kıvanç Tatar & Philippe Pasquier
Zurich University of the Arts, Zurich, Switzerland
Daniel Bisig

Authors

Kıvanç Tatar
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Bisig
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Pasquier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kıvanç Tatar.

Ethics declarations

Conflict of interest

There is no potential conflict of interest related to this work within our knowledge.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Constant-Q transform

We can calculate the CQT of an audio recording [31], a discrete time domain signal x(n), using the following formula:

$$\begin{aligned} X^{CQ} (k,n) = \sum _{j=n- \lfloor N_k /2 \rfloor } ^{n+ \lfloor N_k /2 \rfloor } x(j) a_k ^*(j-n+N_k /2) \end{aligned}$$

(2)

where k represents the CQT frequency bins with a range of [1, K], and $X^{CQ} (k,n)$ is the CQT transform. $N_k$ is the window length of a CQT bin, that is inversely proportional to $f_k$ that we define in Eq. 4 Notice that, $\lfloor \cdot \rfloor$ is the rounding towards negative infinity. $a_k ^ *$ is the negative conjugate of the basis function $a_k (n)$ and,

$$\begin{aligned} a_k (n) = \frac{1}{N_k} w \left(\frac{n}{N_k}\right) exp \left[-i 2\pi n \frac{f_k}{f_s} \right] \end{aligned}$$

(3)

where w(t) is the window function, $f_k$ is the center frequency of bin k, and $f_s$ is the sampling rate. CQT requires a fundamental frequency parameter $f_1$, which is the center frequency of the lowest bin. The center frequencies of remaining bins are calculated using,

$$\begin{aligned} f_k = f_1 2 ^ {\frac{k-1}{B}} \end{aligned}$$

(4)

where B is the number of bins per octave.

CQT is a wavelet-based transform because the window size is inversely proportional to the $f_k$ while ensuring the same Q-factor for all bins k. We can calculate the Q-factor using,

$$\begin{aligned} Q = \frac{qf_s}{f_k (2^\frac{1}{B} - 1)} \end{aligned}$$

(5)

where q is scaling factor with the range [0,1] and equals to 1 as the default setting. We direct our readers to the original publication for the specific details of the CQT [31], which also proposed a fast algorithm to compute CQT and inverse CQT (i-CQT), given in Fig. 7.

Appendix B Phase estimation algorithms

Given an audio signal x(n) and its frequency transform X(i),

where N is the total number of GLA iterations, T and IT is the frequency transform and inverse frequency transform function respectively; such as Short-Fourier Transform, or Constant-Q Transform in our case. Note that, the space of audio spectrograms is a subset of the complex number space. The iterative process of Griffin-Lim moves the complex spectrogram of the estimated signal ${\hat{x}}(n)$ towards the complex number space of audio signals in each iteration, as proven in [9].

The Fast Griffin-Lim algorithm (F-GLA) is a revision of the original Griffin-Lim algorithm. A previous study [27] showed that the F-GLA revision significantly improves signal-to-noise ratio (SNR) compared to the GLA, where the setting $\alpha = 1$ (a constant in algorithm 2) resulted in the highest SNR value.

Appendix C Interview questions

1.
Describe your compositional process when working with the Timbre Space tools
2.
What was the theme and concept of your composition?
3.
How did you incorporate the Timbre Space tools into your work?
4.
How did working with the Timbre Space tools change your composition workflow? What was unique?
5.
What additional tools/technologies apart from the Timbre Space tools were involved in your work?
6.
How would you describe the sound qualities of Timbre Space?
7.
What were the unique aesthetic possibilities of the Timbre Space tools?
8.
What kind of dataset(s) did you train Timbre Space with?
9.
If you trained Timbre Space with several datasets, what kind of relationship did you notice between the datasets and the musical results obtained from Timbre Space?
10.
Did you feel control, and authorship over the musical material generated?
11.
Did you achieve the aesthetic result you intended?
12.
What were the positive aspects when working with the tool?
13.
What were the frustrations when working with the tool? How can it be Improved?
14.
Would you use it again (if the above were addressed)?
15.
For whom else or what musical genres/sectors would this tool be particularly useful (if the criticism was addressed)?

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tatar, K., Bisig, D. & Pasquier, P. Latent Timbre Synthesis. Neural Comput & Applic 33, 67–84 (2021). https://doi.org/10.1007/s00521-020-05424-2

Download citation

Received: 26 June 2020
Accepted: 05 October 2020
Published: 20 October 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s00521-020-05424-2