research-article

Open access

From Music Scores to Audio Recordings: Deep Pitch-Class Representations for Measuring Tonal Structures

Authors:

Christof Weiß,

Meinard MüllerAuthors Info & Claims

ACM Journal on Computing and Cultural Heritage, Volume 17, Issue 3

Article No.: 45, Pages 1 - 19

https://doi.org/10.1145/3659103

Published: 31 July 2024 Publication History

PDF eReader

Abstract

The availability of digital music data in various modalities provides opportunities both for music enjoyment and music research. Regarding the latter, the computer-assisted analysis of tonal structures is a central topic. For Western classical music, studies typically rely on machine-readable scores, which are tedious to create for large-scale works and comprehensive corpora. As an alternative, music audio recordings, which are readily available, can be analyzed with computational methods. With this article, we want to bridge the gap between score- and audio-based measurements of tonal structures by leveraging the power of deep neural networks. Such networks are commonly trained in an end-to-end fashion, which introduces biases towards the training repertoire or towards specific annotators. To overcome these problems, we propose a multi-step strategy. First, we compute pitch-class representations of the audio recordings using networks trained on score–audio pairs. Second, we measure the presence of specific tonal structures using a pattern-matching technique that solely relies on music theory knowledge and does not require annotated training data. Third, we highlight these measurements with interactive visualizations, thus leaving the interpretation to the musicological experts. Our experiments on Richard Wagner's large-scale cycle Der Ring des Nibelungen indicate that deep pitch-class representations lead to a high similarity between score- and audio-based measurements of tonal structures, thus demonstrating how to leverage multi-modal data for application scenarios in the computational humanities, where an explicit and interpretable methodology is essential.

1 Introduction

Digital music collections comprise a wide variety of modalities including audio and video recordings of performances or scores in different formats [23]. While this multi-modality poses challenges for computational approaches, it also provides opportunities for practical applications (e.g., multi-modal search and retrieval [39]) both for consumption and research purposes. In this context, multi-modal representations can be exploited for gaining a deeper understanding of how music is composed [11], performed [16], and perceived [34]. As a peculiarity of Western classical music, several recorded performances (versions) of a composition are usually available, which can be exploited to investigate performance differences [16] or focus on commonalities that highlight work-related aspects, such as harmony and tonality [11].

Computational Musicology. In the field of computational musicology, algorithms are commonly utilized by human experts in a controlled and interactive way. As illustrated by Figure 1, analytical studies typically involve several steps comprising (a) the digitization of a music collection (raw data), (b) the representation of this data according to musically relevant and interpretable entities, (c) the measurement of salient musical structures phenomena, and (d) the interpretation of the measurement results involving musicological knowledge. For Western classical music, machine-readable scores are commonly considered as the underlying raw data, since they explicitly capture symbolic information corresponding to the composer's intention. Subsequently, suitable representations of, for instance, pitch activity over time can be directly derived from such scores. From a practical perspective, however, the limited availability of symbolic scores together with the expertise and effort needed for creating this kind of data¹ impedes the scalability to large-scale works and comprehensive music corpora.

Fig. 1.

As an alternative to scores, one can exploit audio recordings, which exist in large quantities for the Western classical repertoire. Usually, there are even multiple performances of a work available. Relying on audio allows for considering music scenarios where scores are not available (e.g., movie soundtrack) or do not even exist (e.g., some non-Western, electronic, or improvised music). In this context, dedicated audio processing methods are required. For instance, researchers have approached the automatic recognition of tonal structures such as chords or local keys with traditional signal processing (SP) [5, 11] or deep-learning (DL) techniques [22, 27]. Such algorithms are usually optimized on annotated training (and validation) data. In particular, DL approaches often follow an end-to-end strategy for solving steps (c) or even (d) based on a waveform or time–frequency representation as input. This strategy achieved promising results on standard Music Information Retrieval (MIR) datasets and tasks (such as chord labeling in pop and rock music).

Multi-Step Strategy. However, there are major drawbacks when applying these fully learning-based methods for the purpose of musicological research. First, the dependency on (usually limited) training data introduces a repertoire bias: the notion of, e.g., local key for one specific style, composer, or genre might not apply to others. Second, there are musical ambiguities between tonal structures [45], which might even be intended by a composer as an artistic element and pose a problem as soon as algorithms involve hard decisions on one label for each time segment.² Third, the analytical interpretation is influenced by the subjectivity of annotations, which has been pointed out by recent research [21, 32, 45]. These findings fundamentally question the applicability of end-to-end approaches for tonal analysis in computational musicology. Due to these reasons, we follow in this article a different strategy. Instead of explicitly recognizing chord or key labels with a trained model, we aim for an objective (i.e., un-trained and un-biased) measurement of specific tonal structures. To approach this, we compute probabilities over several labels (here: transpositions of the diatonic scale) at a time. We then visualize the measurement results without forcing hard decisions, leaving the interpretation of possible musical ambiguities to the human expert.

Main Contributions. The contributions of this article can be summarized as follows. (1) From a technical point of view, this article is an extension of a recent conference paper [48]. We take up the proposed strategy of training convolutional neural networks (CNNs) on score–audio pairs of classical music, thereby targeting a transcription-like pitch-class representation that relates to the task of multi-pitch estimation (MPE). Building upon [48], we systematically evaluate different architectures and training–test splits. (2) Relying on this front end, we realize our strategy for measuring tonal structures. In particular, we will focus on diatonic scales as proposed in [41, 42] and consider visualization techniques for an interactive analysis. (3) As the underlying raw data, we systematically compare the use of score representations (SC) with audio recordings, where we derive pitch-class representations using DL as well as traditional SP. To this end, we introduce suitable similarity measures for quantitative evaluation and visualizations for qualitative evaluation. We show that the DL approach helps to bridge the gap between audio- and score-based measurements of tonal structures—an encouraging finding since an audio-based analysis can be scaled up easily thanks to the abundance of audio recordings (as opposed to high-quality machine-readable scores). (4) We demonstrate the practicability of our multi-step strategy by considering Richard Wagner's tetralogy Der Ring des Nibelungen, exploiting a multi-modal and multi-version dataset [40] that comprises 16 different performances of approx. 14–15 hours each (232 hours in total) and serves as a prime example for a large-scale corpus analysis of tonal structures.

The remainder of this article is organized into two main parts. Section 2 deals with the extraction of pitch-class representations (Figure 1(b)) from audio recordings building on the initial study [48]. Section 3 deals with the measurement of tonal structures (Figure 1(c)). Each of these sections covers related work, datasets, computational approaches, and experiments, respectively. Section 4 concludes the article.

2 Learning Pitch-Class Representations of Audio

This first part deals with the extraction of pitch-class representations from audio recordings using DL strategies [48]. We start with discussing related work (Section 2.1), then summarize the datasets used (Section 2.2), introduce our computational approaches (Section 2.3), and present experimental results (Section 2.4).

2.1 Motivation and Related Work

In MIR and music processing, pitch-class or chroma representations have been successfully used as basis for different applications such as chord recognition [5, 26], key estimation [45], structure analysis [34], or audio retrieval [29]. Such representations capture the signal's energy distribution over the 12 chromatic pitch classes (aggregating pitch information across octaves (e.g., C4, C5, …) and enharmonic spelling (e.g., C\(\sharp,\)D\(\flat\)) and allow for a direct musical interpretation of tonal information. Traditional chroma features were designed in a handcrafted fashion based on SP [13, 30] and exhibit several drawbacks caused by audio-related artifacts, including timbral characteristics, overtones, vibrato, or transients. Moreover, the relative loudness of a note has an immediate influence on the features.

Enhancements Based on SP. Over the years, these problems were approached with a number of techniques such as spectral whitening [20], overtone removal [26], or timbre homogenization [29]. Most of these techniques led to improved results for tasks such as chord recognition [5, 26] or audio retrieval [29]. However, these results have to be taken with care since improvements for one task may lead to deteriorations for another—a good chroma representation for music synchronization [29] might be less suitable for chord recognition [5], or removal of harmonics might introduce sub-harmonic artifacts [26]. Finally, chroma features are often noisy compared to the pitch classes notated in the score, limiting their interpretability by musicologists as well as their potential for visualization or cross-modal retrieval.

DL Approaches. To address these problems, more recent strategies make use of deep neural networks for learning pitch-class representations from data [22, 27, 43, 49, 50]. For the successful training of high-capacity networks, large amounts of annotated recordings are necessary. Since manual creation of pitch-class annotations is tedious and requires expert knowledge, there are several alternative strategies, all of which have their benefits and limitations. The first studies that aimed for learning a “deep chroma” made use of chord labels to derive the pitch-class training targets [22, 27]. As shown in [22] (where only major and minor triads are used), this leads to a chroma extractor with a strong bias toward the chords’ pitch classes and does not detect the pitch classes as notated in the score, thus limiting interpretability and capability for generalization to other chord vocabularies, genres, and analysis tasks. An alternative strategy [49] relies on symbolic music representations to render synthetic audio recordings together with the corresponding pitch-class annotations—a pragmatic approach that, however, is not able to generalize well to real audio recordings. Other approaches such as those in [10, 17] make use of instruments with Musicial Instrument Digital Interfaces (MIDI), (e.g., Disklaviers, which can capture velocity and timing of pressed piano keys) for capturing pitch information—a strategy that is limited to the piano (or other MIDI instruments).

Training with Aligned Scores. A further strategy exploits score–audio pairs of classical music to generate pitch-class annotations (Figure 2) using score–audio synchronization methods [31] or specialized training strategies for dealing with weakly aligned data such as the Connectionist Temporal Classification (CTC) loss [43, 50]. In this article, extending [48], we derive our training targets from score–audio pairs of classical music that are pre-aligned with a variant of dynamic time warping [31]. We use score-based target labels for supervised training with the goal of detecting framewise pitch-class activations, i.e., a multi-class and multi-label task as illustrated in Figure 2(c). Building upon [48] and a related study on MPE [44], we consider several CNN architectures inspired by [4, 50] and evaluate the networks’ predictions against score-based pitch classes, making use of different training–test splits based on cross-version and cross-dataset strategies.

Fig. 2.

2.2 Datasets

As mentioned before, the limited availability of annotated data is a major issue for MPE and pitch-class estimation—a “key challenge” for the broader field of automatic music transcription [3]. In this section, we describe the datasets with pitch-class annotations (considered in [44, 48]) used for training our models. Then, we introduce our dataset on Wagner's Ring, which serves as our application and evaluation scenario for pitch-class estimation and will be taken up for the subsequent measurements of tonal structures (Section 3). Table 1 provides an overview of the datasets used in this work.

Table 1.

ID	Dataset Name	Style/Instrumentation	Pitch Annotation Strategy	Mix Tracks	Works	Versions	hh:mm
MuN	MusicNet [38]	Chamber music (piano, strings, winds)	Aligned scores	330	306	1 up to 3	34:08
SWD	Schubert Winterreise [46]	Chamber music (piano, solo voice)	Aligned scores	216	24	9	10:50
Tri	TRIOS [12]	Chamber music (piano, strings, winds)	Multi-track	5	5	1	0:03
B10	Bach10 [8]	Chamber music (violin, winds)	Multi-track	10	10	1	0:06
PhA	PHENICX-Anechoic [28]	Symphonic (orchestra)	Multi-track	4	4	1	0:10
CSD	Choral Singing Dataset [6]	A cappella (choir)	MIDI-guided performance	3	3	1	0:07
WRD	Wagner Ring Dataset [40]	Opera (orchestra, voice)	Aligned piano scores	176	11	16	231:56

Table 1. Music Datasets with Multi-Pitch or Pitch-Class Annotations Used in This Article

For work cycles, we count each part/movement as a work. We further report the number of versions (performances) per work for multi-version datasets.

Multi-Pitch and Pitch-Class Datasets. Since datasets with pitch annotations are tedious to generate [35], many datasets are created using pianos with key (MIDI) sensors [10, 17]. In this article, we go beyond the piano scenario, resulting in the availability of only few and small datasets with pitch annotations such as Bach10 (B10) [8], TRIOS (Tri) [12], PHENICX-Anechoic (PhA) [28], or the Choral Singing Dataset (CSD) [6] (all \(\leq\)10 pieces, see Table 1 for an overview). For creating most of these datasets [8, 12, 28], multi-track recordings were involved, which simplifies manual annotation or automatic alignment of score-based pitch information to the individual parts. As noted in Section 2.1, an alternative strategy relies on pre-synchronized score–audio pairs of classical music to generate pitch-class annotations. Datasets created with this strategy are MusicNet (MuN) [38], which comprises pitch-class annotations for 330 recordings of piano and chamber music, and the Schubert Winterreise Dataset (SWD) [46], which contains symbolic scores as well as nine performances (versions) of the 24 songs in the cycle (for piano and singing voice).³ All datasets mentioned above are publicly available.⁴

Wagner's Ring. As our central application scenario, we consider the Wagner Ring Dataset (WRD) [40] based on the tetralogy Der Ring des Nibelungen by Richard Wagner—a unique and extensive work cycle spanning 4 operas (or music dramas) and 21,939 measures in total. We structure the Ring according to 11 acts (see Figure 3), which correspond to regions of contiguous measure counts in the full score (using the complete edition [7] as a reference).

Fig. 3.

The WRD comprises 16 recorded performances of the cycle, each lasting roughly 14–15 hours, resulting in a total amount of almost 232 hours of audio material (see [40] for detailed statistics).⁵ In an extensive manual annotation process [40], we specified measure positions for 3 full performances (conducted by H. v. Karajan, D. Barenboim, and B. Haitink), which we transferred to the remaining 13 versions using a specialized audio–audio synchronization pipeline [51]. Having such measure annotations allows for conducting analyses on a musical time axis as well as for linking the different versions with the musical score and across each other. In our experiments, we exploit this cross-version scenario for studying the robustness and generalization of our approaches for pitch-class estimation (Figure 1(b)).

Beyond the audio recordings, our dataset contains a machine-readable version of a piano score by Richard Kleinmichel (see Figure 1(a)),⁶ which is a reduction of the full orchestral texture (incorporating the singers’ parts) arranged for piano. Being restricted to two staff systems, the piano score turned out to be a useful compromise since the pitch-class content is almost identical to the full score and encoding a full score for the entire Ring appeared to be too time-consuming in practice. Relying on score–audio synchronization methods [31], we transfer the pitch-class information from the piano score (musical time axis) to the audio versions (physical time axis) as illustrated by Figure 2 and use the resulting pitch-class annotations as training targets and for evaluation. As the main focus of Section 3, we illustrate with the WRD dataset how audio-based measurements of tonal structures compare against score-based measurements, and how such techniques may be employed for the purpose of computational musicology (Figure 1(c) and (d)).

2.3 Computational Approach

We now describe our CNN-based approach for extracting pitch-class representations and discuss our design choices, motivated by [48] and further related work. Previous DL approaches for pitch-class representations use a variety of architectures including fully connected neural networks [22] and CNNs [22, 49, 50]. Training deep networks can be accelerated by using so-called residual connections [49]. For MPE, Kelz et al. [19] showed that a CNN performs on par with dense or recurrent architectures while having fewer parameters. Motivated by this result and other approaches [4, 50], we rely on CNNs as the fundamental architectural principle. We further realize an extension to the basic CNN architecure as summarized in Figure 4. All networks in this article closely follow the models tested in [48] with minor modifications inspired by previous work on pitch-class estimation [43] and MPE [44].

Fig. 4.

Input Representation. As network input, we make use of a harmonic constant-Q transform (HCQT) with CQTs in harmonic frequency ratios stacked on top of each other, thus allowing convolutional kernels to span across harmonics (overtones), being able to learn templates for a pitch's harmonic series [4]. To model the most prominent harmonics, we use such an HCQT with five harmonics and one sub-harmonic for avoiding octave confusions (six channels). Based on audio sampled at 22,050 Hz, we use a CQT hopsize of 512 samples resulting in a frame rate of 43.07 Hz.⁷ Our HCQT spans 72 semitones (6 octaves) starting at C1 with a resolution of 3 bins per semitone. We choose a centric strategy with bins corresponding to integer MIDI pitches placed between the two surrounding bins, where center frequencies are adjusted based on the estimated tuning.⁸ To enhance smaller yet potentially important values in the HCQT, we employ logarithmic compression. We feed the network with 75 context frames (37 to each side of the target frame), corresponding to 1.74 seconds. The resulting input tensors are of shape \(216\times 75\times 6\). The output is a 12-dimensional vector, predicting the pitch-classes of the center frame.

Basic CNN Architecture. Our simplest CNN variant (denoted as Basic) filters the input data in a musically meaningful way. The blue arrows (path (a)) in Figure 4 schematically illustrate this model. After a layer normalization [2], we extract \(N_{0}\) feature maps in a prefiltering layer. Using a kernel size of \(15\times 15\) allows the network to detect, e.g., vibrato for singing. The second convolutional layer performs a binning to MIDI pitches by moving a \(3\times 3\) kernel with stride 3 and no padding along the pitch axis, so that each output bin corresponds to an integer MIDI pitch. We learn \(N_{1}\) feature maps with 72 bins each. Third, a convolution across time (\(75\times 1\) kernel) performs time reduction, resulting in \(N_{2}\) feature maps. Fourth, we reduce the channel dimensionality to \(N_{3}\) channels with a \(1\times 1\) convolution. Fifth, we project the 72 pitches to the 12 pitch classes. To this end, we move a kernel with length \(72-11=61\) along the pitch axis. In all convolutional layers, we use LeakyReLU activation (negative slope 0.3) to prevent vanishing gradients and MaxPooling along time to force generalization. Dropout (rate 0.2) hampers overfitting while retaining a large amount of information. We use sigmoid activations in the final layer and train with binary cross-entropy loss between predicted pitch-class vectors \(\mathbf{p}\!\in\![0,1]^{12}\) and multi-hot target vectors \(\mathbf{t}\in\!\{0,1\}^{12}\), obtained from the note occurrences in the score (see Figure 2). As default setting for Basic, we choose \(N_{0}\!=\!N_{1}\!=\!20,{}N_{2}\!=\!10,{}N_{3}\!=\!1\), resulting in a network of roughly 48k parameters in total.

Deep Residual Architectures. In [48], we realized several extensions of the Basic model by increasing the width (higher number of channels \(N_{0}\), \(\ldots\), \(N_{3}\)) or the depth (repeating the prefiltering module). For this article's experiments, we restrict ourselves to considering the most promising of these extensions, the deep residual variant. In this model (red arrows or path (b) in Figure 4), we replicate the first (prefiltering) layer five times. Since training deep architectures is difficult due to vanishing gradients, we add residual connections [18] to these five layers, leaving the remaining layers unchanged. We realize two variants of this model differing in width and number of parameters. The smaller variant DeepResNet (as in [48]) has the same numer of channels as Basic (\(N_{0}\!=\!N_{1}\!=\!20,{}N_{2}\!=\!10,{}N_{3}\!=\!1\)) and 408k parameters in total. As an extension compared to [48], we also consider a larger variant DeepResWide where we increase the number of channels to \(N_{0}\!=\!N_{1}\!=\!70,{}N_{2}\!=\!50,{}N_{3}\!=\!10\), resulting in a network of 4.8 M parameters in total.⁹

Data Augmentation. Going beyond [48], we extend our training set by applying data augmentation on the precomputed HCQT representations. First, we simulate transposition via shifting the HCQT by a random amount of steps (up to \(\pm\)5 semitones) using zero-padding at the edges and shifting the labels accordingly. Second, we perform tuning augmentation via shifting by \(\pm\)1 CQT bin (1/3 semitone) or by \(\pm\)0.5 bins (averaging neighboring bins), thereby preserving the labels. Third, we add Gaussian noise to all HQCT values with a standard deviation of \(10^{-4}\). Following [1], we finally apply a RandomEQ—a multiplicative weighting of the CQT magnitudes with randomized parabolic functions—to obtain a kind of timbre augmentation.

Model Training. To train our models, we feed mini-batches of 25 HCQT excerpts as input (see “Input representation” above) along with the pitch-class labels for the center frame. We take these excerpts from the full HCQT at regular intervals by setting the hopsize (stride) in a way that the training set comprises roughly 300k excerpts and the validation set roughly 40k excerpts. After 3,800 batches, we declare one epoch to be finished. We train for at most 100 epochs with the AdamW optimizer [25]—an improved version of Adam using decoupled weight decay regularization—with an initial learning rate of \(10^{-3}\) for the Basic CNN and \(10^{-4}\) for the residual CNNs. For scheduling, we halve the learning rate if the validation loss does not decrease for five consecutive epochs. We stop training after 12 non-improving epochs and use the best model (regarding validation loss) for testing.

2.4 Experiments

We now report on our experiments regarding pitch-class representations, where we train and test the models described in Section 2.3. We focus on WRD as our central application scenario, comparing an internal split of WRD with cross-dataset evaluation (using other datastes from Table 1 for training). For evaluation, we confine ourselves to report the average framewise cosine similarity (CS) between predicted and target activations (at a frame rate of 43.07 Hz), which is computed as the inner product of the corresponding vectors after \(\ell_{2}\)-normalization. We chose this metric since it does not depend on any threshold and, in all our experiments, other metrics such as the F-measure or the average-precision score showed similar trends.

Cross-Dataset Evaluation. In our first experiment, we investigate the quality of the learned pitch-class estimates using a realistic scenario where no parts of the Ring (and not even any opera) is seen during training. To this end, we perform a cross-dataset experiment using all datasets from Table 1 for training and validation (excluding WRD). These datasets comprise a wide variety of instrumentations and classical music styles including orchestra (PhA), chamber music (MuN, Tri, and B10), singing with piano (SWD), and choir music (CSD). For testing, we consider three versions of the first act of Die Walküre, opus number WWV 86 B-1 (Figure 3, yellow cell). We selected this specific act since this is the act with the highest annotation quality (especially regarding measure annotations and synchronization quality, please see [40].¹⁰ Figure 5(a) shows the results. As a baseline, we consider a traditional chroma feature variant [30] relying on SP techniques—concretely spoken, an infinite impulse response (IIR) filter-bank approach post-processed by weak logarithmic compression. Evaluating these features (denoted as IIR) against the pitch-class annotations derived from the piano score, we obtain a CS of 0.662 (averaged over all three versions). Compared to this baseline, the DL approaches perform clearly better. Using the Basic model, we achieve a CS of 0.774. With the residual models, we observe a further increase to CS \(=\) 0.782 for both DeepResNet and DeepResWide. In this cross-dataset setting, the higher number of parameters in DeepResWide does not lead to a further increase. Regarding the consistency of results, we note that the standard deviation over the three test versions is below 0.006 for all approaches (not shown in Figure 5 due to its small size). This stability across versions is encouraging, indicating that the learned representations are robust to differences in interpretation and acoustic conditions.

Fig. 5.

Overall, this experiment largely confirms our findings from previous work [48] regarding the results on the WRD dataset. Though the results were computed with a different implementation (including a number of minor differences in the methodological design), we obtained the same main trends as in [48]. First, we observed that a larger and more complex architecture can improve the effectiveness of pitch-class estimation models, with a residual CNN variant yielding best results. Second, we found that training on data of a different type such as piano or chamber music is less beneficial for pitch-class estimation quality and the best pitch-class estimates were achieved by using training data from the same domain (opera).

Training–Test Split on WRD. Motivated by these findings, we now report on our next experiment where we investigate the models’ effectiveness when having seen similar data during training. To this end, we train and test our three models by splitting the WRD dataset. To get close to a real-world scenario, we employ a strict split indicated by Figure 3 where neither test act nor test versions are seen by the model during training or validation (therefore denoted as “neither split” in [45]). We adjust the training parameters such that the number of training and validation examples (in total and per epoch) is comparable to the previous experiment and consider the same test set (three versions of Die Walküre WWV 86 B-1). Figure 5(b) shows the corresponding results. The baseline IIR yields the same results as above since it does not involve any training. The DL approaches, however, lead to higher cosine similarities with CS \(=\) 0.797 for Basic. Furthermore, the improvements with larger architectures are more substantial with the residual model DeepResNet achieving CS = 0.809 and its larger counterpart DeepResWide even CS \(=\) 0.821. The results are still stable across versions (all standard deviations below 0.006). From this, we conclude that the CNNs are able to adapt to the properties of opera recordings (e.g., instrumentation, expressive singing with heavy use of vibrato, pitch range), and that the larger, high-capacity models can exploit this knowledge more effectively. In summary, these results confirm the observations from [48]: knowing the specific type of music (Romantic opera) is helpful for the models and larger, residual models are beneficial in that case.

Evaluation on the Full Ring. Since all previous results are obtained from a single act, we now turn to a more comprehensive evaluation on the full WRD dataset (all acts in all 16 versions, roughly 232 hours of audio material). To this end, we employ the cross-dataset split, where no parts of WRD nor any other opera data is seen during training. Figure 6 shows the average CS as well as the minimal and maximal CS across the 16 versions for the 11 acts, focusing on the baseline IIR (blue) and the best-performing DL model DeepResWide (red). As the most important observation, the superiority of the DL approach is clearly confirmed, achieving CS values at least 0.1 higher than the IIR baseline for all acts. Overall, we observe a stable behavior across versions. For the DeepResWide model, the best and worst version of each act differ by only 0.02–0.05. However, we note some substantial variations between acts. While our previous test act B-1 seems to be among the easier ones (together with C-3, D-0, and D-2), others such as B-3, C-1, or C-2 seem to be harder for pitch-class estimation. Interestingly, the variation across acts is larger than across versions. Please note that the best-performing versions for the difficult acts obtain lower CS values than the worst-performing versions of the easier acts. This indicates that musical differences between acts of the Ring crucially influence the ability for predicting pitch classes. Overall, this large-scale evaluation is a strong evidence that pitch-class representations based on deep neural networks—as compared to SP methods—achieve a substantially higher similarity to the pitch-class information derived from a symbolic score, even when being trained on entirely different datasets.

Fig. 6.

3 Measuring Tonal Structures

This second part deals with the measurement of tonal structures based on the pitch-class representations presented above. For the specific application scenario of diatonic scales, we present related work (Section 3.1), our computational approach (Section 3.2), experimental results (Section 3.3), and future directions (Section 3.4).

Fig. 7.

3.1 Motivation and Related Work

The analysis of tonal structures is central to MIR and computational musicology. Notions of tonal structures relate to different temporal scales, with a hierarchical interdependency. Besides the global key, a single label for describing the tonality of a work or movement (work part), more fine-grained notions include local keys [24], which may change (modulate) throughout a movement, and chords [22]. Often, the intended practical application is to provide useful chord (or key) labels for amateur musicians to re-play or accompany a song¹¹ rather than aiming for a precise and faithful chord transcription. Local key or chord analysis is typically approached as a framewise classification task involving an explicit segmentation and labeling (hard decision). However, this strategy does not reflect gradual modulation processes, tonal ambiguities, or different ways of interpretation. As a consequence, there is a high subjectivity in the labels, which has been shown in recent studies for chord [21, 32] and local key estimation (LKE) [45]. Since modern approaches to chord estimation and LKE commonly rely on supervised DL [22, 27] trained on single-expert annotations, the resulting systems suffer from a bias to the annotators [21] and to the repertoire covered by the training set [45].

Multi-Step Strategy. Such biases are generally not acceptable for analysis systems utilized in computational musicology, where objective tools are required. Moreover, the end-to-end paradigm in DL prevents the investigation of intermediate results and hampers interpretability. We address this demand by pursuing a multi-step approach where we rely on DL and training data only for the primary step (Figure 1(b)) of extracting pitch-class representations (as covered by Section 2)—an SP task that does not involve ambiguities in the way tonal analysis tasks do. For the subsequent step of measuring tonal structures (Figure 1(c)), we pursue a model-based approach relying on template matching, which can be realized for chords [13], scales [41], or other tonal structures defined as pitch-class sets. In this article, we focus on the measurement of the 12 diatonic scales, which are suitable for highlighting the overall modulation process while avoiding the frequent ambiguity between relative keys (e.g., C major vs. A minor).¹² Such a diatonic scale merely defines a set of seven pitch classes related by perfect fifth intervals (two pitch classes that are seven semitones apart, see also Figure 7)—without the need to specifiy one of these pitch classes as the tonal center (as for local keys), thus avoiding ambiguities as mentioned above. One example of a diatonic scale are the white keys of the piano, which can be notated without accidentals, therefore denoted as 0 diatonic scale [14]. To measure these scales, we follow a simple template-matching procedure proposed in [41], which is loosely inspired by template-based key finding approaches [36] and their application for visualizing local tonality [15, 33]. In contrast to these works, we model the scales with binary templates—a straightforward choice which has the advantage of avoiding biases since no training (as in [37]) or perceptual experiments (as in [36]) are involved. Inspired by [41, 47], we visualize the measured similarities (probabilities) over time, thus avoiding an explicit segmentation and accounting for ambiguities (indicated by non-zero probabilities for more than one scale). By employing a musically motivated ordering according to the circle of fifths (where neighboring scales overlap by six out of seven pitch classes), these visualizations provide a good overview of the modulation process, in particular for complex and large-scale works such as symphonies or operas (Figure 1(c)). We demonstrate the usefulness of this approach for musicological studies using Wagner's Ring as an illustrative example.

3.2 Computational Approach

We now describe our template-based method for measuring tonal structures [41, 42, 47], closely following previous work [42] where we applied this technique to Beethoven's piano sonatas. The foundation of this approach is measurewise pitch-class representations. In this article, we derive those representations either from the score (accumulating the duration of each pitch-class in a measure ignoring velocity, dynamics, or octave duplicates) or from the audio recordings (as in Section 2.3), where we consider the SP approach (IIR) and the best DL model (DeepResWide). For all representations, we use our measure annotations (Section 2.2) to obtain an average pitch-class distribution for each measure (compare Figure 1(b)). Since local keys or scales refer to the tonal content of longer segments, we smooth the measurewise representations using a window of size \(w\!\in\!{\mathbb{N}}\) measures and a hopsize of one measure. We normalize the smoothed pitch-class representation according to the \(\ell_{2}\)-norm, obtaining for each window a vector \(\mathbf{c}\!\in\!{\mathbb{R}}^{12}\) with the entries \(c_{0}\), \(c_{1}\), \(\ldots\), \(c_{11}\) corresponding to the pitch classes \(\mathrm{C}\), \(\mathrm{C}\sharp\), \(\ldots\), \(\mathrm{B}\) in chromatic order.

Template Matching. To measure the presence of specific tonal structures, we match each pitch-class vector \(\mathbf{c}\) with binary diatonic scale templates \(\mathbf{t}\!\in\!{\mathbb{R}}^{12}\) using the inner product \(\langle\mathbf{t}_{\mathrm{,}}\mathbf{c}\rangle\) of \(\ell_{2}\)-normalized vectors (CS). For instance, in our chromatic ordering, the template for the “0 diatonic scale” (corresponding to the C major and A natural minor scales, i.e., the white keys on the piano) is defined by

\begin{align} \mathbf{t}_{\mathrm{0}}=\frac{1}{\sqrt{7}}\left(1,0,1,0,1,1,0,1,0,1,0,1\right) ^{\top},\end{align}

(1)

where \(1\) denotes an active and \(0\) an inactive pitch-class. Assuming enharmonic equivalence (C\(\sharp\) \(=\) D\(\flat\) etc.), we distinguish 12 diatonic scales whose templates are obtained by circularly shifting \(\mathbf{t}_{\mathrm{0}}\). We obtain for each frame an analytical measurement given by a 12-dimensional vector \(\mathbf{\hat{a}}\in{\mathbb{R}}^{12}\), with \(\hat{a}_{i}=\langle\mathbf{t}_{\mathrm{i}},\mathbf{c}\rangle\), \(i\!\in\!\{-5,-4,\ldots,+5,+6\}\).

Rescaling and Reordering. As discussed in Section 3.1, instead of deciding on the best matching scale (using the \(\mathrm{argmax}\) function), we propose to use suitable visualization techniques [41] allowing for a direct interpretation of the continuous-valued diatonic scale likelihoods. Beyond that, this strategy allows to visually grasp ambiguities, e.g., in the case that two scales obtain comparable probabilities. To generate such visualizations, we obtain rescaled local analyses \(\mathbf{a}\!\in\!{\mathbb{R}}^{12}\) by using the \(\mathrm{softmax}\) function for suppressing the weak components and enhancing the large ones

\begin{align} a_{i}=\left.\exp(\beta\hat{a}_{i})\middle/\sum_{j=-5}^{6}\exp(\beta\hat{a}_{j} \right.)\end{align}

(2)

with \(i\!\in\!\{-5,\ldots,+6\}\) and a parameter \(\beta\in{\mathbb{R}}\) (inverse softmax temperature), which can be used for adjusting the level of enhancement. Due to the \(\ell_{1}\)-normalization involved here, we can interpret the analysis as pseudo-probabilities of diatonic scales, which we then visualize in grayscale (see Figure 1(c) for an example). We combine the analytical measurements \(\mathbf{a}\) for all musical measures to a matrix \(\mathcal{A}\in[0,1]^{12\times M}\) with \(M\) indicating the total number of measures. For the visualization step, we adopt a musical criterion for arranging the order of the scales following the circle of fifths (C, G, D, \(\ldots\)), thus accounting for the tonal proximity of fifth-related scales [14, 41]. In this context, scales notated with sharps (\(\sharp\)) are denoted \(+1\), \(+2\), and so forth and scales notated with flats (\(\flat\)) as \(-\)1, \(-\)2, and so forth (due to enharmonic equivalence, \(-\)6 \(\widehat{=}\) \(+\)6).

Parameter Choice. The choice of the analysis parameters (analysis window size \(w\) and scaling parameter \(\beta\)) has a crucial influence on the resulting visualization. In this article, however, we do not aim for optimizing this choice but leave it to the musicological experts, who can work with the visualizations in an interactive fashion. The idea of the visualizations is not to perform a pre-selection but to provide a user-friendly, intuitive tool that allows for adjustments. For this reason, we try to avoid hard decisions (e.g., a boundary detector) and instead aim for an explicit, easy-to-understand white-box approach. To demonstrate this approach, we set \(w=32\) measures and \(\beta=50\) in the following and show in a dedicated experiment (Figure 10) that the parameter choice does not impact our main experimental findings concerning the usefulness of deep pitch-class representations for this task. For a more detailed discussion on the effect of the window size, we refer to [42].

3.3 Experiments

We now present the results of our experiments for measuring tonal structures. As our central evaluation scenario, we reconsider the WRD dataset, which constitutes a prime example for a large-scale, tonally complex musical work. Based on the different pitch-class representations covered in Section 2, we apply the methodology explained in Section 3.2 for measuring diatonic scale probabilities.

Qualitative Evaluation. To gain an intuitive understanding of the musical differences between our three strategies, we start with a qualitative comparison for the first act of Die Walküre WWV 86 B-1, focusing on the version conducted by Herbert von Karajan. As for the underlying pitch-class representations, we consider the variant based on the piano score (denoted as SC), the IIR variant based on traditional signal processing (denoted as SP), and the DL variant based on the DeepResWide model trained in the cross-dataset split (now denoted as DL for the sake of brevity). Relying on these different representations, we obtain three measurement matrices \(\mathcal{A}_{\textsf{SC}},\mathcal{A}_{\textsf{SP}},\mathcal{A}_{ \textsf{DL}}\in[0,1]^{12\times M}\). To visually compare these measurements, we make use of an additive color scheme loosely inspired by [47]. For computing our color scheme, we map the three measurement results to the red, green, blue (RGB) color space and invert this assignment

\begin{align}\begin{pmatrix}\mathrm{R} \\\mathrm{G} \\\mathrm{B}\end{pmatrix}=\begin{pmatrix}1-\mathcal{A}_{\textsf{SC}}(n,m) \\1-\mathcal{A}_{\textsf{SP}}(n,m) \\1-\mathcal{A}_{\textsf{DL}}(n,m)\end{pmatrix} \end{align}

(3)

with \(n\in\{-5,\ldots,6\}\) and \(m\in\{1,\ldots,M\}\). Basically, this corresponds to plotting the three matrices on top of each other in a transparent way, such that they merge to black or grayscale values in case all matrices are the same. The resulting colors from this assignment are shown in the legend of Figure 8. When occurring in isolation, an SC-based measurement results in cyan, the SP-based measurement results in magenta, and the DL-based measurement results in yellow. As soon as exactly two measurements match, we observe as mix colors blue (SC \(+\) SP), green (SC \(+\) DL), and red (SP \(+\) DL), respectively. If all three representations lead to the same measurement, this is displayed in black or grayscale.

Fig. 8.

In Figure 8, we observe black—indicating agreement across all three representations—for a number of sections, e.g., in measure regions 50–110, 605–640, or 1,280–1,330. The latter region, for instance, starts in clear C major (with a chromatically figurated dominant chord), then circulates around this diatonic center, touching related keys such as F major and A minor but being quite stable against short altered chords. For many other regions, we do not observe such an agreement. Most deviations occur in neighboring rows corresponding to musically similar diatonic scales (differing by only one accidental, as in measures 925–960 or 1,345–1,400). However, we also see stronger deviations such as for measures 440–600, a fluctuative, recitative-like passage in minor, whose tonal centers G minor and A minor (from measure 535 on) can be observed from the SC and DL representation. Overall, we identify two dominating colors: green (SC \(+\) DL together) and magenta (SP in isolation). This points out that, for a large part of this act, the measurement based on SP deviates from the score-based one, but the DL- and the SC-based measurements coincide. We take this observation as a first indication for DL leading to a higher consistency between audio- and score-based measurements of tonal structures.

Quantitative Evaluation. To investigate this observation in a more quantitative way, we now introduce a metric for comparing two measurements of diatonic scales. We tested several metrics including the CS. In practice, we found the most informative metric to be based on the Jensen–Shannon (JS) divergence \(\mathcal{J}\), which is a symmetrized version of the Kullback–Leibler divergence \(\mathcal{K}\) for comparing two probability distributions. For calculating this metric, we interprete the \(\ell_{1}\)-normalized vectors \(\mathbf{a}\in{\mathbb{R}}^{12}\) (columns of \(\mathcal{A}\)) as probability distributions. The JS divergence yields values \(\mathcal{J}(\mathbf{a},\mathbf{a}^{\prime})\!\in\![0,1]\) and is computed as follows:

\begin{align} \mathcal{J}(\mathbf{a},\mathbf{a}^{\prime})=\frac{1}{2}\left(\mathcal{K}(\mathbf{a},\mathbf{m})+\mathcal{K}(\mathbf{a}^{\prime},\mathbf{m})\right)\end{align}

(4)

with \(\mathbf{m}\!=\!1/2(\mathbf{a}\!+\!\mathbf{a}^{\prime})\) and

\begin{align} \mathcal{K}(\mathbf{a},\mathbf{a}^{\prime})=\sum_{i=1}^{12}a_{i}\cdot\log\left (\frac{a_{i}}{a_{i}^{\prime}}\right).\end{align}

(5)

To obtain a similarity measure (like the cosine measure used before), we define the JS similarity \(\mathcal{S}\):

\begin{align} \mathcal{S}(\mathbf{a},\mathbf{a}^{\prime})=1-\mathcal{J}(\mathbf{a},\mathbf{a }^{\prime})\end{align}

(6)

such that \(\mathcal{S}(\mathbf{a},\mathbf{a}^{\prime})\in[0,1]\) and \(\mathcal{S}(\mathbf{a},\mathbf{a}^{\prime})=1\) if \(\mathbf{a}=\mathbf{a}^{\prime}\). For comparing two measurements for a whole act or the full Ring, we calculate the pairwise JS similarity for each musical measure and finally average over all measures.

In Figure 9, we show the resulting JS similarities for the different pairs of representations. Let us first focus on Figure 9(a), which shows the results for the act WWV 86 B-1. For each color, the left (hatched) bar indicates the JS similarity computed on the Karajan version—corresponding to the measurement results visualized in Figure 8. The results confirm our qualitative observation from above. First, note that the similarity between score-based and SP approaches is lowest with \(\mathcal{S}\) \(=\) 0.870 (SC–SP, blue). In contrast, using the DL approach, the resulting representation is much closer to the score-based one, as indicated by the value of \(\mathcal{S}\) \(=\) 0.950 (SC–DL, green). Finally, the similarity between the SP and the DL approach lies between the two previous combinations with \(\mathcal{S}\) \(=\) 0.888 (SP–DL, red). This confirms that SP clearly differs from both DL and SC. We repeat this experiment for all 16 versions of the act B-1, indicated by the right (solid) bar of each color in Figure 9(a), respectively. The findings confirm the trend observed for Karajan, with the SP strategy being even less similar to SC (\(\mathcal{S}\) \(=\) 0.839, blue). This indicates that other versions than Karajan are acoustically more challenging for the SP approach but do not pose a problem for DL. In general, the JS similarities deviate across versions: For SC–SP (blue), we obtain a standard deviation of 0.040. This decreases when using DL, with SC–DL (green) yielding a standard deviation of only 0.015. We conclude that the DL approach not only leads to a measurement of tonal structures much closer to the SC-based one, but that this similarity is also more robust against variations in interpretation and acoustic conditions, pointing to high generalization capabilities of the trained networks (please remember that the networks have not seen any opera during training).

Fig. 9.

We now repeat this experiment for the full WRD dataset, considering all 11 acts for evaluation (Figure 9(b)). Overall, we observe lower JS similarities as compared to the previous experiment. This suggests that the act B-1 is a tonally simpler part of the Ring as compared to other acts and operas, thus confirming our observation for the pitch-class estimation task (Figure 6). However, the general trend remains the same. Looking at the mean over all versions (right, solid bars), the similarity between SC–SP is lowest at \(\mathcal{S}\) \(=\) 0.791 and the similarity between SC–DL highest at \(\mathcal{S}\) \(=\) 0.904. This is an encouraging finding, indicating that even for tonally complex, large-scale works such as the Ring, audio-based measurements of tonal structures lead to results very close to score-based ones when relying on a deep pitch-class representation. In summary, we conclude that computational approaches face particular challenges in the musical analysis of local keys or scales, whereas DL methods can effectively perform the technical front-end task of extracting pitch-class representations from audio.

Influence of Parameters. In the previous experiments, we always fixed the parameters for the template-based measurement of diatonic scales to a window size of \(w\) \(=\) 32 measures and a scaling parameter of \(\beta\) \(=\) 50. We finally want to test if and how this choice influences our central findings. To this end, we repeat the previous experiment on the full Ring in the Karajan version (Figure 9(b), left/hatched bars) for different values of the window size \(w\), which is a central parameter of the measurement. The pairwise similarities for the different window sizes and representations are shown in Figure 10. Overall, we observe that the measurements become more similar for larger window sizes. This may be an effect of averaging, where differences are suppressed when using larger windows. However, irrespectively of the value of \(w\), DL and SC are most similar to each other, and the similarity between SC–SP is always substantially lower. This consistent behavior is a strong support for our conclusion above. By using DL techniques, the computational measurements of tonal structures based on audio recordings can become similar to those based on symbolic scores.

Fig. 10.

3.4 Future Work

We finally want to point out several interesting directions for future work. First, we compared in this section the measurement results based on different representations directly with each other, which does not necessarily indicate how close these measurements are to a human analysis. While annotating the full Ring regarding local keys or diatonic scales is extremely time-consuming and may introduce subjectivity [45], we can derive a weak annotation of diatonic scales from the key signature regions in the score (whose number and type of accidentals define a diatonic scale). Preliminary experiments showed that the DL-based measurements of diatonic scales are closer to such annotations than the SP-based ones, and there is no observable difference between DL and SC.¹³

Second, in this article, we used the different versions of our Ring dataset only for testing the robustness of the results. However, one may also exploit the availability of several versions for improving the measurements themselves by performing cross-version aggregation [11, 47], either with an early-fusion approach (aggregating pitch-class information from different versions) or a late-fusion approach (aggregating tonal measurements). These strategies may further improve audio-based approaches for the computational analysis of tonal structures.

4 Conclusions

In this article, we investigated the measurement of tonal structures in a case study on Richard Wagner's Ring cycle. As our main strategy, we proposed a multi-step approach that allows for an objective measurement and for interpretation by human experts. In the first step, we extracted pitch-class representations from the audio recordings. We found that a DL approach is able to derive pitch-class representations closer to the score than an SP approach, even when the networks are trained on entirely different datasets. Larger and more complex networks were able to further improve the representations when being exposed to similar data (operas) during training. On the basis of the different pitch-class representations, we then performed experiments for measuring diatonic scale probabilities using a traditional pattern-matching technique. We showed that the measurement results with pitch-class representations based on DL are close to the ones based on scores. On the contrary, pitch-class representations extracted with SP produced measurements that were notably less similar to those based on scores. We found this observation to be independent of the musical performances (versions) used for testing and to the parameters of the measurement. From these findings, we conclude that DL strategies substantially contribute to bridging the gap between audio- and score-based analyses. Such strategies allow to conduct large-scale corpus studies—traditionally based on scores—on the basis of audio recordings, thus opening up new possibilities to exploit multi-modal data for musicological research.

Acknowledgments

The International Audio Laboratories Erlangen is a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. We thank Johannes Zeitler, Geoffroy Peeters, Mark Gotham, Stephanie Klauk, Rainer Kleinertz, and Vlora Arifi-Müller for collaboration on prior work. Some ideas were inspired by the Dagstuhl Seminar 22082 on Deep Learning and Knowledge Integration for Music Audio Analysis.

Footnotes

The assistance systems for digitization based on optical music recognition still require a considerable amount of manual intervention [9].

This challenge is caused also by the hierachical nature of tonality, which makes not only labeling but also segmentation an ambiguous task. For instance, one may consider different levels of local key changes denoted as modulations or tonicizations [24].

Pitch and pitch-class annotations for the SWD can be found at https://zenodo.org/record/5139893/, version\(\)2.0.

⁴

Due to copyright restrictions, for SWD and WRD, the audio data are included in the respective archives only for a subset of versions. Other versions are available as commercial CDs as specified in [40, 46].

⁵

The dataset is available online at Zenodo via https://zenodo.org/records/7672157

⁶

Original PDF is available via the International Music Score Library Project (IMSLP), e.g., https://imslp.org/wiki/Special:ImagefromIndex/253868/qrol, accessible as MIDI/MusicXML in [40].

⁷

As the only parameter, the CQT is determined by the hopsize, which must be an integer multiple of powers of two.

⁸

We use the tuning estimation of librosa, see https://librosa.org/

⁹

Our source code (PyTorch) and pretrained models are available online via https://github.com/christofw/pitchclass_tonalstructures

¹⁰

Moreover, in future work, we plan to extend our experiments to compare with pitch-class annotations from a full orchestral score (not only the piano reduction), which we have available for this act but no other part of the Ring.

¹¹

Related commercial applications are, e.g., Chordify or Yousician.

¹²

In essence, this diatonic scale measurement is a simplification of the LKE task. While the missing specification of the referential chord (tonic) and, thus, the missing discrimination between major and relative minor, is certainly a lack of information, this has downsides (such as the lacking consideration of altered scale degrees in harmonic or melodic minor) but also advantages. Being a pure measurement of pitch-class content over time rather than an actual “analysis,” it involves less interpretation by trained experts and therefore, perceptual aspects as the implied “center of gravity” or “feeling of arrival” necessary for establishing a local key are less relevant. This makes the measurements less prone to annotator subjectivity and disagreement, which frequently occurs in local key annotations as shown in [45], while still being explicitly interpretable for musicologists.

¹³

An article with these experiments on the full WRD [40] is currently under preparation.

References

[1]

Jakob Abeßer and Meinard Müller. 2021. Jazz bass transcription using a U-net architecture. Electronics 10, 6 (2021), 670 1–11. DOI:

Abstract

1 Introduction

2 Learning Pitch-Class Representations of Audio

2.1 Motivation and Related Work

2.2 Datasets

2.3 Computational Approach

2.4 Experiments

3 Measuring Tonal Structures

3.1 Motivation and Related Work

3.2 Computational Approach

3.3 Experiments

3.4 Future Work

4 Conclusions

Acknowledgments

Footnotes

References

Index Terms

Recommendations

Pitch-related identification of instruments in classical music recordings

Connectionist Representations of Tonal Music: Discovering Musical Patterns by Interpreting Artificial Neural Networks

Interactive multi-scale visualizations of tonal evolution in MuSA.RT Opus 2

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations