3.1 Motivation and Related Work
The analysis of tonal structures is central to MIR and computational musicology. Notions of tonal structures relate to different temporal scales, with a hierarchical interdependency. Besides the
global key, a single label for describing the tonality of a work or movement (work part), more fine-grained notions include
local keys [
24], which may change (
modulate) throughout a movement, and
chords [
22]. Often, the intended practical application is to provide
useful chord (or key) labels for amateur musicians to re-play or accompany a song
11 rather than aiming for a precise and faithful chord
transcription. Local key or chord analysis is typically approached as a framewise classification task involving an explicit segmentation and labeling (hard decision). However, this strategy does not reflect gradual modulation processes, tonal ambiguities, or different ways of interpretation. As a consequence, there is a high subjectivity in the labels, which has been shown in recent studies for chord [
21,
32] and
local key estimation (LKE) [
45]. Since modern approaches to chord estimation and LKE commonly rely on supervised DL [
22,
27] trained on single-expert annotations, the resulting systems suffer from a bias to the annotators [
21] and to the repertoire covered by the training set [
45].
Multi-Step Strategy. Such biases are generally not acceptable for analysis systems utilized in computational musicology, where objective tools are required. Moreover, the end-to-end paradigm in DL prevents the investigation of intermediate results and hampers interpretability. We address this demand by pursuing a multi-step approach where we rely on DL and training data only for the primary step (
Figure 1(b)) of extracting pitch-class representations (as covered by
Section 2)—an SP task that does not involve ambiguities in the way tonal analysis tasks do. For the subsequent step of measuring tonal structures (
Figure 1(c)), we pursue a model-based approach relying on template matching, which can be realized for chords [
13], scales [
41], or other tonal structures defined as pitch-class sets. In this article, we focus on the measurement of the 12
diatonic scales, which are suitable for highlighting the overall modulation process while avoiding the frequent ambiguity between relative keys (e.g., C major vs. A minor).
12 Such a diatonic scale merely defines a set of seven pitch classes related by perfect fifth intervals (two pitch classes that are seven semitones apart, see also
Figure 7)—without the need to specifiy one of these pitch classes as the tonal center (as for local keys), thus avoiding ambiguities as mentioned above. One example of a diatonic scale are the white keys of the piano, which can be notated without accidentals, therefore denoted as
0 diatonic scale [
14]. To measure these scales, we follow a simple template-matching procedure proposed in [
41], which is loosely inspired by template-based key finding approaches [
36] and their application for visualizing local tonality [
15,
33]. In contrast to these works, we model the scales with binary templates—a straightforward choice which has the advantage of avoiding biases since no training (as in [
37]) or perceptual experiments (as in [
36]) are involved. Inspired by [
41,
47], we visualize the measured similarities (probabilities) over time, thus avoiding an explicit segmentation and accounting for ambiguities (indicated by non-zero probabilities for more than one scale). By employing a musically motivated ordering according to the circle of fifths (where neighboring scales overlap by six out of seven pitch classes), these visualizations provide a good overview of the modulation process, in particular for complex and large-scale works such as symphonies or operas (
Figure 1(c)). We demonstrate the usefulness of this approach for musicological studies using Wagner's
Ring as an illustrative example.
3.2 Computational Approach
We now describe our template-based method for measuring tonal structures [
41,
42,
47], closely following previous work [
42] where we applied this technique to Beethoven's piano sonatas. The foundation of this approach is measurewise pitch-class representations. In this article, we derive those representations either from the score (accumulating the duration of each pitch-class in a measure ignoring velocity, dynamics, or octave duplicates) or from the audio recordings (as in
Section 2.3), where we consider the SP approach (
IIR) and the best DL model (
DeepResWide). For all representations, we use our measure annotations (
Section 2.2) to obtain an average pitch-class distribution for each measure (compare
Figure 1(b)). Since local keys or scales refer to the tonal content of longer segments, we smooth the measurewise representations using a window of size
\(w\!\in\!{\mathbb{N}}\) measures and a hopsize of one measure. We normalize the smoothed pitch-class representation according to the
\(\ell_{2}\)-norm, obtaining for each window a vector
\(\mathbf{c}\!\in\!{\mathbb{R}}^{12}\) with the entries
\(c_{0}\),
\(c_{1}\),
\(\ldots\),
\(c_{11}\) corresponding to the pitch classes
\(\mathrm{C}\),
\(\mathrm{C}\sharp\),
\(\ldots\),
\(\mathrm{B}\) in chromatic order.
Template Matching. To measure the presence of specific tonal structures, we match each pitch-class vector
\(\mathbf{c}\) with binary diatonic scale templates
\(\mathbf{t}\!\in\!{\mathbb{R}}^{12}\) using the inner product
\(\langle\mathbf{t}_{\mathrm{,}}\mathbf{c}\rangle\) of
\(\ell_{2}\)-normalized vectors (CS). For instance, in our chromatic ordering, the template for the “0 diatonic scale” (corresponding to the C major and A natural minor scales, i.e., the white keys on the piano) is defined by
where
\(1\) denotes an active and
\(0\) an inactive pitch-class. Assuming enharmonic equivalence (C
\(\sharp\) \(=\) D
\(\flat\) etc.), we distinguish 12 diatonic scales whose templates are obtained by circularly shifting
\(\mathbf{t}_{\mathrm{0}}\). We obtain for each frame an analytical measurement given by a 12-dimensional vector
\(\mathbf{\hat{a}}\in{\mathbb{R}}^{12}\), with
\(\hat{a}_{i}=\langle\mathbf{t}_{\mathrm{i}},\mathbf{c}\rangle\),
\(i\!\in\!\{-5,-4,\ldots,+5,+6\}\).
Rescaling and Reordering. As discussed in
Section 3.1, instead of deciding on the best matching scale (using the
\(\mathrm{argmax}\) function), we propose to use suitable visualization techniques [
41] allowing for a direct interpretation of the continuous-valued diatonic scale likelihoods. Beyond that, this strategy allows to visually grasp ambiguities, e.g., in the case that two scales obtain comparable probabilities. To generate such visualizations, we obtain rescaled local analyses
\(\mathbf{a}\!\in\!{\mathbb{R}}^{12}\) by using the
\(\mathrm{softmax}\) function for suppressing the weak components and enhancing the large ones
with
\(i\!\in\!\{-5,\ldots,+6\}\) and a parameter
\(\beta\in{\mathbb{R}}\) (inverse softmax temperature), which can be used for adjusting the level of enhancement. Due to the
\(\ell_{1}\)-normalization involved here, we can interpret the analysis as pseudo-probabilities of diatonic scales, which we then visualize in grayscale (see
Figure 1(c) for an example). We combine the analytical measurements
\(\mathbf{a}\) for all musical measures to a matrix
\(\mathcal{A}\in[0,1]^{12\times M}\) with
\(M\) indicating the total number of measures. For the visualization step, we adopt a musical criterion for arranging the order of the scales following the circle of fifths (C, G, D,
\(\ldots\)), thus accounting for the tonal proximity of fifth-related scales [
14,
41]. In this context, scales notated with sharps (
\(\sharp\)) are denoted
\(+1\),
\(+2\), and so forth and scales notated with flats (
\(\flat\)) as
\(-\)1,
\(-\)2, and so forth (due to enharmonic equivalence,
\(-\)6
\(\widehat{=}\) \(+\)6).
Parameter Choice. The choice of the analysis parameters (analysis window size
\(w\) and scaling parameter
\(\beta\)) has a crucial influence on the resulting visualization. In this article, however, we do not aim for optimizing this choice but leave it to the musicological experts, who can work with the visualizations in an interactive fashion. The idea of the visualizations is not to perform a pre-selection but to provide a user-friendly, intuitive tool that allows for adjustments. For this reason, we try to avoid hard decisions (e.g., a boundary detector) and instead aim for an explicit, easy-to-understand white-box approach. To demonstrate this approach, we set
\(w=32\) measures and
\(\beta=50\) in the following and show in a dedicated experiment (
Figure 10) that the parameter choice does not impact our main experimental findings concerning the usefulness of deep pitch-class representations for this task. For a more detailed discussion on the effect of the window size, we refer to [
42].
3.3 Experiments
We now present the results of our experiments for measuring tonal structures. As our central evaluation scenario, we reconsider the
WRD dataset, which constitutes a prime example for a large-scale, tonally complex musical work. Based on the different pitch-class representations covered in
Section 2, we apply the methodology explained in
Section 3.2 for measuring diatonic scale probabilities.
Qualitative Evaluation. To gain an intuitive understanding of the musical differences between our three strategies, we start with a qualitative comparison for the first act of
Die Walküre WWV 86 B-1, focusing on the version conducted by Herbert von Karajan. As for the underlying pitch-class representations, we consider the variant based on the piano score (denoted as SC), the
IIR variant based on traditional signal processing (denoted as SP), and the DL variant based on the
DeepResWide model trained in the cross-dataset split (now denoted as DL for the sake of brevity). Relying on these different representations, we obtain three measurement matrices
\(\mathcal{A}_{\textsf{SC}},\mathcal{A}_{\textsf{SP}},\mathcal{A}_{ \textsf{DL}}\in[0,1]^{12\times M}\). To visually compare these measurements, we make use of an additive color scheme loosely inspired by [
47]. For computing our color scheme, we map the three measurement results to the
red, green, blue (RGB) color space and invert this assignment
with
\(n\in\{-5,\ldots,6\}\) and
\(m\in\{1,\ldots,M\}\). Basically, this corresponds to plotting the three matrices on top of each other in a transparent way, such that they merge to black or grayscale values in case all matrices are the same. The resulting colors from this assignment are shown in the legend of
Figure 8. When occurring in isolation, an SC-based measurement results in cyan, the SP-based measurement results in magenta, and the DL-based measurement results in yellow. As soon as exactly two measurements match, we observe as mix colors blue (
SC \(+\) SP), green (
SC \(+\) DL), and red (
SP \(+\) DL), respectively. If all three representations lead to the same measurement, this is displayed in black or grayscale.
In
Figure 8, we observe black—indicating agreement across all three representations—for a number of sections, e.g., in measure regions 50–110, 605–640, or 1,280–1,330. The latter region, for instance, starts in clear C major (with a chromatically figurated dominant chord), then circulates around this diatonic center, touching related keys such as F major and A minor but being quite stable against short altered chords. For many other regions, we do not observe such an agreement. Most deviations occur in neighboring rows corresponding to musically similar diatonic scales (differing by only one accidental, as in measures 925–960 or 1,345–1,400). However, we also see stronger deviations such as for measures 440–600, a fluctuative, recitative-like passage in minor, whose tonal centers G minor and A minor (from measure 535 on) can be observed from the SC and DL representation. Overall, we identify two dominating colors: green (
SC \(+\) DL together) and magenta (SP in isolation). This points out that, for a large part of this act, the measurement based on SP deviates from the score-based one, but the DL- and the SC-based measurements coincide. We take this observation as a first indication for DL leading to a higher consistency between audio- and score-based measurements of tonal structures.
Quantitative Evaluation. To investigate this observation in a more quantitative way, we now introduce a metric for comparing two measurements of diatonic scales. We tested several metrics including the CS. In practice, we found the most informative metric to be based on the
Jensen–Shannon (JS) divergence
\(\mathcal{J}\), which is a symmetrized version of the Kullback–Leibler divergence
\(\mathcal{K}\) for comparing two probability distributions. For calculating this metric, we interprete the
\(\ell_{1}\)-normalized vectors
\(\mathbf{a}\in{\mathbb{R}}^{12}\) (columns of
\(\mathcal{A}\)) as probability distributions. The JS divergence yields values
\(\mathcal{J}(\mathbf{a},\mathbf{a}^{\prime})\!\in\![0,1]\) and is computed as follows:
with
\(\mathbf{m}\!=\!1/2(\mathbf{a}\!+\!\mathbf{a}^{\prime})\) and
To obtain a similarity measure (like the cosine measure used before), we define the JS
similarity \(\mathcal{S}\):
such that
\(\mathcal{S}(\mathbf{a},\mathbf{a}^{\prime})\in[0,1]\) and
\(\mathcal{S}(\mathbf{a},\mathbf{a}^{\prime})=1\) if
\(\mathbf{a}=\mathbf{a}^{\prime}\). For comparing two measurements for a whole act or the full
Ring, we calculate the pairwise JS similarity for each musical measure and finally average over all measures.
In
Figure 9, we show the resulting JS similarities for the different pairs of representations. Let us first focus on
Figure 9(a), which shows the results for the act WWV 86 B-1. For each color, the left (hatched) bar indicates the JS similarity computed on the Karajan version—corresponding to the measurement results visualized in
Figure 8. The results confirm our qualitative observation from above. First, note that the similarity between score-based and SP approaches is lowest with
\(\mathcal{S}\) \(=\) 0.870 (SC–SP, blue). In contrast, using the DL approach, the resulting representation is much closer to the score-based one, as indicated by the value of
\(\mathcal{S}\) \(=\) 0.950 (SC–DL, green). Finally, the similarity between the SP and the DL approach lies between the two previous combinations with
\(\mathcal{S}\) \(=\) 0.888 (SP–DL, red). This confirms that SP clearly differs from both DL and SC. We repeat this experiment for all 16 versions of the act B-1, indicated by the right (solid) bar of each color in
Figure 9(a), respectively. The findings confirm the trend observed for Karajan, with the SP strategy being even less similar to SC (
\(\mathcal{S}\) \(=\) 0.839, blue). This indicates that other versions than Karajan are acoustically more challenging for the SP approach but do not pose a problem for DL. In general, the JS similarities deviate across versions: For SC–SP (blue), we obtain a standard deviation of 0.040. This decreases when using DL, with SC–DL (green) yielding a standard deviation of only 0.015. We conclude that the DL approach not only leads to a measurement of tonal structures much
closer to the SC-based one, but that this similarity is also more
robust against variations in interpretation and acoustic conditions, pointing to high generalization capabilities of the trained networks (please remember that the networks have not seen any opera during training).
We now repeat this experiment for the full
WRD dataset, considering all 11 acts for evaluation (
Figure 9(b)). Overall, we observe lower JS similarities as compared to the previous experiment. This suggests that the act B-1 is a tonally simpler part of the
Ring as compared to other acts and operas, thus confirming our observation for the pitch-class estimation task (
Figure 6). However, the general trend remains the same. Looking at the mean over all versions (right, solid bars), the similarity between SC–SP is lowest at
\(\mathcal{S}\) \(=\) 0.791 and the similarity between SC–DL highest at
\(\mathcal{S}\) \(=\) 0.904. This is an encouraging finding, indicating that even for tonally complex, large-scale works such as the
Ring, audio-based measurements of tonal structures lead to results very close to score-based ones when relying on a deep pitch-class representation. In summary, we conclude that computational approaches face particular challenges in the
musical analysis of local keys or scales, whereas DL methods can effectively perform the
technical front-end task of extracting pitch-class representations from audio.
Influence of Parameters. In the previous experiments, we always fixed the parameters for the template-based measurement of diatonic scales to a window size of
\(w\) \(=\) 32 measures and a scaling parameter of
\(\beta\) \(=\) 50. We finally want to test
if and
how this choice influences our central findings. To this end, we repeat the previous experiment on the full
Ring in the Karajan version (
Figure 9(b), left/hatched bars) for different values of the window size
\(w\), which is a central parameter of the measurement. The pairwise similarities for the different window sizes and representations are shown in
Figure 10. Overall, we observe that the measurements become more similar for larger window sizes. This may be an effect of averaging, where differences are suppressed when using larger windows. However, irrespectively of the value of
\(w\), DL and SC are most similar to each other, and the similarity between SC–SP is always substantially lower. This consistent behavior is a strong support for our conclusion above. By using DL techniques, the computational measurements of tonal structures based on audio recordings can become similar to those based on symbolic scores.