1. Introduction

Optical music recognition (OMR) (; ) is a classical and challenging area of document analysis, that aims to convert images of written music to machine-readable, encoded form. A crucial component of any OMR pipeline is a music object recognition (MOR) system. In recent years MOR systems have reached greatly increased performance thanks to the adoption of deep learning (; ) and the availability of large datasets (; , ).

Despite these advancements, we have identified two major roadblocks that hold current MOR systems back from reaching their full potential in a practical, real-world setting. Even though deep neural networks have been consistently revolutionising different computer vision tasks like classification, object detection, segmentation, image retrieval, and many more, they often fail to replicate the benchmark performances and results on new domains. This issue is attributed to cross-domain mismatch; some of the problems surrounding this issue are highlighted below.

Firstly, the currently available (MOR) training datasets are either synthetically generated or scans of very high quality, which are visually very close to synthetic imagery. This causes the resulting detectors to perform very well on clean samples (; ), but they struggle significantly when confronted with sub-optimal data quality as is common in real-world applications (later referred to as real-world data). This can be scans of old or degraded pages, or for example, smartphone pictures under non-ideal conditions.

Secondly, deep neural networks are notoriously overconfident in their predictions – especially if an input lies outside the previously observed training data (). This has great implications for the practical usability of MOR because it forces quality control, which is typically performed manually by humans, to check every detection with high diligence. This is particularly cumbersome for sheet music, where clusters of many tightly packed symbols are very common.

We attempt to resolve the issue of creating effective MOR systems for real-world musical sheet images by casting it as a domain shift problem. In order to bridge this domain gap between synthetic and realistic images, we propose two approaches: ScoreAug (Section 5.1), which creates augmentations for diversity in feature distributions during training, and unsupervised domain adaptation (UDA) (Section 5.2), whereby training on one data distribution enables the model to also perform well on different target distributions.

ScoreAug uses real-world, scanned blank pages with natural signs of degradation and combines them with the synthetic input from our initial dataset. By melding these two together, realistic-looking samples can be created on-the-fly. The result of that operation is that the model generalises more to real-world data, thereby bridging the domain shift.

Unsupervised domain adaptation addresses the cross-domain mismatch issue by manipulating the target domain samples (in our experiments the real-world images). In our research, we select a domain-adversarial loss in order to enforce target (real-world data) embeddings to be similar or close to the source (synthetic data) embeddings in a latent feature space. We bridge the gap between domains without the need for the generative modelling capabilities of adversarial models but with the help of a binary domain discriminator.

We address the problem of overconfident predictions by using an ensemble method. Good ensembles result when the predictions of the ensemble members are both accurate and have independent errors (). Thus, the prediction confidence can be estimated over the predictions of several members and hence be better quantified.

We use SnapshotEnsemble (), a method that creates ensemble members at no additional training cost. At inference time, each ensemble member makes a prediction independent of the other members. These predictions are fused into an average prediction with higher accuracy and more reliable confidence ratings and thus facilitate subsequent quality control.

In summary, the contributions of this paper are as follows (c.f. Figure 1):

  • A mor test dataset (Realscores, see Section 4) that contains 14 pages of real-world sheet music. The annotations for this data were newly created by hand and follow the class definitions and data structure of DeepscoresV2 ();
  • ScoreAug (see Section 5.1): A sophisticated data augmentation scheme and training schedule using a combination of synthetic data (DeepscoresV2) and perturbations sourced from a diverse array of real-world documents (imslp), which are combined using randomised heuristics;
  • An adversarial discriminative method for implementing unsupervised domain adaptation (see Section 5.2) in mor for finding indiscriminate representations for the distributions of synthetic (DeepscoresV2) and real-world (imslp) features and bridging the gap between these domains in a latent feature space;
  • Trustworthy confidence ratings (Section 5.3) for symbol level detections based on a prediction fusion algorithm that utilises confidence scores of ensemble outputs to calculate average predictions and confidence ratings.
Figure 1 

Graphical overview of the methods we contribute. ScoreAug (Section 5.1) in the top row, unsupervised domain adaptation (Section 5.2) in the center and snapshot-ensemble-based confidence ratings (Section 5.3) at the bottom.

The rest of this paper is organised as follows: Section 2 gives a thorough introduction to making omr more robust in practice by surveying related work. In Section 3 we introduce our baseline model on which all of our solutions are built. Section 4 presents the Realscores dataset, a newly sourced and annotated small test set for real-world mor. In Section 5 we present our proposed methods. Section 6 contains descriptions and results of all our experiments. Lastly, in Section 7 we draw conclusions and discuss possible future work.

2. Survey of Related Work

Music Object Detection Traditionally, omr systems consisted of a cascade of components such as staff-line removal (; ), symbol segmentation () and symbol classification (), which were built using classical computer vision methods. With the advent of increased computing power and the availability of large-scale datasets (; ), deep-learning-based approaches () started to take over. Deep learning methods resulted in greatly increased performances in the above-mentioned tasks (). More recent work applies convolutional neural networks directly to the raw input data, making multi-step designs obsolete (; ; ). There are efforts to solve the whole omr problem in one single step, as is state of the art in related fields such as text () or speech () recognition. However, due to the high complexity of music notation, all existing solutions focus on a simplified problem such as mensural notation () or monophonic scores, both typeset (; ) and handwritten ().

Input Data Augmentation Input data augmentation has a rich history in deep learning. However, it is mostly used to improve performance on a single domain. Typically, data augmentation consists of scaling, translations, and rotations (; ). On larger natural datasets such as ImageNet more sophisticated transforms like random cropping, image flipping, and colour normalization have become commonplace (). Generative adversarial networks have been employed to generate additional realistic training data (). Recently, automatic search of optimal augmentation strategies on a per dataset basis has become the standard ().

There has already been some effort to address the domain gap between existing datasets and real-world data using data augmentation in the context of mor. Datasets that have been altered to mimic realistic data have been created, either by applying a sequence of graphics filters () or by printing and scanning the data ().

To the best of our knowledge, this is the first work to present an input augmentation technique, that combines algorithmic distortions with real-world perturbations for mor.

Domain Adaptation for Object Detection uda is an unsupervised learning approach () to transfer knowledge obtained from a source domain with labelled data to a target domain with unlabelled data. One of the fundamental approaches in uda was proposed by Tzeng et al. (), creating a generalised framework for adversarial adaptation in image classification.

Recently, uda methods for tasks outside classification, such as object detection, have attracted increasing attention, which is the primary focus of our omr models. Chen et al. () wrote one of the pioneering works on this task. The authors observed image and instance level shifts and proposed segregated components to alleviate the domain discrepancy.

Adversarial approaches for discriminative uda have recently reflected strong results in object detection (; ; ). The primary goal of most of the adversarial approaches addressed above is adversarial feature alignment between the source and target domain.

In the context of mor, Mateiu et al. () employed a domain adversarial neural network () to enable the classification of individual handwritten symbols in old music manuscripts. Castellanos et al. () use uda to improve document analysis (splitting of the input in a layered version containing different information, e.g. staffs, notes or background) on historical music sheets.

To the best of our knowledge, this is the first work to employ and systematically evaluate uda techniques for a full-fledged mor system.

Confidence Ratings Most state-of-the-art approaches to estimate predictive uncertainty rely on ensembles (; ; ; ; ; ; ). Bayesian deep learning approaches like MC-dropout have interesting properties but fail to deliver in practice due to computational or technical constraints ().

Since we train our models for 1000+ epochs and the input images are large (i.e. require a lot of memory), we focus on approaches known as “economic ensembles”, such as HypernetEnsembles () or Masksembles (). For these methods, the computational and memory costs do not increase linearly with the number of ensemble members and thus scale well with large deep learning models.

SnapshotEnsemble was proposed by Huang et al. () and tries to achieve the seemingly paradoxical goal of producing an ensemble at no additional training cost. Their method leverages work on cyclic learning rate schedules (). They lower the learning rate at a very fast pace, thus encouraging the model to converge quickly to its first local minimum. Then the optimisation is continued with a higher learning rate to dislodge the model from this local minimum again. This procedure is repeated multiple times. At each local minimum, the model is saved (i.e. a snapshot is taken). Ensembling the snapshots results in consistently lower error rates than single models. In this work, we exclusively employ SnapshotEnsembles due to their minimal compute requirements.

To the best of our knowledge, this is the first work to employ uncertainty measures in the context of omr and systematically evaluate their merits.

3. Baseline Model

All our experiments are based on the S2A-Net architecture (), which allows for oriented detections unlike earlier methods (; ). The S2A-Net is an anchor-based object detector that uses a single-shot alignment network to generate accurately oriented object detections. Its novel feature alignment and oriented detection modules are fed using a ResNet-based backbone () and feature-pyramid networks ().

We achieve good results on the “oriented mode” of DeepscoresV2 () when training S2A-Nets by scaling the data with a factor of 0.5 and then using random crops of 1000 by 1000 pixels. We are able to conserve GPU memory whilst keeping high precision by just using a singular anchor ratio of 1.0 and a singular anchor scale of 4. We train our models with SGD using a learning rate of α = 2.5·10–3 and a momentum of 0.9. Table 1 contains the average precision (AP) of our baseline model against two state-of-the-art models, which illustrates the competitive performance of our new S2A-Net based approach. The complete training details can be found in the published code.

Table 1

The AP at 0.5 overlap for our baseline model and two state-of-the-art models (DWD, Faster R-CNN (Tuggener et al., 2021)) on DeepscoresV2.


DeepscoresV2 dataset

ModelAP (overlap = 0.50)

Baseline model89.3%

DWD50.3%

Faster R-CNN79.9%

4 The RealScores Data

So far, no real-world test data is available to benchmark models on. Such data is crucial to observe how well our models will perform when facing a domain gap. To create a benchmark dataset for real-world omr, we sourced digitised music scores from the International Music Score Library Project (IMSLP)/Petrucci Music Library. Of the downloaded music scores, only those with specific characteristics were considered for the new test set: the sheets had to be scans or photographs of music scores and be visibly non-synthetic, meaning that they come with scanning artefacts, discolourations, stains, folds, be angled, and have other imperfections. Music scores that were handwritten, of very bad quality, or using non-standard notation were considered out of scope for this work. The selected samples had to be annotated by hand using ScorePad’s current omr pipeline (). The resulting test set consists of 12 music sheets with a total of 12553 annotations that we name RealScores. The annotations are stored in the same format that DeepScoresV2 () introduced. Due to its limited size, only 61 of the original 136 classes are present. Excerpts from two samples with their corresponding annotations are visible in Figure 2.

Figure 2 

Example snippets from two RealScores pages with ground truth annotations overlayed.

In a second step, we sourced a number of “blank” pages from the aforementioned Petrucci Music Library. This was possible because many uploaded music scores would be sourced from completely scanned books, including the front and back covers. Such scans sometimes contain blank pages without any written music, but all the perturbations that naturally occur on sheets of paper. This is a valuable source of real-world noise that can be overlayed with synthetic data. We manually looked through the sourced data for suitable blanks, then converted them into pictures and normalised their size to fit with the synthetic data of DeepScoresV2. A total of 51 such blank pages were selected – 30 of which have a significant portion of the sheet border visible, and 21 do not. Figure 3 shows six of those pages.

Figure 3 

Example blank pages.

5. Methods

In this section, we present our proposed methods to address domain shift and overconfidence. In Section 5.1 we propose a powerful data augmentation method, in Section 5.2 we present an alternative solution based on UDA, and in Section 5.3 we describe our scheme to produce confidence ratings.

5.1 Input Data Augmentation

We propose a sophisticated data augmentation scheme to address the domain gap that we call ScoreAug. With ScoreAug, input samples first can be blurred, get salt-and-pepper-like noise, get irregular edges in the border area, be rotated by a small angle, or become augmented with other irregularities not found in a synthetic dataset like DeepScoresV2. Additionally, we go one step beyond these algorithmic perturbations and complement them by overlaying them on our blank pages from the RealScores dataset. Using this combination of augmentation techniques, we aim to bring synthetic data close enough to the real-world domain to train models that generalise to real-world inputs. For a given synthetic input image, we select one out of our set of the 51 blank pages. To increase variability, the blank page and the synthetic data undergo a variety of further augmentations, as shown in Table 2.

Table 2

Probabilities of augmentations as part of ScoreAug that can be applied to either the blanks, synthetic scores, or both at the same time. Note that Paug decides how likely any other augmentations (after the salt and pepper noise) will be applied, in order to not only feed ScoreAugmented samples to the model. Our final model uses Psnp = 0%, Paug = 30%, Pblur = 10%.


BlanksScores

Salt and Pepper Noise Psnp

No Additional Augmentations Paug

Horizontal Flip50%

Vertical Flip50%

Crop and Resize20%

Randomise Brightness50%

Higher Contrast20%

Small Angle Rotation60%60%

Additional Brightness40%

Gaussian Blur Pblur

To ensure alignment with the transformed image data, the ground-truth bounding boxes also undergo the same transformations. Upon completing these augmentations, the foreground and background are merged by preserving the darker pixel at each position. This means that darker pixels overpower the lighter shades, preserving the dark symbols from the augmented synthetic dataset (the foreground) and replacing the pixels of its white background with the darker pixels from the augmented blank pages (the background). This yields optically similar results to real-world scanned music scores as seen in Figure 4, which can be adapted with hyperparameters (Psnp, Paug, Pblur) to adjust to one’s needs.

Figure 4 

ScoreAug examples (top right, bottom row) derived from the same synthetic sample (top left).

5.2 Unsupervised Adversarial Domain Adaptation

The most common approach to overcome a domain shift is supervised domain adaptation, where densely annotated images are required in the target domain (annotations generally involve instance-level bounding boxes for object detection). Such a solution would require the collection and annotation of a full-scale training dataset consisting of data from the target domain. This approach, therefore, would be tedious and lack the ability to scale, especially for detecting tiny objects in images that are cumbersome to annotate, such as notes in sheet music. Unsupervised domain adaptation (UDA), on the other hand, reduces the expense of annotation by only requiring annotations in the source domain.

Adversarial domain adaptation, which strives to minimise the domain dependency of an object detector via a domain-adversarial loss function utilising a discriminator, is a popular approach for UDA. As highlighted by Tzeng et al. (), adversarial domain adaptation is similar to generative adversarial learning, where a generator and discriminator are pitted against each other. For UDA, this concept is used to train a neural network to be unable to differentiate between two domains (in our case synthetic and real-world sheet music images) and ultimately show similar performance on source and target domain samples.

Here, the source domain is the DeepScoresV2 dataset. For our target domain data, we source non-annotated real-world images from IMSLP. Our system consists of a baseline S2A-Net (comprising a backbone network fbackboneθ and an object detector fdetectθ), a gradient reversal layer and a small domain classifier neural network fdomainθ. The network weights are denoted by θ, θ′ and θ″ respectively. The system is trained using two independent losses, the domain confusion loss Ldomain and the object detection loss Ldetect. The following paragraph gives an overview of each component. See Figure 5 for a graphical overview.

Figure 5 

Overview of our UDA system, with data, gradient, and label flow of step (I) shown in orange, of step (II) in green and of step (III) in blue.

The baseline S2A-Net is configured as described in Section 3. Here we start with networks that have been fully trained on DeepScoresV2 to ensure that the network filters are tuned to sheet music. The gradient reversal layer () can be viewed as a virtual layer in the network that is only active on the backward pass inverting all gradients passing through it. This causes the layers coming after this layer to maximise the training loss (in our case fbackboneθ maximizing Ldomain, causing the backbone embeddings to carry as little information about the domain as possible). fdomainθ has the job of classifying whether an embedding generated by fbackboneθ stems from a data point of the source domain or the target domain. Ldomain is a binary classification target based on the input domain, as used in GANs (). Finally, Ldetect is the base S2A-Net loss to train the whole object detector.

Training the whole system requires the following steps: (I) training fdomainθ based on Ldomain (θ″ is getting updated, θ is frozen); (II) use the gradients generated by Ldomain and propagate them through fdomainθ, applying the gradient reversal layer and propagating the resulting gradients through fbackboneθ tuning θ to maximise Ldomain; and (III) use labelled samples from the source domain and do a regular S2A-Net training step (training fbackboneθ and fdetectθ based on Ldetect). Steps (I) and (II) are pitted against each other in an adversarial game, with the goal of “deleting” information that allows the discriminator to differentiate between domains based on the output of the backbone, making the system “domain blind”. Step (III) is necessary since the backbone changes and the object detection head needs to adapt to the embeddings accordingly.

Preliminary experiments show that for models without pretraining on the DeepScoresV2 dataset the resulting uda does not perform at all. We conjecture that as we are dealing with unlabelled target domain data, it is crucial to learn good representations initially, otherwise the embeddings produced by the backbone would be too noisy and the domain classifier is unable to learn anything.

While implementing adversarial discriminative domain adaptation, we methodologically distinguish our work from Tzeng et al. () in the following ways:

  • We do not use separate networks for source and target domains for efficient sharing of weights.
  • We do not fix the weights of our object detection module to allow the object detection module to adapt to the changes in the backbone.
  • We do not adopt the asymmetric objective mappings of the feature extractor (in our case the output from S2A-Net).

5.3 Confidence Ratings

A music sheet often contains hundreds of musical symbols. Even if omr software works very reliably, the probability of some misclassifications is high due to the high number of symbols. To identify such misclassifications, it is helpful to analyse the prediction confidence of the model. However, deep learning networks for classification are over-confident because their Softmax layer, which assigns decimal probabilities to each class, tends to push the probabilities either close to 0 or close to 1. Therefore, the model outputs cannot directly be used as a useful measure of confidence ().

We mitigate this issue by using SnapshotEnsemble () (i.e. multiple predictions) to quantify the predictive uncertainty of our model. This method generates several snapshots (i.e. ensemble members) during training. During inference, each snapshot creates independent predictions of bounding boxes. We use the Weighted Box Fusion (WBF) () algorithm to fuse the bounding boxes. This method constructs the averaged bounding boxes with a corresponding confidence score by utilising the position and confidence scores of all proposed boxes. This overall score can be used as a measurement of the predictive uncertainty.

6. Experiments and Results

6.1 Input Data Augmentation

Experimental Setup To measure the impact of ScoreAug, we trained one baseline model with ScoreAug and another without it – each for 2000 epochs. Both models were trained on half-resolution, cropped samples to allow for larger batch sizes and faster convergence. During training, we used a learning rate of α = 2.5·10–3 throughout and used linear warmup with a ratio of 13 for the first 500 epochs. We observed that the models lack global awareness (e.g. predicting noteheads at the corner of the page), therefore we trained some models an additional 200 epochs on full pages; we denote this step as Finalise. During evaluation, we make sure to only consider the results of classes that have at least one positive prediction per model. We evaluate our models using average precision (AP) at an overlap of 25%. We use this unusually low overlap threshold due to the very small object sizes common in MOR, which cause detections that are very usable in practice to often be below the 50% mark.

Results Thanks to ScoreAug and Finalise we observe an absolute increase in AP of roughly 40% compared to models trained for the same number of epochs and without using both (see Table 3). We observe that on the source Dataset DeepScoresV2 the performance slightly degrades from 87.6% to 83.3%. However, this is to be expected since we move from a model specifically trained on and for synthetic data to one that can handle a much wider variety of data.

Table 3

The AP for the baseline model and models with ScoreAug and Finalise data augmentation on the DeepScoresV2 and the RealScores datasets.


DeepScoresV2 dataset

ModelAP (overlap = 0.25)

Baseline87.6%

ScoreAug86.0%

ScoreAug + Finalise83.3%

RealScores dataset

ModelAP (overlap = 0.25)

Baseline36.0%

ScoreAug56.5%

ScoreAug + Finalise73.7%

6.2 Unsupervised Adversarial Domain Adapdation

Experimental Setup Pretraining the S2A-Net for uda showcased impressive results, allowing us to train for relatively few epochs. In our experiments, the pretrained checkpoint had been trained for 250 epochs on the DeepScoresV2 dataset. We train our uda pipeline for 30 further epochs. For the domain discriminator, the source domain label is set to 1 and the target domain label is set to 0. The input feature size is 128, based on the output from S2A-Net, and the hidden feature size is 256. Batch normalisation () is applied to calculate the mean and standard deviation per dimension over the mini-batches. We use an Adam optimiser () with an initial learning rate of 0.01 and constant epoch-driven decay for both targets. We train with a batch size of 4 for both source and target data loaders, which is the maximum batch size our GPUs allowed while keeping the number of samples balanced between domains. In our experiments, for a fair comparison, we follow the same configuration as the baseline model, in terms of S2A-Net initialisation and DeepScoresV2 data loader structures. We limit data augmentations on the RealScores data to geometric transformations such as scaling by a factor of 0.5 and random cropping of 1000 by 1000 pixels, both of which are similar to the DeepScoresV2 data loader samples.

Results Table 4 shows the average precisions for uda. For the target domain RealScores we observe a gain of 12.9% from 36.0% to 48.9%. uda results in the largest source domain performance loss from 87.6% down to 72.4%. This gain is not quite as impressive as for ScoreAug, but we believe it shows the merits of this fully unsupervised approach for mor. Additionally, the uda models have been trained only on low-resolution samples to overcome current GPU constraints. It is likely that results would improve in the future with higher resolution images which have generally aided object detection models dealing with tiny objects, such as in mor.

Table 4

The AP for the baseline model and a model with uda on DeepScoresV2 and the RealScores dataset.


DeepScoresV2 dataset

ModelAP (overlap = 0.25)

Baseline87.6%

uda72.4%

RealScores dataset

ModelAP (overlap = 0.25)

Baseline36.0%

uda48.9%

6.3 Producing Confidence Ratings

Experimental Setup We train different ensemble versions as well as a model not utilising ensembles on the DeepScoresV2 dataset. We train each model for 1000 epochs and with ScoreAug. For the model not utilising ensembles, we use a constant learning rate of α = 2.5·10–3. For the SnapshotEnsemble models, we start with the same learning rate and decrease it over 500 epochs to 1·10–5 using one single cosine annealing cycle. This rather long cycle is a pretraining of the model before the actual ensemble members are generated. To obtain the ensembles, we train the model for 500 additional epochs with shorter cosine annealing cycles over 10, 20, and 30 epochs with learning rates in the range of 1·10–5α ≤ 7.5·10–3.

After training, the AP for a given overlap of 0.25 is calculated on the test set. In addition, the overlap between snapshots is calculated by using the output from one model as ground truth and the output from a second model as the prediction. We build a set of 10 ensemble members iteratively. We start with an empty set of snapshots. First, the snapshot with the highest AP is added. Afterwards, we add the snapshot which has (i) an overall AP which is max. 5% worse than the AP of our best model; and (ii) has the smallest average overlap with the models which are already added to our set of ensemble members. We repeat this procedure until our set of ensemble members contains 10 snapshots.

The final predictions are the fused boxes generated by the wbf algorithm. The fusion threshold of WBF was set to 0.3, meaning that boxes with the same label and an intersection over union (IoU) of ≥ 0.3 are fused into one box. Since WBF can be used to quantify predictive uncertainty, we use this score to remove predictions with a confidence score below 10% on the RealScores dataset. We found that this improves the prediction quality and reduces false positive rates in particular. In contrast, the predictions on the DeepScoresV2 dataset are of high quality and no bounding boxes have to be removed based on their confidence score.

Results We found that ensembles yield better results than a single model. Table 5 shows the AP of the ensemble approaches as well as the AP of the model not utilising ensembles. It can be observed that ensembles improve the AP by up to 5.2 p.p. on the DeepScoresV2 dataset and up to 9.1 p.p. on the RealScores dataset compared to the model without ensembles. Of the three cosine annealing cycle lengths validated, the ensemble with a cycle length of 20 worked best on the DeepScoresV2 dataset while the ensemble with a cycle length of 30 achieves the highest AP on the RealScores dataset. Compared to the results reported in Table 3, the ensemble approaches achieved a lower AP. Since the model’s prediction accuracy increases continuously with more training epochs, we suspect that this is due to the fact that the ensembles are trained for only 1000 epochs, while the models in Table 3 are trained for 2000 epochs. However, it is likely that the ensemble would achieve similar or slightly better results, since ensembles typically improve results ().

Table 5

The AP for the model not utilizing ensembles and ensemble models with different cosine annealing cycle lengths on the DeepScoresV2 and the RealScores dataset.


DeepScoresV2 dataset

ModelAP (overlap = 0.25)

ScoreAug82.1%

ScoreAug ensemble (10 cycles)85.6%

ScoreAug ensemble (20 cycles)87.3%

ScoreAug ensemble (30 cycles)83.4%

RealScores dataset

ModelAP (overlap = 0.25)

ScoreAug37.9%

ScoreAug ensemble (10 cycles)44.6%

ScoreAug ensemble (20 cycles)46.7%

ScoreAug ensemble (30 cycles)47.0%

Having a high precision and thus a low false-positive rate is particularly important for omr since it is easier for human annotators to find and label missing annotations than to identify wrong predictions. The confidence ratings can be used to reduce the number of false positive predictions and to increase the precision. We found that removing predictions with a confidence score below 10% increases precision from 87.8% to 97.2% on the DeepScoresV2 dataset, and from 35.7% to 41.9% on the RealScores dataset respectively. Thus, retaining only predictions with a confidence score larger than a predefined threshold allows to increase precision at the expense of recall.

Additionally, we assess the confidence ratings visually. Figure 6 shows result excerpts from model outputs with the predictions coloured according to their confidence score. These visualisations can provide useful insights for creating annotations. In accordance with the previous findings, we observed that analysing the predictions with low confidence is particularly helpful as wrong predictions usually have low confidence.

Figure 6 

Four cropped visualisation samples of predictions made by an ensemble. The colour of the bounding box indicates the model’s confidence (green means high confidence, and red means low confidence). For symbols with a confidence score below 30%, we plot not only the coloured bounding box but also the assigned label as well as the confidence score.

As in Section 6.1 (c.f. Table 3), we examine the effect of using ScoreAug in combination with Finalise (i.e. training on full pages) for the ensemble approach. We perform Finalise for 50 epochs on each ensemble member obtained. The results with and without Finalise are shown in Table 6. The effectiveness of using Finalise can be observed particularly clearly on the RealScores dataset. When training the ensemble approach for 1000 epochs with ScoreAug but without Finalise, we achieve an AP of 46.7%. If ScoreAug is combined with 50 additional Finalise epochs per ensemble member, the AP further improves to 63.6%. Finalise thus improves the results not only for single models but also for ensembles. On the source dataset, this once again leads to a small loss in performance from 87.3% to 81.5%. However, Finalise in combination with SnapshotEnsembles has the disadvantage that after creating the ensemble members, each member must be fine-tuned separately. This increases the duration of the fine-tuning linearly with the number of ensemble members.

Table 6

The AP for the ensemble trained with a cosine annealing cycle length of 20. The model is trained once with ScoreAug only and once with ScoreAug in combination with 50 subsequent Finalise cycles.


DeepScoresV2 dataset

Ensemble (cycle length = 20)AP (overlap = 0.25)

ScoreAug87.3%

ScoreAug & Finalise81.5%

RealScores dataset

Ensemble (cycle length = 20)AP (overlap = 0.25)

ScoreAug46.7%

ScoreAug & Finalise63.6%

A combination of uda with ScoreAug and snapshot ensembles is currently not indicated by the individual results: performance on RealScores exceeds 73% using ScoreAug + Finalise (see Table 3) and reaches beyond 63% when adding confidence ratings via ensembling (see Table 6), but only achieves ca. 49% using uda (over a baseline of 36%, see Table 4). We do not expect a strong performance boost from a combination, especially since integration is technically uncertain due to the fact that uda relies on fully pretrained networks and the fragile interplay between steps (I) to (III) that require specific learning rates (see Section 5.2).

7. Conclusions and Future Work

We presented multiple successful avenues towards improving the practical usability of omr systems. Together, they improve the speed of professional-grade music digitisation on medium-quality scores by more than a factor of 3 over a strong baseline () for high-quality scores. Specifically, 11 minutes per page using the baseline could be reduced to 3.5 minutes on average within the digitisation pipeline of ScorePad AG, which consists of our MOR solution mated to a proprietary backend that combines all the information and features a human-in-the-loop correction step. A fully manual transcription by professional musicians would take ca. 40 minutes.

To bridge the domain gap between synthetic datasets and real-world data, algorithmic input augmentation paired with noise sourced from aged real-world documents proved especially fruitful, increasing average detection precision by nearly 50% on the RealScores data. In conjunction with Finalise, the model performed twice as well as the model trained on synthetic data only.

Unsupervised adversarial domain adaptation showed some promise, outperforming the baseline by 36%. We believe this could be further improved by a uda method working at very high resolution, to prevent the destruction of fine-grained information in the small patterns of music notation. Both domain adaptation methods had a marginally adverse effect on the performance of the model on synthetic data. In a practical setting, this can be alleviated by employing a data quality classifier and using multiple expert models for high and low-quality data

Using ensembles in combination with weighted box fusion has improved the AP by up to 9.1pp. Besides the better results, ensembles allow us to calculate reliable confidence ratings. These confidence ratings can be used to identify misclassifications, and thus to simplify the manual post-processing of the predictions.

The current models cannot deal with hand-written music scores, which could be addressed in the future. Another drawback is the heavy reliance on exact interline scaling: we observed a steep performance drop-off when the interline space is outside of the 8 to 12 pixel range. SnapshotEnsembles create ensembles without additional training costs by storing snapshots during a single training cycle. The resulting snapshots are fine-tuned separately by using Finalise to achieve better performance on real-world data. Training would be more efficient if Finalise could be incorporated into the ensemble-generating training cycle and does not have to be done for each ensemble member separately.