Keywords

1 Introduction

For the last decade, deep learning architectures have undoubtably established the state of the art in computer vision tasks such as image classification [18, 38] or object detection [15, 33]. These architectures, e.g. ConvNets, consist of several convolutional layers, followed by a few fully connected layers and by a classification softmax layer with, for instance, a cross-entropy loss. ConvNets have also been used for regression, i.e. predict continuous as opposed to categorical output values. Classical regression-based computer vision methods have addressed human pose estimation [39], age estimation [30], head-pose estimation [9], or facial landmark detection [37], to cite a few. Whenever ConvNets are used for learning a regression network, the softmax layer is replaced with a fully connected layer, with linear or sigmoid activations, and \(L_2\) is often used to measure the discrepancy between prediction and target variables. It is well known that \(L_2\)-loss is strongly sensitive to outliers, potentially leading to poor generalization performance [17]. While robust regression is extremely well investigated in statistics, there has only been a handful of methods that combine robust regression with deep architectures.

Fig. 1.
figure 1

A Gaussian-uniform mixture model is combined with a ConvNet architecture to downgrade the influence of wrongly annotated targets (outliers) on the learning process.

This paper proposes to mitigate the influence of outliers when deep neural architectures are used to learn a regression function, ConvNets in particular. More precisely, we investigate a methodology specifically designed to cope with two types of outliers that are often encountered: (i) samples that lie at an abnormal distance away from the other training samples, and (ii) wrongly annotated training samples. On the one hand, abnormal samples are present in almost any measurement system and they are known to bias the regression parameters. On the other hand, deep learning requires very large amounts of data and the annotation process, be it either automatic or manual, is inherently prone to errors. These unavoidable issues fully justify the development of robust deep regression.

The proposed method combines the representation power of ConvNets with the principled probabilistic mixture framework for outlier detection and rejection, e.g. Fig. 1. We propose to use a Gaussian-uniform mixture (GUM) as the last layer of a ConvNet, and we refer to this combination as DeepGUM. The mixture model hypothesizes a Gaussian distribution for inliers and a uniform distribution for outliers. We interleave an EM procedure within stochastic gradient descent (SGD) to downgrade the influence of outliers in order to robustly estimate the network parameters. We empirically validate the effectiveness of the proposed method with four computer vision problems and associated datasets: facial and fashion landmark detection, age estimation, and head pose estimation. The standard regression measures are accompanied by statistical tests that discern between random differences and systematic improvements.

The remainder of the paper is organized as follows. Section 2 describes the related work. Section 3 describes in detail the proposed method and the associated algorithm. Section 4 describes extensive experiments with several applications and associated datasets. Section 5 draws conclusions and discusses the potential of robust deep regression in computer vision.

2 Related Work

Robust regression has long been studied in statistics [17, 24, 31] and in computer vision [6, 25, 36]. Robust regression methods have a high breakdown point, which is the smallest amount of outlier contamination that an estimator can handle before yielding poor results. Prominent examples are the least trimmed squares, the Theil-Sen estimator or heavy-tailed distributions [14]. Several robust training strategies for artificial neural networks are also available [5, 27].

M-estimators, sampling methods, trimming methods and robust clustering are among the most used robust statistical methods. M-estimators [17] minimize the sum of a positive-definite function of the residuals and attempt to reduce the influence of large residual values. The minimization is carried our with weighted least squares techniques, with no proof of convergence for most M-estimators. Sampling methods [25], such as least-median-of-squares or random sample consensus (RANSAC), estimate the model parameters by solving a system of equations defined for a randomly chosen data subset. The main drawback of sampling methods is that they require complex data-sampling procedures and it is tedious to use them for estimating a large number of parameters. Trimming methods [31] rank the residuals and down-weight the data points associated with large residuals. They are typically cast into a (non-linear) weighted least squares optimization problem, where the weights are modified at each iteration, leading to iteratively re-weighted least squares problems. Robust statistics have also been addressed in the framework of mixture models and a number of robust mixture models were proposed, such as Gaussian mixtures with a uniform noise component [2, 8], heavy-tailed distributions [11], trimmed likelihood estimators [12, 28], or weighted-data mixtures [13]. Importantly, it has been recently reported that modeling outliers with an uniform component yields very good performance [8, 13].

Deep robust classification was recently addressed, e.g. [3] assumes that observed labels are generated from true labels with unknown noise parameters: a probabilistic model that maps true labels onto observed labels is proposed and an EM algorithm is derived. In [41] is proposed a probabilistic model that exploits the relationships between classes, images and noisy labels for large-scale image classification. This framework requires a dataset with explicit clean- and noisy-label annotations as well as an additional dataset annotated with a noise type for each sample, thus making the method difficult to use in practice. Classification algorithms based on a distillation process to learn from noisy data was recently proposed [21].

Recently, deep regression methods were proposed, e.g. [19, 26, 29, 37, 39]. Despite the vast robust statistics literature and the importance of regression in computer vision, at the best of our knowledge there has been only one attempt to combine robust regression with deep networks [4], where robustness is achieved by minimizing the Tukey’s bi-weight loss function, i.e. an M-estimator. In this paper we take a radical different approach and propose to use robust mixture modeling within a ConvNet. We conjecture that while inlier noise follows a Gaussian distribution, outlier errors are uniformly distributed over the volume occupied by the data. Mixture modeling provides a principled way to characterize data points individually, based on posterior probabilities. We propose an algorithm that interleaves a robust mixture model with network training, i.e. alternates between EM and SGD. EM evaluates data-posterior probabilities which are then used to weight the residuals used by the network loss function and hence to downgrade the influence of samples drawn from the uniform distribution. Then, the network parameters are updated which in turn are used by EM. A prominent feature of the algorithm is that it requires neither annotated outlier samples nor prior information about their percentage in the data. This is in contrast with [41] that requires explicit inlier/outlier annotations and with [4] which uses a fixed hyperparameter (\(c=4.6851\)) that allows to exclude from SGD samples with high residuals.

3 Deep Regression with a Robust Mixture Model

We assume that the inlier noise follows a Gaussian distribution while the outlier error follows a uniform distribution. Let \(\varvec{x}\in \mathbb {R}^M\) and \(\varvec{y}\in \mathbb {R}^D\) be the input image and the output vector with dimensions M and D, respectively, with \(D\ll M\). Let \(\varvec{\phi }\) denote a ConvNet with parameters \(\varvec{w}\) such that \(\varvec{y}=\varvec{\phi }(\varvec{x},\varvec{w})\). We aim to train a model that detects outliers and downgrades their role in the prediction of a network output, while there is no prior information about the percentage and spread of outliers. The probability of \(\varvec{y}\) conditioned by \(\varvec{x}\) follows a Gaussian-uniform mixture model (GUM):

(1)

where \(\pi \) is the prior probability of an inlier sample, \(\gamma \) is the normalization parameter of the uniform distribution and \(\mathbf{\Sigma }\in \mathbb {R}^{D\times D}\) is the covariance matrix of the multivariate Gaussian distribution. Let \(\varvec{\theta }=\{\pi ,\gamma ,\mathbf{\Sigma }\}\) be the parameter set of GUM. At training we estimate the parameters of the mixture model, \(\varvec{\theta }\), and of the network, \(\varvec{w}\). An EM algorithm is used to estimate the former together with the responsibilities \(r_n\), which are plugged into the network’s loss, minimized using SGD so as to estimate the later.

3.1 EM Algorithm

Let a training dataset consist of N image-vector pairs \(\{\varvec{x}_n,\varvec{y}_n\}_{n=1}^N\). At each iteration, EM alternates between evaluating the expected complete-data log-likelihood (E-step) and updating the parameter set \(\varvec{\theta }\) conditioned by the network parameters (M-step). In practice, the E-step evaluates the posterior probability (responsibility) of an image-vector pair n to be an inlier:

$$\begin{aligned} r_{n}(\varvec{\theta }^{(i)}) = \frac{\pi ^{(i)} \mathcal {N}(\varvec{y}_n;\varvec{\phi }(\varvec{x}_n,\varvec{w}^{(c)}),\mathbf{\Sigma }^{(i)})}{\pi ^{(i)} \mathcal {N}(\varvec{y}_n;\varvec{\phi }(\varvec{x}_n,\varvec{w}^{(c)}),\mathbf{\Sigma }^{(i)})+(1-\pi ^{(i)})\gamma ^{(i)}}, \end{aligned}$$
(2)

where (i) denotes the EM iteration index and \(\varvec{w}^{(c)}\) denotes the currently estimated network parameters. The posterior probability of the n-th data pair to be an outlier is \(1-r_n(\varvec{\theta }^{(i)})\). The M-step updates the mixture parameters \(\varvec{\theta }\) with:

$$\begin{aligned} \mathbf{\Sigma }^{(i+1)}&= \sum _{n=1}^N r_{n}(\varvec{\theta }^{(i)}) \varvec{\delta }_n^{(i)}\varvec{\delta }_n^{(i)\top },\end{aligned}$$
(3)
$$\begin{aligned} \pi ^{(i+1)}&=\sum _{n=1}^Nr_{n}(\varvec{\theta }^{(i)})/N,\end{aligned}$$
(4)
$$\begin{aligned} \frac{1}{\gamma ^{(i+1)}}&= \prod _{d=1}^{D} 2\sqrt{3\left( C^{(i+1)}_{2d}-\left( C^{(i+1)}_{1d}\right) ^2\right) }, \end{aligned}$$
(5)

where \(\varvec{\delta }_n^{(i)} = \varvec{y}_n-\varvec{\phi }(\varvec{x}_n;\varvec{w}^{(c)})\), and \(C_1\) and \(C_2\) are the first- and second-order centered data moments computed using (\(\delta _{nd}^{(i)}\) denotes the d-th entry of \(\varvec{\delta }_n^{(i)}\)):

$$\begin{aligned} C_{1d}^{(i+1)}=\frac{1}{N}\sum _{n=1}^N\frac{(1-r_{n}(\varvec{\theta }^{(i)}))}{1- \pi ^{(i+1)} }\delta _{nd}^{(i)},\; C_{2d}^{(i+1)}=\frac{1}{N}\sum _{n=1}^N\frac{(1-r_{n}(\varvec{\theta }^{(i)}))}{1- \pi ^{(i+1)}}\left( \delta _{nd}^{(i)}\right) ^2. \end{aligned}$$
(6)

The iterative estimation of \(\gamma \) as just proposed has an advantage over using a constant value based on the volume of the data, as done in robust mixture models [8]. Indeed, \(\gamma \) is updated using the actual volume occupied by the outliers, which increases the ability of the algorithm to discriminate between inliers and outliers.

Another prominent advantage of DeepGUM for robustly predicting multidimensional outputs is its flexibility for handling the granularity of outliers. Consider for example to problem of locating landmarks in an image. One may want to devise a method that disregards outlying landmarks and not the whole image. In this case, one may use a GUM model for each landmark category. In the case of two-dimensional landmarks, this induces D/2 covariance matrices of size 2 (D is the dimensionality of the target space). Similarly one may use an coordinate-wise outlier model, namely D scalar variances. Finally, one may use an image-wise outlier model, i.e. the model detailed above. This flexibility is an attractive property of the proposed model as opposed to [4] which uses a coordinate-wise outlier model.

3.2 Network Loss Function

As already mentioned we use SGD to estimate the network parameters \(\varvec{w}\). Given the updated GUM parameters estimated with EM, \(\varvec{\theta }^{(c)}\), the regression loss function is weighted with the responsibility of each data pair:

$$\begin{aligned} \mathcal {L}_{\textsc {deepgum}}=\sum _{n=1}^Nr_{n}(\varvec{\theta }^{(c)}){||\varvec{y}_n-\varvec{\phi }(\varvec{x}_n; \varvec{w})||}^2_2. \end{aligned}$$
(7)
Fig. 2.
figure 2

Loss gradients for Biweight (black), Huber (cyan), \(L_2\) (magenta), and DeepGUM (remaining colors). Huber and \(L_2\) overlap up to \(\delta =4.6851\) (the plots are truncated along the vertical coordinate). DeepGUM is shown for different values of \(\pi \) and \(\gamma \), although in practice they are estimated via EM. The gradients of DeepGUM and Biweight vanish for large residuals. DeepGUM offers some flexibility over Biweight thanks to \(\pi \) and \(\gamma \). (Color figure online)

With this formulation, the contribution of a training pair to the loss gradient vanishes (i) if the sample is an inlier with small error (\(\Vert \varvec{\delta }_n\Vert _2\rightarrow 0,r_n\rightarrow 1\)) or (ii) if the sample is an outlier (\(r_n\rightarrow 0\)). In both cases, the network will not back propagate any error. Consequently, the parameters \(\varvec{w}\) are updated only with inliers. This is graphically shown in Fig. 2, where we plot the loss gradient as a function of a one-dimensional residual \(\delta \), for DeepGUM, Biweight, Huber and \(L_2\). For fair comparison with Biweight and Huber, the plots correspond to a unit variance (i.e. standard normal, see discussion following Eq. (3) in [4]). We plot the DeepGUM loss gradient for different values of \(\pi \) and \(\gamma \) to discuss different situations, although in practice all the parameters are estimated with EM. We observe that the gradient of the Huber loss increases linearly with \(\delta \), until reaching a stable point (corresponding to \(c=4.6851\) in [4]). Conversely, the gradient of both DeepGUM and Biweight vanishes for large residuals (i.e. \(\delta >c\)). Importantly, DeepGUM offers some flexibility as compared to Biweight. Indeed, we observe that when the amount of inliers increases (large \(\pi \)) or the spread of outliers increases (small \(\gamma \)), the importance given to inliers is higher, which is a desirable property. The opposite effect takes place for lower amounts of inliers and/or reduced outlier spread.

3.3 Training Algorithm

In order to train the proposed model, we assume the existence of a training and validation datasets, denoted \(\mathcal {T}=\{\varvec{x}_n^\textsc {t},\varvec{y}_n^\textsc {t}\}_{n=1}^{N_\textsc {t}}\) and \(\mathcal {V}=\{\varvec{x}_n^\textsc {v},\varvec{y}_n^\textsc {v}\}_{n=1}^{N_\textsc {v}}\), respectively. The training alternates between the unsupervised EM algorithm of Sect. 3.1 and the supervised SGD algorithm of Sect. 3.2, i.e. Algorithm 1. EM takes as input the training set, alternates between responsibility evaluation, (2) and mixture parameter update, (3), (4), (5), and iterates until convergence, namely until the mixture parameters do not evolve anymore. The current mixture parameters are used to evaluate the responsibilities of the validation set. The SGD algorithm takes as input the training and validation sets as well as the associated responsibilities. In order to prevent over-fitting, we perform early stopping on the validation set with a patience of K epochs.

Notice that the training procedure requires neither specific annotation of outliers nor the ratio of outliers present in the data. The procedure is initialized by executing SGD, as just described, with all the samples being supposed to be inliers, i.e. \(r_n=1, \forall n\). Algorithm 1 is stopped when \(\mathcal {L}_\textsc {DEEPGUM}\) does not decrease anymore. It is important to notice that we do not need to constrain the model to avoid the trivial solution, namely all the samples are considered as outliers. This is because after the first SGD execution, the network can discriminate between the two categories. In the extreme case when DeepGUM would consider all the samples as outliers, the algorithm would stop after the first SGD run and would output the initial model.

figure a

Since EM provides the data covariance matrix \(\mathbf{\Sigma }\), it may be tempting to use the Mahalanobis norm instead of the \(L_2\) norm in (7). The covariance matrix is narrow along output dimensions with low-amplitude noise and wide along dimensions with high-amplitude noise. The Mahalanobis distance would give equal importance to low- and high-amplitude noise dimensions which is not desired. Another interesting feature of the proposed algorithm is that the posterior \(r_n\) weights the learning rate of sample n as its gradient is simply multiplied by \(r_n\). Therefore, the proposed algorithm automatically selects a learning rate for each individual training sample.

4 Experiments

The purpose of the experimental validation is two-fold. First, we empirically validate DeepGUM with three datasets that are naturally corrupted with outliers. The validations are carried out with the following applications: fashion landmark detection (Sect. 4.1), age estimation (Sect. 4.2) and head pose estimation (Sect. 4.3). Second, we delve into the robustness of DeepGUM and analyze its behavior in comparison with existing robust deep regression techniques by corrupting the annotations with an increasing percentage of outliers on the facial landmark detection task (Sect. 4.4).

We systematically compare DeepGUM with the standard \(L_2\) loss, the Huber loss and the Biweight loss (used in [4]). In all these cases, we use the VGG-16 architecture [35] pre-trained on ImageNet [32]. We also tried to use the architecture proposed in [4], but we were unable to reproduce the results reported in [4] on the LSP and Parse datasets, using the code provided by the authors. Therefore, for the sake of reproducibility and for a fair comparison between different robust loss functions, we used VGG-16 in all our experiments. Following the recommendations from [20], we fine-tune the last convolutional block and both fully connected layers with a mini-batch of size 128 and learning rate set to \(10^{-4}\). The fine-tuning starts with 3 epochs of \(L_2\) loss, before exploiting either the Biweight, Huber of DeepGUM loss. When using any of these three losses, the network output is normalized with the median absolute deviation (as in [4]), computed on the entire dataset after each epoch. Early stopping with a patience of \(K=5\) epochs is employed and the data is augmented using mirroring.

In order to evaluate the methods, we report the mean absolute error (MAE) between the regression target and the network output over the test set. Inspired by [20], we complete the evaluation with statistical tests that allow to point out when the differences between methods are systematic and statistically significant or due to chance. Statistical tests are run per-image regression errors and therefore can only be applied to the methods for which the code is available, and not to average errors reported in the literature; in the latter case, only MAE are made available. In practice, we use the non-parametric Wilcoxon signed-rank test [40] to assess whether the null hypothesis (the median difference between pairs of observations is zero) is true or false. We denote the statistical significance with \(^*\), \(^{**}\) or \(^{***}\), corresponding to a p-value (the conditional probability of, given the null hypothesis is true, getting a test statistic as extreme or more extreme than the calculated test statistic) smaller than \(p=0.05\), \(p=0.01\) or \(p=0.001\), respectively. We only report the statistical significance of the methods with the lowest MAE. For instance, A\(^{***}\) means that the probability that method A is equivalent to any other method is less than \(p=0.001\).

4.1 Fashion Landmark Detection

Visual fashion analysis presents a wide spectrum of applications such as cloth recognition, retrieval, and recommendation. We employ the fashion landmark dataset (FLD) [22] that includes more than 120K images, where each image is labeled with eight landmarks. The dataset is equally divided in three subsets: upper-body clothes (6 landmarks), full-body clothes (8 landmarks) and lower-body clothes (4 landmarks). We randomly split each subset of the dataset into test (5K), validation (5K) and training (\({\sim }30K\)). Two metrics are used: the mean absolute error (MAE) of the landmark localization and the percentage of failures (landmarks detected further from the ground truth than a given threshold). We employ landmark-wise \(r_{n}\).

Table 1. Mean absolute error on the upper-body subset of FLD, per landmark and in average. The landmarks are left (L) and right (R) collar (C), sleeve (S) and hem (H). The results of DFA are from [23] and therefore do not take part in the statistical comparison.

Table 1 reports the results obtained on the upper-body subset of the fashion landmark dataset (additional results on full-body and lower-body subsets are included in the supplementary material). We report the mean average error (in pixels) for each landmark individually, and the overall average (last column). While for the first subset we can compare with the very recent results reported in [23], for the other there are no previously reported results. Generally speaking, we outperform all other baselines in average, but also in each of the individual landmarks. The only exception is the comparison against the method utilizing five VGG pipelines to estimate the position of the landmarks. Although this method reports slightly better performance than DeepGUM for some columns of Table 1, we recall that we are using one single VGG as front-end, and therefore the representation power cannot be the same as the one associated to a pipeline employing five VGG’s trained for tasks such as pose estimation and cloth classification that clearly aid the fashion landmark estimation task.

Interestingly, DeepGUM yields better results than \(L_2\) regression and a major improvement over Biweight [4] and Huber [16]. This behavior is systematic for all fashion landmarks and statistically significant (with \(p<0.001\)). In order to better understand this behavior, we computed the percentage of outliers detected by DeepGUM and Biweight, which are \(3\%\) and \(10\%\) respectively (after convergence). We believe that within this difference (\(7\%\) corresponds to 2.1K images) there are mostly “difficult” inliers, from which the network could learn a lot (and does it in DeepGUM) if they were not discarded as happens with Biweight. This illustrates the importance of rejecting the outliers while keeping the inliers in the learning loop, and exhibits the robustness of DeepGUM in doing so. Figure 3 displays a few landmarks estimated by DeepGUM.

Fig. 3.
figure 3

Sample fashion landmarks detected by DeepGUM.

Fig. 4.
figure 4

Results on the CACD dataset: (left) mean absolute error and (right) images considered as outliers by DeepGUM, the annotation is displayed below each image.

4.2 Age Estimation

Age estimation from a single face image is an important task in computer vision with applications in access control and human-computer interaction. This task is closely related to the prediction of other biometric and facial attributes, such as gender, ethnicity, and hair color. We use the cross-age celebrity dataset (CACD) [7] that contains 163, 446 images from 2, 000 celebrities. The images are collected from search engines using the celebrity’s name and desired year (from 2004 to 2013). The dataset splits into 3 parts, 1, 800 celebrities are used for training, 80 for validation and 120 for testing. The validation and test sets are manually cleaned whereas the training set is noisy. In our experiments, we report results using image-wise \(r_{n}\).

Apart from DeepGUM, \(L_2\), Biweight and Huber, we also compare to the age estimation method based on deep expectation (Dex) [30], which was the winner of the Looking at People 2015 challenge. This method uses the VGG-16 architecture and poses the age estimation problem as a classification problem followed by a softmax expected value refinement. Regression-by-classification strategies have also been proposed for memorability and virality [1, 34]. We report results with two different approaches using Dex. First, our implementation of the original Dex model. Second, we add the GUM model on top the Dex architecture; we termed this architecture DexGUM.

The table in Fig. 4 reports the results obtained on the CACD test set for age estimation. We report the mean absolute error (in years) for size different methods. We can easily observe that DeepGUM exhibits the best results: 5.08 years of MAE (0.7 years better than \(L_2\)). Importantly, the architectures using GUM (DeepGUM followed by DexGUM) are the ones offering the best performance. This claim is supported by the results of the statistical tests, which say that DexGUM and DeepGUM are statistically better than the rest (with \(p<0.001\)), and that there are no statistical differences between them. This is further supported by the histogram of the error included in the supplementary material. DeepGUM considered that \(7\%\) of images were outliers and thus these images were undervalued during training. The images in Fig. 4 correspond to outliers detected by DeepGUM during training, and illustrate the ability of DeepGUM to detect outliers. Since the dataset was automatically annotated, it is prone to corrupted annotations. Indeed, the age of each celebrity is automatically annotated by subtracting the date of birth from the picture time-stamp. Intuitively, this procedure is problematic since it assumes that the automatically collected and annotated images show the right celebrity and that the times-tamp and date of birth are correct. Our experimental evaluation clearly demonstrates the benefit of a robust regression technique to operate on datasets populated with outliers.

4.3 Head Pose Estimation

The McGill real-world face video dataset [9] consists of 60 videos (a single participant per video, 31 women and 29 men) recorded with the goal of studying unconstrained face classification. The videos were recorded in both indoor and outdoor environments under different illumination conditions and participants move freely. Consequently, some frames suffer from important occlusions. The yaw angle (ranging from \(-90^\circ \) to \(90^\circ \)) is annotated using a two-step labeling procedure that, first, automatically provides the most probable angle as well as a degree of confidence, and then the final label is chosen by a human annotator among the plausible angle values. Since the resulting annotations are not perfect it makes this dataset suitable to benchmark robust regression models. As the training and test sets are not separated in the original dataset, we perform a 7-fold cross-validation. We report the fold-wise MAE average and standard deviation as well as the statistical significance corresponding to the concatenation of the test results of the 7 folds. Importantly, only a subset of the dataset is publicly available (35 videos over 60).

In Table 2, we report the results obtained with different methods and employ a dagger to indicate when a particular method uses the entire dataset (60 videos) for training. We can easily notice that DeepGUM exhibits the best results compared to the other ConvNets methods (respectively \(0.99^\circ \), \(0.50^\circ \) and \(0.20^\circ \) lower than \(L_2\), Huber and Biweight in MAE). The last three approaches, all using deep architectures, significantly outperform the current state-of-the-art approach [10]. Among them, DeepGUM is significantly better than the rest with \(p<0.001\).

Table 2. Mean average error on the McGill dataset. The results of the first half of the table are directly taken from the respective papers and therefore no statistical comparison is possible. \(^{\dagger }\)Uses extra training data.

4.4 Facial Landmark Detection

We perform experiments on the LFW and NET facial landmark detection datasets [37] that consist of 5590 and 7876 face images, respectively. We combined both datasets and employed the same data partition as in [37]. Each face is labeled with the positions of five key-points in Cartesian coordinates, namely left and right eye, nose, and left and right corners of the mouth. The detection error is measured with the Euclidean distance between the estimated and the ground truth position of the landmark, divided by the width of the face image, as in [37]. The performance is measured with the failure rate of each landmark, where errors larger than \(5\%\) are counted as failures. The two aforementioned datasets can be considered as outlier-free since the average failure rate reported in the literature falls below \(1\%\). Therefore, we artificially modify the annotations of the datasets for facial landmark detection to find the breakdown point of DeepGUM. Our purpose is to study the robustness of the proposed deep mixture model to outliers generated in controlled conditions. We use three different types of outliers:

  • Normally Generated Outliers (NGO): A percentage of landmarks is selected, regardless of whether they belong to the same image or not, and shifted a distance of d pixels in a uniformly chosen random direction. The distance d follows a Gaussian distribution, \(\mathcal {N}(25, 2)\). NGO simulates errors produced by human annotators that made a mistake when clicking, thus annotating in a slightly wrong location.

  • Local - Uniformly Generated Outliers (l-UGO): It follows the same philosophy as NGO, sampling the distance d from a uniform distribution over the image, instead of a Gaussian. Such errors simulate human errors that are not related to the human precision, such as not selecting the point or misunderstanding the image.

  • Global - Uniformly Generated Outliers (g-UGO): As in the previous case, the landmarks are corrupted with uniform noise. However, in g-UGO the landmarks to be corrupted are grouped by image. In other words, we do not corrupt a subset of all landmarks regardless of the image they belong to, but rather corrupt all landmarks of a subset of the images. This strategy simulates problems with the annotation files or in the sensors in case of automatic annotation.

The first and the second types of outlier contamination employ landmark-wise \(r_{n}\), while the third uses image-wise \(r_{n}\).

Fig. 5.
figure 5

Evolution of the failure rate (top) when augmenting the noise for the 3 types of outliers considered. We also display the corresponding precisions and recalls in percentage (bottom) for the outlier class. Best seen in color. (color figure online)

The plots in Fig. 5 report the failure rate of DeepGUM, Biweight, Huber and \(L_2\) (top) on the clean test set and the outlier detection precision and recall of all except for \(L_2\) (bottom) for the three types of synthetic noise on the corrupted training set. The precision corresponds to the percentage of training samples classified as outliers that are true outliers; and the recall corresponds to the percentage of outliers that are classified as such. The first conclusion that can be drawn directly from this figure are that, on the one hand, Biweight and Huber systematically present a lower recall than DeepGUM.

In other words, DeepGUM exhibits the highest reliability at identifying and, therefore, ignoring outliers during training. And, on the other hand, DeepGUM tends to present a lower failure rate than Biweight, Huber and \(L_2\) in most of the scenarios contemplated.

Regarding the four most-left plots, l-UGO and g-UGO, we can clearly observe that, while for limited amounts of outliers (i.e. \({<}10\%\)) all methods report comparable performance, DeepGUM is clearly superior to \(L_2\), Biweight and Huber for larger amounts of outliers. We can also safely identify a breakdown point of DeepGUM on l-UGO at \({\thicksim }40\%\). This is inline with the reported precision and recall for the outlier detection task. While for Biweight and Huber, both decrease when increasing the number of outliers, these measures are constantly around \(99\%\) for DeepGUM (before \(40\%\) for l-UGO). The fact that the breakdown point of DeepGUM under g-UGO is higher than \(50\%\) is due to fact that the a priori model of the outliers (i.e. uniform distribution) corresponds to the way the data is corrupted.

For NGO, the corrupted annotation is always around the ground truth, leading to a failure rate smaller than \(7\%\) for all methods. We can see that all four methods exhibit comparable performance up to \(30\%\) of outliers. Beyond that threshold, Biweight outperforms the other methods in spite of presenting a progressively lower recall and a high precision (i.e. Biweight identifies very few outliers, but the ones identified are true outliers). This behavior is also exhibited by Huber. Regarding DeepGUM, we observe that in this particular setting the results are aligned with \(L_2\). This is because the SGD procedure is not able to find a better optimum after the first epoch and therefore the early stopping mechanism is triggered and SFD output the initial network, which corresponds to \(L_2\). We can conclude that the strategy of DeepGUM, consisting in removing all points detected as outliers, is not effective in this particular experiment. In other words, having more noisy data is better than having only few clean data in this particular case of 0-mean highly correlated noise. Nevertheless, we consider an attractive property of DeepGUM the fact that it can automatically identify these particular cases and return an acceptable solution.

5 Conclusions

This paper introduced a deep robust regression learning method that uses a Gaussian-uniform mixture model. The novelty of the paper resides in combining a probabilistic robust mixture model with deep learning in a jointly trainable fashion. In this context, previous studies only dealt with the classical \(L_2\) loss function or Tukey’s Biweight function, an M-estimator robust to outliers [4]. Our proposal yields better performance than previous deep regression approaches by proposing a novel technique, and the derived optimization procedure, that alternates between the unsupervised task of outlier detection and the supervised task of learning network parameters. The experimental validation addresses four different tasks: facial and fashion landmark detection, age estimation, and head pose estimation. We have empirically shown that DeepGUM (i) is a robust deep regression approach that does not need to rigidly specify a priori the distribution (number and spread) of outliers, (ii) exhibits a higher breakdown point than existing methods when the outliers are sampled from a uniform distribution (being able to deal with more than \(50\%\) of outlier contamination without providing incorrect results), and (iii) is capable of providing comparable or better results than current state-of-the-art approaches in the four aforementioned tasks. Finally, DeepGUM could be easily used to remove undesired samples that arise from tedious manual annotation. It could also deal with highly unusual training samples inherently present in automatically collected huge datasets, a problem that is currently addressed using error-prone and time-consuming human supervision.