survey

Open access

AUC Maximization in the Era of Big Data and AI: A Survey

Authors:

Tianbao Yang,

Yiming YingAuthors Info & Claims

ACM Computing Surveys, Volume 55, Issue 8

Article No.: 172, Pages 1 - 37

https://doi.org/10.1145/3554729

Published: 23 December 2022 Publication History

All formats PDF

Abstract

Area under the ROC curve, a.k.a. AUC, is a measure of choice for assessing the performance of a classifier for imbalanced data. AUC maximization refers to a learning paradigm that learns a predictive model by directly maximizing its AUC score. It has been studied for more than two decades dating back to late 90s, and a huge amount of work has been devoted to AUC maximization since then. Recently, stochastic AUC maximization for big data and deep AUC maximization (DAM) for deep learning have received increasing attention and yielded dramatic impact for solving real-world problems. However, to the best our knowledge, there is no comprehensive survey of related works for AUC maximization. This article aims to address the gap by reviewing the literature in the past two decades. We not only give a holistic view of the literature but also present detailed explanations and comparisons of different papers from formulations to algorithms and theoretical guarantees. We also identify and discuss remaining and emerging issues for DAM and provide suggestions on topics for future work.

1 Introduction

ROC (receiver operating characteristic) curve is a curve of true positive rate (TPR, equivalently sensitivity or recall) versus false positive rate (FPR, equivalently fall-out) of a classifier by varying the threshold. The method was originally developed for operators of military radar receivers starting in 1941 [52]. ROC analysis has emerged as an important tool in many domains, e.g., medicine, radiology, biometrics, meteorology, forecasting of natural hazards, and is widely used in machine learning and artificial intelligence. A statistic measure associated with the ROC curve if the area under the curve (AUC), which has been widely used for assessing the performance of a classifier. Another closely related measure is called partial AUC, which refers to AUC in a certain region that restricts the range of FPR and/or TPR.

A standard approach in machine learning for learning a predictive model is to optimize some performance metric. A traditional performance metric of a classifier is the accuracy, i.e., the proportion of examples that are predicted correctly. However, accuracy can be misleading when the data is imbalanced, meaning that the number of data points from one class is much larger than the number of data points from the another class. In contrast, AUC is a more informative measure than accuracy for imbalanced data. However, studies show that algorithms that maximize accuracy of a model do not necessarily maximize the AUC score [28]. Hence, it is necessary to study algorithms for maximizing AUC directly.

AUC maximization in machine learning has a long history dating back to late ’90s [68]. Tremendous studies have been devoted to this topic and various aspects have been studied ranging from formulations to algorithms and theories. Below, we give a brief overview with exemplar references. First, AUC maximization has been studied in the context of different learning paradigms, e.g., supervised learning [81, 152], semi-supervised learning [80, 165], positive-unlabeled (PU) learning [140, 143], active learning [30, 60], Bayesian learning [59], federated learning [55, 195], online learning [47, 201]. Second, models in different forms have been learned in the context of AUC maximization, including linear models [192], kernel models [68, 132], extreme learning machines [188], decision trees [45], neural networks [178], deep neural nets [194]. Third, various solvers based on different methodologies have been studied, e.g., linear programming [130], quadratic programming [68], cutting-plane methods [81], L-BFGS [94], evolutionary algorithms [106], gradient descent methods [69], stochastic gradient methods [192], other methods [13, 19]. Fourth, different theoretical guarantees have been examined, e.g., consistency [48], generalization error bounds [95], excess risk bounds [58, 193], regret bounds [201], convergence rates or sample complexities [101], stability [96, 184]. Last but not least, AUC maximization has been successfully investigated in a variety of applications [6, 9, 43, 61, 77, 150, 153, 166, 167, 176, 204, 208], e.g., medical image classification [196] and molecular properties prediction [171], to mention but a few.

A bulk of studies related to AUC maximization revolve around the development of the solver, i.e., optimization algorithms, for learning a predictive model. The reason is that, compared with the traditional metric of accuracy, the AUC score is non-decomposable over individual examples, which renders its optimization much more challenging, especially for big data. The research of AUC maximization algorithms has experienced four different ages in the long history of two decades, namely, full-batch-based methods for the first age (roughly 2000–2010), online methods for the second age (roughly 2011–2015), stochastic methods for the third age (roughly 2016–2019), and deep learning methods for the recent age (roughly 2020–present). The first three ages focus on learning linear models or kernelized models, and the last age focuses on deep neural networks. In each age, there have been seminal works in rigorous optimization algorithms that play important roles in the evolution of AUC maximization methods. The four ages are illustrated in Figure 1.

Fig. 1.

To the best of our knowledge, there is no comprehensive survey devoted to AUC maximization. The only related survey work is Reference [162], published in 2011. Nevertheless, it focuses on ordinal regression and does not provide a comprehensive survey of optimization algorithms for AUC maximization with theoretical guarantees. This article aims to address this gap by providing a comprehensive review of related works for AUC maximization, with a particular focus on the optimization algorithms. We will cover important works in all four ages about the optimization algorithms and discuss their properties. The remainder of this article is organized as follows:

•

We provide some background for AUC and AUC estimators in Section 2. We give definitions for both AUC and partial AUC and derive their non-parametric estimators.

•

In Section 3, we review different objective functions for AUC maximization and mainly discuss three families of objectives.

•

We review full-batch-based methods for solving AUC maximization in the first age for both AUC maximization and partial AUC maximization in Section 4.

•

In Section 5, we present two classes of online optimization methods for AUC maximization and discuss their properties.

•

We present stochastic optimization methods in both offline setting and online setting in Section 6, and compare their properties.

•

In Section 7, we survey recent papers about non-convex optimization for deep AUC and partial AUC maximization, and discuss their applications in the real world.

•

In Section 8, we discuss remaining and emerging issues in deep AUC maximization and provide suggestions of topics for future work. Finally, we conclude in Section 9.

Disclaimer. Before ending this section, we would like to point out that we have done our best to include as many related works in machine learning as possible and may innocently miss some relevant papers in machine learning or other areas. We also emphasize that this article is about maximization of areas under ROC curves and does not cover the maximization of areas under Precision-Recall curves. Finally, we present a list of three fundamental papers of AUC, top 10 Cited Papers (as of 07/28/2022) related to AUC maximization, and two representative works for deep AUC maximization in Table 1.

Table 1.

Title	Authors	Year	Citations	Venue	Reference
1. The Meaning and Use of the Area under a Receiver Operating Characteristic	J. A. Hanley and B. J. McNeil.	1982	22314	Radiology	[63]
2. Analyzing a Portion of the ROC Curve	D. Katzman McClish	1989	769	Med Decis Making	[110]
3. Partial AUC Estimation and Regression	L. Dodd and M. Pepe	2003	389	Biometrics	[37]
1. A Support Vector Method for Multivariate Performance Measures	T. Joachims	2005	1022	ICML	[81]
2. AUC Optimization vs. Error Rate Minimization	C. Cortes and M. Mohri	2003	721	NIPS	[28]
3. Optimizing Classifier Performance via an Approximation to the Wilcoxon-Mann-Whitney Statistic	L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz	2003	350	ICML	[178]
4. Optimising Area under the ROC Curve Using Gradient Descent	A. Herschtal and B. Raskutti	2004	231	ICML	[69]
5. Online AUC Maximization	P. Zhao, S. C. H. Hoi, R. Jin, and T. Yang	2011	217	ICML	[201]
6. AUC Maximizing Support Vector Learning	U. Brefeld, T. Scheffer	2005	184	ICML (workshop)	[18]
7. The P-Norm Push: A Simple Convex Ranking Algorithm that Concentrates at the Top of the List	C. Rudin	2009	179	JMLR	[142]
8. One-pass AUC Optimization	W. Gao, R. Jin, S. Zhu, and Z. Zhou	2013	153	ICML	[47]
9. Efficient AUC Optimization for Classification	T. Calders and S. Jaroszewicz	2007	128	PKDD	[19]
10. Stochastic Online AUC Maximization	Y. Ying, L. Wen, and S. Lyu	2016	104	NIPS	[192]
1. Stochastic AUC Maximization with Deep Neural Networks	M. Liu, Z. Yuan, Y. Ying, and T. Yang.	2020	35	ICLR	[101]
2. Robust Deep AUC Maximization: A New Surrogate Loss and Empirical Studies on Medical Image Classification	Z. Yuan, Y. Yan, M. Sonka and T. Yang.	2021	13	ICCV	[196]

Table 1. Three Fundamental Papers of AUC, Top 10 Cited Papers Related to AUC Maximization, and Two Representative Works for Deep AUC Maximization

The citations data is from Google Scholar as of July 28, 2022, which is included for reference and is by no means the only metric to measure the influence of a paper. Remark: We do not include the highly cited paper [45] in the list due to that it was studied in Reference [28] for AUC maximization.

2 Background

Notations. Let \(\mathbb {I}(\cdot)\) be an indicator function of a predicate, and \([n]=\lbrace 1,\ldots , n\rbrace\). Let \({\bf z}= ({\bf x}, y)\) denote an input-output pair, where \({\bf x}\in \mathcal {X}\) denotes the input data and \(y\in \lbrace 1,-1\rbrace\) denotes its class label. Let \(\mathcal {P}_+\) denote the distribution of positive examples and \(\mathcal {P}_-\) denote the distribution of negative examples. Let \(f({\bf x}):\mathcal {X}\rightarrow \mathbb {R}\) denote a predictive function to be learned. It is usually parameterized by a vector \(\mathbf {w}\in \mathbb {R}^d\) and we use the notation \(f_\mathbf {w}(\cdot)\) to emphasize that it is a parameterized model. Let \(\ell (\mathbf {w}; {\bf x}, {\bf x}^{\prime }) = \ell (f_\mathbf {w}({\bf x}^{\prime }) - f_\mathbf {w}({\bf x}))\) denote a pairwise loss for a positive-negative pair \(({\bf x}, {\bf x}^{\prime })\).

For a set of given training examples \(\mathcal {S}=\lbrace ({\bf x}_i, y_i), i\in [n]\rbrace\) in the offline setting, let \(\mathcal {S}_+\) and \(\mathcal {S}_-\) be the subsets of \(\mathcal {S}\) with only positive and negative examples, respectively, and let \(n_+=|\mathcal {S}_+|\) and \(n_-=|\mathcal {S}_-|\) be the number of positive and negative examples, respectively. Denote by \(\mathcal {S}^{\downarrow }[k_1,k_2]\subseteq \mathcal {S}\) the subset of examples whose rank in terms of their prediction scores in the descending order are in the range of \([k_1, k_2]\), where \(k_1\le k_2\). Similarly, let \(\mathcal {S}^{\uparrow }[k_1,k_2]\subseteq \mathcal {S}\) denote the subset of examples whose rank in terms of their prediction scores in the ascending order are in the range of \([k_1, k_2]\), where \(k_1\le k_2\). We denote by \(\mathbb {E}_{{\bf x}\sim \mathcal {S}}\) the average over \({\bf x}\in \mathcal {S}\). Let \(D(\mathbf {p}, \mathbf {q})\) denote the KL divergence between two probability vectors. Let \(\Delta\) denote a simplex of a proper dimension, and \(\Pi _{\Omega }[\cdot ]\) denote the standard Euclidean projection onto a set \(\Omega\).

2.1 Definitions of AUC, Partial AUC, Two-way Partial AUC

In this subsection, we fix the predictive model \(f(\cdot)\) and present the definitions and formulas for computing AUC of \(f\). For a given threshold \(t\), the true positive rate (TPR) can be written as \(\text{TPR}(t) = \Pr (f({\bf x})\gt t|y = 1) = \mathbb {E}_{{\bf x}\sim \mathcal {P}_+}[\mathbb {I}(f({\bf x})\gt t)]\), and the false positive rate (FPR) can be written as \(\text{FPR}(t) = \Pr (f({\bf x})\gt t | y=-1) = \mathbb {E}_{{\bf x}\sim \mathcal {P}_-}[\mathbb {I}(f({\bf x})\gt t)]\). Let \(F_-(t) = 1 - \text{FPR}(t)\) denote the cumulative density function of the random variable \(f({\bf x})\) for \({\bf x}\sim \mathcal {P}_-\). Let \(p_-(t)\) denotes its corresponding probability density function. Similarly, let \(F_+(t) = 1- \text{TPR}(t)\) and \(p_+(t)\) denote the cumulative density function and the probability density function of \(f({\bf x})\) for \({\bf x}\sim \mathcal {P}_+\), respectively.

For a given \(u\in [0,1]\), let \(\text{FPR}^{-1}(u) = \inf \lbrace t\in \mathbb {R}: \text{FPR}(t)\le u\rbrace\). The ROC curve is defined as \(\lbrace u, \text{ROC}(u)\rbrace\), where \(u\in [0,1]\) and \(\text{ROC}(u) = \text{TPR}(\text{FPR}^{-1}(u))\). The AUC score of \(f\) is given by

\begin{align} \text{AUC}(f)& = \int ^{1}_0\text{ROC}(u) du = \int ^{\infty }_{-\infty }\text{TPR}(t)dF_-(t)= \int ^{\infty }_{-\infty }\text{TPR}(t)p_-(t)dt \nonumber \nonumber\\ & = \int ^{\infty }_{-\infty }\int ^{\infty }_t p_+(s)ds p_-(t)dt = \int ^{\infty }_{-\infty }\int ^{\infty }_{-\infty } p_+(s) p_-(t)\mathbb {I}(s\gt t)ds dt. \end{align}

(1)

The above expression also gives a probabilistic interpretation of AUC [63], i.e.,

\begin{align} \text{AUC}(f) = \Pr (f({\bf x}_+)\gt f({\bf x}_-)) = \mathbb {E}_{{\bf x}_+\sim \mathcal {P}_+, {\bf x}_-\sim \mathcal {P}_-}[\mathbb {I}(f({\bf x}_+)\gt f({\bf x}_-))]. \end{align}

(2)

The normal AUC measure could be misleading when the data is highly imbalanced. In many applications (e.g., medical diagnostics), we would like to control the FPR in a certain range, e.g., \(\text{FPR}\in (\alpha , \beta)\). Hence, another measure of interest is partial AUC (pAUC) with FRP restricted in the range \((\alpha , \beta)\), which is given by

\begin{equation} \text{pAUC}(f, \alpha , \beta) = \int ^{\text{FPR}^{-1}(\alpha)}_{\text{FPR}^{-1}(\beta)}\text{TPR}(t)dF_-(t) = \int ^{\text{FPR}^{-1}(\alpha)}_{\text{FPR}^{-1}(\beta)}\int ^{\infty }_{-\infty } p_+(s) p_-(t)\mathbb {I}(s\gt t)ds dt. \end{equation}

(3)

This expression gives a probabilistic interpretation of pAUC, which was first shown in Reference [37], i.e.,

\begin{align} \text{pAUC}(f, \alpha , \beta) =\Pr (f({\bf x}_+)\gt f({\bf x}_-), f({\bf x}_-)\in [\text{FPR}^{-1}(\beta), \text{FPR}^{-1}(\alpha)]). \end{align}

(4)

In contrast to pAUC defined above that is also referred to as one-way pAUC, two-way pAUC has been also studied [181]. A two-way pAUC is defined by specifying an upper bound \(\beta\) on the FPR and a lower bound on \(\alpha\) on the TPR. Then, the two-way pAUC (TPAUC) is given by

\begin{align} \text{TPAUC}(f, \alpha , \beta) = \Pr (f({\bf x}_+)\gt f({\bf x}_-), f({\bf x}_-)\ge \text{FPR}^{-1}(\beta), f({\bf x}_+)\le \text{TPR}^{-1}(\alpha)\rbrace). \end{align}

(5)

An illustration of AUC, one-way partial AUC and two-way partial AUC is given in Figure 2.

Fig. 2.

2.2 Non-Parametric Estimators

Given a set of examples \(\mathcal {S}=\mathcal {S}_+\cup \mathcal {S}_-\), how can we estimate AUC and pAUC? There are parametric estimators assuming the prediction scores following a particular distribution (e.g., normal distribution) [110], and non-parametric estimators that do not make any assumptions regarding the distribution of prediction scores. We will focus on non-parametric estimators below, since they are widely used for AUC maximization.

According to the probabilistic interpretation of AUC in Equation (2), a non-parametric estimator can be computed as follows that corresponds to the Mann-Whitney U-statistic [63]:

\begin{align} {\text{AUC}}(f; \mathcal {S}) = \frac{1}{n_+}\frac{1}{n_-}\sum _{{\bf x}_i\in \mathcal {S}_+}\sum _{{\bf x}_j\in \mathcal {S}_-}\mathbb {I}(f({\bf x}_i)\gt f({\bf x}_j)). \end{align}

(6)

A (non-normalized) non-parametric estimator of pAUC can be computed by [37]:

\begin{align*} {\text{pAUC}}(f,\alpha , \beta ; \mathcal {S}) = \frac{1}{n_+}\frac{1}{n_-}\sum _{{\bf x}_i\in \mathcal {S}_+}\sum _{{\bf x}_j\in \mathcal {S}_-}\mathbb {I}(f({\bf x}_i)\gt f({\bf x}_j), f({\bf x}_j)\in (q_{\beta }, q_{\alpha })), \end{align*}

where \(q_\alpha\) denotes the \(\alpha\) quantile of \(f({\bf x}_-), {\bf x}_-\sim \mathcal {P}_-\). The quantiles \(q_{\alpha }, q_{\beta }\) are usually replaced by their empirical estimations, which gives the following non-normalized estimator of pAUC:

\begin{align} {\text{pAUC}}(f, \alpha , \beta ; \mathcal {S}) = \frac{1}{n_+}\frac{1}{n_-}\sum _{{\bf x}_i\in \mathcal {S}_+}\sum _{{\bf x}_j\in \mathcal {S}^\downarrow _-[k_1+1, k_2]}\mathbb {I}(f({\bf x}_i)\gt f({\bf x}_j)), \end{align}

(7)

where \(k_1=\lceil n_-\alpha \rceil , k_2 = \lfloor n_-\beta \rfloor\). Similarly, a (non-normalized) non-parametric estimator for two-way pAUC is given by

\begin{align} {\text{TPAUC}}(f, \alpha , \beta ; \mathcal {S}) = \frac{1}{n_+}\frac{1}{n_-}\sum _{{\bf x}_i\in \mathcal {S}_+^{\uparrow }[1, k_1]}\sum _{{\bf x}_j\in \mathcal {S}^\downarrow _-[1, k_2]}\mathbb {I}(f({\bf x}_i)\gt f({\bf x}_j)), \end{align}

(8)

where \(k_1=\lfloor n_+\alpha \rfloor , k_2 = \lfloor n_-\beta \rfloor\).

3 Surrogate Objectives for AUC Maximization

Objectives based on a Pairwise Surrogate Loss. . For AUC maximization, one often replaces the indicator function in the non-parametric estimators of AUCs defined above by a surrogate loss function \(\ell (f_\mathbf {w}({\bf x}_-) - f_\mathbf {w}({\bf x}_+))\) of \(\mathbb {I}(f_\mathbf {w}({\bf x}_-)\ge f_\mathbf {w}({\bf x}_+))\) to formulate the objective function. As a result, AUC maximization can be formulated as

\begin{equation} \min _{\mathbf {w}\in \mathbb {R}^d} \frac{1}{n_+n_-} \sum _{{\bf x}_i\in \mathcal {S}_+}\sum _{{\bf x}_j\in \mathcal {S}_-}\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)), \end{equation}

(9)

and (one-way) pAUC maximization can be formulated as

\begin{equation} \min _{\mathbf {w}\in \mathbb {R}^d} \frac{1}{n_+}\frac{1}{n_-}\sum _{{\bf x}_i\in \mathcal {S}_+}\sum _{{\bf x}_j\in \mathcal {S}^\downarrow _-[k_1+1, k_2]}\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)), \end{equation}

(10)

and two-way pAUC maximization can be formulated as

\begin{equation} \min _{\mathbf {w}\in \mathbb {R}^d} \frac{1}{n_+}\frac{1}{n_-}\sum _{{\bf x}_i\in \mathcal {S}_+^{\uparrow }[1, k_1]}\sum _{{\bf x}_j\in \mathcal {S}^\downarrow _-[1, k_2]}\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)). \end{equation}

(11)

Regarding the pairwise surrogate loss \(\ell (f_\mathbf {w}({\bf x}_-) - f_\mathbf {w}({\bf x}_+))\), different choices have been investigated in the literature. A list of different surrogate loss functions with a sampling of references are summarized in Table 2. An important property of pairwise surrogate loss for AUC maximization is its consistency [48]. Loosely speaking, a pairwise surrogate loss \(\ell (\cdot)\) is called consistent if optimizing the surrogate loss with infinite amount of data gives a solution to optimizing the original AUC score. A rigorous definition is given in Reference [48]. A necessary condition for the consistency of a pairwise loss \(\ell (f({\bf x}^{\prime }) - f({\bf x}))\) for a positive-negative pair \(({\bf x}, {\bf x}^{\prime })\) is that \(\ell\) is a convex, differentiable and non-decreasing function with \(\ell ^{\prime }(0) \gt 0\) [48].

Table 2.

Name	Form	Parameters	Remarks	References
Square	\(\ell (s)=(s+c)^2\)	\(c\gt 0\)	Consistent	[35, 47, 103, 149, 192]
Hinge	\(\ell (s)=\max (0,c+s)\)	\(c\gt 0\)	Non-consistent	[18, 86, 87, 169]
Squared Hinge	\(\ell (s)=max(0, c+s)^2\)	\(c\gt 0\)	Consistent	[18, 87, 178]
Logistic	\(\ell (s)=-\log \frac{1}{1+\exp (cs)}\)	\(c\gt 0\)	Consistent	[154]
Exponential Loss	\(\ell (s) = \exp (cs)\)	\(c\gt 0\)	Consistent	[45]
Barrier Hinge	\(\ell (s) = \max (-b(r+s)+r, \max (b(s-r), r-s))\)	\(r\gt 0, b\gt 0\)	Noisy Labels	[22]
Sigmoid	\(\ell (s) = \frac{1}{1+\exp (-cs)}\)	\(c\gt 0\)	Non-convex	[19, 69, 80, 159, 170]
Ramp Function	\(\ell (s)=\max (0, 1+s) - \max (0, c+s)\)	\(c\lt 0\)	Non-convex	[25]
CDF of Normal Distribution	\(\ell (s)=\frac{1}{2}(1+erf(\frac{s}{\sqrt {2}c}))\)	\(c\gt 0\)	Non-convex	[90]
Exponential-type	\(\ell (s) = 1 - \exp (-cs)\)	\(c\gt 0\)	Non-convex	[157]
Chebyshev Polynomial	\(\ell (s)=\sum _{k=0}^mc_ks^k\)	\(m\gt 0\)	Decomposable	[19]

Table 2. Different Surrogate Loss Functions \(\ell (s)\) Used in AUC Maximization

Min-Max Objectives for AUC Maximization. . One issue of the pairwise loss-based objective is that it needs to explicitly construct the positive-negative pairs, which is not suitable for online learning where the data comes sequentially and distributed optimization where data is distributed over many different machines. To address this issue, Ying et al. [192] propose to formulate an equivalent min-max objective for using a pairwise square loss. Specially, when \(\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)) = (c+f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i))^2\), the problem in Equation (9) is equivalent to

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^d,(a,b)\in \mathbb {R}^2}\max _{\alpha \in \mathbb {R}}F\left(\mathbf {w},a,b,\alpha \right):=\mathbb {E}_{{\bf z}}\left[F\left(\mathbf {w},a,b,\alpha ;{\bf z}\right)\right], \end{align}

(12)

where \(\mathbb {E}_{{\bf z}}\) denotes the empirical average of training data (offline setting) or expectation over underlying distribution (online setting), \(F(\mathbf {w}, a, b, \alpha ; {\bf z})\) is given by

\begin{align} F(\mathbf {w},a,b,\alpha ;{\bf z}) &= (1-p)\left(f_\mathbf {w}({\bf x})-a\right)^2\mathbb {I}(y=1)+p(f_\mathbf {w}({\bf x})-b)^2\mathbb {I}(y=-1)\\ &\quad -p(1-p)\alpha ^2+2\alpha \left(p(1-p)c+ p f_\mathbf {w}({\bf x})\mathbb {I}(y=-1)-(1-p)f_\mathbf {w}({\bf x})\mathbb {I}(y=1)\right)\nonumber \nonumber , \end{align}

(13)

and \(p = \Pr (y=1)\) (online setting) or \(p=n_+/n\) (offline setting). A benefit of this objective function is that it is decomposable over individual examples. Hence, it enables one to develop efficient stochastic algorithms for updating the model parameter \(\mathbf {w}\) without explicitly constructing and handling positive-negative pairs. It is notable that a similar min-max formulation for AUC maximization was also independently examined in Reference [134] for the offline setting.

Recently, Yuan et al. [196] reveal some potential issues of optimizing pairwise square loss and its equivalent min-max objective. They demonstrate that optimizing the pairwise square loss or its equivalent min-max objective is sensitive to noisy data and also has adverse effect on easy data. To address these issues, they decompose the square loss-based objective into three components:

\begin{align} &\mathbb {E}[(c - h_\mathbf {w}({\bf x}) + h_\mathbf {w}({\bf x}^{\prime }))^2|y=1, y^{\prime }=-1]\\ &= {\mathbb {E}[(h_\mathbf {w}({\bf x}) - a(\mathbf {w}))^2|y=1]}+ {\mathbb {E}[(h_\mathbf {w}({\bf x}^{\prime }) - b(\mathbf {w}))^2|y^{\prime }=1]}+ {(c - a(\mathbf {w}) + b(\mathbf {w}))^2}\nonumber \nonumber , \end{align}

(14)

where \(a(\mathbf {w})=\mathbb {E}_{{\bf x}}[f_\mathbf {w}({\bf x})|y=1]\) and \(b(\mathbf {w}) = \mathbb {E}_{{\bf x}}[f_\mathbf {w}({\bf x})|y=-1]\), and they propose to replace the last term by using a squared hinge loss:

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^d}{\mathbb {E}[(h_\mathbf {w}({\bf x}) - a(\mathbf {w}))^2|y=1]}+ {\mathbb {E}[(h_\mathbf {w}({\bf x}^{\prime }) - b(\mathbf {w}))^2|y^{\prime }=1]}+ (c- a(\mathbf {w}) + b(\mathbf {w}))_+^2, \end{align}

(15)

whose objective is referred to as min-max margin loss [196]. For solving the above problem, they formulate the problem into an equivalent min-max optimization problem:

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^d,(a,b)\in \mathbb {R}^2}\max _{\alpha \ge 0}f\left(\mathbf {w},a,b,\alpha \right):=\mathbb {E}_{{\bf z}}\left[F\left(\mathbf {w},a,b,\alpha ;{\bf z}\right)\right], \end{align}

(16)

where \(F(\mathbf {w}, a, b,\alpha ; {\bf z})\) is the same as Equation (13). The difference between the above objective and Reference (12) is that there is a non-negative constraint on the dual variable \(\alpha \ge 0\).

Composite Objectives for AUC Maximization. . Recently, Zhu et al. [207] propose another family of objectives for AUC maximization, which subsumes min-max objective of the pairwise square loss and the min-max margin loss as special cases. The objective consists of three terms:

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^d}{\mathbb {E}[(f_\mathbf {w}({\bf x}) - a(\mathbf {w}))^2|y=1]}+ {\mathbb {E}[(f_\mathbf {w}({\bf x}^{\prime }) - b(\mathbf {w}))^2|y^{\prime }=1]}+ \ell (c- a(\mathbf {w}) + b(\mathbf {w}))^2, \end{align}

(17)

where \(\ell (\cdot)\) is a surrogate loss. When \(\ell (\cdot)\) is a square function, the above objective is equivalent to the pairwise square loss-based objective or the min-max objective in Equation (12). When \(\ell (\cdot)\) is a squared hinge function, the above objective is equivalent to the min-max margin objective (16). Other choices of \(\ell (\cdot)\) are possible [207]. For solving the above objective, it can be transformed into

\begin{align} \min _{\mathbf {w}\in \mathbb {R}^d, (a, b)\in \mathbb {R}^2}{\mathbb {E}[(f_\mathbf {w}({\bf x}) - a)^2|y=1]}+ {\mathbb {E}[(f_\mathbf {w}({\bf x}^{\prime }) - b)^2|y^{\prime }=1]}+ \ell (c- a(\mathbf {w}) + b(\mathbf {w}))^2, \end{align}

(18)

where the last term can be regarded as a two-level stochastic compositional function [49, 164]. Another way is to view all three terms in Equation (17) as compositional functions.

It is notable that a regularization term about \(\mathbf {w}\) (e.g., \(\ell _2\) norm square) can be added to the above objectives for improving generalization. In addition, formulations can be extended to the multi-class scenario following the one-vs.-all or one-vs.-one settings [62, 101, 186].

4 Full Batch Based Methods - the First Age

Earlier works for AUC maximization use full batch-based methods, which process all training examples at each iteration in the algorithmic optimization. Notable optimization algorithms for AUC maximization include the quadratic programming, gradient decent methods, cutting plane algorithms, and boosting-type methods.

4.1 AUC Maximization

Quadratic Programming. . To the best of our knowledge, the earliest work dates back to 1999 [68] and derives the dual problem of the support vector machine (SVM) formulation for ordinal regression in the kernel setting. Quadratic programming is then employed to obtain the optimal solution that can apply to AUC maximization. References [18, 138] use a similar optimization algorithm for AUC maximization with the hinge loss. Since the number of constraints and parameters grows quadratically in the number of examples, running such quadratic programming for AUC maximization is very computationally expensive for large-scale datasets. To mitigate such computational burden, in References [18, 138] heuristic tricks using k-means clustering and k-nearest neighborhood are proposed to reduce the number of constraints. However, such approximate solutions do not guarantee an optimal solution to the original AUC maximization problem.

Gradient Descent Methods. . Gradient descent methods are used in References [19, 69, 178] for AUC maximization. Reference [178] is probably the first work that applies the gradient descent method for AUC maximization. They use the the hinge function with a power \(p\gt 1\) as the surrogate loss. One year later, Reference [69] considers improving the gradient descent algorithm for AUC maximization, where they use the sigmoid function as a surrogate loss. They also propose a heuristic technique by reducing the number of positive-negative pairs used in the gradient descent methods. In particular, for each negative data they only construct a pairwise loss with only one positive data. However, the quality of such approximation highly depends on the properties of the dataset. When the examples have large intra-variance, their objective could yield poor performance. Reference [19] uses a different method to improve the scalability of the gradient descent method. In particular, they use the Chebyshev polynomial to approximate the indicator function in the original formulation of the AUC score given by Equation (6) and then a gradient descent method is employed to optimize such approximated AUC score, which only requires a linear scan of all examples at each iteration without explicitly working on all pairs.

Cutting Plane and Accelerated Gradient-based Methods. . The seminal work by Joachims [81] uses the cutting plane algorithms to optimize a general multivariate performance measure including the AUC score. The basic principle behind this optimization algorithm is to, at each iteration, solve a quadratic programming problem subject to a selected subset of constraints. The sufficient subset of constraints is generated by gradually adding the currently most violated constraint in each iteration. The cutting plane methods converge with an iteration complexity of \(\mathcal {O}(\frac{1}{\lambda \epsilon })\) to find a \(\epsilon\)-accurate solution, where \(\lambda\) is the regularization parameter in the formulation. Zhang et al. [198] work on the dual form of the formulation for optimizing the multivariate performance measure, which may be not smooth, and use the smoothing techniques [127] to smooth the empirical objective function. Then, the Nesterov’s accelerated gradient method [126] is employed to optimize the smoothed objective function, which has an iteration complexity of \(\max (\mathcal {O}(\frac{1}{\epsilon }, \frac{1}{\sqrt {\lambda \epsilon }}))\).

Boosting Methods. . Freund et al. [45] propose a boosting method named RankBoost for bipartite ranking, which is applicable to AUC maximization. The RankBoost algorithm is based on Freund and Schapire’s AdaBoost algorithm [46] and its successor developed by Schapire and Singer [144]. RankBoost works by combining many “weak” rankings of the given instances to learn a strong ranking model. The RankBoost algorithm was later applied to AUC maximization [28]. The boosting methods for AUC maximization have also been examined in Reference [105].

4.2 Partial AUC Maximization

Compared to AUC maximization, partial AUC maximization is much more challenging due to that it involves selection of examples whose prediction scores are in a certain range. We provide a survey of partial AUC maximization according to the chronological order and group them according to the underlying methodologies.

Indirect Methods. . Wu et al. [173] propose a new support vector machine (SVM) named assymetric SVM, which aims to lower the false positive rate while maximizing the margin. To achieve this, it maximizes two margins, the core-margin (i.e., the margin between the negative class and the high confidence subset of the positive class), and the traditional class-margin. By enlarging the core-margin, it is able to enclose the core (i.e., high confident examples) of the positive class in a set. The authors employ Sequential Minimal Optimization (SMO) to solve the resulting objective.

Rudin [142] proposes the p-norm push method for bipartitie ranking, which is to optimize a measure focusing on the left end of the ROC curve aiming to the make the leftmost portion of ROC curve higher. The measure to be minimized is defined as a sum of p-norm of the heights of negative examples, where the height of a negative example is defined as the number of positive examples that are ranked lower than the negative example. The author proposes a boosting-type algorithm for optimizing the p-norm push objective.

Later, Agarwal [2] proposes the infinite-push method to minimize the maximal height of all negative examples, which can be considered as an empirical estimator of pAUC with the FPR controlled below \(1/n_-\). The author proposes a gradient descent algorithm for solving the infinite-push objective, which suffers a higher per-iteration cost in the order of \(O(n_+n_-d+n_+n_-\log (n_+n_-))\) and an iteration complexity of \(O(1/\epsilon ^2)\), where \(d\) is the dimensionality of input data.

Rakotomamonjy [139] extends the infinite-push method to handle sparsity-inducing regularizers and proposes an ADMM-based algorithm for optimizing the problem, which has a per-iteration cost of \(O(n_+n_-d + n_+n_-\log (n_+n_-))\) and an iteration complexity of \(O(1/\epsilon)\). In 2014, Li, Jin, and Zhou [98] propose a method called TopPush for optimizing the infinity-push objective. The authors use a different formulation from that in Reference [2] where each positive example is only compared with the negative example with the highest score before computing the loss, which leads to a more efficient algorithm with a per-iteration cost of \(O((n_+ + n_-)d)\). They employ the Nesterov’s accelerated gradient method to optimize the dual objective with an iteration complexity of \(O(1/\sqrt {\epsilon })\).

Boosting-type methods. . Komori and Eguchi [90] propose a boosting-style algorithm named pAUCboost for partial AUC maximization. In this work, the indicator function \(\mathbb {I}(f({\bf x}_-)\gt f({\bf x}_+))\) is approximated by the cumulative density function of the normal distribution. The weak leaner is defined by a natural cubic spline. To simplify the optimization for the weaker learner and its weight at each iteration, the algorithm first employs one-step Newton-Raphson to update the weight and then solves for the optimal weaker learner given its weight. However, it does not discuss complexity and efficiency in finding weaker learners for maximizing pAUC at each iteration. Takenouchi, Komori, and Eguchi [157] propose a more principled boosting method for pAUC maximization named pU-AUCBoost, where U stands for a surrogate function of the indicator \(\mathbb {I}(f({\bf x}_+)\gt f({\bf x}_-))\). To address the inter-dependency issue between the weak learner and its weights, they derive a lower bound of the pAUC objective at each iteration, which decouples the weaker learner and its weights. In these papers, the authors only conduct experiments on small scale datasets with few hundred or thousand examples.

Heuristic Methods. . Wang and Chang [170] consider the marker (feature) selection problem via maximizing the partial AUC of linear risk scores. They propose a surrogate loss function for pAUC and show its non-asymptotic convergence and greedily select features for learning a linear classifier. There is no discussion on efficiency and complexity of how to solve the pAUC maximization problem. The authors have conducted experiments on some simulated data and real data with only few hundred examples. Ricamato and Tortorella [141] examine the problem of how to combine two or multiple classifiers to maximize partial AUC. The problem is reduced to optimizing a scalar combination weight, which is different from standard pAUC maximization methods for learning a classifier. For combining multiple classifiers, they use a greedy method to select which two classifiers to combine at each iteration. As a result, they derive a boosting algorithm similar to the classical Adaboost algorithm, which first finds the optimal base learner given previous combined learner and then optimizes the weight of the base learner.

Structural SVM Methods. . Narasimhan and Agarwal [117] propose a structural SVM-based approach for learning a linear model by optimizing partial AUC inspired by Reference [81]. Their formulated optimization problem has an exponential number of constraints, one for each possible ordering of training data. To solve this problem, they use the cutting plane method, which is based on the fact that for any \(\epsilon \gt 0\) a small subset of the constraints is sufficient to find an \(\epsilon\)-approximate solution to the problem. However, the bottleneck lies at finding the most violated constraint at each iteration, which could cost \(O((n_++n_-)d+n_+n_-+ n_-\log n_-)\) time complexity. In addition, the cutting-plane method could have a slow convergence with an iteration complexity of \(O(1/\epsilon)\). In the extended version [119], the authors have managed to reduce the per-iteration time complexity to \(O((n_++n_-)d+n_+n_-\beta + n_-\log n_-)\), where \(\beta \in (0,1)\) is the upper bound parameter of the FPR. In 2013, the same authors propose a tight surrogate loss for the partial AUC in the structural SVM framework [118]. In this article, the authors also present a projected gradient method, which suffers a per-iteration cost of \(O((n_++n_-)d+ n_-\log n_- + (n_++n_-\beta)\log (n_++n_-\beta))\) for learning a linear model of dimentionality of \(d\), and an iteration complexity of \(O(1/\epsilon ^2)\). A DC programming approach is also presented in Reference [119] for optimizing pAUC with FPR restricted in a range \((\alpha _0, \alpha _1)\) where \(\alpha _0\gt 0\), which is computationally more expensive than the structural SVM approach due to requiring to solve an entire structural SVM optimization at each iteration. In these papers, the authors have conducted experiments on multiple datasets with size ranging from a few thousand to a few hundred thousand. The theoretical work cited in Reference [109] provides a statistical performance guarantee for algorithms of maximizing the empirical pAUC proposed in References [117, 118, 119].

Constrained Optimization . Maximizing the partial AUC can be reformulated as a constrained optimization problem that involves optimizing a non-decomposable evaluation metric with a certain thresholded form, while constraining another metric of interest. In particular, Reference [40] proposes to approximate the area under the ROC curve using a Riemann approximation while dividing the range of FPRs into a number of bins where each threshold is associated with a bin. This approach allows the reformulation of a constrained optimization problem where the objective is to maximize the sum of the TPRs at each threshold with constraints associated with threshold satisfying the FPRs. Replacing TPRs and FPRs with surrogate relaxations, it can be further shown to be equivalent to a Lagrangian (mini-max) problem and then vanilla stochastic gradient descent and ascent algorithms can be applied. The follow-up works [29, 120] have improved this approach using the surrogate relaxations for the primal updates. In Reference [92], the authors further improved this approach by expressing the threshold parameter as a function of the model parameters via the Implicit Function theorem [158]. The resulting optimization problem can be solved using standard gradient-based methods.

4.3 Summary

We compare different methods in Table 3 for pAUC maximization from different perspectives, where we also include deep partial AUC maximization methods reviewed in Section 7. The full-batch-based algorithms could suffer a quadratic time complexity in the worst-case or a super-linear (e.g., log-linear) time complexity per-iteration, which makes them not amenable for handling large-scale datasets. Most of them are for learning traditional models (e.g., linear models, kernel models) and algorithms for solving the underlying optimization problem are not scalable to large-scale datasets and not suitable for deep learning.

Table 3.

Work	Category	Objective Functions	Models	Complexity/Convergence Analysis	Size of Data
[173]	Indirect Methods	SVM-like	Kernel	No	\(10^5\)
[142]	Indirect Methods	P-norm Push	Linear	No	\(10^4\)
[2]	Indirect Methods	Infinity Push	Linear	Yes	\(10^3\)
[139]	Indirect Methods	Infinity Push	Linear	Yes	\(10^3\)
[98]	Indirect Methods	Infinity Push	Linear	Yes	\(10^6\)
[90]	Boosting-type	pAUC Surrogate	Cubic Spline	No	\(10^3\)
[157]	Boosting-type	pAUC Surrogate	Decision Stump	No	\(10^3\)
[170]	Heuristic Methods	pAUC Surrogate	Linear	No	\(10^2\)
[141]	Heuristic Methods	pAUC Surrogate	Any	No	\(10^5\)
[117, 118, 119]	Structural SVM	pAUC surrogate	Linear	Yes	\(10^6\)
[29, 40, 92, 120]	Constrained Opt.	Riemann approximation	Linear/Non-linear	Convex Only	\(10^5\)
[187]	Stochastic/Deep	Appr. Pairwise Surrogate	Deep Nets	No	\(10^5\)
[206]	Stochastic/Deep	(10), (11), (23), (24)	Deep Nets	Yes	\(10^5\)
[190]	Stochastic/Deep	(10)	Deep Nets	Yes	\(10^5\)

Table 3. Comparison of Different Studies for pAUC Maximization

5 Online AUC Maximization - the Second Age

In contrast to the full-batch methods that need all training data beforehand, online learning algorithms [21] can update the model parameter upon receiving new datum and can efficiently handle streaming data where examples are presented in sequence. Online learning with point-wise loss has been studied extensively [65, 131, 147]. However, online learning for AUC maximization has different challenges due to that the pairwise loss does not naturally fit the streaming data. In the literature, there has been a wave of studies focusing on online learning for AUC maximization. Below, we will categorize them into two classes, namely, online buffer-based methods, online statistics-based methods. Revolving around these methods, we will discuss two theoretical properties, i.e., regret bounds and statistical error bounds.

We first provide some background on regret bounds and statistical error bounds. In the standard online learning setting, there is no statistical assumption on the data received, e.g., IID assumption. Hence, the measure of interest is the regret bound. Let \(\lbrace ({\bf x}_1, y_1), \ldots , ({\bf x}_T, y_T)\rbrace\) denote the sequence of data received in the stream. To measure the regret, let \(L_t(\mathbf {w}_t, {\bf x}_t, y_t)\) denote the cost measure of the \(t\)th model \(\mathbf {w}_t\) with respect to the received data \(({\bf x}_t, y_t)\) at the \(t\)th iteration, let \(L(\mathbf {w}, \lbrace {\bf x}_t, y_t\rbrace _{t=1}^T)\) denote the cost measure defined on all data. Then the regret is defined as

\begin{align*} R_T = \sum _{t=1}^TL_t(\mathbf {w}_t, {\bf x}_t, y_t) - \min _{\mathbf {w}}\sum _{t=1}^TL_t(\mathbf {w}, {\bf x}_t, y_t). \end{align*}

There are different ways to define the cost at each iteration for AUC maximization, which will be discussed in the following:

When the received data is assumed to follow the IID assumption, the statistical error bound is another performance guarantee of interest. There are two types of statistical error bounds, namely, generalization error bounds and excess risk bounds, where the former refers to the bounds of the difference between the expected risk of a learned model and the empirical risk, and the latter refers to the bounds of the difference between the expected loss of a learned model and the optimal expected loss. In particular, let \(\mathcal {R}(\mathbf {w})\) denote the expected risk for AUC maximization, which is given by \(\mathcal {R}(\mathbf {w})=\mathbb {E}_{{\bf x}_+\sim \mathcal {P}_+, {\bf x}_-\sim \mathcal {P}_-}[\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i))]\). Then the generalization error bounds usually take the form of \(\mathcal {R}(\widehat{\mathbf {w}}_T)\le \frac{1}{T}\sum _{t=1}^TL_t(\mathbf {w}_t, {\bf x}_t, y_t) + O(T^\alpha)\) for some \(\widehat{\mathbf {w}}_T\) and \(\alpha \gt 0\), and the excess risk bounds usually take the form of \(\mathcal {R}(\widehat{\mathbf {w}}_T)\le \min _{\mathbf {w}\in \mathcal {W}}\mathcal {R}(\mathbf {w}) + O(T^{-\alpha })\) for some \(\widehat{\mathbf {w}}_T\) and \(\alpha\).

5.1 Online Buffered Gradient Descent for AUC Maximization

The most representative online buffer-based methods is the online buffer gradient descent method proposed in the seminal paper cited in Reference [201] in 2011 by Zhao, Hoi, Jin, and Yang. It is the first work that studies online AUC maximization and inspires many following studies. They propose online buffered gradient descent methods, whose algorithmic framework is shown in Algorithm 1. There are two key functions, i.e., \(\text{UpdateBuffer}\) and \(\text{UpdateModel}\). In the paper, the authors define the following cost function for each iteration:

\begin{align*} L_t(\mathbf {w}, {\bf x}_t, y_t) = \sum _{i\lt t, y_i=1}\mathbb {I}(y_t=-1)\ell (f_\mathbf {w}({\bf x}_t) - f_\mathbf {w}({\bf x}_i)) + \sum _{i\lt t, y_i=-1}\mathbb {I}(y_t=1)\ell (f_\mathbf {w}({\bf x}_i) - f_\mathbf {w}({\bf x}_t)). \end{align*}

They update the buffer by using the “reservoir sampling” technique [161], which aims to simulate a uniform sampling of the received examples. They update the model parameter based on the gradient descent of the cost function \(L_t\) by only using examples in the buffer, i.e., \(({\bf x}_i, y_i)\in \mathcal {B}_t\). They establish a regret bound in the order of \(\scriptstyle {\sqrt {B_+T_+^3 + B_-T_-^3}}\), where \(B_+\) and \(B_-\) denote the buffer size for positive samples and negative samples, respectively, \(T_+\) and \(T_-\) denote the number of received positive examples and negative examples over \(T\) iterations, respectively. The authors provide an explanation regarding the optimal buffer size in the presence of the variance terms that have been ignored in the regret bound analysis, which gives an optimal buffer size \(B_+=\sqrt {T_+}\) and \(B_-=\sqrt {T_-}\).

Later, the statistical error bounds of online buffer-based methods are established in References [86, 168]. Wang et al. [168] provide the generalization error bounds for any arbitrary online learner with an infinite buffer size and a finite buffer size for learning from \(n\) examples. They use the covering number to bound the complexity of hypothesis and derive a generalization error bound of a tailed-averaged solution in the order of \(O(\log (\mathcal {N}(\epsilon) n/\delta)/\sqrt {\min (T, B)})\) with a high probability \(1-\delta\), where \(\mathcal {N}(\epsilon)\) denotes the cardinality of \(\epsilon\)-net of the hypothesis space for a small value \(\epsilon\), and \(B\) denotes the buffer size. Kar et al. [86] improve the generalization error bound of the method with an infinite buffer by using the Rademacher complexity of the hypothesis space. Their error bound of the averaged solution is in the order of \(O(C_d/\sqrt {T})\), where \(C_d\) is a constant in the Rademacher complexity that is dependent only on the dimension \(d\) of the input space. Based on the generalization error bound and the regret bound, they also establish an excess risk bound in the order of \(R_T/T + O(\frac{C_d+\log (T/\delta)}{T})\), where \(R_T\) is the regret bound and \(\delta \in (0,1)\) is the failure probability. In addition, they provide generalization error bounds and a high probability excess risk bound for online buffered gradient descent method with a finite-sized buffer, which has a dominating term of \(O(C_d\log (T/\delta)/\sqrt {B})\), where \(B\) is the buffer size.

Kar et al. [85] also study an online buffer-based method with an infinite buffer size for partial AUC maximization. In the paper, they define a different cost function for each iteration. Let \(L(\mathbf {w}, \lbrace {\bf x}_i, y_i\rbrace _{i=1}^t)\) denote the pairwise loss summed over all pairs received in the first \(t\) iterations. The cost function at the \(t\)th iteration is defined as \(L_t(\mathbf {w}; {\bf x}_t, y_t) = L(\mathbf {w}, \lbrace {\bf x}_i, y_i\rbrace _{i=1}^t) - L(\mathbf {w}, \lbrace {\bf x}_i, y_i\rbrace _{i=1}^{t-1})\). In this way, the optimal model in hindsight \(\min _{\mathbf {w}}L_t(\mathbf {w}; {\bf x}_t, y_t)\) indeed optimizes the objective of interest (e.g., pairwise-loss-based objective for AUC maximization). They employ the Follow-the-Regularized-Leader (FTRL) algorithm for updating the model and establish a regret bound in the order of \(\scriptstyle {\sqrt {T_+T_-^2 + T_-T_+^2}}\). They also establish an excess risk bound for a modified FTRL method that uses \(s\) samples per-iteration, which is in the order of \(O(1/T^{1/4})\) for \(s=\sqrt {T}\).

5.2 Online Statistics-based Methods

To address the issue of maintaining a large buffer size, Gao et al. [47] propose an online AUC maximization by leveraging the property of pairwise square loss for learning a linear model. They use the same definition of the cost function \(L_t(\mathbf {w}; {\bf x}_t, y_t)\) as Reference [201]. By using the square loss for learning a linear model, they show that the gradient of the cost function \(L_t\) can be computed based on first-order moments (mean vectors of positive and negative examples) and second-order moments (covariance matrices of positive and negative examples) of the received data before the \(t\)th iteration. Nevertheless, it also introduces high memory costs for maintaining the covariance matrix. To address this issue, the authors develop low-rank approximation methods and only update low-rank matrices for the second-order moments at each iteration. In the paper, the authors also establish the regret bounds for both full-rank and the low-rank approximation methods.

5.3 Online Non-linear Methods for AUC Maximization

Online nonlinear kernel methods based on AUC maximization have been proposed and studied in References [35, 70, 156] to address the non-separability of the data and the scalability issues. In particular, Ding et al. [35] extend the online buffered gradient descent method to learn non-linear kernel-based models. They employ two functional approximation strategies, i.e., random fourier features (RFF) [137] to approximate the shift invariant kernels and the Nyström method [172] to approximate the kernel matrix. For the two methods, the authors have established regret bounds in the order of \(\sqrt {T}\). Nevertheless, it is claimed that the RFF-based method require \(m=T\) random features for achieving a high probability bound.

Hu et al. [70] propose a different kernelized online AUC maximization method. They do not use RFF or the Nyström method to approximately compute the kernel similarities. Instead, they use the pairwise hinge loss or squared hinge loss as the surrogate loss and maintain support vectors of positive and negative classes in the online fashion, i.e., those examples whose contribution weights in the classifier are non-zero. They maintain and update two buffers for storing these support vectors and their contribution weights. The cost function at each iteration is defined similarly as in Reference [201] except that \(k\)-nearest examples to the received data in the buffer are used to compute the loss. They establish a regret bound in the order of \(\sqrt {T}\). They also present an extension to the multiple kernel learning framework that can automatically determine a good kernel representation.

Szörényi et al. [156] propose a kNN-based online AUC maximization method by suggesting an algorithmic solution based on the kNN-estimate of the conditional probability function. They use an infinite buffer that stores all received examples.

5.4 Adaptive Online AUC Maximization

References [25, 36, 103] propose and study adaptive online AUC maximization algorithms belonging to the two classes of online AUC maximization methods. Ding et al. [36] extend the online statistics-based method proposed in Reference [47] for AUC maximization by incorporating an online adaptive gradient method (AdaGrad) [39] for exploiting the knowledge of historical gradients. Cheng et al. [25] propose to use the Adam-style update in the framework of online buffered gradient descent methods, where the buffer is maintained in first-in-first-out fashion. In the paper, they use a non-convex ramp loss as the surrogate function of the indicator \(\mathbb {I}(f_\mathbf {w}({\bf x}_-) - f_\mathbf {w}({\bf x}_+))\) and use concave-convex procedure (CCCP) to approximate the cost function at each iteration. Liu et al. [103] leverage the Adam-style update [88] in the framework of online statistics-based method for AUC maximization.

5.5 Summary

Two classes of methods, namely, online buffer-based methods and online statistics-based methods, have been proposed for online AUC maximization. Online buffer-based methods are more generic, which can be leveraged for learning both linear and non-linear classifiers for any possible pairwise surrogate losses, while online statistics-based methods are restricted to learning linear models and using pairwise square loss. Nevertheless, online buffer-based methods usually require a large buffer to achieve a good performance, and online statistics-based methods could enjoy a lower regret and a lower memory costs for low-dimensional data. We compare different works in Table 4 from different perspectives.

Table 4.

Work	Category	Loss Functions	Models	Regret/Generalization Analysis	Memory Costs
[201]	Buffer-based	Pairwise AUC Surrogate	Linear	Yes	\(O(Bd)\)
[168]	Buffer-based	Pairwise AUC Surrogate	Linear	Yes	\(O(Td)\)
[86]	Buffer-based	Pairwise AUC Surrogate	Linear	Yes	\(O(Td)\) or \(O(Bd)\)
[85]	Buffer-based	Pairwise pAUC Surrogate	Linear	Yes	\(O(Td)\)
[35]	Buffer-based	Pairwise AUC Surrogate	Kernel	Yes	\(O(Bd)\)
[70]	Buffer-based	Pairwise (squared) hinge loss	Kernel	Yes	\(O(Bd)\)
[25]	Buffer-based	Pairwise Ramp loss	Linear	Yes	\(O(Bd)\)
[156]	Buffer-based	NA	Non-Parametric	Yes	\(O(Td)\)
[47]	Statistics-based	Pairwise square loss	Linear	Yes	\(O(d^2)\) or \(O(dr)\)
[36]	Statistics-based	Pairwise square loss	Linear	Yes	\(O(d^2)\)
[103]	Statistics-based	Pairwise square loss	Linear	Yes	\(O(d^2)\)

Table 4. Comparison of Different Studies for Online AUC Maximization

Where \(T\) is the Total Number of Iterations, \(B\) is a Fixed Buffer Size, \(r\) is a Constant Size of the Low Rank Approximation, \(d\) is the Dimensionality of Input Data.

6 Stochastic AUC Maximization - the Third Age

Stochastic AUC maximization refers to a family of methods that only process one or a small mini-batch of examples at each iteration for updating the model parameters, which are amenable for handling big data. The difference from online AUC maximization is that the IID assumption of data is typically assumed in stochastic AUC maximization. In this section, we provide a review on works for learning linear and kernel-based models for AUC maximization, and present a review for deep AUC maximization in Section 7. We categorize the existing stochastic methods for AUC maximization into two classes, i.e., stochastic batch-based pairwise methods, stochastic primal-dual methods. The existing works consider two learning settings: online setting similar to stochastic approximation in conventional literature [148] and offline setting similar to stochastic average approximation in conventional literature [125]. In the online setting, the data \(\lbrace {\bf z}_1,{\bf z}_2,\ldots , {\bf z}_t, \ldots \rbrace\) is assumed to be i.i.d. from an unknown distribution and continuously arriving, i.e., streaming data, and the goal is to minimize the expected loss in Equation (2). In the offline setting, a set of training data \(\mathcal {S}=\lbrace ({\bf x}_i, y_i), i\in [n]\rbrace\) of size \(n\) is given beforehand, and the goal is to minimize the empirical loss in Equation (9). There are two different errors that have been analyzed for different algorithms, namely, optimization error and statistical error. For optimizing the expected loss (2), the optimization error and the statistical error coincide.

6.1 Stochastic Batch-based Pairwise Methods

The idea of batch-based pairwise methods is to use a mini-batch of data points for computing a stochastic gradient estimator for updating the model parameter. Below, we discuss two categories of methods for the offline setting and the online setting, respectively.

Offline setting. . A straightforward approach for designing stochastic AUC maximization algorithms is by using stochastic gradients of the pairwise loss function \(\ell (\mathbf {w}; {\bf z}, {\bf z}^{\prime }) = \ell (f_\mathbf {w}({\bf z}^{\prime }) - f_\mathbf {w}({\bf z}))\) for the sampled positive-negative pairs \(({\bf z}, {\bf z}^{\prime })\). Then the model parameter can be updated by any suitable stochastic algorithms, e.g., SGD. This approach has been adopted and studied in several papers with different aims [33, 53, 96, 184].

Gu et al. [53] focus on establishing statistical error in the order of \(O(1/\sqrt {n})\) of a stochastic algorithm based on a finite training data set of size \(n\). They propose a doubly stochastic gradient algorithm (AdaDSG) by solving regularized pairwise learning problems. Specifically, at each stage, AdaDSG uses an inner solver to solve a sampled sub-problem and then uses the solution obtained from this sub-problem as a warm start for the next larger problem with a doubled size of training samples. The inner solver simply uses the SGD method based on a randomly sampled positive-negative pair for updating the model parameter. Reference [33] proposes a triply stochastic functional gradient for AUC maximization problem for learning a kernelized model. At each iteration, this algorithm performs SGD update based on an unbiased functional gradient calculated from a random pair of examples using random Fourier features (the pair of examples and the random variable for constructing the Fourier features constitute the triplet). A convergence rate in the order \(\mathcal {O}({1\over {T}})\) for the optimization error was established for the strongly regularized empirical AUC maximization problem. Reference [149] also considers a similar algorithm for semi-supervised ordinal regression based on AUC optimization.

Recent works cited in References [95, 96] focus on establishing the statistical error for a specific type of SGD algorithms for pairwise learning in the offline setting. In particular, they study the following SGD-type algorithm for pairwise learning: At the each iteration, it randomly draws \(({\bf z}_{i_t}, {\bf z}_{j_{t}})\) from all possible \({n(n-1)/2}\) pairs of examples, and the model parameter is updated by \(\mathbf {w}_{t+1} = \mathbf {w}_{t} - \eta _t \nabla \ell (\mathbf {w}_{t}; {\bf z}_{i_t}, {\bf z}_{j_{t}})\). Reference [95] uses the the concept of uniform stability [15] and the corresponding high probability generalization bounds [16, 44] to derive the excess risk bound \(\mathcal {O}({\log n/\sqrt {n}})\) in the convex case. Reference [96] further provides improved results by incorporating the variance information and show that, under an interpolation or a low noise assumption, the risk bounds can achieve \(\mathcal {O}({1/n})\) through exploiting the smoothness assumption.

Yang et al. [184] propose a simple SGD-type algorithm for pairwise learning where, at the each iteration, it randomly draws \(i_t\in [n]\) and the current example \({\bf z}_{i_t}\) is paired with previous one \({\bf z}_{i_{t-1}}\), and the model parameter is updated by the SGD based on the pair \(({\bf z}_{i_t}, {\bf z}_{i_{t-1}})\), i.e., \(\mathbf {w}_{t+1} = \mathbf {w}_{t} - \eta _t \nabla \ell (\mathbf {w}_{t}; {\bf z}_{i_t}, {\bf z}_{i_{t-1}})\). The authors have established excess risk bounds \(\mathcal {O}({1\over \sqrt {n}})\) for smooth and non-smooth convex losses and smooth non-convex losses under Polyak-Lojasiewicz (PL) condition, in different orders with different number of iterations. Online setting.. The online setting is more challenging due to that each iteration only receives or samples one data point. The challenge is that an unbiased stochastic gradient cannot be computed based on one data point. To address the challenge, the received data will be stored in a buffer and will be used for computing a stochastic gradient, which is similar to online AUC optimization [86, 168, 169]. References [11, 58, 193] have considered SGD for pairwise learning in the stochastic setting with an infinite buffer. In particular, for such SGD-based pairwise learning algorithms, at each iteration, the current example \({\bf z}_{t}\) is paired with previous ones \(\lbrace {\bf z}_{1},\ldots ,{\bf z}_{{t-1}}\rbrace\), the model parameter is updated by gradient descent based on the gradient of \(L_t(\mathbf {w}_t; {\bf z}_t) = {1\over t-1} \sum _{j=1}^{t-1} \ell (\mathbf {w}_t; {\bf z}_t, {\bf z}_{j})\), i.e., \(\mathbf {w}_{t+1} = \mathbf {w}_t - \eta _t \nabla L_t(\mathbf {w}_t; {\bf z}_t)\) where \(\eta _t\) is a step size. The authors of References [58, 193] prove the convergence of such SGD-based algorithm for learning a kernelized model with a convergence rate of \(\mathcal {O}({1\over t})\) for a strongly convex objective function and \(\mathcal {O}({1\over \sqrt {t}})\) for using the pairwise square loss without explicit regularization term. For learning a linear model, Boissier et al. [11] prove that a fast convergence rate \(\mathcal {O}({1\over t})\) is still possible for using the pairwise square loss. However, it is notable that the algorithms in References [11, 58, 193] are not scalable to large-scale datasets, since the buffer size increases as the number of sampled data. This issue can be addressed by the stochastic primal-dual methods discussed shortly.

6.2 Stochastic Primal-Dual (PD) Methods

The idea of stochastic primal-dual methods is to directly apply stochastic methods for addressing the min-max formulations of AUC maximization, e.g., Equation (12). The benefit of the min-max formulations is that the minimax objective is simply the average of individual data, which makes it suitable to the online setting. Nevertheless, the algorithms discussed below can be applied to both online and offline settings. It is notable that most of the algorithms discussed in this subsection are developed for solving the min-max formulation for the pairwise square loss. However, many of them can be easily extended for solving the min-max margin loss (16).

Ying et al. [192] are the first to propose the idea of solving a minimax objective (i.e., Equation (12)) by a stochastic algorithm for AUC maximization with a pairwise square loss.

The authors propose to use the stochastic first-order primal-dual algorithm [124] for AUC maximization, which is referred to as SOLAM. The algorithm uses a stochastic gradient descent for updating the primal variables \(\mathbf {w}, a, b\) and uses a stochastic gradient ascent for updating the dual variable \(\alpha\). It enjoys a convergence rate \(\mathcal {O}({1\over \sqrt {T}})\) with a per-iteration complexity \(\mathcal {O}(d)\) for learning a linear model of dimensionality of \(d\).

In the subsequent work cited in References [97, 121], the authors leverage the special formulation of the minimax objective for AUC maximization with a square loss to derive faster algorithms for learning a linear model, i.e., \(f_\mathbf {w}({\bf x})=\mathbf {w}^{\top }{\bf x}\). In particular, Natole et al. [121] derive a closed form solution for \(a, b, \alpha\) given \(\mathbf {w}\) for Equation (12), i.e., \(a(\mathbf {w}) = \mathbf {w}^{\top }\mathbb {E}[{\bf x}|y=1]\), \(b(\mathbf {w})=\mathbf {w}^{\top }\mathbb {E}[{\bf x}|y=-1]\) and \(\alpha (\mathbf {w}) = b(\mathbf {w})+ c - a(\mathbf {w}).\)¹ Given that data statistics \(\mathbb {E}[{\bf x}| y=1]\), \(\mathbb {E}[{\bf x}| y=-1]\) and the probability \(p=\Pr (y=1)\) can easily be estimated from training data, the authors propose a stochastic proximal gradient descent algorithm that only updates \(\mathbf {w}\) while the auxiliary variables \(a, b\), and \(\alpha\) are subsequently computed from \(\mathbf {w}\) using the updated data statistics. A fast convergence rate \(\mathcal {O}({1\over T})\) is proved for AUC maximization by leveraging the strong convexity of the regularization term, e.g., \(\ell _2\) norm square regularization. Lei and Ying [97] give an alternative but self-contained proof for stochastic saddle point formulation in References [121, 192] by writing the objective function in Equation (12) with \(c=1\) as

\begin{align*} &(1 - \mathbf {w}^\top (x - x^{\prime }))^2 = ([1 + \alpha (\mathbf {w}) ] + [\mathbf {w}^\top x^{\prime } - b(\mathbf {w})] - [\mathbf {w}^\top x - a(\mathbf {w})])^2 \end{align*}

\begin{align} & \!\!= ([1 + \mathbf {w}^\top (\mathbb {E}[\tilde{x}|\tilde{y}=-1] - \mathbb {E}[\tilde{x}|\tilde{y}=1])] \!+\! [\mathbf {w}^\top (x^{\prime } - \mathbb {E}[\tilde{x}|\tilde{y}=-1]) - \mathbf {w}^\top (x - \mathbb {E}[\tilde{x}|\tilde{y}=1])])^2. \end{align}

(19)

Based on this important observation, they prove that AUC maximization (9) is equivalent to \(\min _{\mathbf {w}} \mathbb {E}_{{\bf z}} [ \widetilde{F}(\mathbf {w};{\bf z}) ] =f(\mathbf {w})\), where

\begin{align*} & \widetilde{F}(\mathbf {w};{\bf z}) =p(1-p)+ (1-p)(\mathbf {w}^\top (x-\mathbb {E}[{\bf x}^{\prime }|y^{\prime }=1]))^2\mathbb {I}(y=1) +p(\mathbf {w}^\top ({\bf x}-\mathbb {E}[{\bf x}^{\prime }|y^{\prime }=-1]))^2{\bf I}_{[y=-1]} \\ & +2p(1-p)\mathbf {w}^\top (\mathbb {E}[{\bf x}^{\prime }|y^{\prime }=-1]-\mathbb {E}[{\bf x}^{\prime }|y^{\prime }=1]) +p(1-p)(\mathbf {w}^\top (\mathbb {E}[{\bf x}^{\prime }|y^{\prime }=-1]-\mathbb {E}[{\bf x}^{\prime }|y^{\prime }=1]))^2. \end{align*}

From this key observation, they propose a stochastic proximal stochastic gradient (SPAM) that also only needs to update \(\mathbf {w}\). In particular, the authors prove that, in either the unconstrained case without explicit regularizer or with a strong convex regularizer, SPAM can achieve a fast convergence rate \(\mathcal {O}({1\over T})\) with a linear per-iteration cost \(\mathcal {O}(d).\)

There are further studies trying to improve the convergence rate for solving the minimax objective (12) without assuming the strong convexity of the regularizer. Liu et al. [102] propose an improved stochastic algorithm for solving the minimax objective of AUC maximization. The idea is to leverage the strong concavity in terms of the dual variable \(\alpha\) and a proved error bound condition of the primal objective function in terms of \(\mathbf {w}, a, b\). Their algorithm needs to know the total number of iterations \(T\) beforehand and divides the update into multiple stages according to \(T\), and each stage calls a stochastic primal-dual method with a constant step size. After each stage, the step size is decreased by a constant factor. Their algorithm enjoys a convergence rate of \(O(1/T)\) for \(T\) iterations with one example per-iteration. Later on, Yan et al. [180] consider a more general minimax objective under an error bound condition of the primal objective and develop a stagewise stochastic algorithm without knowing the total number of iterations \(T\) in advance. Their algorithm also enjoys a convergence rate of \(O(1/T)\).

Stochastic algorithms with linear convergence for AUC maximization by using more advanced techniques, e.g., variance-reduction, have been considered in several later works, e.g., References [32, 123, 189]. Natole Jr et al. [123] propose a minibatch stochastic primal-dual algorithm (SPDAM) with a linear convergence rate. This algorithm is adapted from the mini-batch stochastic primal-dual coordinate method in Reference [200] to the problem of AUC maximization with the pairwise square loss and a strongly convex regularizer. The authors prove its linear convergence rate \(e^{ - c(n,m, \lambda) T}\) where \(c(n,m,\lambda)\) depends on the size \(m\) of the minibatch set, the size \(n\) of training data, and the strong convexity parameter \(\lambda .\) Reference [32] further extends SPAM [121] by using the variance-reduction technique [83]. It enjoys a linear convergence rate \(e^{-c(M,\beta , \eta)T}\) where \(\beta\) is the strongly-convex parameter, \(M\) is the strongly smooth parameter, and \(\eta\) is the constant step size.

In References [189, 203], the authors develop efficient sparse AUC maximization algorithms with the pairwise square loss for analyzing the high dimensional data. Both studies use the minimax objective (e.g., Equation (12)) and the explicit solutions for the auxiliary variables \(a,b\), and \(\alpha\) as observed in References [97, 121]. In particular, Reference [189] uses the hard thresholding algorithms for AUC maximizatin and prove its linear convergence under the assumption of restricted strong convexity (RSC) and restricted strong smoothness (RSS) on the objective function. Reference [203] considers the application of AUC maximization for handling sparse high-dimensional datasets in the sense that the number of nonzero features \(k\) in each example is far less than the total number of features \(d\). Such datasets are abundant in online spam filtering [145], ad click prediction [112], and identifying malicious URLs [108]. They develop a generalized Follow-The-Regularized-Leader framework [111] for AUC maximization with a lazy update that only involves a per-iteration cost \(O(k).\)

Recently, Reference [185] also proposes stochastic primal-dual algorithm for solving AUC maximization with a general convex pairwise loss. They propose to use Bernstein polynomials [135] to uniformly approximate a general loss. This reduction for AUC maximization with a general convex pairwise loss is equivalent to a weakly convex min-max problem (for learning a linear model). Then, the authors apply the stochastic proximal point-based method [136] for AUC maximization that has a per-iteration cost \(\mathcal {O}(md)\), where \(m\) is the degree of Bernstein polynomials used to approximate the original convex surrogate loss. Despite its non-convexity, they have proved its global convergence by exploring the appealing convexity-preserving property [135] of Bernstein polynomials and the intrinsic structure of the min-max formulation. However, the final convergence in terms of the original objective function is of a slow rate \(\mathcal {O}({1\over \sqrt {m}}).\)

6.3 Summary

Two main classes of methods have been proposed for stochastic AUC maximization: stochastic batch-pairwise (BP) methods and stochastic primal-dual (PD) methods. Stochastic batch-pairwise methods are generic, which depend on the strategy of pairing examples, while the stochastic PD methods explore the special problem structure that facilitates the design of fast stochastic optimization algorithms. We have compared different works in Table 5 from different perspectives.

Table 5.

Work	Category	Objective	Model	Guarantee	Rate	Memory Cost
[58]	Stochastic BP	Pairwise hinge loss	kernel	Opt. Error	\(\mathcal {O}({1/ T})\)	\(\mathcal {O}(T^2)\)
[193]	Stochastic BP	Pairwise square loss	kernel	Opt. Error	\(\mathcal {O}({1/ {T}^{1/3}})\)	\(\mathcal {O}(T^2)\)
[11]	Stochastic BP	Pairwise square loss	linear	Opt. Error	\(\mathcal {O}({1/ T})\)	\(\mathcal {O}(d^2)\)
[33]	Stochastic BP	Pairwise loss	linear	Opt. Error	\(\mathcal {O}({1/T})\)	\(\mathcal {O}(d)\)
[53]	Stochastic BP	Pairwise loss	linear	Stat. Error	\(\mathcal {O}({1/{\sqrt {n}}})\)	\(\mathcal {O}(D d)\)
[96]	Stochastic BP	Pairwise loss	linear	Stat. Error	\(\mathcal {O}({1/ \sqrt {n} })\)	\(\mathcal {O}(d)\)
[184]	Stochastic BP	Pairwise loss	linear	Stat. Error	\(\mathcal {O}(1/ \sqrt {n})\)	\(\mathcal {O}(d)\)
[192]	Stochastic PD	Minimax	linear	Opt. Error	\(\mathcal {O}({1/ \sqrt {T}})\)	\(\mathcal {O}(d)\)
[102]	Stochastic PD	Minimax (square loss)	linear	Opt. Error	\(\mathcal {O}({1/{T}})\)	\(\mathcal {O}(d)\)
[180]	Stochastic PD	Minimax	linear	Opt. Error	\(\mathcal {O}({1/{T}})\)	\(\mathcal {O}(d)\)
[121]	Stochastic PD	Minimax (square loss)	linear	Opt. Error	\(\mathcal {O}({1/{T}})\)	\(\mathcal {O}(d)\)
[97]	Stochastic PD	Minimax (square loss)	linear	Opt. Error	\(\mathcal {O}({1/ {T}})\)	\(\mathcal {O}(d)\)
[123]	Stochastic PD	Minimax (square loss)	linear	Opt. Error	\(\mathcal {O}({1/ {T}})\)	\(\mathcal {O}(d)\)
[189]	Stochastic PD	Minimax (square loss)	linear	Opt. Error	\(e^{-c(\rho _k^+(B),\rho _k^-) T }\)	\(\mathcal {O}(B d)\)
[32]	Stochastic PD	Minimax (square loss)	linear	Opt. Error	\(e^{-c(\beta , \eta , M) T }\)	\(\mathcal {O}(d)\)
[185]	Stochastic PD	Minimax (general loss)	linear	Opt. Error	\(\mathcal {O}({1/\sqrt {m} })\)	\(\mathcal {O}(m d)\)

Table 5. Comparison of Different Studies for Stochastic AUC Maximization Algorithms

Where \(T\) is the Total Number of Iterations, \(d\) denotes the Dimensionality of the Input Data, \(B\) is the Batch Size used in Reference [189] on which the Parameter \(\rho _k^+(B)\) is Dependent, \(m\) is the Degree of Bernstein Polynomials to Approximate the General Convex Loss used in Reference [185], \(\rho _k^+(B)\) and \(\rho _k^-\) are RSC and RSS Parameters used in Reference [189], and \(\beta , M\) and \(\eta\), Respectively, Denote the Strong-convexity Parameter, the Smooth Parameter and the Constant Step Size in Reference [32]. Opt. is short for optimization and Stat. is short for statistical.

7 Deep AUC Maximization (DAM): the Fourth Age

Recently, there is a surge of interest in AUC maximization for learning deep neural networks, i.e., deep AUC maximization (DAM). This problem has received much attention from the algorithmic perspective for solving the minimax objective of AUC maximization due to its advantage over the pairwise-loss-based objective for big data. Then, it is employed for solving real-world classification problems (e.g., medical image classification) and achieves great success [196]. Below, we will survey related works from algorithmic and practical perspectives. We would like to point out that all algorithms surveyed below are also stochastic algorithms. However, the differences from works in the third age in that (i) algorithms presented below are applicable to any deep neural networks; in contrast, many algorithms in the third age are developed for learning linear models by leveraging the special structure of the objective.; (ii) deep AUC maximization has faced some unique challenges, e.g., feature learning, regularization and normalization, which will be discussed in Section 8.

7.1 Non-convex Concave Min-Max Optimization

For deep learning, the prediction function \(f_\mathbf {w}({\bf x})\) is a non-linear function of the model parameter \(\mathbf {w}\), which makes the objective in Equation (9) and the minimax objective in Equations (12) and (16) non-convex. Although standard stochastic methods (e.g., SGD, Adam) can be directly applied for solving the pairwise-loss-based objective in Equation (9) with provable convergence to a stationary point, these methods are not directly applicable to the minimax objective in Equation (12), which is more suitable for online learning and distributed optimization. The minimax objective in Equations (12) and (16) is a non-convex strongly concave problem. Below, we will focus on stochastic methods for solving non-convex min-max problems, and we categorize different stochastic methods into two classes, i.e., two-loop proximal point-based methods and single-loop stochastic primal-dual methods. Without loss of generality, we consider the following min-max optimization problem for discussion:

\begin{align} \min _{\mathbf {w}} F(\mathbf {w}):= \left\lbrace \max _{\alpha \in \Omega }F(\mathbf {w}, \alpha) = \mathbb {E}_{{\bf z}}[F(\mathbf {w}, \alpha ; {\bf z})]\right\rbrace . \end{align}

(20)

Proximal Point-based Methods. . The proximal point-based methods follow a common framework as shown in Algorithm 2. This general framework has several unique features: (i) the algorithm is run in multiple stages \(k=1, \ldots , K\); (ii) at each stage a quadratic regularized function \(F_k(\mathbf {w}, \alpha)\) is constructed by adding a quadratic function \(\frac{\gamma }{2}\Vert \mathbf {w}- \mathbf {w}_0^k\Vert ^2\), where \(\gamma \gt 0\) is a proper hyperparameter; (iii) a proper stochastic algorithm \(\mathcal {A}\) is employed for solving the regularized function with a step size \(\eta _k\) and a number of iterations specified by \(T_k\), whose output denoted by \(\mathbf {w}_k, \alpha _k\) that are usually the last or the averaged solutions across all iterations in this stage; (iv) the step size \(\eta _k\) and the number of iterations \(T_k\) are changed appropriately for next stage. The following different methods differ in how to change \(\eta _k, T_k\) and how to implement the function \(\Lambda\) for computing \(\alpha _0^k\):

Rafique et al. [136] are the first to study non-convex concave min-max optimization problems and to establish the convergence rate. In particular, they assume the objective function \(F(\mathbf {w}, \alpha)\) is weakly convex in terms of the primal variable \(\mathbf {w}\) and is (strongly) concave in terms of the dual variable \(\alpha\). A function is called weakly convex if it becomes a convex function by adding a quadratic function in term of the decision variable with a proper scaling factor. This is the motivation of adding \(\frac{\gamma }{2}\Vert \mathbf {w}- \mathbf {w}_0^k\Vert ^2\) to the objective at each stage, which can make the objective convex or strongly convex with an appropriate \(\gamma \gt 0\). Since the objective function is non-convex and not necessarily smooth, they consider a convergence measure for weakly convex function, i.e., nearly stationary solution [34]. An \(\epsilon\)-level nearly stationary solution to a problem \(\min _{\mathbf {w}}F(\mathbf {w})\) is defined as a point \(\mathbf {w}\) such that there exists a point \(\widehat{\mathbf {w}}\) satisfying \(\Vert \mathbf {w}- \widehat{\mathbf {w}}\Vert \le O(\epsilon)\) and \(\text{Dist}(0, \partial F(\widehat{\mathbf {w}}))\le \epsilon\), where \(\text{Dist}(\cdot ,\cdot)\) denotes the Euclidean distance from a point to a set. In this work, the authors consider both the online setting and the offline (a.k.a. finite-sum) setting for the objective function \(F(\mathbf {w}, \alpha)\). For the online setting, they employ stochastic mirror descent (SMD) method for implementing \(\mathcal {A}\). The parameters are set as \(\eta _k\propto 1/\sqrt {k}, T_k\propto k^2\) when the objective is only concave in terms of the dual variable \(\alpha\), and are set as \(\eta _k\propto 1/k, T_k\propto k\) when the objective is strongly concave in terms of the dual variable. When the objective is weakly convex and concave, the sample complexity is in the order of \(O(1/\epsilon ^6)\) for finding an \(\epsilon\)-level nearly stationary solution to \(F(\mathbf {w}) = \max _{\alpha \in \Omega }F(\mathbf {w}, \alpha)\), and when the objective is strongly concave, they improve the sample complexity to \(O(1/\epsilon ^4 + C/\epsilon ^2)\) by considering a special class such that \(\alpha _*=\arg \max _{\alpha \in \Omega }F(\mathbf {w}, \alpha)\) can be computed, where \(C\) denotes the complexity for computing \(\alpha _*\) given \(\mathbf {w}\). To enjoy this improved complexity, they compute \(\alpha _0^k\) by solving \(\max _{\alpha \in \Omega }F(\mathbf {w}_0^k, \alpha)\) to the optimal solution for \(\alpha\). For the finite-sum setting with \(n\) components for the function \(F(\mathbf {w}, \alpha)\), they improve the complexity to \(O(n/\epsilon ^2)\) when the objective is strongly concave in terms of \(\alpha\).

Yan et al. [179] further improve the algorithm and complexity for solving weakly convex and strongly concave min-max problems. They do not assume certain structure of the objective function or the optimal dual variable can be easily computed given \(\mathbf {w}\). Their algorithm is similar to the first algorithm proposed in Reference [136], i.e., the initial solution \(\alpha ^k_0\) is simply the averaged solution from last stage of running \(\mathcal {A}\), i.e., \(\Lambda (\mathbf {w}_{k-1}, \alpha _{k-1})=\alpha _{k-1}\). They develop a novel analysis to prove the algorithm enjoys a sample complexity of \(O(1/\epsilon ^4)\) for finding an \(\epsilon\)-level nearly stationary solution to \(F(\mathbf {w})\). The key challenge lies at tackling error \(\Vert \alpha ^k_0 - \alpha (\mathbf {w}_k)\Vert ^2\) in the upper bound for solving \(\min _{\mathbf {w}}\max _{\alpha }F_k(\mathbf {w},\alpha)\), where \(\alpha (\mathbf {w}_k)= \arg \max _{\alpha \in \Omega }F(\mathbf {w}, \alpha)\). In Reference [136], the authors compute \(\alpha ^k_0=\alpha (\mathbf {w}_0^k)\), which reduces the dual error \(\Vert \alpha ^k_0 - \alpha (\mathbf {w}_k)\Vert ^2\) to \(\Vert \mathbf {w}_{k-1} - \mathbf {w}_k\Vert ^2\) due to the Lipchitz continuity of \(\alpha (\mathbf {w})\), which is decreasing to zero. In contrast, the authors of Reference [179] avoid computing \(\alpha _0^k=\alpha (\mathbf {w}_0^k)\) instead directly set \(\alpha _0^k=\alpha _{k-1}\). As a result, they need to explicitly tackle the error \(\Vert \alpha ^k_0 - \alpha (\mathbf {w}_k)\Vert ^2\). To this end, they develop a novel analysis based on a new Lyapunov function to prove the convergence. In contrast to that in Reference [136], which uses the recursion of \(F(\mathbf {w}_{k-1}) - F(\mathbf {w}_k)\), Yan et al. use both the recursions of the duality gap of the regularized function \(F_k\) and of \(F(\mathbf {w}_{k-1}) - F(\mathbf {w}_k)\). They are able to bound \(\Vert \alpha ^k_0 - \alpha (\mathbf {w}_k)\Vert ^2\) by the duality gap of the regularized function.

When the objective is just concave in terms of the dual variable, Zhao [202] develop a stagewise stochastic algorithm similar to Algorithm 1 except that the primal function is also smoothed by adding a strongly concave term on the dual variable, which has the same complexity as Reference [136].

Liu et al. [101] consider the deep AUC maximization explicitly and develop the first practical and provable stochastic algorithms for deep AUC maximization based on the min-max formulation of the pairwise square loss function, which enjoy a faster convergence rate. In particular, they assume that the primal objective function \(F(\mathbf {w})\) satisfies a PL condition, i.e., there exists \(\mu \gt 0\) such that \(\Vert \nabla F(\mathbf {w})\Vert ^2 \ge \mu (F(\mathbf {w}) - F_*)\), where \(F_*\) denotes the global minimum of \(F\). They show that two-layers neural network satisfy this PL condition. Based on this condition, they have shown that Algorithm 2 enjoys a faster convergence rate in the order of \(O(1/(\mu ^2\epsilon))\) for finding an \(\epsilon\)-level optimal solution. For \(\alpha _0^k\), they compute it similarly to that in Reference [136] except that it is approximated by sampling a number of data. For the parameters \(\eta _k, T_k\), they decrease \(\eta _k\) geometrically and increase \(T_k\) geometrically. For the stochastic algorithm \(\mathcal {A}\), they employ both stochastic primal-dual gradient method and stochastic primal-dual adaptive gradient method, where the latter one could enjoy even faster convergence when the stochastic gradients have a slow growth.

Recently, Guo et al. [57] propose a family of Proximal Epoch Stochastic (PES) methods for more generic non-convex min-max optimization under a PL condition and establish several improved rates under different conditions, e.g., near convexity condition of the primal objective, and Lipchitz condition of stochastic gradients. Under these conditions, they can reduce the sample complexity to \(O(1/(\mu \epsilon))\). They also analyze the convergence rates for multiple stochastic algorithms \(\mathcal {A}\), including stochastic gradient descent ascent, stochastic optimistic gradient descent ascent, stochastic primal-dual STORM updates, and so on. In addition, they establish the PL condition of the primal objective for AUC maximization for learning over-parameterized neural networks.

Guo et al. [55, 195] also study the federated deep AUC maximization by solving the min-max formulations in a distributed fashion and establish both computation and communication complexity under a PL condition of the objective function. It is notable that Reference [195] claims that they achieve the optimal communication complexity.

Single-loop Stochastic Primal-Dual Methods. . A generic framework of single-loop stochastic gradient descent ascent methods is shown in Algorithm 3. At each iteration, it computes a stochastic gradient estimator \(\mathbf {u}_{t+1}\) of \(\nabla _\mathbf {w}F(\mathbf {w}_t, \alpha _t)\) and then updates the primal variable based on this gradient estimator. Then it computes a stochastic gradient estimator \(\mathbf {v}_{t+1}\) of \(\nabla _\alpha F(\widehat{\mathbf {w}}_t, \alpha _t)\) and update, the dual variable based on \(\mathbf {v}_{t+1}\) for some \(\widehat{\mathbf {w}}_t\). Different methods differ from each other on how to compute the gradient estimators \(\mathbf {u}_{t+1}\) and \(\mathbf {v}_{t+1}\).

Lin et al. [100] are the first to analyze the single-loop primal-dual method (the basic stochastic gradient descent ascent method, i.e., SGDA) for non-convex concave min-max optimization problems, corresponding to Algorithm 3 with \(\widehat{\mathbf {w}}_t=\mathbf {w}_t\). In the paper, they assume the objective function \(F(\mathbf {w}, \alpha)\) is smooth in terms of both \(\mathbf {w}\) and \(\alpha\). They compute \(\mathbf {u}_{t+1} = \frac{1}{B}\sum _{{\bf z}_t\in \mathcal {B}}\nabla _\mathbf {w}F(\mathbf {w}_t, \alpha _t, {\bf z}_t)\) and \(\mathbf {v}_{t+1} =\frac{1}{B}\sum _{{\bf z}_t\in \mathcal {B} } \nabla _\alpha F(\mathbf {w}_t, \alpha _t, {\bf z}_t)\) based on a batch of \(B\) samples. However, their convergence results are unsatisfactory. In particular, for non-convex concave min-max problems, their analysis yields an \(O(1/\epsilon ^8)\) complexity for finding an \(\epsilon\)-stationary solution to \(F(\mathbf {w})\); and for non-convex strongly concave min-max problems, their analysis requires a large mini-batch size in the order of \(O(1/\epsilon ^2)\) and yields an \(O(1/\epsilon ^4)\) sample complexity. It is worth to point out that the complexity for the former case is worse than that established in Reference [136] and the complexity for the latter case matches that in Reference [136] but requires a large mini-batch size, which is not required in Reference [136]. Recently, Boţ and Böhm [14] extend the analysis to stochastic alternating (proximal) gradient descent ascent method, which uses \(\widehat{\mathbf {w}}_t = \mathbf {w}_{t+1}\) to compute the estimator \(\mathbf {v}_{t+1}\). However, this algorithm suffers from the same issue of requiring a large mini-batch size and the worse complexity for non-convex concave min-max problems.

Recently, Guo et al. [56] develop a new stochastic primal-dual method for solving non-convex strongly concave min-max problems under the smoothness assumption of \(F(\mathbf {w}, \alpha)\). They address the issue of large mini-batch size requirement in References [14, 100]. The key improvement lies at using moving average to compute the estimator \(\mathbf {u}_{t+1}\), i.e., \(\mathbf {u}_{t+1} = (1-\beta _{1,t})\mathbf {u}_t + \beta _{1,t} \mathcal {O}_\mathbf {w}(\mathbf {w}_t, \alpha _t)\) and simply use \(\mathbf {v}_{t+1} = \mathcal {O}_\alpha (\mathbf {w}_t, \alpha _t; {\bf z}_t)\), where \(\mathcal {O}_\mathbf {w}\) and \(\mathcal {O}_\alpha\) denote an unbiased stochastic estimator of \(\nabla _\mathbf {w}F(\mathbf {w}, \alpha)\) and \(\nabla _\alpha F(\mathbf {w}, \alpha)\), respectively. The authors also establish the convergence using adaptive step sizes such as the Adam-style with a sample complexity in the order of \(O(1/\epsilon ^4)\). This is the first work that establishes the convergence Adam-style updates for solving non-convex min-max problems.

An improved complexity of \(O(1/\epsilon ^3)\) is achieved in several recent works under the Lipschitz continuous assumption for the stochastic gradient \(\nabla _\mathbf {w}F(\mathbf {w}, \alpha ; {\bf z})\) and \(\nabla _\alpha F(\mathbf {w}, \alpha ; {\bf z})\) [75, 107], which is a stronger condition than the smoothness condition of the objective function. Luo et al. [107] are the first to establish such an improved rate. Their algorithm called SREDA uses the SPIDER/SARAH technique [42, 129] to update the gradient estimators \(\mathbf {u}_{t+1}\) and \(\mathbf {v}_{t+1}\), i.e., \(\mathbf {u}_{t+1} = \mathbf {u}_t + \frac{1}{B}\sum _{{\bf z}_t\in \mathcal {B}}\nabla _\mathbf {w}F(\mathbf {w}_t, \alpha _t, {\bf z}_t) - \frac{1}{B}\sum _{{\bf z}_t\in \mathcal {B}}\nabla _\mathbf {w}F(\mathbf {w}_{t-1}, \alpha _{t-1}, {\bf z}_t)\), where \(B\) is in the order of \(O(1/\epsilon)\). \(\mathbf {u}_t\) and \(\mathbf {v}_t\) are re-computed based on a large batch size in the order of \(O(1/\epsilon ^2)\) every \(q=O(1/\epsilon)\) iterations. It is worth mentioning that SREDA is a double loop algorithm, where the inner loop is to mainly update the dual variable and the estimators \(\mathbf {u}_{t+1}, \mathbf {v}_{t+1}\) with multiple iterations and the outer loop is to update the primal variable. This issue was addressed by Huang et al. [75], who propose a single-loop algorithm named AccMDA to enjoy a fast rate of \(O(1/\epsilon ^3)\) under the Lipschitz continuous assumption for the stochastic gradient. They use the STORM technique [31] to compute \(\mathbf {u}_{t+1}, \mathbf {v}_{t+1}\), i.e., \(\mathbf {u}_{t+1} = (1-\beta _t)\mathbf {u}_t + \beta \nabla _\mathbf {w}F(\mathbf {w}_t, \alpha _t, {\bf z}_t) - \beta _t\nabla _\mathbf {w}F(\mathbf {w}_{t-1}, \alpha _{t-1}, {\bf z}_t)\), similarly for \(\mathbf {v}_{t+1}\). It is notable that both SREDA and AccMDA require computing two (batch) stochastic gradients at each iteration. It is notable that AccMDA has a worse dependence on the strong concavity parameter than that in References [56, 100]. It is likely that by simply computing \(\mathbf {v}_{t+1}=\nabla _\alpha F(\mathbf {w}_t, \alpha _t, {\bf z}_t)\) in AccMDA, one should be able to improve the dependence on the strong concavity as in Reference [56].

Yang et al. [182] develop a single-loop algorithm for improving the convergence rate of non-convex min-max optimization under PL conditions. They consider a class of smooth non-convex non-concave problems, which satisfy both the dual-side PL condition (i.e., \(F(\mathbf {w}, \cdot)\) satisfies a PL condition for any \(\mathbf {w}\)) and the primal-side PL condition (i.e., \(F(\cdot , \alpha)\) satisfies a PL condition for any \(\alpha\)). They propose stochastic alternating gradient descent ascent algorithm (Stoc-AGDA) and establish a global convergence for a Lyapunov function \(F(\mathbf {w}_t) - F_* + \lambda (F(\mathbf {w}_t) - F(\mathbf {w}_t, \alpha _t))\) for a constant \(\lambda\) in the order of \(O(1/\epsilon)\), which directly implies the convergence for the primal objective gap in the same order. Their algorithm uses a polynomially decreasing or very small step sizes. It is notable that the complexity of Stoc-AGDA is worse than that of PES established in Reference [57] under similar PL conditions but requiring the strong concavity of the objective function in terms of the dual variable, which makes PES more appropriate to deep AUC maximization. Without the primal-side PL condition, Stoch-AGDA and Smoothed-AGDA are also analyzed under the dual-side PL condition with a better dependence on the condition number [183].

Improved Rates for the Offline (Finite-sum) Setting. . There are also multiple papers trying to improve the complexity of non-convex (strongly) concave min-max optimization in the finite-sum setting by leveraging the variance reduction techniques [107, 136, 182]. However, they usually require computing the gradient based on the full-batch or a large-batch that are less practical for deep learning with big data, see Table 6.

Table 6.

Method	Category	batch size	Sample Complexity	Oracle	Experiments for AUC Max.
SGDA [100]	Single-loop	\(O(1/\epsilon ^2)\)	\(O(1/\epsilon ^4)\)	General	No
PDAda [56]	Single-loop	\(O(1)\)	\(O(1/\epsilon ^4)\)	General	Yes
AccMDA [75]	Single-loop	\(O(1)\)	\(O(1/\epsilon ^3)\)	Lipschitz	No
Stoc-AGDA [182]	Single-loop	\(O(1)\)	\(O(1/(\mu ^2\epsilon))\)	General	No
PG-SMD [136]	ProximalPoint	\(O(1)\)	\(O(1/\epsilon ^4+n/\epsilon ^2)\)	General	No
Epoch-SGDA [179]	Proximal Point	\(O(1)\)	\(O(1/\epsilon ^4)\)	General	No
PPD-SG [101]	Proximal Point	\(O(1)\)	\(O(1/\epsilon ^4)\)	General	Yes
PPD-AdaGrad [101]	Proximal Point	\(O(1)\)	\(O(1/\epsilon ^4)\)	General	Yes
PES-\(\mathcal {A}\) [57]	Proximal Point	\(O(1)\)	\(O(1/\epsilon ^4)\)\(\sim\)\(O(1/(\mu \epsilon))\)	General or Lipchitz	Yes

Table 6. Comparison of Different Stochastic Methods for Solving Non-convex Strongly Concave Min-max Optimization

\(\epsilon\) denotes the target accuracy level for the objective gradient norm, i.e., \(\mathbb {E}[\Vert \nabla F(\mathbf {w})\Vert ]\le \epsilon\) or the primal objective gap, i.e., \(\mathbb {E}[F(\mathbf {w}) - F(\mathbf {w}_*)]\le \epsilon\). The column “oracle” denotes the condition on the stochastic gradient. \(\mu\) denotes a parameter in the assumed PL condition.

7.2 Deep Partial AUC Maximization

Deep pAUC maximization is challenging not only because of the non-differentiable selection operator but also due to non-convexity of the objective. Below, we discuss two classes of methods.

Naive Mini-batch Approach. . Kar et al. [85] propose mini-batch-based stochastic methods for pAUC maximization, which is applicable to deep learning. At each iteration, a gradient estimator is simply computed based on the pAUC surrogate function of the mini-batch data. However, this heuristic approach is not guaranteed to converge for minimizing the pAUC objective and its error scales as \(O(1/\sqrt {B})\), where \(B\) is the mini-batch size. Ueda and Fujino [159] consider partial AUC maximization for learning non-linear scoring functions, e.g., neural networks and probabilistic generative models. The paper claims to use the Adam optimizer [88] in Tensorflow for optimizing the partial AUC. However, it does not provide any discussion how the algorithm was implemented and what is the complexity and convergence of the optimization algorithm. We conjecture they use the naive mini-batch approach equipped with the Adam optimizer. For experiments, they have used an image dataset, namely, Hyper Suprime-Cam (HSC) dataset [116] with 487 real and 267,074 bogus optical transient objects collected with the HSC using the Subaru telescope.

Reduction Approaches. . The idea is to reduce the objective into different formulations (equivalent or approximate), which facilitate the design of large-scale optimization algorithms.

Recently, Yang et al. [187] consider optimizing two-way pAUC with FPR less than \(\beta\) and TPR larger than \(1-\alpha\). The paper focuses on simplifying the optimization problem that involves selection of top-ranked negative examples and bottom-ranked positive examples. They first formulate the problem into a bilevel optimization, where the upper-level objective function is a weighted average of pairwise surrogate loss, and the lower-level optimization problem is to compute the weights that accounts for selection of top-ranked negative examples and bottom-ranked positive examples. To address the computational challenge for solving the bilevel optimization problem, the authors propose to simplify the lower-level problem by relaxing the non-decoposable constraint on the decision variables into decomposable regularization. As a result, a simplified weighted pairwise loss minimization problem is derived, where the weights for each positive-negative pair is a product of two individual weights that are computed directly from the prediction scores of the positive and negative examples using a penalty function. Then any stochastic algorithms based on random positive-negative pairs can be employed for solving their formulation, e.g., SGD, Adam.

Zhu et al. [206] consider both pAUC maximization and two-way pAUC maximization. For pAUC maximization, they focus on that with FPR in a range \((0, \beta)\). They propose two formulations for pAUC maximization by leveraging distributionally robust optimization technique, and develop stochastic algorithms for optimizing both formulations for both pAUC and two-way pAUC. In particular, for pAUC they define a robust loss for each positive data by

\begin{align*} &\hat{L}_\phi (\mathbf {w}; {\bf x}_i)=\max _{\mathbf {p}\in \Delta _{n_-}} \sum _{{\bf x}_j\in \mathcal {S}_-} p_j\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)) - \lambda D_\phi (\mathbf {p}, 1/n_-), \end{align*}

where \(\Delta _{n_-}=\lbrace \mathbf {p}\in \mathbb {R}^{n_-}: \sum _j p_j =1, p_j\ge 0\rbrace\) is a simplex, \(D_\phi (\mathbf {p}, 1/(n_-)) = \frac{1}{n_-}\sum _i \phi (n_-p_i)\) is a divergence measure defined by a function \(\phi\). Then the following objective is used for one-way pAUC maximization:

\begin{align} \min _{\mathbf {w}}\frac{1}{n_+}\sum _{{\bf x}_i\in \mathcal {S}_+}\hat{L}_\phi (\mathbf {w}; {\bf x}_i). \end{align}

(21)

They consider two functions \(\phi\), i.e., the KL divergence \(\phi _{kl}(t) = t\log t - t + 1\), which gives \(D_\phi (\mathbf {p}, 1/n_-) =\sum _ip_i\log (n_-p_i)\), and the CVaR divergence \(\phi _{c}(t) =\mathbb {I}(0\lt t\le 1/\beta)\) with a parameter \(\beta \in (0,1)\), which gives \(D_\phi (\mathbf {p}, 1/n_-) = 0\) if \(p_i\le 1/(n_-\beta)\) and infinity otherwise. It is shown that if \(\ell (\cdot)\) is non-decreasing, when using \(\phi _{c}\) the objective (21) is equivalent to (10) for pAUC maximization, when using \(\phi _{kl}\) it gives a soft estimator of pAUC.

For solving Equation (21) with CVaR divergence, they formulate the problem as a weakly convex optimization problem by introducing another set of variables, i.e.,

\begin{align} \ \min _{\mathbf {w}}\min _{\mathbf {s}\in \mathbb {R}^{n_+}} F(\mathbf {w}, \mathbf {s})= \frac{1}{n_+}\sum _{{\bf x}_i\in \mathcal {S}_+} \left(s_i + \frac{1}{\beta } \frac{1}{n_-} \sum _{{\bf x}_j\in \mathcal {S}_-}(\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)) - s_i)_+\right). \end{align}

(22)

They develop an efficient stochastic algorithm named SOPA with a sample complexity of \(O(1/\epsilon ^4)\) for finding a nearly \(\epsilon\)-stationary point for \(F(\mathbf {w}, \mathbf {s})\).

For solving Equation (21) with KL divergence, they formulate the problem as a novel finite-sum coupled compositional optimization problem, i.e.,

\begin{align} \min _{\mathbf {w}}\frac{1}{n_+}\sum _{{\bf x}_i\sim \mathcal {S}_+}\lambda \log \mathbb {E}_{{\bf x}_j\in \mathcal {S}_-}\exp \left(\frac{\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i))}{\lambda }\right)\!. \end{align}

(23)

A stochastic algorithm named SOPA-s is proposed for solving Equation (23) with a sample complexity of \(O(1/\epsilon ^4)\) for finding an \(\epsilon\)-level stationary point.

For two-way pAUC such that FPR is less than \(\beta\) and TPR is larger than \(\alpha\), the authors further define a new objective:

\begin{align} F(\mathbf {w}; \phi , \phi ^{\prime })=\max _{\mathbf {q}\in \Delta _{n_+}}\sum _{{\bf x}_i\in \mathcal {S}_+} q_i \hat{L}_\phi ({\bf x}_i, \mathbf {w}) - \lambda ^{\prime } D_{\phi ^{\prime }}\left(\mathbf {q}, \frac{1}{n_+}\right)\!. \end{align}

(24)

They prove that when \(\phi (t)=\mathbb {I}(0\lt t\le 1/\beta)\) and \(\phi ^{\prime }(t)=\mathbb {I}(0\lt t\le 1/\alpha)\) the above objective is equivalent to Equation (11) if \(\ell (\cdot)\) is non-decreasing. The authors develop two algorithms for solving the above objective with CVaR divergence and KL divergence, respectively, and establish their convergence. A sample complexity of \(O(1/\epsilon ^4)\) is established for the algorithm that optimizes the above objective with the KL-divergence to find a \(\epsilon\)-level stationary point. For optimizing the above objective with CVaR divergence, the sample complexity of their algorithm is \(O(1/\epsilon ^6)\). This is the first time that stochastic algorithms are developed for optimizing two-way pAUC for deep learning with convergence guarantee.

A concurrent work by Yao et al. [190] focuses on optimizing pAUC such that FPR is in a range \((\alpha , \beta)\). When \(\ell (\cdot)\) is a non-decreasing function, they formulate Equation (21) as non-smooth different-of-convex problems:

\begin{align} F(\mathbf {w}) = \frac{1}{n_+}\sum _{{\bf x}_i\in \mathcal {S}_+} (F(\mathbf {w}; {\bf x}_i, k_2) - F(\mathbf {w}; {\bf x}_i, k_1)), \end{align}

(25)

where

\begin{align*} F(\mathbf {w}; {\bf x}_i, k)= \frac{1}{n_-}\sum _{{\bf x}_j\in \mathcal {S}^\downarrow _-[1, k]}\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)) = \min _{\lambda } \frac{\lambda k}{n_-} + \frac{1}{n_-}\sum _{{\bf x}_j\in \mathcal {S}_-} (\ell (f_\mathbf {w}({\bf x}_j) - f_\mathbf {w}({\bf x}_i)) -\lambda)_+. \end{align*}

They develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization [155]. To increase the efficiency of large data processing, they use an efficient stochastic block coordinate update for solving each sub-problem inexactly. A sample complexity of \(O(1/\epsilon ^6)\) is established for their algorithm to find a nearly \(\epsilon\)-stationary solution.

7.3 Applications of Deep AUC Maximization (DAM)

Due to the success of deep learning in various applications, DAM has also been applied to different domains with demonstrated success. Below, we review some applications of DAM.

Medical Image Classification. . Sulam et al. [154] consider the classification of breast cancer based on imbalanced mammogram images. They learn a deep convolutional neural network by AUC maximization by using the online buffered gradient method proposed by Zhao et al [201]. Nevertheless, the issue of this approach is that it cannot scale to large datasets, as it requires a large buffer to store positive and negative samples at each iteration for computing an approximate AUC score. As a result, they only consider small-scale datasets. In particular, two datasets are used. The first one, named IMG, is a proprietary mammogram dataset comprising 796 patients, 80 of them defined as positive (164 images), and 716 negative (1,869 images) with both Cranial-Caudal (CC) and Mediolateral-Oblique (MLO) views, belonging to normal patients as well as benign findings. The second dataset is the public INbreast dataset, which consists of 115 cases with 410 images.

Yuan et al. [196] are the first to evaluate the performance of DAM on large-scale medical image data with hundreds of thousands of images for learning modern deep neural networks (e.g., ResNets, DenseNets). They propose a new minimax objective as in Equation (16) for robust AUC maximization to alleviate the issues of the square loss, namely, the sensitivity to noisy data and the adverse effect on easy data. The new objective is shown to be more robust than the commonly used square loss, while enjoying the same advantage in terms of large-scale stochastic optimization. The authors employ PESG [57] for solving the minimax objective. They conduct extensive empirical studies of DAM on four difficult medical image classification tasks discussed below.

•

CheXpert Competition. CheXpert is a large-scale chest X-ray dataset for detecting chest and lung diseases, which is released through a medical AI competition [79]. The training data consists of 224,316 high-quality X-ray images from 65,240 patients with frontal and lateral views. The validation dataset consists of 234 images from 200 patients. The testing data has images for 500 patients, which is not released to the public. The model is only evaluated for predicting five selected diseases, i.e., Cardiomegaly, Edema, Consolidation, Atelectasis, Pleural Effusion, which have an average imbalance ratio (i.e., the proportion of positive examples) of 20.21% in the training set. AUC and NRBC are used for evaluation, where NRBC refers a number of radiologists out of 3 are beaten by AI algorithms. Yuan et al. [196] achieved first place using their DAM method in this competition in August 2020. Compared with the standard deep learning approach that minimizes the cross-entropy loss, their DAM method achieves 2% improvements.

•

SIIM-ISIC Melanoma Classification. Melanoma is a skin cancer and is the major cause of skin cancer death [115]. Kaggle hold a competition for melanoma classification in 2020. The dataset consists of 33,126 training images with 584 malignant melanoma images and 10,892 testing images with an unknown number of malignant melanoma images. The testing set is split into a validation set with 30% images and the final testing set with 70% images. The raw images have high resolutions, e.g., 6,000 \(\times\) 4,000. Yuan et al. [196] demonstrate the performance DAM for Melanoma Classification. They resize the the image into lower resolutions, e.g., 384 \(\times\) 384 and also use additional 12,859 images from previous competitions in their experiments. Their method achieves the 33rd place out 3,314 teams for the competition by ensemble over 10 models. A simple ensemble of a DAM model and a standard model learned by optimizing the cross-entropy loss beats the winning team by combining 18 models [196]. Compared with the standard deep learning approach that minimizes the cross-entropy loss, the DAM method achieves 1% improvements.

•

Breast-Cancer Screening. For this task, they use the DDSM+ data, which is a combination of two datasets, namely, DDSM and CBIS-DDSM [17, 67]. The dataset consists of 55,000 mammographic images (224 \(\times\) 224) taken at lower doses than usual X-rays for training with an imbalance ratio of 13% and 13,900 images for testing with an imbalance ratio of 4%. Compared with the standard deep learning approach that minimizes the cross-entropy loss, the DAM method achieves 1.5% improvements.

•

Lymph Node Tumor Detection. They use the PathCamelyon dataset for this task, which consists of 294,912 color images (96 \(\times\) 96) extracted from histopathologic scans of lymph node section for training and 32,768 images for testing with balanced class ratio [8, 160]. The authors manually construct an imbalanced dataset with an imbalance ratio of 1% for their experiments. Compared with the standard deep learning approach that minimizes the cross-entropy loss, the DAM method achieves 5% improvements.

Recently, Reference [66] investigated DAM for COVID-19 Chest X-ray Classification. Covid-19 is a global pandemic that broke out in 2020. Early detection of Covid-19 is crucial to contain the spread of the virus and is helpful for providing early treatment for the patients with Covid-19. The authors use the dataset (COVIDx8B), which consists of 15,952 chest X-ray images for training with 13.5% Covid-19 positive samples and 400 images for testing with balanced positive and negative samples. The authors use self-supervised training method discussed below for learning a backbone network and then use the LibAUC library [196] for fine-tuning the network with significant improvements observed over the baseline method.

Molecular Property Predictions. . Molecular property prediction is one of the key tasks in cheminformatics and has applications in many fields, including quantum mechanics, physical chemistry, biophysics, and physiology. Multiple molecular datasets have been released, e.g., MoleculeNet benchmark datasets (e.g., PCBA, HIV, MUV, Tox21, ToxCast) [174], MIT AICURES Challenge dataset,² Stanford OGB benchmark datasets (e.g., OGBG-molhiv, OGBG-molpcba) [73]. Recently, Reference [171] has employed the LibAUC library [196] for solving the molecular property prediction and achieved the first place at MIT AICURES Challenge. Several research groups have also used LibAUC library for improving the performance on the OGBG-molhiv dataset.³ The authors of Reference [206] also consider pAUC maximization on some of these molecular datasets and compare different methods for pAUC maximization.

Fraud/Outlier Detection . Identifying outliers in data is referred to as outlier or anomaly detection. It can be regarded as an extremely imbalanced classification problem where the object of interest (anomaly/outlier) is the minority class. AUC maximization can naturally be applied for outlier or anomaly detection. In Reference [36], multiple online AUC maximization algorithms, including OAM [201] and OPAUC [47], are applied to the benchmark datasets for outlier/anomaly detection such as Webspam [163], Sensor Faults [114], and Malware App [205]. The performance of different stochastic AUC maximization algorithms is compared in References [97, 122] for anomaly detection. In Reference [76], the authors consider AUC maximization for fraud detection on a graph. They consider learning both the parameters of a graph neural network (GNN) and a policy of edge pruning that affects prediction outputs of the GNN. For learning the parameters of GNN, they employ the PPD-SG algorithm [101] to solve the saddle point formulation. The learning of the edge pruner is formulated as a reinfocement learning problem and a classic policy gradient is used.

Other Applications. . Besides medical image classification, molecular property prediction, and fraud/outlier detection, AUC/pAUC maximization have been investigated in other applications, which do not necessarily involve deep learning. Examples include points of interest recommendation [61], credit scoring for financial institutions [99], time series classification for medicine, manufacturing, and maintenance [177], protein disorder prediction [166], pedestrian detection [133], differentiated gene detection [104], and discovery of motifs [208].

Benchmark and Library. . We present a summary of benchmark datasets in Table 7 for DAM and include references that provide benchmark results of deep AUC maximization and deep pAUC maximization. The author T. Yang’s group has developed an open-source library for DAM called LibAUC,⁴ which implements a set of efficient stochastic algorithms for deep AUC maximization, deep one-way pAUC maximization, and deep two-way pAUC maximization.

Table 7.

Dataset	Type	Order of Data Size	Imbalance	Source	References for DAM	Benchmark Result
STL10	Natural Image	\(10^3\)	Artificial	[27]	[196, 207]	-
CIFAR10	Natural Image	\(10^4\)	Artificial	[91]	[196, 207]	-
CIFAR100	Natural Image	\(10^4\)	Artificial	[91]	[196, 207]	-
Cat vs Dog	Natural Image	\(10^4\)	Artificial	[41]	[196, 207]	-
Melanoma	Skin Lesion Image	\(10^4\)	Natural	[1]	[196, 207]	0.9505
CheXpert	Chest X-ray Image	\(10^5\)	Natural	[79]	[196, 207]	0.9305
DDSM+	Mammographic Image	\(10^4\)	Natural	[17, 67]	[196]	0.9544
PatchCamelyon	Microscopic Image	\(10^5\)	Artificial	[8, 160]	[196]	-
MoleculeNet/HIV	Molecular Graph	\(10^4\)	Natural	[174]	[207]	0.770
MoleculeNet/PCBA	Molecular Graph	\(10^5\)	Natural	[174]	-	-
MoleculeNet/MUV	Molecular Graph	\(10^4\)	Natural	[174]	[207]	0.644
MoleculeNet/Tox21	Molecular Graph	\(10^3\)	Natural	[174]	[207]	-
MoleculeNet/ToxCast	Molecular Graph	\(10^3\)	Natural	[174]	[207]	-
ogbg-molhiv	Molecular Graph	\(10^4\)	Natural	[72]	-	-
ogbg-molpcba	Molecular Graph	\(10^5\)	Natural	[72]	[71]	0.8406
ogbg-molmuv	Molecular Graph	\(10^4\)	Natural	[72]	[206]	-
ogbg-moltox21	Molecular Graph	\(10^3\)	Natural	[72]	[206]	-
ogbg-moltoxcast	Molecular Graph	\(10^3\)	Natural	[72]	[206]	-

Table 7. Benchmark Datasets for Deep AUC and pAUC Maximization

The differences between the MoleculeNet and OGBG molecular graph datasets lie at the way of training/testing split. The numbers in the last column are the best results on testing data reported in the referred papers for AUC maximization. We do not include the results for partial AUC maximization, as it requires setting a threshold for FPR/TPR. We also do not present the results on datasets with manual construction of imbalanced data denoted by “Artificial” for the “Imbalance” column.

7.4 Summary

The research of deep AUC maximization springs from the studies for solving the non-convex min-max problems. The applicability in deep AUC maximization motivates a wave of studies for algorithmic design and theoretical analysis of non-convex strongly concave min-max problems. New formulations for AUC maximization and partial AUC maximization are also developed, for which efficient stochastic algorithms are proposed. The algorithms are then employed for solving real-world applications with great success, e.g., medical image classification and molecular property prediction. However, there are still many challenges regarding deep (partial) AUC maximization to be addressed, which will be discussed in the next section.

8 Other Issues for DAM and Outlook for Future Work

Below, we discuss five remaining or emerging issues for DAM.

Large-scale Stochastic Optimization. . Although large-scale optimization algorithms for DAM have been developed, there are still many open problems to be addressed. Below, we will list several important questions. (i) How to further improve the algorithms and theories for solving the composite objectives of AUC and for solving pAUC objectives? Zhu et al. [207] employ stochastic compositional algorithms for solving the composite objectives (18). There is still much room for improving the optimization for solving pAUC objectives, e.g., Equations (22)\(\sim\)(25), or new formulations of pAUC. (ii) How to optimize AUC in the federated learning setting? Although federated deep AUC maximization for the minimax objective has been considered in References [55, 195], federated learning algorithms for optimizing other objectives remains to be developed. In particular, the objectives (22)\(\sim\)(25) for pAUC maximization are much more challenging to be optimized in the federated learning setting.

Network Structures. . Standard deep neural networks have been used for DAM in different applications. For example, deep convolutional neural networks such as VGG, ResNets, DenseNets, EfficientNets have been used for medical image classification [66, 154, 196]. Graph neural networks, e.g., graph isomorphism network (GIN) [74, 175], Message-Passing Neural Network (MPNN) [50, 171], have been used for molecular property prediction [171, 206]. It remains to be explored by using more advanced network structures, e.g., vision transformer [38] for medical image classification tasks in the context of DAM. Another interesting direction is to explore neural architecture search (NAS) [93] in the context of DAM. A natural question is if we use AUC as a performance measure of NAS, how would the found network be different from standard approaches that use accuracy as a performance measure?

Regularization and Normalization. . Regularization is an important technique for improving generalization. A standard regularization technique is to use weight decay [51]. It was shown to be effective for DAM as well [207]. Nevertheless, more algorithmic regularization techniques should be considered for DAM, including explicit and implicit regularization. Recently, Zhu et al. [207] demonstrate that the explicit regularization by adding the quadratic term in the proximal point methods discussed in Section 7.1 is helpful for improving the generalization. Implicit regularization by gradient-based methods [128] for DAM is an interesting question to be explored. Normalization is another key technique for improving the training of deep neural networks, e.g., batch normalization [78], layer normalization [5], and so on. Of particular interest to DAM is the normalization in the output layer. Yuan et al. [196] have used a batch score normalization layer, which normalizes the non-activated scores in the mini-batch such that the \(\ell _2\) norm of scores in the mini-batch is one. This is found to be better than not using any activation or normalization. Zhu et al. [207] show that using a sigmoid activation in the last layer also yields better performance than not using any activation or normalization. They also demonstrate that using the sigmoid activation is competitive with the batch score normalization, and they are better than the standard batch normalization.

Data Sampling and Augmentation. . The standard data sampling for deep learning is to use data shuffling over all examples. However, different data sampling strategies have been considered for imbalanced data, e.g., oversampling and undersampling [82]. Recently, Zhu et al. [207] demonstrate via empirical studies that oversampling for the minority class is helpful for improving the generalization performance of DAM. However, how can we incorporate more advanced oversampling or data augmentation techniques into DAM remains an interesting topic. One might consider synthetic oversampling method SMOTE [23] and MIXUP [197] for DAM.

Feature Learning. . Feature learning is an important capability of deep learning for tackling un-structured data. It is shown that directly optimizing the AUC loss from scratch does not necessarily yield better feature representations [194]. A practice of DAM uses a two-stage approach: The first stage is to learn the encoder network by optimizing the traditional cross-entropy loss, and the second stage is to fine-tune the encoder network and to learn the classifier by DAM [196]. It is still not fully understood why optimizing the AUC loss in an end-to-end fashion does not yield better feature representations, and it remains an open problem how to learn better encoder networks by using DAM. Recently, Yuan et al. [194] proposed an end-to-end training method called compositional training for DAM. The idea is to solve a compositional objective \(L_{\text{AUC}}(\mathbf {w}- \alpha \nabla L_{\text{CE}}(\mathbf {w}))\), where \(L_{\text{AUC}}\) denotes an AUC loss and \(L_{\text{CE}}\) denotes a standard cross-entropy loss. It is shown that the compositional training method for DAM yields much better feature representations than optimizing either the CE loss or the AUC loss from scratch. It remains an open problem how to understand this method theoretically. Another direction is to consider self-supervised pre-training methods on large-scale unlabeled medical datasets. This approach was recently explored in References [4, 151, 199]. The success on downstream tasks of using DAM has also been demonstrated for detecting COVID-19 based on X-ray images [66]. Nevertheless, we could consider pre-training on much larger medical datasets than those used in existing studies and demonstrate the performance of DAM on multiple downstream medical image classification tasks.

Learning fair and interpretable AI models. . Building trustworthy AI is important for many domains, e.g., healthcare, in particular medical image classification. Two issues are of foremost importance, namely, fairness [7, 10, 20, 113] and interpretability [3, 24, 64]. Although these issues have received tremendous attention in the literature for medical image classification [26, 145, 146], developing fair and interpretable DAM methods remains to be explored. Some outstanding questions and work include (i) how to develop scalable in-processing algorithms for optimizing AUC under AUC-based fairness constraints [12, 84]; (ii) how to develop scalable and interpretable DAM methods; (iii) evaluating these fairness-aware and interpretable AUC optimization methods on large-scale medical image datasets.

Out-of-distribution Robustness . An emerging issue in machine learning that has attracted great attention is how to tackle the distributional shifts, i.e., the distribution of testing data differs from that of training data. While this issue has been investigated for traditional risk minimization, it has been rarely explored for AUC maximization. Given that AUC maximization is more aggressive in pushing positive examples ranked above negative examples [196], it might cause more severe performance degradation in the presence of distributional shifts. Different types of distributional shifts have been studied, e.g., domain generalization, subpopulation shift, covariate shift, concept drift, and so on. Accordingly, various benchmark datasets following different distributional shifts have been curated [54, 72, 89, 191]. Many of these datasets use AUROC as the performance measure. It remains an open problem how robust are existing DAM methods in the presence of distributional shifts and how to make them more robust.

Finally, we would like to point out that the above list of issues is not complete. There must be some other issues related to DAM or in the context of DAM to be addressed in the future. While this issue has been studied for traditional risk optimization, it has been rarely explored for AUC maximization.

9 Conclusions

In this article, we have presented a comprehensive survey of AUC maximization methods in the past 20 years with a focus on recent research and development of stochastic AUC maximization and deep AUC maximization. We have compared different methods from different perspectives, e.g., formulations, per-iteration complexity, sample complexities, optimization error, statistical error, empirical performance, and so on. We also discuss remaining and emerging issues in deep AUC maximization and provide suggestions of topics for future work.

Acknowledgments

We thank the editors and anonymous reviewers for their constructive comments.

Footnotes

In their papers, \(\alpha (\mathbf {w})\) is given by \(b(\mathbf {w})-a(\mathbf {w})\) due to a variable change.

https://www.aicures.mit.edu/tasks.

https://ogb.stanford.edu/docs/leader_graphprop/.

⁴

www.libauc.org.

References

[1]

Kaggle Competition: SIIM-ISIC Melanoma Classification. Retrieved from https://www.kaggle.com/c/siim-isic-melanoma-clas sification.

Abstract

1 Introduction

2 Background

2.1 Definitions of AUC, Partial AUC, Two-way Partial AUC

2.2 Non-Parametric Estimators

3 Surrogate Objectives for AUC Maximization

4 Full Batch Based Methods - the First Age

4.1 AUC Maximization

4.2 Partial AUC Maximization

4.3 Summary

5 Online AUC Maximization - the Second Age

5.1 Online Buffered Gradient Descent for AUC Maximization

5.2 Online Statistics-based Methods

5.3 Online Non-linear Methods for AUC Maximization

5.4 Adaptive Online AUC Maximization

5.5 Summary

6 Stochastic AUC Maximization - the Third Age

6.1 Stochastic Batch-based Pairwise Methods

6.2 Stochastic Primal-Dual (PD) Methods

6.3 Summary

7 Deep AUC Maximization (DAM): the Fourth Age

7.1 Non-convex Concave Min-Max Optimization

7.2 Deep Partial AUC Maximization

7.3 Applications of Deep AUC Maximization (DAM)

7.4 Summary

8 Other Issues for DAM and Outlook for Future Work

9 Conclusions

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Evolving neural networks with maximum AUC for imbalanced data classification

A Brief Survey on Big Data in Healthcare

A survey on blockchain for big data: Approaches, opportunities, and future directions

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations