Keywords

1 Introduction

Fig. 1.
figure 1

The motivation interpolation from separately performing subspace learning and classification to joint learning to joint \(\&\) progressive learning again. The subspaces learned from our model indicates the higher feature discriminative ability as explained by the green bottom line. (Color figure online)

High-dimensional data are often characterized by very rich and diverse information, which enables us to classify or recognize the targets more effectively and analyze data attributes more easily, but inevitably introduces some drawbacks (e.g. information redundancy, complex noise effects, high storage-consuming, etc.) due to the curve of dimensionality. A general way to address this problem is to learn a low-dimensional and high-discriminative feature representation. In general, it is also called as dimensionality reduction or subspace learning. In the past decades, a large number of subspace learning techniques have been developed in the machine learning community, with successful applications to biometrics [5, 9, 10, 20], image/video analysis [26], visualization [22], hyperspectral data analysis (e.g., dimensionality reduction and unmixing) [12,13,14]. These subspace learning techniques are generally categorized into linear or nonlinear methods. Theoretically, nonlinear approaches are capable of curving the data structure in a more effective way. There is, however, no explicit mapping function (poor explainability), and meanwhile it is relatively hard to embed the out-of-samples into the learned subspace (weak generalization) as well as high computational cost (lack of cost-effectiveness). Additionally, for a task of multi-label classification, these classic subspace learning techniques, such as principal component analysis (PCA) [29], local discriminant analysis (LDA) [20], local fisher discriminant analysis (LFDA) [23], manifold learning (e.g. Laplacian eigenmaps (LE) [1], locally linear embedding (LLE) [21]) and their linearized methods (e.g. locality preserving projection (LPP) [6], neighborhood preserving embedding (NPE) [4]), are commonly applied as a disjunct feature learning step before classification, whose limitation mainly lies in a weak connection between features by subspace learning and label space (see the top panel of Fig. 1). It is unknown which learned features (or subspace) can improve the classification.

Recently, a feasible solution to the above problems can be generalized as a joint learning framework [17] that simultaneously considers linearized subspace learning and classification, as illustrated in the middle panel of Fig. 1. Following it, more advanced methods have been proposed and applied in various fields, including supervised dimensionality reduction (e.g. least-squares dimensionality reduction (LSDR) [24] and its variants: least-squares quadratic mutual information derivative (LSQMID) [25]), multi-modal data matching and retrieval [27, 28], and heterogeneous features learning for activity recognition [15, 16]. In these work, the learned features (or subspace) and label information are effectively connected by regression techniques (e.g. linear regression) to adaptively estimate a latent and discriminative subspace. Despite this, they still fail to find an optimal subspace, as single linear projection is hardly enough to represent the complex transformation from the original data space to the potential optimal subspace.

Motivated by the aforementioned studies, we propose a novel joint and progre-ssive learning strategy (J-Play) to linearly find an optimal subspace for general multi-label classification, illustrated in the bottom panel of Fig. 1. We practically extend the existing joint learning framework by learning a series of subspaces instead of single subspace, aiming at progressively converting the original data space to a potentially optimal subspace through multi-coupled intermediate transformations [18]. Theoretically, by increasing the number of subspaces, coupled subspace variations are gradually narrowed down to a very small range that can be represented effectively via a linear transformation. This renders us to find a good solution easier, especially when the model is complex and non-convex. We also contribute to structure learning in each latent subspace by locally embedding manifold structure.

The main highlights of our work can be summarized as follows:

  • A linearized progressive learning strategy is proposed to describe the variations from the original data space to potentially optimal subspace, tending to find a better solution. A joint learning framework that simultaneously estimates subspace projections (connect the original space and the latent subspaces) and a property-labeled projection (connect the learned latent subspaces and label space) is considered to find a discriminative subspace where samples are expected to be better classified.

  • Structure learning with local manifold regularization is performed in each latent subspace.

  • Based on the above techniques, a novel joint and progressive learning strategy (J-Play) is developed for multi-label classification.

  • An iterative optimization algorithm based on the alternating direction method of multipliers (ADMM) is designed to solve the proposed model.

2 Joint and Progressive Learning Strategy (J-Play)

2.1 Notations

Let \(\mathbf {X}=[\mathbf {x}_{1},...,\mathbf {x}_{k},...,\mathbf {x}_{N}]\in \mathbb {R}^{d_{0}\times N}\) be a data matrix with \(d_{0}\) dimensions and N samples, and the matrix of corresponding class labels be \(\mathbf {Y} \in \{0,1\}^{L \times N}\). The kth column of \(\mathbf {Y}\) is \(\mathbf {y}_{k}=[\mathbf {y}_{k1},...,\mathbf {y}_{kt},...,\mathbf {y}_{kL}]^{T}\in \mathbb {R}^{L\times 1}\) whose each element can be defined as follows:

$$\begin{aligned} \mathbf {y}_{kt}= {\left\{ \begin{array}{ll} \begin{aligned} 1, \quad &{} \text {if } \mathbf {y}_{k} \text { belongs to the } t\hbox {-th class;}\\ 0, \quad &{} \text {otherwise.} \end{aligned} \end{array}\right. } \end{aligned}$$
(1)

In our task, we aim to learn a set of coupled projections \(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}\in \mathbb {R}^{d_{l}\times d_{l-1}}\) and a property-labeled projection \(\mathbf {P} \in \mathbb {R}^{L\times d_{m}}\), where m stands for the number of subspace projections and \(\{d_{l}\}_{l=1}^{m}\) are defined as the dimensions of those latent subspaces respectively, while \(d_{0}\) is specified as the dimension of \(\mathbf {X}\).

2.2 Basic Framework of J-Play from the View of Subspace Learning

Subspace learning is to find a low-dimensional space where we expect to maximize certain properties of the original data, e.g. variance (PCA), discriminative ability (LDA), and graph structure (manifold learning). Yan et al. [30] summarized these subspace learning methods in a general graph embedding framework.

Given an undirected similarity graph \(G=\{\mathbf {X},\mathbf {W}\}\) with the vertices \(\mathbf {X}\in \{\mathbf {x}_{1},...,\mathbf {x}_{N}\}\) and the adjacency matrix \(\mathbf {W}\in \mathbb {R}^{N\times N}\), we can intuitively measure the similarities among the data. By preserving the similarities relationship, the high-dimensional data can be well embedded into the low-dimensional space, which can be formulated by denoting the low-dimensional data representation as \(\mathbf {Z}\in \mathbb {R}^{d\times N}\) (\(d\ll d_{0}\)) in the following

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {Z}}{{\mathrm{tr}}}(\mathbf {Z}\mathbf {L}\mathbf {Z}^{{{\mathrm{T}}}}), \quad \mathrm {s.t.} \quad \mathbf {Z}\mathbf {D}\mathbf {Z}^{{{\mathrm{T}}}}=\mathbf {I}, \end{aligned} \end{aligned}$$
(2)

where \(\mathbf {D}_{ii}=\sum _{j}\mathbf {W}_{ij}\) is a diagonal matrix, \(\mathbf {L}\) is a Laplacian matrix defined by \(\mathbf {L}=\mathbf {D}-\mathbf {W}\) [3], and \(\mathbf {I}\) is the identity matrix. In our case, we aim at learning multi-coupled linear projections to find optimal mapping, therefore a linearized subspace learning problem can be reformulated on the basis of Eq. (2) by substituting \(\mathbf {\Theta }\mathbf {X}\) for \(\mathbf {Z}\)

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }}{{\mathrm{tr}}}(\mathbf {\Theta }\mathbf {X}\mathbf {L}\mathbf {X}^{{{\mathrm{T}}}}\mathbf {\Theta }^{{{\mathrm{T}}}}), \quad \mathrm {s.t.} \quad \mathbf {\Theta }\mathbf {X}\mathbf {D}\mathbf {X}^{{{\mathrm{T}}}}\mathbf {\Theta }^{{{\mathrm{T}}}}=\mathbf {I}, \end{aligned} \end{aligned}$$
(3)

which can be solved by generalized eigenvalue decomposition.

Different from the previously mentioned subspace learning methods, a re-gression-based joint learning model [17] can explicitly bridge the learned latent subspace and labels, which can be formulated in a general form:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P},\mathbf {\Theta }}\frac{1}{2}\mathbf {E}(\mathbf {P},\mathbf {\Theta })+\frac{\beta }{2}\mathbf {\Phi }(\mathbf {\Theta })+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P}), \end{aligned} \end{aligned}$$
(4)

where \(\mathbf {E}(\mathbf {P},\mathbf {\Theta })\) is the error term defined as \(||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }\mathbf {X}||_{{{\mathrm{F}}}}^{2}\), \(||\bullet ||_{{{\mathrm{F}}}}\) represents a Frobenius norm, \(\beta \) and \(\gamma \) are the corresponding penalty parameters. \(\mathbf {\Phi }\) and \(\mathbf {\Psi }\) denote regularization functions, which might be \(l_{1}\) norm, \(l_{2}\) norm, \(l_{2,1}\) norm or manifold regularization. Herein, the variable \(\mathbf {\Theta }\) is called intermediate transformation and the corresponding subspace generated by \(\mathbf {\Theta }\) is called latent subspace where the feature can be further structurally learned and represented in a more suitable way [16].

Fig. 2.
figure 2

The illustration of the proposed J-Play framework.

On the basis of Eq. (5), we further extend the framework by following a progressive learning strategy:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m}}\frac{1}{2}\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\beta }{2}\mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P}), \end{aligned} \end{aligned}$$
(5)

where \(\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})\) is specified as \(||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2}\) and \(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}\) represent a set of intermediate transformations.

2.3 Problem Formulation

Following the general framework given in Eq. (6), the proposed J-Play can be formulated as the following constrained optimization problem:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m}}&\frac{1}{2}\mathbf {\Upsilon }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}) +\frac{\alpha }{2}\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\beta }{2}\mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P})\\&\mathrm {s.t.} \quad \mathbf {X}_{l}=\mathbf {\Theta }_{l}\mathbf {X}_{l-1}, \quad \mathbf {X}_{l}\succeq 0, \quad ||\mathbf {x}_{lk}||_{2} \preceq 1, \quad \forall l=1,2,...,m, \end{aligned} \end{aligned}$$
(6)

where \(\mathbf {X}\) is assigned to \(\mathbf {X}_{0}\), while \(\alpha \), \(\beta \), and \(\gamma \) are three penalty parameters corresponding to the different terms, which aim at balancing the importance between the terms. Figure 2 illustrates the J-Play framework. Since Eq. (7) is a typically ill-posed problem, reasonable assumptions or priors need to be introduced to search a solution in a narrowed range effectively. More specifically, we cast Eq. (7) as a least-square regression problem with reconstruction loss term (\(\mathbf {\Upsilon }(\bullet )\)), prediction loss term (\(\mathbf {E}(\bullet )\)) and two regularization terms (\(\mathbf {\Phi }(\bullet )\) and \(\mathbf {\Psi }(\bullet )\)). We detail these terms one by one as follows.

(1) Reconstruction Loss Term \(\mathbf {\Upsilon }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})\): Without any constraints or prior, directly estimating multi-coupled projections in J-Play is hardly performed with the increase of the number of estimated projections. This can be reasonably explained by gradient missing between the two neighboring variables estimated in the process of optimization. That is, the variations between these neighboring projections are made to be tiny and even zero. In particular, when the number of projections increases to a certain extent, most of learned projections tend to be zero and become meaningless. To this end, we adopt a kind of autoencoder-like scheme to make the learned subspace projected back to the original space as much as possible. The benefits of the scheme are, on one hand, to prevent the data over-fitting to some extent, especially avoiding overmuch noises from being considered; on the other hand, to establish an effective link between the original space and the subspace, making the learned subspace more meaningful. Therefore, the resulting expression is

$$\begin{aligned} \begin{aligned} \mathbf {\Upsilon }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})=\sum \nolimits _{l=1}^{m}||\mathbf {X}_{l-1}-\mathbf {\Theta }_{l}^{T}\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}. \end{aligned} \end{aligned}$$
(7)

In our case, to fully utilize the advantages of this term, we consider it in each latent subspace as shown in Eq. (8).

(2) Predication Loss Term \(\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})\): This term is to minimize the empirical risk between the original data and the corresponding labels through multi-coupled projections in a progressive way, which can be formulated as

$$\begin{aligned} \begin{aligned} \mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})=||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2}. \end{aligned} \end{aligned}$$
(8)

(3) Local Manifold Regularization \(\mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})\): As introduced in [27], a manifold structure is an important prior for subspace learning. Superior to vector-based feature learning, such as artificial neural network (ANN), a manifold structure can effectively capture the intrinsic structure between samples. To facilitate structure learning in J-Play, we perform the local manifold regularization to each latent subspace. Specifically, this term can be expressed by

$$\begin{aligned} \begin{aligned} \mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})=\sum \nolimits _{l=1}^{m}{{\mathrm{tr}}}(\mathbf {\Theta }_{l}\mathbf {X}_{l-1}\mathbf {L}\mathbf {X}_{l-1}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}^{{{\mathrm{T}}}}). \end{aligned} \end{aligned}$$
(9)
figure a

(4) Regression Coefficient Regularization \(\mathbf {\Psi }(\mathbf {P})\): The regularization term can promote us to derive a more reasonable solution with a reliable generalization to our model, which can be written as

$$\begin{aligned} \begin{aligned} \mathbf {\Psi }(\mathbf {P})=||\mathbf {P}||_{{{\mathrm{F}}}}^{2}. \end{aligned} \end{aligned}$$
(10)

Moreover, the non-negativity constraint with respect to each learned dimension-reduced feature (e.g. \(\{\mathbf {X}_{l}\}_{l=1}^{m} \succeq 0\)) is considered since we aim to obtain a meaningful low-dimensional feature representation similar to original image data acquired in a non-negative unit. In addition to the non-negativity constraint, we also impose a norm constraintFootnote 1 for sample-based of each subspace: \(||\mathbf {x}_{lk}||_{2} \preceq 1, \forall k=1,...,N\) and \(l=1,...,m\).

2.4 Model Optimization

Considering the complexity and the non-convexity of our model, we pretrain our model to have an initial approximation of subspace projections \(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}\) as this can greatly reduce the model’s training time and also help finding an optimal solution easier. This is a common tactic that has been successfully employed in deep autoencoders [8]. Inspired by this trick, we propose a pre-training model with respect to \(\mathbf {\Theta }_{l},\forall l=1,...,m\) by simplifying Eq. (7) as

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }_{l}}\frac{1}{2}\mathbf {\Upsilon }(\mathbf {\Theta }_{l})+\frac{\eta }{2}\mathbf {\Phi }(\mathbf {\Theta }_{l}) \quad \mathrm {s.t.} \quad \mathbf {X}_{l}\succeq 0, \quad ||\mathbf {x}_{lk}||_{2} \preceq 1, \end{aligned} \end{aligned}$$
(11)

which is named as auto-reconstructing unsupervised learning (AutoRULe). Given the outputs of AutoRULe, the problem of Eq. (7) can be more effectively solved by an alternatively minimizing strategy that separately solves two subproblems with respect to \(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}\) and \(\mathbf {P}\). Therefore, the global algorithm of J-Play can be summarized in Algorithm 1, where AutoRULe is initialized by LPP.

The pre-training method (AutoRULe) can be effectively solved via the ADMM-based framework. Following this, we consider an equivalent form of Eq. (12) by introducing multiple auxiliary variables \(\mathbf {H}\), \(\mathbf {G}\), \(\mathbf {Q}\) and \(\mathbf {S}\) to replace \(\mathbf {X}_{l}\), \(\mathbf {\Theta }_{l}\), \(\mathbf {X}_{l}^{+}\) and \(\mathbf {X}_{l}^{\sim }\), respectively, where \(()^{+}\) denotes an operator that converts each component of the matrix to its absolute value and \(()^{\sim }\) is a proximal operator for solving the constraint of \(||\mathbf {x}_{lk}||_{2} \preceq 1\) [7], written as follows

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }_{l},\mathbf {H},\mathbf {G},\mathbf {Q},\mathbf {S}}&\frac{1}{2}\mathbf {\Upsilon }(\mathbf {G},\mathbf {H})+\frac{\eta }{2}\mathbf {\Phi }(\mathbf {\Theta }_{l}) =\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {G}^{{{\mathrm{T}}}}\mathbf {H}||_{{{\mathrm{F}}}}^{2}+\frac{\eta }{2}{{\mathrm{tr}}}(\mathbf {X}_{l}\mathbf {L}\mathbf {X}_{l}^{{{\mathrm{T}}}})\\&\mathrm {s.t.} \quad \mathbf {Q}\succeq 0, \quad ||\mathbf {s}_{k}||_{2} \preceq 1, \quad \mathbf {X}_{l}=\mathbf {\Theta }_{l}\mathbf {X}_{l-1},\\&\quad \quad \quad \mathbf {X}_{l}=\mathbf {H}, \quad \mathbf {\Theta }_{l}=\mathbf {G}, \quad \mathbf {X}_{l}=\mathbf {Q}, \quad \mathbf {X}_{l}=\mathbf {S}. \end{aligned} \end{aligned}$$
(12)

The augmented Lagrangian version of Eq. (13) is

$$\begin{aligned} \begin{aligned} \mathscr {L}&_{\mu }\left( \mathbf {\Theta }_{l},\mathbf {H},\mathbf {G},\mathbf {Q},\mathbf {S}, \{\mathbf {\Lambda }_{n}\}_{n=1}^{4} \right) \\ {}&=\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {G}^{{{\mathrm{T}}}}\mathbf {H}||_{{{\mathrm{F}}}}^{2}+\frac{\eta }{2}{{\mathrm{tr}}}(\mathbf {\Theta }_{l}\mathbf {X}_{l-1}\mathbf {L}\mathbf {X}_{l-1}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}^{{{\mathrm{T}}}}) +\mathbf {\Lambda }_{1}^{{{\mathrm{T}}}}(\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})\\&+\mathbf {\Lambda }_{2}^{{{\mathrm{T}}}}(\mathbf {G}-\mathbf {\Theta }_{l})+\mathbf {\Lambda }_{3}^{{{\mathrm{T}}}}(\mathbf {Q}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})+\mathbf {\Lambda }_{4}^{{{\mathrm{T}}}}(\mathbf {S}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})+\frac{\mu }{2}||\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}\\&+\frac{\mu }{2}||\mathbf {G}-\mathbf {\Theta }_{l}||_{{{\mathrm{F}}}}^{2}+\frac{\mu }{2}||\mathbf {Q}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}+\frac{\mu }{2}||\mathbf {S}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}+l_{R}^{+}(\mathbf {Q})+l_{R}^{\sim }(\mathbf {S}), \end{aligned} \end{aligned}$$
(13)

where \(\{\mathbf {\Lambda }_{n}\}_{n=1}^{4}\) are Lagrange multipliers and \(\mu \) is the penalty parameter. The two terms \(l_{R}^{+}(\bullet )\) and \(l_{R}^{\sim }(\bullet )\) represent two kinds of projection operators, respectively. That is, \(l_{R}^{+}(\bullet )\) is defined as

$$\begin{aligned} max(\bullet )= \begin{aligned} {\left\{ \begin{array}{ll} \quad \bullet \;, \quad \bullet \succ 0\\ \quad 0 \;, \quad \bullet \preceq 0, \end{array}\right. } \end{aligned} \end{aligned}$$
(14)

while \(l_{R}^{\sim }(\bullet _{k})\) is a vector-based operator defined by

$$\begin{aligned} \begin{aligned} prox_{f}(\bullet _{k})= {\left\{ \begin{array}{ll} \; \frac{\bullet _{k}}{||\bullet _{k}||_{2}} \;, \quad ||\bullet _{k}||_{2} \succ 1\\ \quad \bullet _{k} \;, \quad ||\bullet _{k}||_{2} \preceq 1, \end{array}\right. } \end{aligned} \end{aligned}$$
(15)

where \(\bullet _{k}\) is the kth column of matrix \(\bullet \). Algorithm 2 details the procedures of AutoRULe.

figure b

The two subproblems in Algorithm 1 can be optimized alternatively as follows:

Optimization with respect to \(\mathbf {P}\): This is a typical least square regression problem, which can be written as

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P}}\frac{\alpha }{2}\mathbf {E}(\mathbf {P})+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P})=\frac{\alpha }{2}||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2} +\frac{\gamma }{2}||\mathbf {P}||_{{{\mathrm{F}}}}^{2}, \end{aligned} \end{aligned}$$
(16)

which has a closed-form solution

$$\begin{aligned} \begin{aligned} \mathbf {P} \leftarrow (\alpha \mathbf {Y}\mathbf {V}^{{{\mathrm{T}}}})(\alpha \mathbf {V}\mathbf {V}^{{{\mathrm{T}}}}+\gamma \mathbf {I})^{-1}, \end{aligned} \end{aligned}$$
(17)

where \(\mathbf {V}=\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}, \forall l=1,...,m\).

Optimization with respect to \(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}\): The variables \(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}\) can be individually optimized, and hence the optimization problem of each \(\mathbf {\Theta }_{l}\) can be generally formulated by

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }_{l}}&\frac{1}{2}\mathbf {\Upsilon }(\mathbf {\Theta }_{l})+\frac{\alpha }{2}\mathbf {E}(\mathbf {\Theta }_{l})+\frac{\beta }{2}\mathbf {\Phi }(\mathbf {\Theta }_{l}) =\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {\Theta }_{l}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}\\&+\frac{\alpha }{2}||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2} +\frac{\beta }{2}{{\mathrm{tr}}}(\mathbf {\Theta }_{l}\mathbf {X}_{l-1}\mathbf {L}\mathbf {X}_{l-1}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}^{{{\mathrm{T}}}})\\&\mathrm {s.t.} \quad \mathbf {X}_{l}=\mathbf {\Theta }_{l}\mathbf {X}_{l-1}, \quad \mathbf {X}_{l}\succeq 0, \quad ||\mathbf {x}_{lk}||_{2} \preceq 1, \end{aligned} \end{aligned}$$
(18)

which can be basically deduced by following the framework of Algorithm 2. The only difference lies in the optimization subproblem with respect to \(\mathbf {H}\) whose solution can be collected by solving the following problem:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {H}}&\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {G}^{{{\mathrm{T}}}}\mathbf {H}||_{{{\mathrm{F}}}}^{2}+\frac{\alpha }{2}||\mathbf {Y}-\mathbf {P}_{l}\mathbf {H}||_{{{\mathrm{F}}}}^{2} +\mathbf {\Lambda }_{1}^{{{\mathrm{T}}}}(\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})\\&+\frac{\mu }{2}||\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2} \quad \mathrm {s.t.} \quad \mathbf {P}_{l}=\mathbf {P}_{l-1}\mathbf {\Theta }_{l+1}, \quad \mathbf {P}_{0}=\mathbf {P}. \end{aligned} \end{aligned}$$
(19)

The analytical solution of Eq. (20) is given by

$$\begin{aligned} \begin{aligned} \mathbf {H} \leftarrow (\alpha \mathbf {P}_{l}^{{{\mathrm{T}}}}\mathbf {P}_{l}+\mathbf {G}\mathbf {G}^{{{\mathrm{T}}}}+\mu \mathbf {I})^{-1} (\alpha \mathbf {P}_{l}^{{{\mathrm{T}}}}\mathbf {Y}+\mathbf {G}\mathbf {X}_{l-1}+\mu \mathbf {\Theta }_{l}\mathbf {X}_{l-1}-\mathbf {\Lambda }_{1}). \end{aligned} \end{aligned}$$
(20)

Finally, we repeat these optimization procedures until a stopping criterion is satisfied. Please refer to Algorithms 1 and 2 for more explicit steps.

3 Experiments

In this section, we conduct the classification to quantitatively evaluate the performance of the proposed method (J-Play) using three popular and advanced classifiers, namely the nearest neighbor (NN) based on the Euclidean distance, kernel support vector machines (KSVM) and canonical correlation forest (CCF), in comparison with previous state-of-the-art methods. Overall accuracy (OA) is given to quantify the classification performance.

3.1 Data Description

The experiments are performed on two different types of datasets: hyperspectral datasets and face datasets, as both of them easily suffer from the information redundancy and need to improve the representative ability of features. We have used the following two hyperspectral datasets and two face datasets:

  1. (1)

    Indian Pines AVIRIS Image: The first hyperspectral cube was acquired by the AVIRIS sensor with the size of \(145\times 145\times 220\), which consists of 16 class of vegetation. More specific classes and the arrangement of training and test samples can be found in [11]. The first image of Fig. 3 shows a false color image of Indian Pines data.

  2. (2)

    University of Houston Image: The second hyperspectral cube was provided for the 2013 IEEE GRSS data fusion contest acquired by ITRES-CASI sensor with size of \(349\times 1905\times 144\). The information regarding classes and corresponding train and test samples can be found in [13]. A false color image of the study scene is shown in the first image of Fig. 4.

  3. (3)

    Extended Yale-B Dataset: We only choose a subset of the mentioned dataset with the frontal pose and the different illuminations of 38 subjects (2414 images in total), which can widely used in evaluating the performance of subspace learning [2, 32]. These images were aligned and cropped to the size of \(32\times 32\), that is, 1024-dimensional vector-based representation. Each individual has 64 near frontal images under different illuminations.

  4. (4)

    AR Dataset: Similar to [31], we choose a subset of AR under the conditions of illumination and expressions, which comprises of 100 subjects. Each person has 14 images with seven ones from Session 1 as training set and others from Session 2 as testing samples. The images are resized to \(60\times 43\).

3.2 Experimental Steup

As the fixed training and testing samples are given for the hyperspectral datasets, subspace learning techniques can directly be performed on training set to learn an optimal subspace where the testing set can be simply classified by NN, KSVM, and CCF. For the face datasets, since there is no standard training and testing sets, ten replications are performed for randomly selecting training and testing samples. A random subset with 10 facial images per individual is chosen with labels as the training set and the rest of it is considered to be the testing set. Furthermore, we compare the performance of the proposed method (J-Play) with the baseline (original features without dimensionality reduction) and six popular and advanced methods (PCA, LPP, LDA, LFDA, LSDR, and LSQMID). With learning the different number of coupled projections, the proposed method can be successively specified as J-Play\(_{1}\),...,J-Play\(_{l}\),...,J-Play\(_{m}\), \(\forall l=1,...,m\). To investigate the trend of OAs, m are uniformly set up to 7 on the four datasets.

Fig. 3.
figure 3

A false color image, ground truth and classification maps of the different algorithms obtained using CCF on the Indian Pines dataset.

Fig. 4.
figure 4

A false color image, ground truth and classification maps of the different algorithms obtained using CCF on the Houston dataset. (Color figure online)

Table 1. Quantitative performance comparisons on two hyperspectral datasets. The best results for the different classifiers are shown in red.

3.3 Results of Hyperspectral Data

Initially, we conduct a 10-fold cross-validation for the different algorithms on the training set in order to estimate the optimal parameters which can be selected from \(\{10^{-2}, 10^{-1}, 10^{0}, 10^{1}, 10^{2}\}\). Table 1 lists classification performances of the different methods with the optimal subspace dimensions obtained by cross-validation using three different classifiers. Correspondingly, the classification maps are given in Figs. 3 and 4 to intuitively highlight the difference.

Overall, PCA performs basically similar performance with the baseline using the three different classifiers on the two datasets. For LPP, due to its sensitivity to noise, it yields a poor performance on the first dataset, while on the relatively high-quality second dataset, LPP steadily outperforms the baseline and PCA. In the supervised algorithms, owing to the limitation of training samples and discriminative power, the classification accuracies of classic LDA is holistically lower than those previously mentioned. With a more powerful discriminative criterion, LFDA obtains more competitive results by locally focusing on discriminative information, which are generally better than those of the baseline, PCA, LPP, and LDA. However, the features learned by LFDA is sensitive to noise and the number of neighbors, resulting in the unstable performance particularly for the different classifiers. For LSDR and LSQMID, they aim to find a linear projection by maximizing the mutual information between input and output from the view of statistics. With fully considering the mutual information, they achieve the good performance on the two given hyperspectral datasets.

Remarkably, the performance of the proposed method (J-Play) is superior to the other methods on the two hyperspectral datasets. This indicates that J-Play is prone to learn a better feature representation and robust against noise. On the other hand, with the increase of m, the performance of J-Play steadily increases to the best with around 4 or 5 layers for the first dataset and 2 or 3 layers for the second one, and then gradually decreases with a slight perturbation since our model is only trained on the training set.

Fig. 5.
figure 5

Visualization of partial facial features learned by the proposed J-Play on two face datasets.

Table 2. Quantitative performance comparisons on two face datasets. The best results for the different classifiers are shown in red.

3.4 Results of Face Images

As J-Play is proposed as a general subspace learning framework for multi-label classiciation, we additionally used two popular face datasets to further assess its generalization capability. Similarly, cross-validation on training set is conducted for estimating the optimal parameter combination on the extended Yale-B and AR datasets. Considering the high-dimensional vector-based face images, we first perform the PCA for face images in order to roughly reduce the feature redundancy, whose results are further explored to the dimensionality reduction methods by following the previous work on face recognition (e.g. LDA (Fisherfaces) [20] and LPP (Laplacianfaces) [5]). Table 2 gives the corresponding OAs using the different methods on the two face datasets respectively.

By comparison, the performance of PCA and LPP is steadily superior to that of baseline, while PCA is even better than LPP. For supervised approaches, LDA performs better than baseline, PCA, LPP and even LFDA, showing an impressive result. Due to the less number of training samples from face datasets, LSDR and LSQMID are limited to effectively estimate the mutual information between the training samples and labels, resulting in the performance degradation compared to the hyperspectral data. The proposed method outperforms other algorithms, which indicates that this method can effectively learn an optimal mapping from original space to label space, further improving the classification accuracy. Likewise, there is a similar trend for the proposed method with the increase of m that J-Play can basically obtain the optimal OAs with around 4 or 5 layers and more layers would lead to the performance degradation. We also characterize and visualize each column of the learned projection, as shown in Fig. 5 where those high-level or semantically meaningful features, i.e. face features under the different pose and illumination, can be learned well, making the faces identified easier.

4 Conclusions

To effectively find an optimal subspace where the samples can be semantically represented and thereby be better classified or recognized, we proposed a novel linearized subspace learning framework (J-Play) which aims at learning the feature representation from the high-dimensional data in a joint and progressive way. Extensive experiments of multi-label classification are conducted on two types of datasets: hyperspectral images and face images, in comparison with some previously proposed state-of-the-art methods. The promising results using J-Play demonstrate its superiority and effectiveness. In the future, we will further build an unified framework based on J-Play by extending it to semi-supervised learning, transfer learning, or multi-task learning.