Joint and Progressive Learning from High-Dimensional Data for Multi-label Classification

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11212))

Included in the following conference series:

European Conference on Computer Vision

Abstract

Despite the fact that nonlinear subspace learning techniques (e.g. manifold learning) have successfully applied to data representation, there is still room for improvement in explainability (explicit mapping), generalization (out-of-samples), and cost-effectiveness (linearization). To this end, a novel linearized subspace learning technique is developed in a joint and progressive way, called joint and progressive learning strategy (J-Play), with its application to multi-label classification. The J-Play learns high-level and semantically meaningful feature representation from high-dimensional data by (1) jointly performing multiple subspace learning and classification to find a latent subspace where samples are expected to be better classified; (2) progressively learning multi-coupled projections to linearly approach the optimal mapping bridging the original space with the most discriminative subspace; (3) locally embedding manifold structure in each learnable latent subspace. Extensive experiments are performed to demonstrate the superiority and effectiveness of the proposed method in comparison with previous state-of-the-art methods.

You have full access to this open access chapter, Download conference paper PDF

Learning Through Non-linearly Supervised Dimensionality Reduction

Multi-label Feature Selection with Adaptive Subspace Learning

MSSL: a memetic-based sparse subspace learning algorithm for multi-label classification

Article 01 August 2022

Keywords

1 Introduction

High-dimensional data are often characterized by very rich and diverse information, which enables us to classify or recognize the targets more effectively and analyze data attributes more easily, but inevitably introduces some drawbacks (e.g. information redundancy, complex noise effects, high storage-consuming, etc.) due to the curve of dimensionality. A general way to address this problem is to learn a low-dimensional and high-discriminative feature representation. In general, it is also called as dimensionality reduction or subspace learning. In the past decades, a large number of subspace learning techniques have been developed in the machine learning community, with successful applications to biometrics [5, 9, 10, 20], image/video analysis [26], visualization [22], hyperspectral data analysis (e.g., dimensionality reduction and unmixing) [12,13,14]. These subspace learning techniques are generally categorized into linear or nonlinear methods. Theoretically, nonlinear approaches are capable of curving the data structure in a more effective way. There is, however, no explicit mapping function (poor explainability), and meanwhile it is relatively hard to embed the out-of-samples into the learned subspace (weak generalization) as well as high computational cost (lack of cost-effectiveness). Additionally, for a task of multi-label classification, these classic subspace learning techniques, such as principal component analysis (PCA) [29], local discriminant analysis (LDA) [20], local fisher discriminant analysis (LFDA) [23], manifold learning (e.g. Laplacian eigenmaps (LE) [1], locally linear embedding (LLE) [21]) and their linearized methods (e.g. locality preserving projection (LPP) [6], neighborhood preserving embedding (NPE) [4]), are commonly applied as a disjunct feature learning step before classification, whose limitation mainly lies in a weak connection between features by subspace learning and label space (see the top panel of Fig. 1). It is unknown which learned features (or subspace) can improve the classification.

Recently, a feasible solution to the above problems can be generalized as a joint learning framework [17] that simultaneously considers linearized subspace learning and classification, as illustrated in the middle panel of Fig. 1. Following it, more advanced methods have been proposed and applied in various fields, including supervised dimensionality reduction (e.g. least-squares dimensionality reduction (LSDR) [24] and its variants: least-squares quadratic mutual information derivative (LSQMID) [25]), multi-modal data matching and retrieval [27, 28], and heterogeneous features learning for activity recognition [15, 16]. In these work, the learned features (or subspace) and label information are effectively connected by regression techniques (e.g. linear regression) to adaptively estimate a latent and discriminative subspace. Despite this, they still fail to find an optimal subspace, as single linear projection is hardly enough to represent the complex transformation from the original data space to the potential optimal subspace.

Motivated by the aforementioned studies, we propose a novel joint and progre-ssive learning strategy (J-Play) to linearly find an optimal subspace for general multi-label classification, illustrated in the bottom panel of Fig. 1. We practically extend the existing joint learning framework by learning a series of subspaces instead of single subspace, aiming at progressively converting the original data space to a potentially optimal subspace through multi-coupled intermediate transformations [18]. Theoretically, by increasing the number of subspaces, coupled subspace variations are gradually narrowed down to a very small range that can be represented effectively via a linear transformation. This renders us to find a good solution easier, especially when the model is complex and non-convex. We also contribute to structure learning in each latent subspace by locally embedding manifold structure.

The main highlights of our work can be summarized as follows:

A linearized progressive learning strategy is proposed to describe the variations from the original data space to potentially optimal subspace, tending to find a better solution. A joint learning framework that simultaneously estimates subspace projections (connect the original space and the latent subspaces) and a property-labeled projection (connect the learned latent subspaces and label space) is considered to find a discriminative subspace where samples are expected to be better classified.
Structure learning with local manifold regularization is performed in each latent subspace.
Based on the above techniques, a novel joint and progressive learning strategy (J-Play) is developed for multi-label classification.
An iterative optimization algorithm based on the alternating direction method of multipliers (ADMM) is designed to solve the proposed model.

2 Joint and Progressive Learning Strategy (J-Play)

2.1 Notations

Let $\mathbf {X}=[\mathbf {x}_{1},...,\mathbf {x}_{k},...,\mathbf {x}_{N}]\in \mathbb {R}^{d_{0}\times N}$ be a data matrix with $d_{0}$ dimensions and N samples, and the matrix of corresponding class labels be $\mathbf {Y} \in \{0,1\}^{L \times N}$. The kth column of $\mathbf {Y}$ is $\mathbf {y}_{k}=[\mathbf {y}_{k1},...,\mathbf {y}_{kt},...,\mathbf {y}_{kL}]^{T}\in \mathbb {R}^{L\times 1}$ whose each element can be defined as follows:

$$\begin{aligned} \mathbf {y}_{kt}= {\left\{ \begin{array}{ll} \begin{aligned} 1, \quad &{} \text {if } \mathbf {y}_{k} \text { belongs to the } t\hbox {-th class;}\\ 0, \quad &{} \text {otherwise.} \end{aligned} \end{array}\right. } \end{aligned}$$

(1)

In our task, we aim to learn a set of coupled projections $\{\mathbf {\Theta }_{l}\}_{l=1}^{m}\in \mathbb {R}^{d_{l}\times d_{l-1}}$ and a property-labeled projection $\mathbf {P} \in \mathbb {R}^{L\times d_{m}}$, where m stands for the number of subspace projections and $\{d_{l}\}_{l=1}^{m}$ are defined as the dimensions of those latent subspaces respectively, while $d_{0}$ is specified as the dimension of $\mathbf {X}$.

2.2 Basic Framework of J-Play from the View of Subspace Learning

Subspace learning is to find a low-dimensional space where we expect to maximize certain properties of the original data, e.g. variance (PCA), discriminative ability (LDA), and graph structure (manifold learning). Yan et al. [30] summarized these subspace learning methods in a general graph embedding framework.

Given an undirected similarity graph $G=\{\mathbf {X},\mathbf {W}\}$ with the vertices $\mathbf {X}\in \{\mathbf {x}_{1},...,\mathbf {x}_{N}\}$ and the adjacency matrix $\mathbf {W}\in \mathbb {R}^{N\times N}$, we can intuitively measure the similarities among the data. By preserving the similarities relationship, the high-dimensional data can be well embedded into the low-dimensional space, which can be formulated by denoting the low-dimensional data representation as $\mathbf {Z}\in \mathbb {R}^{d\times N}$ ($d\ll d_{0}$) in the following

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {Z}}{{\mathrm{tr}}}(\mathbf {Z}\mathbf {L}\mathbf {Z}^{{{\mathrm{T}}}}), \quad \mathrm {s.t.} \quad \mathbf {Z}\mathbf {D}\mathbf {Z}^{{{\mathrm{T}}}}=\mathbf {I}, \end{aligned} \end{aligned}$$

(2)

where $\mathbf {D}_{ii}=\sum _{j}\mathbf {W}_{ij}$ is a diagonal matrix, $\mathbf {L}$ is a Laplacian matrix defined by $\mathbf {L}=\mathbf {D}-\mathbf {W}$ [3], and $\mathbf {I}$ is the identity matrix. In our case, we aim at learning multi-coupled linear projections to find optimal mapping, therefore a linearized subspace learning problem can be reformulated on the basis of Eq. (2) by substituting $\mathbf {\Theta }\mathbf {X}$ for $\mathbf {Z}$

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }}{{\mathrm{tr}}}(\mathbf {\Theta }\mathbf {X}\mathbf {L}\mathbf {X}^{{{\mathrm{T}}}}\mathbf {\Theta }^{{{\mathrm{T}}}}), \quad \mathrm {s.t.} \quad \mathbf {\Theta }\mathbf {X}\mathbf {D}\mathbf {X}^{{{\mathrm{T}}}}\mathbf {\Theta }^{{{\mathrm{T}}}}=\mathbf {I}, \end{aligned} \end{aligned}$$

(3)

which can be solved by generalized eigenvalue decomposition.

Different from the previously mentioned subspace learning methods, a re-gression-based joint learning model [17] can explicitly bridge the learned latent subspace and labels, which can be formulated in a general form:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P},\mathbf {\Theta }}\frac{1}{2}\mathbf {E}(\mathbf {P},\mathbf {\Theta })+\frac{\beta }{2}\mathbf {\Phi }(\mathbf {\Theta })+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P}), \end{aligned} \end{aligned}$$

(4)

where $\mathbf {E}(\mathbf {P},\mathbf {\Theta })$ is the error term defined as $||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }\mathbf {X}||_{{{\mathrm{F}}}}^{2}$, $||\bullet ||_{{{\mathrm{F}}}}$ represents a Frobenius norm, $\beta $ and $\gamma $ are the corresponding penalty parameters. $\mathbf {\Phi }$ and $\mathbf {\Psi }$ denote regularization functions, which might be $l_{1}$ norm, $l_{2}$ norm, $l_{2,1}$ norm or manifold regularization. Herein, the variable $\mathbf {\Theta }$ is called intermediate transformation and the corresponding subspace generated by $\mathbf {\Theta }$ is called latent subspace where the feature can be further structurally learned and represented in a more suitable way [16].

On the basis of Eq. (5), we further extend the framework by following a progressive learning strategy:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m}}\frac{1}{2}\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\beta }{2}\mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P}), \end{aligned} \end{aligned}$$

(5)

where $\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})$ is specified as $||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2}$ and $\{\mathbf {\Theta }_{l}\}_{l=1}^{m}$ represent a set of intermediate transformations.

2.3 Problem Formulation

Following the general framework given in Eq. (6), the proposed J-Play can be formulated as the following constrained optimization problem:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m}}&\frac{1}{2}\mathbf {\Upsilon }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m}) +\frac{\alpha }{2}\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\beta }{2}\mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P})\\&\mathrm {s.t.} \quad \mathbf {X}_{l}=\mathbf {\Theta }_{l}\mathbf {X}_{l-1}, \quad \mathbf {X}_{l}\succeq 0, \quad ||\mathbf {x}_{lk}||_{2} \preceq 1, \quad \forall l=1,2,...,m, \end{aligned} \end{aligned}$$

(6)

where $\mathbf {X}$ is assigned to $\mathbf {X}_{0}$, while $\alpha $, $\beta $, and $\gamma $ are three penalty parameters corresponding to the different terms, which aim at balancing the importance between the terms. Figure 2 illustrates the J-Play framework. Since Eq. (7) is a typically ill-posed problem, reasonable assumptions or priors need to be introduced to search a solution in a narrowed range effectively. More specifically, we cast Eq. (7) as a least-square regression problem with reconstruction loss term ($\mathbf {\Upsilon }(\bullet )$), prediction loss term ($\mathbf {E}(\bullet )$) and two regularization terms ($\mathbf {\Phi }(\bullet )$ and $\mathbf {\Psi }(\bullet )$). We detail these terms one by one as follows.

(1) Reconstruction Loss Term $\mathbf {\Upsilon }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})$: Without any constraints or prior, directly estimating multi-coupled projections in J-Play is hardly performed with the increase of the number of estimated projections. This can be reasonably explained by gradient missing between the two neighboring variables estimated in the process of optimization. That is, the variations between these neighboring projections are made to be tiny and even zero. In particular, when the number of projections increases to a certain extent, most of learned projections tend to be zero and become meaningless. To this end, we adopt a kind of autoencoder-like scheme to make the learned subspace projected back to the original space as much as possible. The benefits of the scheme are, on one hand, to prevent the data over-fitting to some extent, especially avoiding overmuch noises from being considered; on the other hand, to establish an effective link between the original space and the subspace, making the learned subspace more meaningful. Therefore, the resulting expression is

$$\begin{aligned} \begin{aligned} \mathbf {\Upsilon }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})=\sum \nolimits _{l=1}^{m}||\mathbf {X}_{l-1}-\mathbf {\Theta }_{l}^{T}\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}. \end{aligned} \end{aligned}$$

(7)

In our case, to fully utilize the advantages of this term, we consider it in each latent subspace as shown in Eq. (8).

(2) Predication Loss Term $\mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})$: This term is to minimize the empirical risk between the original data and the corresponding labels through multi-coupled projections in a progressive way, which can be formulated as

$$\begin{aligned} \begin{aligned} \mathbf {E}(\mathbf {P},\{\mathbf {\Theta }_{l}\}_{l=1}^{m})=||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2}. \end{aligned} \end{aligned}$$

(8)

(3) Local Manifold Regularization $\mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})$: As introduced in [27], a manifold structure is an important prior for subspace learning. Superior to vector-based feature learning, such as artificial neural network (ANN), a manifold structure can effectively capture the intrinsic structure between samples. To facilitate structure learning in J-Play, we perform the local manifold regularization to each latent subspace. Specifically, this term can be expressed by

$$\begin{aligned} \begin{aligned} \mathbf {\Phi }(\{\mathbf {\Theta }_{l}\}_{l=1}^{m})=\sum \nolimits _{l=1}^{m}{{\mathrm{tr}}}(\mathbf {\Theta }_{l}\mathbf {X}_{l-1}\mathbf {L}\mathbf {X}_{l-1}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}^{{{\mathrm{T}}}}). \end{aligned} \end{aligned}$$

(9)

(4) Regression Coefficient Regularization $\mathbf {\Psi }(\mathbf {P})$: The regularization term can promote us to derive a more reasonable solution with a reliable generalization to our model, which can be written as

$$\begin{aligned} \begin{aligned} \mathbf {\Psi }(\mathbf {P})=||\mathbf {P}||_{{{\mathrm{F}}}}^{2}. \end{aligned} \end{aligned}$$

(10)

Moreover, the non-negativity constraint with respect to each learned dimension-reduced feature (e.g. $\{\mathbf {X}_{l}\}_{l=1}^{m} \succeq 0$) is considered since we aim to obtain a meaningful low-dimensional feature representation similar to original image data acquired in a non-negative unit. In addition to the non-negativity constraint, we also impose a norm constraint^{Footnote 1} for sample-based of each subspace: $||\mathbf {x}_{lk}||_{2} \preceq 1, \forall k=1,...,N$ and $l=1,...,m$.

2.4 Model Optimization

Considering the complexity and the non-convexity of our model, we pretrain our model to have an initial approximation of subspace projections $\{\mathbf {\Theta }_{l}\}_{l=1}^{m}$ as this can greatly reduce the model’s training time and also help finding an optimal solution easier. This is a common tactic that has been successfully employed in deep autoencoders [8]. Inspired by this trick, we propose a pre-training model with respect to $\mathbf {\Theta }_{l},\forall l=1,...,m$ by simplifying Eq. (7) as

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }_{l}}\frac{1}{2}\mathbf {\Upsilon }(\mathbf {\Theta }_{l})+\frac{\eta }{2}\mathbf {\Phi }(\mathbf {\Theta }_{l}) \quad \mathrm {s.t.} \quad \mathbf {X}_{l}\succeq 0, \quad ||\mathbf {x}_{lk}||_{2} \preceq 1, \end{aligned} \end{aligned}$$

(11)

which is named as auto-reconstructing unsupervised learning (AutoRULe). Given the outputs of AutoRULe, the problem of Eq. (7) can be more effectively solved by an alternatively minimizing strategy that separately solves two subproblems with respect to $\{\mathbf {\Theta }_{l}\}_{l=1}^{m}$ and $\mathbf {P}$. Therefore, the global algorithm of J-Play can be summarized in Algorithm 1, where AutoRULe is initialized by LPP.

The pre-training method (AutoRULe) can be effectively solved via the ADMM-based framework. Following this, we consider an equivalent form of Eq. (12) by introducing multiple auxiliary variables $\mathbf {H}$, $\mathbf {G}$, $\mathbf {Q}$ and $\mathbf {S}$ to replace $\mathbf {X}_{l}$, $\mathbf {\Theta }_{l}$, $\mathbf {X}_{l}^{+}$ and $\mathbf {X}_{l}^{\sim }$, respectively, where $()^{+}$ denotes an operator that converts each component of the matrix to its absolute value and $()^{\sim }$ is a proximal operator for solving the constraint of $||\mathbf {x}_{lk}||_{2} \preceq 1$ [7], written as follows

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }_{l},\mathbf {H},\mathbf {G},\mathbf {Q},\mathbf {S}}&\frac{1}{2}\mathbf {\Upsilon }(\mathbf {G},\mathbf {H})+\frac{\eta }{2}\mathbf {\Phi }(\mathbf {\Theta }_{l}) =\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {G}^{{{\mathrm{T}}}}\mathbf {H}||_{{{\mathrm{F}}}}^{2}+\frac{\eta }{2}{{\mathrm{tr}}}(\mathbf {X}_{l}\mathbf {L}\mathbf {X}_{l}^{{{\mathrm{T}}}})\\&\mathrm {s.t.} \quad \mathbf {Q}\succeq 0, \quad ||\mathbf {s}_{k}||_{2} \preceq 1, \quad \mathbf {X}_{l}=\mathbf {\Theta }_{l}\mathbf {X}_{l-1},\\&\quad \quad \quad \mathbf {X}_{l}=\mathbf {H}, \quad \mathbf {\Theta }_{l}=\mathbf {G}, \quad \mathbf {X}_{l}=\mathbf {Q}, \quad \mathbf {X}_{l}=\mathbf {S}. \end{aligned} \end{aligned}$$

(12)

The augmented Lagrangian version of Eq. (13) is

$$\begin{aligned} \begin{aligned} \mathscr {L}&_{\mu }\left( \mathbf {\Theta }_{l},\mathbf {H},\mathbf {G},\mathbf {Q},\mathbf {S}, \{\mathbf {\Lambda }_{n}\}_{n=1}^{4} \right) \\ {}&=\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {G}^{{{\mathrm{T}}}}\mathbf {H}||_{{{\mathrm{F}}}}^{2}+\frac{\eta }{2}{{\mathrm{tr}}}(\mathbf {\Theta }_{l}\mathbf {X}_{l-1}\mathbf {L}\mathbf {X}_{l-1}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}^{{{\mathrm{T}}}}) +\mathbf {\Lambda }_{1}^{{{\mathrm{T}}}}(\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})\\&+\mathbf {\Lambda }_{2}^{{{\mathrm{T}}}}(\mathbf {G}-\mathbf {\Theta }_{l})+\mathbf {\Lambda }_{3}^{{{\mathrm{T}}}}(\mathbf {Q}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})+\mathbf {\Lambda }_{4}^{{{\mathrm{T}}}}(\mathbf {S}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})+\frac{\mu }{2}||\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}\\&+\frac{\mu }{2}||\mathbf {G}-\mathbf {\Theta }_{l}||_{{{\mathrm{F}}}}^{2}+\frac{\mu }{2}||\mathbf {Q}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}+\frac{\mu }{2}||\mathbf {S}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}+l_{R}^{+}(\mathbf {Q})+l_{R}^{\sim }(\mathbf {S}), \end{aligned} \end{aligned}$$

(13)

where $\{\mathbf {\Lambda }_{n}\}_{n=1}^{4}$ are Lagrange multipliers and $\mu $ is the penalty parameter. The two terms $l_{R}^{+}(\bullet )$ and $l_{R}^{\sim }(\bullet )$ represent two kinds of projection operators, respectively. That is, $l_{R}^{+}(\bullet )$ is defined as

$$\begin{aligned} max(\bullet )= \begin{aligned} {\left\{ \begin{array}{ll} \quad \bullet \;, \quad \bullet \succ 0\\ \quad 0 \;, \quad \bullet \preceq 0, \end{array}\right. } \end{aligned} \end{aligned}$$

(14)

while $l_{R}^{\sim }(\bullet _{k})$ is a vector-based operator defined by

$$\begin{aligned} \begin{aligned} prox_{f}(\bullet _{k})= {\left\{ \begin{array}{ll} \; \frac{\bullet _{k}}{||\bullet _{k}||_{2}} \;, \quad ||\bullet _{k}||_{2} \succ 1\\ \quad \bullet _{k} \;, \quad ||\bullet _{k}||_{2} \preceq 1, \end{array}\right. } \end{aligned} \end{aligned}$$

(15)

where $\bullet _{k}$ is the kth column of matrix $\bullet $. Algorithm 2 details the procedures of AutoRULe.

The two subproblems in Algorithm 1 can be optimized alternatively as follows:

Optimization with respect to $\mathbf {P}$: This is a typical least square regression problem, which can be written as

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {P}}\frac{\alpha }{2}\mathbf {E}(\mathbf {P})+\frac{\gamma }{2}\mathbf {\Psi }(\mathbf {P})=\frac{\alpha }{2}||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2} +\frac{\gamma }{2}||\mathbf {P}||_{{{\mathrm{F}}}}^{2}, \end{aligned} \end{aligned}$$

(16)

which has a closed-form solution

$$\begin{aligned} \begin{aligned} \mathbf {P} \leftarrow (\alpha \mathbf {Y}\mathbf {V}^{{{\mathrm{T}}}})(\alpha \mathbf {V}\mathbf {V}^{{{\mathrm{T}}}}+\gamma \mathbf {I})^{-1}, \end{aligned} \end{aligned}$$

(17)

where $\mathbf {V}=\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}, \forall l=1,...,m$.

Optimization with respect to $\{\mathbf {\Theta }_{l}\}_{l=1}^{m}$: The variables $\{\mathbf {\Theta }_{l}\}_{l=1}^{m}$ can be individually optimized, and hence the optimization problem of each $\mathbf {\Theta }_{l}$ can be generally formulated by

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {\Theta }_{l}}&\frac{1}{2}\mathbf {\Upsilon }(\mathbf {\Theta }_{l})+\frac{\alpha }{2}\mathbf {E}(\mathbf {\Theta }_{l})+\frac{\beta }{2}\mathbf {\Phi }(\mathbf {\Theta }_{l}) =\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {\Theta }_{l}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2}\\&+\frac{\alpha }{2}||\mathbf {Y}-\mathbf {P}\mathbf {\Theta }_{m}...\mathbf {\Theta }_{l}...\mathbf {\Theta }_{1}\mathbf {X}||_{{{\mathrm{F}}}}^{2} +\frac{\beta }{2}{{\mathrm{tr}}}(\mathbf {\Theta }_{l}\mathbf {X}_{l-1}\mathbf {L}\mathbf {X}_{l-1}^{{{\mathrm{T}}}}\mathbf {\Theta }_{l}^{{{\mathrm{T}}}})\\&\mathrm {s.t.} \quad \mathbf {X}_{l}=\mathbf {\Theta }_{l}\mathbf {X}_{l-1}, \quad \mathbf {X}_{l}\succeq 0, \quad ||\mathbf {x}_{lk}||_{2} \preceq 1, \end{aligned} \end{aligned}$$

(18)

which can be basically deduced by following the framework of Algorithm 2. The only difference lies in the optimization subproblem with respect to $\mathbf {H}$ whose solution can be collected by solving the following problem:

$$\begin{aligned} \begin{aligned} \mathop {\min }_{\mathbf {H}}&\frac{1}{2}||\mathbf {X}_{l-1}-\mathbf {G}^{{{\mathrm{T}}}}\mathbf {H}||_{{{\mathrm{F}}}}^{2}+\frac{\alpha }{2}||\mathbf {Y}-\mathbf {P}_{l}\mathbf {H}||_{{{\mathrm{F}}}}^{2} +\mathbf {\Lambda }_{1}^{{{\mathrm{T}}}}(\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1})\\&+\frac{\mu }{2}||\mathbf {H}-\mathbf {\Theta }_{l}\mathbf {X}_{l-1}||_{{{\mathrm{F}}}}^{2} \quad \mathrm {s.t.} \quad \mathbf {P}_{l}=\mathbf {P}_{l-1}\mathbf {\Theta }_{l+1}, \quad \mathbf {P}_{0}=\mathbf {P}. \end{aligned} \end{aligned}$$

(19)

The analytical solution of Eq. (20) is given by

$$\begin{aligned} \begin{aligned} \mathbf {H} \leftarrow (\alpha \mathbf {P}_{l}^{{{\mathrm{T}}}}\mathbf {P}_{l}+\mathbf {G}\mathbf {G}^{{{\mathrm{T}}}}+\mu \mathbf {I})^{-1} (\alpha \mathbf {P}_{l}^{{{\mathrm{T}}}}\mathbf {Y}+\mathbf {G}\mathbf {X}_{l-1}+\mu \mathbf {\Theta }_{l}\mathbf {X}_{l-1}-\mathbf {\Lambda }_{1}). \end{aligned} \end{aligned}$$

(20)

Finally, we repeat these optimization procedures until a stopping criterion is satisfied. Please refer to Algorithms 1 and 2 for more explicit steps.

3 Experiments

In this section, we conduct the classification to quantitatively evaluate the performance of the proposed method (J-Play) using three popular and advanced classifiers, namely the nearest neighbor (NN) based on the Euclidean distance, kernel support vector machines (KSVM) and canonical correlation forest (CCF), in comparison with previous state-of-the-art methods. Overall accuracy (OA) is given to quantify the classification performance.

3.1 Data Description

The experiments are performed on two different types of datasets: hyperspectral datasets and face datasets, as both of them easily suffer from the information redundancy and need to improve the representative ability of features. We have used the following two hyperspectral datasets and two face datasets:

(1)
Indian Pines AVIRIS Image: The first hyperspectral cube was acquired by the AVIRIS sensor with the size of $145\times 145\times 220$, which consists of 16 class of vegetation. More specific classes and the arrangement of training and test samples can be found in [11]. The first image of Fig. 3 shows a false color image of Indian Pines data.
(2)
University of Houston Image: The second hyperspectral cube was provided for the 2013 IEEE GRSS data fusion contest acquired by ITRES-CASI sensor with size of $349\times 1905\times 144$. The information regarding classes and corresponding train and test samples can be found in [13]. A false color image of the study scene is shown in the first image of Fig. 4.
(3)
Extended Yale-B Dataset: We only choose a subset of the mentioned dataset with the frontal pose and the different illuminations of 38 subjects (2414 images in total), which can widely used in evaluating the performance of subspace learning [2, 32]. These images were aligned and cropped to the size of $32\times 32$, that is, 1024-dimensional vector-based representation. Each individual has 64 near frontal images under different illuminations.
(4)
AR Dataset: Similar to [31], we choose a subset of AR under the conditions of illumination and expressions, which comprises of 100 subjects. Each person has 14 images with seven ones from Session 1 as training set and others from Session 2 as testing samples. The images are resized to $60\times 43$.

3.2 Experimental Steup

As the fixed training and testing samples are given for the hyperspectral datasets, subspace learning techniques can directly be performed on training set to learn an optimal subspace where the testing set can be simply classified by NN, KSVM, and CCF. For the face datasets, since there is no standard training and testing sets, ten replications are performed for randomly selecting training and testing samples. A random subset with 10 facial images per individual is chosen with labels as the training set and the rest of it is considered to be the testing set. Furthermore, we compare the performance of the proposed method (J-Play) with the baseline (original features without dimensionality reduction) and six popular and advanced methods (PCA, LPP, LDA, LFDA, LSDR, and LSQMID). With learning the different number of coupled projections, the proposed method can be successively specified as J-Play$_{1}$,...,J-Play$_{l}$,...,J-Play$_{m}$, $\forall l=1,...,m$. To investigate the trend of OAs, m are uniformly set up to 7 on the four datasets.

Table 1. Quantitative performance comparisons on two hyperspectral datasets. The best results for the different classifiers are shown in red.

Full size table

3.3 Results of Hyperspectral Data

Initially, we conduct a 10-fold cross-validation for the different algorithms on the training set in order to estimate the optimal parameters which can be selected from $\{10^{-2}, 10^{-1}, 10^{0}, 10^{1}, 10^{2}\}$. Table 1 lists classification performances of the different methods with the optimal subspace dimensions obtained by cross-validation using three different classifiers. Correspondingly, the classification maps are given in Figs. 3 and 4 to intuitively highlight the difference.

Overall, PCA performs basically similar performance with the baseline using the three different classifiers on the two datasets. For LPP, due to its sensitivity to noise, it yields a poor performance on the first dataset, while on the relatively high-quality second dataset, LPP steadily outperforms the baseline and PCA. In the supervised algorithms, owing to the limitation of training samples and discriminative power, the classification accuracies of classic LDA is holistically lower than those previously mentioned. With a more powerful discriminative criterion, LFDA obtains more competitive results by locally focusing on discriminative information, which are generally better than those of the baseline, PCA, LPP, and LDA. However, the features learned by LFDA is sensitive to noise and the number of neighbors, resulting in the unstable performance particularly for the different classifiers. For LSDR and LSQMID, they aim to find a linear projection by maximizing the mutual information between input and output from the view of statistics. With fully considering the mutual information, they achieve the good performance on the two given hyperspectral datasets.

Remarkably, the performance of the proposed method (J-Play) is superior to the other methods on the two hyperspectral datasets. This indicates that J-Play is prone to learn a better feature representation and robust against noise. On the other hand, with the increase of m, the performance of J-Play steadily increases to the best with around 4 or 5 layers for the first dataset and 2 or 3 layers for the second one, and then gradually decreases with a slight perturbation since our model is only trained on the training set.

Table 2. Quantitative performance comparisons on two face datasets. The best results for the different classifiers are shown in red.

Full size table

3.4 Results of Face Images

As J-Play is proposed as a general subspace learning framework for multi-label classiciation, we additionally used two popular face datasets to further assess its generalization capability. Similarly, cross-validation on training set is conducted for estimating the optimal parameter combination on the extended Yale-B and AR datasets. Considering the high-dimensional vector-based face images, we first perform the PCA for face images in order to roughly reduce the feature redundancy, whose results are further explored to the dimensionality reduction methods by following the previous work on face recognition (e.g. LDA (Fisherfaces) [20] and LPP (Laplacianfaces) [5]). Table 2 gives the corresponding OAs using the different methods on the two face datasets respectively.

By comparison, the performance of PCA and LPP is steadily superior to that of baseline, while PCA is even better than LPP. For supervised approaches, LDA performs better than baseline, PCA, LPP and even LFDA, showing an impressive result. Due to the less number of training samples from face datasets, LSDR and LSQMID are limited to effectively estimate the mutual information between the training samples and labels, resulting in the performance degradation compared to the hyperspectral data. The proposed method outperforms other algorithms, which indicates that this method can effectively learn an optimal mapping from original space to label space, further improving the classification accuracy. Likewise, there is a similar trend for the proposed method with the increase of m that J-Play can basically obtain the optimal OAs with around 4 or 5 layers and more layers would lead to the performance degradation. We also characterize and visualize each column of the learned projection, as shown in Fig. 5 where those high-level or semantically meaningful features, i.e. face features under the different pose and illumination, can be learned well, making the faces identified easier.

4 Conclusions

To effectively find an optimal subspace where the samples can be semantically represented and thereby be better classified or recognized, we proposed a novel linearized subspace learning framework (J-Play) which aims at learning the feature representation from the high-dimensional data in a joint and progressive way. Extensive experiments of multi-label classification are conducted on two types of datasets: hyperspectral images and face images, in comparison with some previously proposed state-of-the-art methods. The promising results using J-Play demonstrate its superiority and effectiveness. In the future, we will further build an unified framework based on J-Play by extending it to semi-supervised learning, transfer learning, or multi-task learning.

Notes

1.
Regarding this constraint,please refer to [19] for more details.

References

Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput. 15(6), 1373–1396 (2003)
Article Google Scholar
Cai, D., He, X., Han, J.: Spectral regression: a unified approach for sparse subspace learning. In: International Conference on Data Mining (ICDM), pp. 73–82 (2007)
Google Scholar
Chung, F.R.K.: Spectral Graph Theory. American Mathematical Society, Providence (1997)
MATH Google Scholar
He, X., Cai, D., Yan, S., Zhang, H.J.: Neighborhood preserving embedding. In: International Conference on Computer Vision (ICCV), vol. 2, pp. 1208–1213 (2005)
Google Scholar
He, X., Hu, S., Niyogi, P., Zhang, H.J.: Face recognition using laplacianfaces. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 27(3), 328–340 (2005)
Article Google Scholar
He, X., Niyogi, P.: Locality preserving projections. In: Advances in Neural Information Processing Systems (NIPS), pp. 153–160 (2004)
Google Scholar
Heide, F., Heidrich, W., Wetzstein, G.: Fast and flexible convolutional sparse coding. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5135–5143 (2015)
Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet Google Scholar
Hong, D., Liu, W., Su, J., Pan, Z., Wang, G.: A novel hierarchical approach formultispectral palmprint recognition. Neurocomputing 151, 511–521 (2015)
Article Google Scholar
Hong, D., Liu, W., Wu, X., Pan, Z., Su, J.: Robust palmprint recognition based on the fast variation Vese-Osher model. Neurocomputing 174, 999–1012 (2016)
Article Google Scholar
Hong, D., Yokoya, N., Zhu, X.: The K-LLE algorithm for nonlinear dimensionality ruduction of large-scale hyperspectral data. In: IEEE Workshop on Hyperspectral Image and Signal Processing: Evolution in Remote Sensing (WHISPERS), pp. 1–5. IEEE (2016)
Google Scholar
Hong, D., Yokoya, N., Zhu, X.: Local manifold learning with robust neighbors selection for hyperspectral dimensionality reduction. In: IEEE International Conference on Geoscience and Remote Sensing Symposium (IGARSS), pp. 40–43. IEEE (2016)
Google Scholar
Hong, D., Yokoya, N., Zhu, X.: Learning a robust local manifold representation for hyperspectral dimensionality reduction. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. (JSTARS) 10(6), 2960–2975 (2017)
Article Google Scholar
Hong, D., Yokoya, N., Chanussot, J., Zhu, X.X.: Learning a low-coherence dictionary to address spectral variability for hyperspectral unmixing. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 235–239. IEEE (2017)
Google Scholar
Hu, J., Zheng, W., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5344–5352 (2015)
Google Scholar
Hu, J., Zheng, W., Lai, J., Zhang, J.: Jointly learning heterogeneous features for RGB-D activity recognition (2016)
Google Scholar
Ji, S., Ye, J.: Linear dimensionality reduction for multi-label classification. In: International Joint Conference on Artifical Intelligence (IJCAI), vol. 9, pp. 1077–1082 (2009)
Google Scholar
Kan, M., Shan, S., Chang, H., Chen, X.: Stacked progressive auto-encoders (SPAE) for face recognition across poses. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1883–1890 (2014)
Google Scholar
Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: Advances in Neural Information Processing Systems (NIPS), pp. 801–808 (2007)
Google Scholar
Martínez, A.M., Avinash, C.K.: PCA versus LDA. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 23(2), 228–233 (2001)
Article Google Scholar
Roweis, S.T., Lawrence, K.S.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Saul, S.L., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. J. Mach. Learn. Res. (JMLR) 4, 119–155 (2003)
MathSciNet MATH Google Scholar
Sugiyama, M.: Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis. J. Mach. Learn. Res. (JMLR) 8, 1027–1061 (2007)
MATH Google Scholar
Suzuki, T., Sugiyama, M.: Sufficient dimension reduction via squared-loss mutual information estimation. Neural Comput. 25(3), 725–758 (2013)
Article MathSciNet Google Scholar
Tangkaratt, V., Sasaki, H., Sugiyama, M.: Direct estimation of the derivative of quadratic mutual information with application in supervised dimension reduction. Neural Comput. 29(8), 2076–2122 (2017)
Article Google Scholar
Tosato, D., Farenzena, M., Spera, M., Murino, V., Cristani, M.: Multi-class classification on riemannian manifolds for video surveillance. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6312, pp. 378–391. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15552-9_28
Chapter Google Scholar
Wang, K., He, R., Wang, L., Wang, W., Tan, T.: Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 38(10), 2010–2023 (2016)
Article Google Scholar
Wang, K., He, R., Wang, W., Wang, L., Tan, T.: Learning coupled feature spaces for cross-modal matching. In: International Conference on Computer Vision (ICCV), pp. 2088–2095 (2013)
Google Scholar
Wold, S., Esbensen, K., Geladi, P.: Principal component analysis. Chemometr. Intell. Lab. Syst. 2(1), 37–52 (1987)
Article Google Scholar
Yan, S., Xu, D., Zhang, B., Zhang, H.J., Yang, Q., Lin, S.: Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 29(1), 40–51 (2007)
Article Google Scholar
Yang, M., Zhang, L., Yang, J., Zhang, D.: Robust sparse coding for face recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 625–632 (2011)
Google Scholar
Zhang, L., Yang, M., Feng, X.: Sparse representation or collaborative representation: which helps face recognition?. In: International Conference on Computer Vision (ICCV), pp. 471–478 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Remote Sensing Technology Institute (IMF), German Aerospace Center (DLR), Wessling, Germany
Danfeng Hong, Jian Xu & Xiaoxiang Zhu
Signal Processing in Earth Observation (SiPEO), Technical University of Munich, Munich, Germany
Danfeng Hong & Xiaoxiang Zhu
RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Naoto Yokoya

Authors

Danfeng Hong
View author publications
You can also search for this author in PubMed Google Scholar
Naoto Yokoya
View author publications
You can also search for this author in PubMed Google Scholar
Jian Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoxiang Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Danfeng Hong .

Editor information

Editors and Affiliations

Google Research, Zurich, Switzerland
Vittorio Ferrari
Carnegie Mellon University, Pittsburgh, PA, USA
Martial Hebert
Google Research, Zurich, Switzerland
Cristian Sminchisescu
Hebrew University of Jerusalem, Jerusalem, Israel
Yair Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hong, D., Yokoya, N., Xu, J., Zhu, X. (2018). Joint and Progressive Learning from High-Dimensional Data for Multi-label Classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018. ECCV 2018. Lecture Notes in Computer Science(), vol 11212. Springer, Cham. https://doi.org/10.1007/978-3-030-01237-3_29

Download citation

DOI: https://doi.org/10.1007/978-3-030-01237-3_29
Published: 07 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-01236-6
Online ISBN: 978-3-030-01237-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics