In this section, we describe how the base classifier
\(\Phi\) and our meta-model
\(\Psi\) are trained jointly. Because the meta-model needs as input the visual feature, we separate the main model
\(\Phi (\cdot ; w)\) into two different parts: the backbone
\(\Phi _b(\cdot ; w_b)\) and the category predictor
\(\Phi _c(\cdot ; w_c)\). The first one has an image
\(x\) as input and gives out a feature vector
\(f\). Instead, the second part has
\(f\) as input and a probability score vector
\(z\) as output. In this way, it is possible to manipulate the feature
\(f\) directly with our meta-model
\(\Psi\). The meta-model takes two different inputs
\(\Psi (f,\mathcal {L})\) and gives back two vectors of weights
\(W_f\) and
\(s_k\). Our algorithm is divided into four main phases that are shown in Figure
2 and summarized in Algorithm 1. We describe our method in detail starting with the
\(t\)-th iteration and moving forward each step until we reach the
\((t+1)\)-th. Different from the meta-learning optimization strategy described in Section
3.2, we need an additional initial phase, called Loss Pre-Calculation (Figure
2(a)). The value of loss
\(\mathcal {L}^{pre}\) related to the training batch
\(X^{train}\) must be calculated at the beginning. This loss value must be dependent on the original feature
\(f^{train}\) and not on the weighted one
\(f^{att}\). In the second step Virtual-Train (Figure
2(b)),
\(\Phi _b^t\) and
\(\Phi _c^t\) are the virtual clones of the backbone
\(\Phi _b(\cdot ; w_b)\) and the category predictor
\(\Phi _c(\cdot ; w_c)\) at the beginning of the
\(t\)-th iteration. We obtain the features
\(f^{train}\) passing through
\(\Phi _b^t\) the batch
\(X^{train}\). Then the loss values
\(\mathcal {L}^{pre}\) pre-calculated and its relative feature
\(f^{train}\) are given to
\(\Psi ^t\) (the meta-model at time
\(t\)) to obtain the two vectors of weights
\(W_f\) and
\(s_k\). The feature
\(f^{train}\) is multiplied element-wise with
\(W_f\) to get a new feature vector with attention
\(f^{att}\) as in Section
3.3. The modified feature is passed to the predictor
\(\Phi _c^t\) obtaining the score
\(z^{train}\). Now we calculate the
\(\mathcal {L}^{train}\) with the equalization loss, described in Section
3.4, using the vector
\(s_k\) in the Equation (
8). Then
\(\Phi _b^t\) and
\(\Phi _c^t\) parameters are virtually updated to minimize
\(\mathcal {L}^{train}\), excluding those of
\(\Psi ^t\). For the third step Meta-Train (Figure
2(c)), we need a clean and balanced meta-dataset that will be used to train the meta-model
\(\Psi\). We pass a meta batch
\(X^{meta}\) through the virtually updated
\(\Phi _b^{t+1}\) and
\(\Phi _c^{t+1}\) in order to get a validation loss
\(\mathcal {L}^{meta}\). In this step, the feature is not modified and the loss is the classic softmax cross-entropy loss. Then only
\(\Psi ^t\) is updated minimizing
\(\mathcal {L}^{meta}\). In this way, the meta-model is optimized to help the main model minimize its error on clean and balanced data. Here the optimization takes into consideration also the previous Virtual-Train. In the last phase, Actual-Train (Figure
2(d)) the original
\(\Phi _b^t\) and
\(\Phi _c^t\) are optimized taking into account the updated meta-model
\(\Psi ^{t+1}\). Our meta-model is used only during the training of the main network
\(\Phi\) when external help is needed to solve noisy or imbalance label problems. It is discarded at test time when only the main network is retained as the final model.
Excluding the Loss Pre-Calculation phase, the Virtual-Train, Meta-Train, and Actual-Train steps need a backward operation in addition to a forward one. The Meta-Train backward step, in which the meta-gradient is computed from the loss on the meta-set, takes more than
\(80\%\) of the total computation [
53]. In this step, to update the meta-model parameters, the meta-gradient is back-propagated backward through each layer of the main network. Since normal training does not involve this step, this additional cost quickly becomes significant as the number of layers in deep networks increases. In addition, the amount of GPU memory required is duplicated compared to traditional training. The gradients obtained in the Virtual-Train step must still be kept in memory so that the meta-gradient can be calculated during the Meta-Train step. These computation and memory problems are typical of a lot of meta-learning approaches. However, a method like [
53] which computes the meta-gradient with a faster layer-wise approximation provides strategies to overcome them. These overhead costs are present only during the training. In the test phase, the meta-model is not used and there is only a forward pass on the classifier to get a prediction on the input.