1 Introduction
The ultimate goal of Artificial Intelligence, particularly machine learning, is to empower machines with the abilities of humans. This work focuses on imitating humans’ ability to continuously learn from experience—that is,
lifelong learning. Humans can utilize the previously learned skills to accomplish new tasks and memorize the acquired knowledge as the basis for future learning without forgetting previous skills. However, current machine learning algorithms perform far less competitively than humans in this aspect, which has prompted fast-growing research on continual recognition. Continual learning [
8] aims to learn several tasks sequentially while maintaining good performances on all the tasks learned so far, which is quite challenging. This article focuses on class incremental continual recognition, where each task is classification on several classes, and task ID is unavailable during inference.
Some continual learning methods merge the data from new tasks and existing tasks and then train the model on them, which is quite time consuming in most cases. Others fine-tune the trained model on the new task, which suffers from the fact that knowledge obtained from old tasks will be overwritten upon training on a new task, resulting in poor performances on old tasks. This phenomenon is referred to as catastrophic forgetting. Existing works to solve the catastrophic forgetting problem on continual learning can be divided into three categories: (i) parameter isolation, dedicating different model parameters to each task, which is restricted to the task incremental setting; (ii) regularization, adding certain restrictions during model updates to prevent the model from “forgetting” the learned knowledge of old tasks, which, although it can mitigate the catastrophic forgetting problem to some extent, suffers from a performance drop as the number of tasks continuously increases; and (iii) replay, partially memorizing data from old tasks as exemplars and mixing them with the new data for training, which has the best performance among the existing literature but can still inevitably “forget” knowledge obtained previously. Since replay methods achieve the state of the art, we choose to design our owned model based on replay.
Most of the preceding models focus on the training stage and design different methods to prevent forgetting. However, forgetting always happens, even for humans. The
Ebbinghaus forgetting curve [
26] shows that learners will forget about 90% of what they have learned within the first month. Due to the inevitability of forgetting, how to fix the model after forgetting happened needs to be considered. To the best of our knowledge, we are the first to remedy the machine learning model after forgetting happened. As Figure
1 shows, one major reason for forgetting from replay methods is that the stored exemplars are not sufficient to help discriminate between the coming classes and previously learned classes. As no information is provided about the future classes, we could only tackle this issue after we meet the new classes. Inspired by the baby who is born with the ability to learn and recognize continually [
35], we assume that most learned samples are stored in long-term memory. As forgetting or interference happen, they are retrieved efficiently to solve the issue.
Based on the preceding observation and assumption, we propose a continual learning method with adaptive memory update for class incremental continual learning. The proposed method only preserves a small number of exemplars in working memory for each task learned. By conducting
half-and-half validations on exemplars, our method can adaptively determine which classes are experiencing severe forgetting or interference. By exchanging data points between current exemplars set and long-term memory in a non-trivial manner, our method can adaptively maintain a group of exemplars that best help memorize the forgetting classes. Meanwhile, the computational complexity of our model for a single task will not be increased. Moreover, our proposed method can serve as a general plugin for any replay-based approach to further improve their performance. We report our experimental results on comparisons between our method and several baselines on CIFAR-100 and mini-ImageNet, showing that our method can significantly outperform state-of-the-art methods. Figure
2 shows the process of our proposed continual learning method.
Our contributions are summarized as follows:
•
We propose using adaptive memory update to remedy the machine learning model after forgetting, which is the first work with such mechanism to the best of our knowledge.
•
We propose a novel continual recognition model, which is capable of adaptively discovering classes being severely forgotten or interfered with, and then conducting a memory update on these classes through exchanging data points between the current exemplars set and long-term memory.
•
We conduct extensive experiments on several real-world datasets to validate our proposed method’s superiority over existing baseline approaches.
The rest of the article is organized as follows. We discuss related works in Section
2 and explain the details of our proposed model in Section
3. In Sections
4 and
5, we conduct extensive experiments on various real-world datasets to show the proposed model’s advantages against existing state-of-the-art approaches. We finally conclude the article and point out future research directions in Section
6.
2 Related Work
Continual learning, also known as lifelong learning, is continuously acquiring, modifying, and transferring knowledge and skills. It remains a considerable challenge for machine learning and neural networks. In recent years, much work [
4,
21,
27,
30,
36] has been proposed to address the catastrophic forgetting problem. The existing methodologies for continual learning can be divided into three categories: parameter isolation-based methods [
1,
15,
22,
24], regularization-based methods [
10,
18,
23], and replay-based methods [
5,
7,
14,
31,
39]. Our work is based on replay.
2.1 Parameter Isolation Based Methods
Parameter isolation based methods [
22,
24] dedicate different subsets of the model parameters to each task. When a new task arrives, this kind of method trains a new branch of network with parameters from the previous task (or sometimes entirely new). After the training of this task, the branch is determined to be used for the prediction of this task and can be reused in a layer-wise or neuron-wise way for future tasks. Some methods select a new branch from the original architecture [
24], whereas others increase the model capacity to grow a new branch [
22].
Most of these works require a task identifier to activate the corresponding branch during prediction, and it impairs the performance of the model significantly. The catastrophic forgetting problem still exists when the model capacity is limited.
2.2 Regularization-Based Methods
Regularization-based methods limit how far the parameters can move from values that were optimal for previous tasks. This is usually implemented via additional terms in the loss function. Some methods discourage the updating of essential parameters for past tasks [
18]. They determine essential parameters in the current task first and then penalize the change to these parameters in future training for the new tasks. In the training phase for a new task, they use the output probabilities for each image on the old task as a soft label, and the updated model’s output is forced to be close to these soft labels [
23].
To some extent, regularization-based methods are a simple way to mitigate the problem of catastrophic forgetting. However, the constraint is still insufficient to counter the accumulation of errors in old tasks, and the drop in performance is inevitable as the number of classes increases.
2.3 Replay-Based Continual Learning
Replay-based continual learning methods, also known as rehearsal methods, need a memory component to store samples from the previous task. The stored samples are often in their raw format, known as exemplars. These previous task samples are replayed while learning a new task to alleviate forgetting. This memory component plays a role like the hippocampus of the complementary learning theory [
25].
With the stored samples, the replay-based methods perform pretty well in continual learning. Nevertheless, which samples to store remains a challenge. Several strategies [
16,
31] have been proposed for selecting the samples.
Recently, some works have tried to use generative models to generate high-quality samples instead of storing them [
34,
38]. However, this also leads to a challenge of training the generative model continually.
Most replay-based methods pick samples randomly from the exemplar set while training, which is not optimal. Aljundi et al. [
2] use a selective replay technique that retrieves the most disturbed samples from the exemplar set each time. Shim et al. [
33] try to retrieve those exemplars that would be most helpful for learning. Inspired by game theory, they use the Shapley value to measure the extent to which the samples contributed to the learning.
How to store the exemplar set more efficiently is of equal interest. Riemer et al. [
32] use a discretized variational self-encoder to compress the stored exemplar set to save storage costs. Caccia et al. [
6] use a variational self-encoder with adaptive vector quantization to compress the exemplar set.
2.4 Class Incremental Scenario
There are different settings of continual learning [
37]. In this article, we focus on class incremental continual learning, where the model does not have access to the task-ID at inference time and therefore must be able to distinguish between all classes from all tasks. It is a much more difficult scenario.
3 Proposed Method
In this section, we first formulate the class incremental continual learning problem in Section
3.1, and then we describe our main components in Sections
3.2 through
3.4.
3.1 Problem Formulation
In class incremental continual learning, a model experiences a sequence of classification tasks denoted by \(\mathcal {T} = [(C_1,D_1), (C_2,D_2), \dots , (C_T,D_T)]\), where each task t is represented by a set of classes \(C_t = \lbrace c^1_t,c^2_t,\dots \rbrace\) and training data \(D_t\). T is the total number of tasks and is not a priori. Each training data \(D_t\) contains a number of input-target pairs \((x_i^t, y_i^t)\), which is identically and independently drawn from an unknown distribution. Here, \(x_i^t\) represents the i-th input example in task t and \(y_i^t \in C_t\) represents its class label. We use \(N_t\) to represent the set of total classes in all tasks up to and including task t: \(N_t = \mathop {\cup }\nolimits _{i=1}^t C_i\) Usually, different tasks contain different classes, and hence the model needs to recognize more and more classes during the training phase.
We denote our model as \(M_\theta : \mathcal {X} \rightarrow \mathcal {Y}\), composed by a feature extractor \(f: \mathcal {X} \rightarrow \mathbb {R}^d\) and a classifier \(g_\mathcal {W} : \mathbb {R}^d \rightarrow \mathcal {Y}\). Here, f can be any convolutional neural network, depending on the complexity of the dataset. The parameters \(\mathcal {W}\) of g is a set of weight vectors \(\lbrace w_1, w_2, \dots , w_k\rbrace\), and k is the number of classes learned so far. When our model finishes training on one task and is ready for learning a new task, f and \(w_i\) would be temporarily saved as \(\hat{f}\) and \(\hat{w_i}\) for distillation loss. Meanwhile, classifier \(\mathcal {W}\) will expand, and several new weight vectors will be added corresponding to the classes in the new task.
The model gets to learn tasks sequentially, and it is worth mentioning which tasks the samples belong to is not provided in the inference phase.
3.2 Long-Term Memory
When forgetting happens and we want to do something to remedy it, the first thing to determine is what has been forgotten. There are usually two situations for humans. One is that humans could tell what they have forgotten through their own perception. The other one is that humans become confused or make mistakes. The latter situation is quite similar to the catastrophic forgetting of our machine learning model. After determining what has been forgotten or interfered with, humans could review them from some learning resources like the library or the Internet, or even their memory if they have a good one. Since there are no such resources for our machine learning model, we choose to store the learning material, i.e., the training data, into a long-term memory component.
Someone may argue that most existing works follow a rule that the training data for previously learned tasks is unavailable. There are two main reasons for this rule: one is the privacy issue, and the other is storage constraints. However, Knoblauch et al. [
19] claim that the optimal continual learning model needs perfect memory. Bartol et al. [
3] estimate that the storage capacity per synapse is roughly 4.7 bits of information, and this implies that the total memory capacity of the brain, with its many trillions of synapses, is much larger than the size of our continual learning model.
These findings make us wonder whether this rule is still necessary in today’s world of reliable encryption and anonymity technology, and low storage costs. In this article, we relax the rule in a way that the training data is stored in a long-term memory component and can be accessed depending on demand.
3.3 Exemplar Management
Our model maintains a collection of exemplars \(\mathcal {E} = \lbrace e_1, e_2, \dots , e_k\rbrace\) during training as working memory. When the model finishes training for a task, a few exemplars of each new class will be selected and added to the exemplar set. The data from the collection will be involved in the training on future tasks later.
Some methods use an exemplar set of a fixed size. Every time new exemplars are added, the number of old class exemplars needs to be reduced to meet the fixed size limit. Since the number of tasks is unknown, it is difficult to determine a proper size for the exemplar set at the very beginning. Our model maintains the exemplar set without the constraint on fixed memory size and uses a constant number of exemplars for each class instead. In this way, the exemplar set’s size will increase linearly as the number of classes increases. Since the number of exemplars for each class is a small constant, the problem of increasing size is affordable compared to the improvement brought by the exemplar set.
We use herding selection to select exemplars.
Herding selection [
31] is a greedy algorithm for selecting new exemplars for one class. This algorithm iteratively selects exemplars from training data and makes the mean feature vectors of exemplars close to the mean feature vectors of training data until the exemplar set size is met.
3.4 Adaptive Memory Update
To conduct a memory update, there are two fundamental problems to be solved. One is when to update, and the other is how to update.
As we will see later, the drop in model performance varies significantly from class to class, telling us that the probability of a class being forgotten is quite different. Here we let the unit for memory update be one single class rather than all classes in one task. In the following discussion, we are going to focus on a particular class c.
When to Update. The time for our model to update is when the model finds itself unable to distinguish a class from other classes—that is, performing poorly on the data from this class. However, it is difficult for a model to obtain this kind of ability and evaluate its own performance autonomously. We tackle this issue through conducting a half and half validation procedure on our exemplar set.
Before training a new task, the data from each class’s exemplar set is randomly divided into two parts. One part is then used as training data to participate in model training together with the data from the new task. After the training is completed, our model would be tested on the other part, which we call half and half validation.
If our model performs poorly on the other part for a particular class, our model will update the exemplar set for this class. The criterion we adopt for poor performance is whether the recall score is lower than a threshold \(\lambda\). Any other reasonable criterion can also be used.
Based on the preceding mechanism, our model could determine whether a class c needs a memory update.
How to Update. After training a task, we assume that all the training data is not discarded but instead is stored in a long-term memory component. A simple and brute-force update approach is to take all the data from class c into the training procedure. This method, however, requires a lot of access operations on long-term memory, and the training complexity is greatly increased, especially when there are many classes that need an update.
To address this difficulty, we propose a simple and elegant update approach. We replace the exemplar set with the same number of data samples that are randomly selected from long-term memory. In this way, the long-term memory component only needs to support random access operation by class. Besides, the training complexity with the memory update is not increased at all. Figure
3 illustrates the process of the adaptive memory update.
3.5 Loss Function
Our loss function contains two terms: distillation loss \(L_{distill}\) and classification loss \(L_{clf}\).
Distillation Loss. Distillation loss was originally proposed to transfer knowledge between different neural networks [
13]. Here we use it to maintain the output of our model on the old tasks while training the model on a new task. When our model is training on task
t, the distillation loss
\(L_{distill}\) is computed as follows:
And
\(q_{ij}\) is computed as:
Here,
\(p_{ij}\) represents the probability of the
i-th sample belonging to class
j, and
\(h_j(x_i)\) is the output from the old model before training on task
t:
Here we adopt a
l2-normalized form of weight vector to produce logits, which is useful and practical for solving the class imbalance issue in continual learning [
14].
Classification Loss. In the traditional multi-class classification task, cross-entropy loss with softmax activation is the most commonly used loss. When considering distillation loss, we find that binary cross-entropy with sigmoid is a better choice than cross-entropy with softmax, because the soft label in distillation loss is more analogous to a multi-label target than a multi-class one. As we can see, the form of distillation loss is binary cross-entropy with sigmoid activation. For consistency, the binary cross-entropy loss is directly performed on the exemplar set.
\(\delta\) is an indicator function. Finally, our loss function is calculated as:
The overall incremental training process with the proposed method is presented in Algorithm
1.
5 Results and Analysis
We run our experiments on 5, 10, and 20 tasks with 20, 10, and 5 classes per task. Five random class orders are used, and the mean of incremental average accuracy is reported.
Figures
4,
5,
6, and
7 show the results of CIFAR-100 and ImageNet-Subset using NME inference and CNN inference, respectively. One can see that our proposed method significantly beats other approaches, and the larger number of tasks we have, the more ours outperforms other baselines, indicating that our model is more adaptable for real lifelong learning.
General Plugin. Table
3 shows the effect of using the adaptive memory update as a plugin among different models. Ours benefits from this update and obtains the best improvement when using NME to make the inference. However, iCaRL-NME’s performance drops a little bit after using this plugin. This is because iCaRL-NME’s distillation loss uses the model’s previous output as targets for the exemplar set, resulting in a steady accumulation of old classes’ errors and the failure of the plugin. Other methods, more or less, have been improved by the plugin.
Memory Size. Table
2 shows the effect of the different memory sizes.
Fixed 2000 means using an exemplar set with fixed capacity, and in this way, the more classes stored, the fewer exemplars reserved for each old class. All methods demonstrate a significant increase in incremental average accuracy using a larger size of the exemplar set. Ours consistently performs best under different memory sizes.
Criterion for Update. Figure
8 shows that the drop in model performance varies significantly from class to class, indicating that the probability of forgetting a class is different. In our proposed model, the criterion according to which a class needs an update is that the recall of the model for the target task is below 0.7. Figure
9 (left) shows how this threshold influences the performance. We conduct experiments on CIFAR100 for 10 tasks with 50 exemplars for each class. We can see the slope of this curve is going up, indicating that as the number of updates increases (a larger threshold tends to increase the probability that a class needs update), the model’s gain from the update increases. If our access to the long-term memory is fast enough, we can update all the classes we have learned to achieve better performance.
Access Overheads. However, there is always an overhead for accessing the long-term memory. Sometimes this becomes a non-negligible constraint on our method. Figure
9 (right) shows the performance when we limit the access times. The access limit means how many classes we can update after one task. As we can see, the performance first rises rapidly and then slowly as the access limit increases, which shows that it is enough to update a few classes for impressive improvement.
Update Strategy. In this work, we use a simple and elegant memory update strategy. We replace exemplars with randomly selected samples in the long-term memory. In addition to this method, we tried other strategies based on different sample selection methods. The results are shown in Table
1 . Random is our proposed strategy. Nearest strategy means selecting those samples that are close to the misclassified exemplars in feature space. Herding strategy uses herding selection to select samples. Testing strategy means selecting those samples from the external database that are misclassified. To select the most representative samples, Kmeans strategy selects the cluster centers after conducting the standard Kmeans clustering algorithm. GNG strategy selects samples via growing neural gas algorithm [
11], which helps maximize coverage of the feature space.
Exemplars play a crucial role in replay-based methods. The model needs the exemplars to find the best distinguishing feature for the classes learned before. Furthermore, when the model uses NME to infer, exemplars need to approximate the means of classes in feature space. That is why Herding performs best, followed by Random strategy by a narrow margin; Nearest performs worst in both settings; Testing performs poorly using NME but is not bad using CNN. It is noted that these methods, in addition to Random, require additional computation to obtain samples’ features, classification results, or cluster centers, but finally come to a comparable result or worse. So we choose Random as our update strategy for efficiency.