1 Introduction
Convolutional neural networks (CNNs) have achieved relevant results on several computer vision-related tasks, such as image classification and object detection in scenes. Such success can be explained by how the convolutional neuron works: It highlights given features according to the spatial properties of the image. The initial layers highlight less complex features, such as borders; however, more depth layers can detect more complex traits, like entire objects or faces of people. Nowadays, it is hard to find any other computer vision technique applied without any CNNs, from biometrics to disease detection.
One key aspect concerning CNNs is how to stack the convolutional kernels to accomplish the best result on a given task. It is widespread to use the same basic architecture on several different tasks, just changing the output. For instance, the basic block used for EfficientNet [
54], a neural network used for image classification, is also used on the EfficientDet [
55] architecture to tackle the object detection task.
The architecture may be the central part of a computer vision model; however, there are other relevant points before starting the training step. For instance, the optimization technique can influence the final result. Even the kernels’ initial random values can influence how well the model will perform in the end. This study focuses on one of these aspects that can influence the final result: the regularization algorithms. Depending on the chosen regularization strategy used, some architectures can achieve a relevant gain on the final results. One important aspect of using a good regularizer is that it does not influence the final model’s performance. It means that, independently of using or not one regularizer, the model’s computational cost for inference is the same. However, in some cases, it can influence performance during the training phase, using a little computational overhead or pre-train epochs. In any way, the results of the output usually overcompensate this cost.
1.1 How Regularization Works
CNNs are usually used for computer vision tasks, such as image classification and object detection, to create models as powerful as human vision. If the amount of information available is considered, then it becomes clear the training task requires more data variability than possible. Considering a healthy human with a regular brain and eyes, we retain new information around 16 hours per day, on average, disregarding the time we sleep. Even considering huge datasets such as ImageNet, the number of images available is minimal compared to the quantity of data a human brain receives through the eyes. This unavailability of new data may lead to a situation known as overfitting, where the model learns how to represent well the training data, but it does not perform well on new information, i.e., the test data. This situation usually happens when the model has been trained exhaustively in the available training information that it cannot generalize well in other new information.
As an artificial neural network, the training step of CNNs can be described as an optimization problem, where the objective is to find out the weight values that, given an input and a loss function, can transform the information in the desired output, such as a label, with the lowest possible error. One way to achieve this goal is to minimize the following function:
where
\(||.||_F^2\) is the Frobenius norm,
\(X\in \mathbb {R}^{m \times n}\) defines the input data, and
\(W\in \mathbb {R}^{m \times d}\) and
\(Y\in \mathbb {R}^{n \times d}\) denote the weight matrix and the target labels, respectively. According to Reference [
3], the Frobenius norm imposes the similarity between
X and
\(WY^T\). This interpretation has one main advantage: This formulation enables the optimization through matrix factorization, producing a structured factorization of
X. However, it is only possible to achieve a global minimum if
W or
\(Y^T\) is fixed for optimizing both matrices together converts the original equation into a non-convex formulation. This problem can be solved if the matrix factorization is changed to a matrix approximation as follows:
where the target is to estimate the matrix
A, which ends up in a convex optimization, meaning it has a global minimum that can be found via gradient descent algorithms. When using regularization, this equation becomes:
where
\(\Omega (\cdot)\) describes the regularization function based on
A, and
\(\lambda\) is the scalar factor that sets how much influence the regularization function infers on the objective function.
One key aspect of the regularization methods, independent of the training phase it works, is to prevent the model from overfitting the training data. It operates by increasing the variability of the data on different stages of a CNN. When working with images, the most straightforward method is random image changing, such as rotation and flipping. Several deep learning frameworks, such as Keras and TensorFlow, have their implementation available, facilitating this kind of regularization and improving the results. Although this type of regularization works well, some points should be taken into consideration. For example, some transformations may distort the image into another existing class in the classification. The more straightforward example is baseline image classification on the MNIST dataset: If the rotation is too several, then an input “6” may be transformed into a “9,” leading the model to learn wrong information.
1.2 Regularization vs. Normalization
A general problem in machine learning is to tune the parameters of a given model to perform well on the training data and eventually new information, i.e., the test set. The collection of algorithms that aims to reduce the error on the data that does not belong to the training set is called regularization techniques.
One main difference between the normalization and regularization techniques is that the second is not performed after the training period, while the first is kept in the model. For example, Cutout [
7] and MaxDropout [
44] original codes show they do not execute anything during the inference, but the BatchNormalization [
25] executes its algorithm in deducing the test set.
1.3 Scope of This Work
This study focuses on the most recent regularization techniques for CNNs. Other studies [
36,
46] focus on older and more general regularization methods. Here, we consider three main points:
•
Recently developed: Besides Dropout [
49], no other study is older than four years, making this study very much up-to-date;
•
Code availability: All related algorithms in this study are available in some way, usually on Github. We considered it an essential point, because it avoids studies with possibly inaccurate results and allows reproducibility when necessary;
•
Results: All techniques here were able to improve the results of the original models significantly.
In this work, the regularization algorithms are divided into three main categories, each one in a given section: The first one is called “data augmentation,” and it describes the techniques that change the input of a given CNN. The second category is called “internal changes,” and it describes the set of algorithms that changes values of a neural network internally, such as kernel values or weights. The third category is called “label,” in which techniques perform their changes over the desired output. Table
1 gives a list of all methods discussed in this work.
Although we divided the methods into three different strategies, Table
1 highlights that some algorithms work on two different levels. For instance, CutMix and CutBlur work on both input and label levels. The majority of the methods work on input or internal structures, which shows a lack of research on label regularization methods.
1.4 Comparison with Other Works
In a quick search, it is possible to find a diversity of works using Convolutional Neural Networks, such as image classification [
15,
16,
24,
54], object detection [
11,
40], and image reconstruction [
71,
72,
73]. However, the frequency of works for regularization compared to other problems is very low. As far as we are concerned, we found only two recent surveys about regularization for deep neural networks.
The first one [
36] is an extensive analysis of regularization methods and their results. Although it is an interesting work, it focuses considerably on older methods, such as adding noise to the input, DropConnect [
59], and Bagging [
1]. Those methods are still broadly used and have their importance; however, they are not exactly new.
Another relevant work found was a survey focused only on dropout-based approaches [
28]. Dropout [
49] is undoubtedly an important regularization method for different types of neural networks, and it has influenced several new approaches over the years, besides being used in several different architectures.
In this work, we show very recent developments on strategies for improving the results of Convolutional Neural Networks. As one can observe, it presents works published as recently as 2021 [
33,
38]. The following subsections present more insights and statistical information about the works surveyed in the manuscript.
1.5 Where Do Regularizers Work Primarily?
Even though most of the works are applied to the input, there are many studies dedicated to internal structures and the label layer. Figure
1 depicts the proportion of the scientific works presented in this survey.
Around
\(44\%\) of the works relies on changes on the input, most known as data augmentation strategies. The easiness of changing parameters and structures in a CNN’s input may explain such an amount of works. Image processing- and computer vision-driven applications still play a significant role when dealing with deep learning. The second most common regularization approaches stand for the ones that perform changes in the internal structures. Dropout [
49] contributed considerably to advances in this research area. Several works [
33,
38,
44] are mainly based on Dropout, while some of them [
9,
64] are new approaches.
1.6 Lack of Label Regularizers
We want to highlight the importance of more research on regularizers that work on a neural network label level. Although around 22% of the works make changes on the label as a regularization strategy, we found two relevant works on the area only [
53,
63]. Some hypotheses may be raised here.
The first one is that the label level is not intuitively changed as the input or in the middle-level of a neural network. Performing changes in both levels is more natural, for it is visually more obvious to understand what is going on during training and inference. However, it is harder to explain what happens when label changes are performed. Even though the original work [
53] argues that it prevents the overconfidence problem, it fails to explain why such a situation is avoided.
Another explanation is the lack of mathematical explanation for most approaches. Fortunately, some techniques such as Dropout [
3] and Mixup [
2] present interesting insights about their inner mechanism. An algebraic proof that label smoothing works well may be an essential step for the development of new strategies concerning the last level’s regularization.
Finally, it is always good to remember that one of the most critical steps for developing a machine learning area is creating reliable-labeled datasets. Although we focused on regularization strategies, it is worth remembering that, eventually, a breakthrough on the way we work with labels may lead to more powerful systems. Therefore, we emphasize that more works related to the label-level regularization are worth researching.
2 Convolutional Neural Networks
Neural networks have been used since the 1950s when the first neuron emulation, called Perceptron [
41], was developed. However, it can primarily address linearly separable feature spaces. However, in the 1980s, the development of the backpropagation algorithm [
42] to set new values in a structure that uses several Perceptrons in more than one layer, called the
Multilayer Perceptron (MLP), made it possible to solve nonlinear problems as well. Even with these advances, it still lacks some relevant results to solve unstructured data problems, such as images.
In late 1990, a new neuron structure emerged based on the 2D convolution process, the so-called Convolutional Neural Network [
29]. The 2D convolution process can find different features on an image, depending on the convolutional kernel’s size and values. What makes a CNN so valuable for image processing is the possibility of stacking convolutional processes to find out different features whose training can be accomplished using the well-known backpropagation algorithm. Figure
2 illustrates a standard structure of a Convolutional Neural Network.
Even being so powerful, it still needs lots of data to achieve relevant results, thus requiring considerable computational power. In the middle of the 2000s, GPUs’ use accelerated the training process hugely, becoming possible to solve image processing problems in a feasible time. From 2010, the first relevant result emerged. The AlexNet structure [
27] achieved first place in the ImageNet classification challenge, overcoming the runner-up result by more than 10%. It is an eight-layer CNN with an MLP on top to perform the classification. Since then, other CNN structures have appeared, each one with new features in their structure.
The Visual Geometry Group developed the VGG architecture [
47], which demonstrates, for the first time, that stacking convolution layers with smaller kernels perform better than shallow layers with bigger kernels, even then performing over the same region. This architecture achieved first place in the ImageNet classification challenge in 2012. Another architecture family with relevant results is the Inception [
51,
52,
53], which was developed by Google by parallelizing kernel operations in the same layer and then fusing them before the next layer.
About the same time the first Inception architecture showed up, Microsoft presented the Residual Network, most known as ResNet [
16]. It works by fusing the output of layers with the same dimensions before the pooling operation. It looks like a simple operation at first, but later on, it has been shown that this residual connection helps the backpropagation algorithm to handle better the well-known vanishing/exploding gradient shortcoming [?].
Neural Architecture Search (NAS) [
77] developed a new way to find better CNN architectures. Using an agent trained by the Q-Learning technique [
35], it can find out the CNN that can achieve the best result according to some rules. The drawback of this technique is that it takes a considerable amount of time to discover the best neural network architecture. However, recent studies [
34] showed how to improve the search algorithm, making it faster to discover new architectures.
Later in 2018, Google showed the NAS could be improved when some rules are better designed, such as the computational limit, input size, and other parameters, and incorporate other architectures, such as Squeeze-and-Excitation [
24], ending up in the EfficientNet family [
54]. The original work showed eight different architectures (called B0-7), which perform using the same quantity of
floating points operation (FLOP) as other architectures but achieving better results. In the same study, the EfficientNet architectures delivered state-of-the-art results in five different datasets.
All works discussed until now operate on the image classification problem. However, CNNs can be used in several other tasks. One interesting problem is object detection in natural scenes. The R-CNN [
11], for instance, works in two stages, being the first to find interest regions on the image, and the final stage classifies each region in the desired objects. The
You Only Look Once (YOLO) [
40] goes one step further and performs the localization and classification steps in the same stage.
Another task well solved by CNN concerns image reconstruction. In this case, most of them are
Fully Convolutional Networks (FCN), which means that every single layer on the neural network is a convolutional layer. One relevant work in this area is the Residual Dense Network, which has a version for super-resolution [
72] and image denoising [
73] purposes. Another significant development is the DnCNN [
71], which not only resolves problems for image denoising, JPEG deblocking, and super-resolution but has a version that can solve the three problems without any information about the input image, performing a blind reconstruction.
The
Generative Adversarial Network (GAN) was first developed using MLP [
12]; however, it is used mainly with convolutional layers to solve diverse problems. One problem tackled by GAN is the style transfer, in which the StackGAN [
70] shows a very nice result, being able to change the style completely without losing relevant information. Another work with good results is the ERSGAN [
60], which deals with the super-resolution of images. The neural network shows outstanding results by training a
Residual-in-Residual Dense Network (RRDN) using the GAN approach.
5 Label Regularization
Revisiting some information on Table
1, other methods use label smoothing as part of their regularization strategy. For instance, Mixup [
69] averages the values of the labels depending on the interpolation between two different images. The same rule is applied for the Manifold Mixup technique [
58]; however, the data interpolation is computed among the layers and the same calculus is used for resetting the label values.
Another regularizer that uses label transformation is Cutblur [
66]. In this case, the transformation is used so wisely that, during training, the label could be inverted with the input, making the input as the label, and the model would converge as expectedly. The reason for this expected result is due to the cut size of the low-resolution and high-resolution images, which are not defined beforehand. It means that the input can be a low-resolution image with a crop from the high-resolution image, and the label would be the high-resolution image with the crop from its low-resolution counterpart. Therefore, inverting the label and input still makes sense.
Other methods can also have their results improved by using some rationale borrowed from label smoothing. For instance, Cutout [
7] removes parts from the input, so it makes sense to “remove” part of the label according to the crop size as well. Pretend the crop size is 25% of the image, so the active class could be dropped from 1 to 0.75. The same strategy can be applied to RandomErasing [
75]. Methods that drop neurons during training, such as Dropout [
49], could, for example, drop the values of the hot label by the same range of the total active neurons deactivated during training.
5.1 Label Smoothing
It is widespread in a general classification task to use the one-hot vector to encode the labels. Dating back from 2015 [
52], label smoothing proposes a regularization technique in the label encoding process by changing the value on each position of the one-hot representation.
Label smoothing works by preventing two main problems. First, the well-known overfitting, i.e., the situation where the model learns the information about the training set but cannot generalize the classification in the test set. The second and less obvious is overconfidence. According to the authors [
52], by using the smoothing factor over the encoding label, the softmax function applied over the vector produces values closer to the smoothed encoded vector, limiting the value used in the backpropagation algorithm and producing a more realistic value according to the class.
5.2 TSLA
One difficulty of using label smoothing is to find out what value of
\(\epsilon\) (i.e., smoothing factor) is the ideal, either for a general or for a specific dataset. The original work suggests that
\(\epsilon = 0.1\) is the excellent condition; however, the
Two-Stage Label Smoothing (TSLA) [
63] suggests that, in general, the gradient descent combined with the label smoothing technique can only improve the results until a certain point of training; after that it is better to set all values to 0 and 1 for the active class. For instance, when training the ResNet18 in the CIFAR-100 dataset for 200 epochs, results suggest the best performance is achieved when label smoothing is used until the epoch 160.
5.3 SLS
Usually, it is not straightforward to define appropriate values for the label smoothness factor.
Structural Label Smoothing (SLS) [
31] proposes to compute such a value by estimating the Bayes Estimation Error, which, according to authors, helps define the label’s boundaries for each instance. Several experiments show that this approach can overcome the traditional label smoothing method on different occasions. Although the work is fully evaluated on MobileNet V2 [
43], it does not consider other neural network architectures. Even though some popular datasets were used for comparison purposes, e.g., CIFAR and SVHN, the work is limited to MobileNet-V2 only.
5.4 JoCor
This work proposes a new approach to avoid the influence of noisy labels on neural networks. JoCoR [
61] trains two similar neural networks on the same dataset and tries to correlate two different labels. The method calculates the loss by adding the cross-entropy losses of both networks plus the contrastive loss between them and then uses only the most negligible losses on the batch to update the parameter of the architectures. The authors argue that both networks agree with the predictions by using the smallest values to update parameters, and the labels tend to be less noisy. Although the method was developed for weakly supervised problems, it could easily fit traditional supervised problems, such as data classification, to improve outcomes. The downside of this method is using two neural networks for training, which requires more processing and memory.
7 Experimental Results
Convolutional Neural Networks are usually designed to achieve the best possible performance in image processing, depending on the targeting difficulty. Sometimes, the same basic structure can be used in two or more problems, i.e., one needs to change the output layer according to the labels. For instance, the EfficientNet structure [
54] is re-used in the Efficient-Det work [
55]. Concerning regularization techniques, other components can be tricky to get rid of. Table
2 shows the results of several models on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. The next sections overview an in-depth discussion about the experiments considered in this article.
7.1 State-of-the-art Regularizer?
Defining the best regularization technique is not something trivial. For example, the baseline for determining the best image classifier is the one that achieves the best results in the 2012 ILRSVC image classification challenge dataset. During this work, the current research with the best impact on the mentioned dataset is the Meta Pseudo Labels approach [
39]. One may argue that the best result achieved by a regularization technique on a given architecture might be considered the best regularization method.
According to Table
2, which is a compilation of the results in the most common architectures, one can observe that AutoAugment performs better than PBA using ResNeXT architecture plus Shake-shake regularization in the CIFAR-10 dataset. However, when both regularization algorithms are compared using PyramidNet and ShakeDrop regularization, the opposite happens: PBA achieves better results on CIFAR-10 than AutoAugment. Further analysis showed it is possible to observe other variations in the results.
The best possible assumption about state-of-the-art regularization is based on the results and sorting them into work areas. For example, the likely best regularizer concerning the input layer is the RandAugment, for it does not affect the time spent for training and achieves satisfactory results. For internal regularizers, it is even more challenging. Take ShakeDrop as an example. It has not been evaluated within ResNet-18, while MaxDropout was not assessed for the PyramidNet. Based only on a guess, ShakeDrop appears to have the best results in this particular part. Unfortunately, there are only two regularizers that work directly on labels. For this reason, the TSLA might be considered the best one to be used on a label level.
7.2 Defining a Basic Protocol
There are several aspects to be considered for a fair evaluation of a new regularizer. The primary purpose of using regularization is to improve a given baseline architecture by using some operations in the input data, among layers, or in the label. However, a slight difference in the training protocol may infer a better result, not necessarily related to the operations from the regularizer. Another protocol can be removing any other regularization method, even small data augmentation and weight decay. As such, it is possible to verify how a new regularizer can improve a baseline architecture without any other influence.
Some papers [
7,
44,
75] train ResNet-18 using the same data transformations, i.e., random flipping, padding pixels, and using the same values for the weight decay. Some works use the same neural network, claim to have some relevant results but do not make the source code available, turning the evaluation process not trustful, since there might be other transformations working on training, such as Dropout [
49]. Therefore, the primary condition to be cited in this survey is to have the source code available so it can be compared to other methods directly.
However, it might be crucial for different reasons to use more than one regularizer in the same evaluation. For instance, the Wide Residual Network [
68], a common architecture used for evaluating new regularizers, has in its layers a dropout regularization. Therefore, wherever a new regularization is proposed (in the input, among layers, or in the label), it should be able to work with the dropout regularization. Another point is that some regularizers incorporate other techniques naturally. For instance, the AutoAugment [
5] and the Fast AutoAugmentat [
32] incorporate as one of their policies the Cutout [
7]. Therefore, a new regularization technique should be able to work with another regularizer and improve the results when both are used together.
7.3 Use of Minor Architectures
As a general rule, regularization adds little overhead during training time (AutoAugmentation [
5] is, perhaps, the only one that increases training time to find out the better policies for data augmentation) and no overhead at all at inference. For this reason, the use of regularizers for avoiding early overfitting of a neural network is strongly recommended. Still, it should be encouraged, sometimes, to use more than one at the same time. No matter what the problem is, everyone has the desire to improve results; however, it is particularly necessary for shallower neural networks.
One point missing in all works analyzed in this survey is the lack of proper investigation concerning lightweight CNNs. Architectures like MobileNet-V3 [
23] should be boosted in regularization works for these smaller designs usually have fewer parameters or make use of less complex calculations. In the same direction, quantization [
13] should be dissected to know how a given regularization algorithm influences either training a quantized neural network and performing the quantization after training.
EfficientNet [
54] provides a clever calculation for defining how an efficient CNN architecture should be designed, based on the width, depth, and resolution. However, for faster and less resourceful hardware, this calculation presents better results when neural networkss’ resolution and depth are designated as more important than width. It is possible to verify that in the TinyNet work [
15]. It might be a good idea to provide comparisons using this minimal and fast neural network architecture to show that new regularizers can improve results for smaller CNNs.
7.4 Use of More Complex Datasets
The most common datasets used in regularization works concern objects and animals, which humans can easily distinguish. Another characteristic of these datasets is that they are perfectly balanced, meaning that every possible class has a similar amount of samples in the training and test validation. Usually, in medical and some real-world problems, such a balancing is hard to obtain.
In health-related problems, any increase in the results can lead to a safer treatment or even avoid misuse of medication and death. For these reasons, some datasets, like the
Breast Cancer Histopathological Image Classification (BreakHis) [
48], might be used to increase the work’s relevance. In this specific case, where the results may infer in a life-threatening situation, the idea is to use a deeper CNN, like the Efficient-Net family [
54] or ResNet [
16].
7.5 Other Problems besides Classification
In the past, CNNs and other neural networks were mainly used for the image classification task. However, more recently, CNNs were also employed in other tasks, such as object detection and speech recognition. For example, the YOLO architecture [
40] is a Fully Convolutional Network, which means that every layer performs a 2D convolutional process. In that sense, some changes on the loss calculation allow final layers to find out where objects on a given image are located. Another domain where CNNs have state-of-the-art results is image reconstruction. The Residual Dense Network has outstanding results on image reconstruction from noisy [
73] and low-resolution images [
72].
There are two suggestions in this case. The first one is the use of regularization techniques in such different tasks or, at least, a reason for not using them in other domains. The second proposal is the development of new regularization targeting these specific problems. The only work found so far to solve different problems than image classification is the CutBlur [
66], thus highlighting the lack of works in this direction.
7.6 Source Code Links
As mentioned before, we only considered papers with the source code available. Table
3 presents the list of links concerning the source codes for every paper surveyed in this work.
8 Conclusion
Regularization is a vital tool to improve the final CNN results, since it helps prevent the model from overfitting on the training data. This work aimed at showing the most recent commitments in the area, targeting to deliver a brief rmn how they work and their main results.
This work introduced a lineup of recent regularizers that can fit in most neural networks for outcome improvement. Although some can drastically increase the training time, such as AutoAugment, most do not require any relevant extra time, and none influences the time taken for inference. Right after the introduction, we provide a brief explanation of how CNN works and a little history of its development, and then we divided all works analyzed in this article as follows:
•
“input regularization,” where the models work before the image is fed to the network;
•
“internal regularization,” when the regularization algorithms work after the image is feedforwarded to the model; and
•
“label regularization,” when the algorithm performs on the output layer.
Besides, the methodology presents the most popular datasets used to evaluate regularization techniques and the most traditional CNN architectures for such a task. Such information is crucial, for it helps standardize an evaluation protocol from now on.
Along with the reported results for each work, we provided our opinion on setting up a state-of-the-art regularizer, an essential but trustful protocol evaluation for new regularizers, which can help compare the results and provide insights for researchers in this area. The same section highlights some issues we found in most of the works:
•
the lack of using simpler architectures, which are the ones that could be more benefited from the use of regularizers; and
•
the lack of an evaluation of methods on more complex data, such as unbalanced datasets, to provide richer information for other researchers.
Last but not least, we encourage the development of new regularization techniques on tasks other than image classification, such as object detection and image reconstruction.