1 Introduction

Convolutional Neural Networks (CNNs) have made significant advances in deep learning over the last few years [10], demonstrating exceptional performance across a wide range of computer vision and natural language processing tasks. CNNs are employed for various tasks in computer vision such as image classification [4], image restoration [48], image enhancement [47], object detection [54], image segmentation [37]. CNNs have convolutional layers that act as feature extractor or encoders and decoders in autoencoder [56] architectures and can also be combined with other architectures to generate hybrid models [57]. The accuracy of CNNs has improved as the networks have grown deeper with increasing dataset size. The large number of parameters in such deep networks makes network inference computationally expensive and time consuming. This paper addresses compressing CNNs to achieve computationally less expensive networks thereby accelerating inference.

Deep CNN inference includes forward propagation where the input is processed through each layer according to the trained weights and activation function and the output is predicted. For deep CNNs, the number of parameters and floating point operations (FLOPs) increases. More number of parameters increases the model size. Deployment of such deep networks on resource-constrained edge devices is challenging because the networks are compute and memory-intensive making the inference time-consuming. The inference time on edge devices is crucial in real-time applications. As a result, a model compression technique is necessary to compress the deep CNN before deployment on edge devices. Choudhary et al. [2] have presented an extensive review of various techniques adopted in the literature to compress neural networks and accelerate inference. These techniques are broadly classified into quantization, knowledge distillation, low-rank factorization and pruning. Quantization techniques aim to reduce the precision of weights and activations of layers from floating-point to lower-bits thereby reducing power requirement, memory footprint and improving inference speed. However, such reduced precision can introduce quantization noise and impact the model accuracy. Additionally quantization makes the operations more complex [27]. Low-rank factorization techniques approximate large weight matrices to lower-rank matrices using various decomposition methods. This reduces the computational cost and helps for faster inference. However, the decomposition techniques are computationally expensive and have to be performed layer-wise [1] and cannot be applied to models with small kernels. Also the factorized models require extensive retraining to regain the accuracy. Knowledge distillation techniques aim to transfer the knowledge from large teacher models to smaller student models such that smaller models give an accuracy performance as that of teacher model. However, it requires training both teacher and student models thereby increasing the training complexity and this technique can only be applied to classification tasks with softmax output [1]. Pruning techniques aim to remove redundant connections from the network. Pruning techniques are classified into structured and unstructured pruning. Structured pruning removes entire channel from the network layers by evaluating channel importance. Unstructured pruning zeros out less important weights in the network which may add irregular sparsity in the network [27]. Pruning techniques reduce the model size, computational cost, memory usage and improve the inference time. However, pruning may degrade the model accuracy and may require retraining to regain the accuracy. Despite this limitation, pruning techniques are straightforward as they involve finding filter importance without adding much of computational overhead and leads to direct reduction in the number of parameters. Pruning techniques can also be adapted to different model architectures and can be used along with different optimization techniques. Motivated by the lottery ticket hypothesis [36], this paper hypothesizes that if a sub-network is extracted from a deep CNN by selecting filters corresponding to only relevant feature maps then the retraining of the sub-network can be avoided. Relevant feature maps can be selected based on various factors such as calculating the importance score, performing similarity analysis or performing sensitivity analysis [46]. Quantifying the feature map importance using a suitable metric is a direct and efficient way and also involves less computational overhead as compared to similarity and sensitivity analysis. The importance of feature maps can be calculated using data-dependent techniques that use training data to obtain importance score and data-independent techniques that are merely based on magnitude of weights [46]. Data-dependent techniques calculate an appropriate score to retain only those feature maps that contain relevant information thereby ensuring minimal degradation in model performance and may require minimal or avoid retraining after pruning. Entropy is a measure that quantifies amount of information. Higher entropy indicates more variability and information, while lower entropy indicates more deterministic and less information. When applied to feature maps of CNNs, low entropy feature maps indicate less information being carried to the next layers and less contribution to the model’s decision. Entropy-based pruning focusses on pruning feature maps that are less informative thereby ensuring reduction in redundancy and lesser degradation in accuracy [35]. Hence, this paper focusses on using entropy as a measure to quantify feature map importance.

This paper proposes, a pruning approach not only for convolutional but also for fully connected (FC) layers as the fully connected layers also have many parameters. For convolutional layers, the pruning technique selects class-wise important feature maps quantified in terms of entropy. The filters corresponding to less important feature maps are pruned from the CNN. For fully connected layers, the pruning is based on the number of zero activations. This paper also explores the need for retraining after pruning.

The contributions of this paper are as follows:

  1. 1.

    A pruning technique for both convolutional and fully connected layers is proposed. Pruning all the layers of CNN has given significant compression while maintaining accuracy.

  2. 2.

    Class-wise entropy values are calculated to quantify the importance of feature maps in convolutional layers. A pruning threshold selects filter indices of less important feature maps for every class label based on a pre-defined percentile. The final pruning indices are obtained as an intersection over all class labels. This ensures that the threshold is different for every layer and is not influenced by any particular class.

  3. 3.

    The redundant neurons in fully connected layers are identified using number of incoming zeros metric. This prevents the zeros from being carried forward as is in the network thereby reducing redundancy.

  4. 4.

    A study on retraining the entire model versus retraining only fully connected layers in a pruned model has been presented.

  5. 5.

    A feature map visualization with respect to the entropy of feature maps to demonstrate the importance of considering class labels while selecting threshold and analysis of layer-wise pruning has been presented.

  6. 6.

    The effectiveness of the proposed work has been validated on three datasets and three CNN architectures with different complexities.

This paper is organized as: Sect. 2 discusses the existing work on pruning. Section 3 discusses the proposed pruning strategy. The experiments and results are discussed in Sect. 4. The conclusions of the work are presented in Sect. 5.

2 Related work

The trained networks may be over-parameterized and may have some redundant parameters. Such redundant parameters should be eliminated from the network before deployment on resource-constrained platforms. Pruning techniques with different selection criteria have been discussed in the literature.

Ghimire et al. [9] have presented an extensive survey of various pruning, quantization, tensor decomposition, knowledge distillation and neural network search-based techniques for compressing neural networks and accelerating neural networks on hardware. Vadera et al. [46] have surveyed various pruning techniques in the literature. The authors have categorized the pruning techniques as magnitude-based, similarity and clustering-based and sensitivity analysis-based. The magnitude-based pruning techniques are based on pruning kernels with less magnitude. The similarity and clustering-based techniques prune redundant kernels. The sensitivity analysis-based techniques remove kernels after analysing their effect on the loss. Mondal et al. [38] have categorized and reviewed filter pruning techniques based on filer norm, feature map characteristics, reconstruction error, derivatives of the loss function, pruning tools, and modified loss.

Han et al. [11] have proposed a combined pruning and quantization approach for compressing deep neural networks. The pruning is based on removing connections corresponding to small weights. Li et al. [24] prune filters from the network which have a low absolute sum of filter weights. Paupamah et al. [39] have proposed a technique based on pruning filters having small weights and less sensitivity to the network performance. Singh et al. [45] have proposed passive pruning based on filter norm. Jayasimhan et al. [17] prune the model based on norm and redundancy. The filters after pruning are selectively restored using simulated annealing.

Luo et al. [35] have defined the importance of feature maps in terms of entropy as it is a measure of the amount of information. The authors have tested this pruning criteria for VGG-16 and ResNet-50 and the results show that a significant speed-up and compression can be achieved with this pruning technique as compared to the other techniques compared in this paper. Hu et al. [14] have proposed a pruning technique based on the average percentage of zeros (APoZ) in feature maps. The higher the APoZ, the higher the amount of unimportant features. These zeros are carried forward as is in the network and hence do not contribute to the performance. Hence the kernels corresponding to the feature maps with a high APoZ can be removed without affecting the accuracy of the network. Lin et al. [28] have selected filters corresponding to low rank feature maps for pruning. Zheng et al. [58] have presented a pruning technique that calculates filter importance based on direct and indirect effect of filter on current and next layers.

He et al. [13] have proposed a geometric median based pruning technique that considers mutual relation between filters to remove redundancy. Shao et al. [43] have presented an analysis of results obtained by pruning CNNs based on the similarity of filters and similarity of feature maps. The authors have observed that for shallow networks, pruning based on the similarity of feature maps is better, while for deep networks, pruning based on the similarity of filters is better. Wang et al. [51] have proposed a pruning technique based on the similarity of feature maps quantified in terms of structural similarity (SSIM) and peak signal-to-noise ratio (PSNR). Liu et al. [32] have proposed a technique to cluster and prune similar filters of a layer using kmeans++ clustering. Geng et al. [8] have calculated the importance of a filter based on their norm, similarity with other filters in a layer and the parameters of batch normalization layer. Liu et al. [33] have obtained the importance score of a feature map based on the cosine similarity and energy weighting coefficient calculation between the low frequency and high frequency component of feature maps after applying wavelet transform.

Wang et al. [50] have proposed a pruning technique which finds redundant filters based on structural redundancy in the graph of a layer. Lin et al. [29] have proposed a pruning approach which finds unimportant filters while training using an optimization problem with sparsity criteria, thereby capturing the relation between outputs and local pruning operations. Fernandes et al. [6] have proposed an evolutionary algorithm-based pruning technique for CNN. Jiang et al. [18] have also proposed pruning based on a bi-objective optimization and evolutionary algorithm. Chung et al. [3] have also proposed a multi-objective evolutionary algorithm with weight inheritance for pruning. Huang et al. [15] have also presented a sparsity regularization based pruning technique. Liu et al. [31] use a pre-trained model that takes the target compression rate as input and produces the compressed architecture. This technique compresses every layer using a different compression rate using reinforcement learning. Ding et al. [5] have proposed a pruning technique that prunes every layer using a threshold obtained by solving a constrained optimization problem. Gao et al. [7] prune the model based on bi-level optimization that considers both static and dynamic channel pruning. Louati et al. [34] have presented a bi-level technique for pruning channels and filters from the channels based on evolutionary algorithm.

Zhao et al. [55] have discussed a channel saliency metric obtained from batch normalization layer for pruning the networks. The results presented demonstrate that the retraining stage after this pruning technique can be skipped without affecting the performance. Lin et al. [30] have also proposed a pruning technique that does not require retraining based on generative adversarial learning. Kim et al. [19] have proposed a technique that does not require retraining after pruning in which the filters of the next convolutional layer considers all the information of the filters pruned in the previous layer.

From the literature study, it can be observed that the pruning thresholds are obtained based on the entire training data without considering the class-wise feature importance. In this paper, a class-specific threshold is selected across different layers which ensures that the threshold is not influenced by any particular class and the layers are effectively and non-uniformly pruned thereby resulting in no or less drop in classification accuracy and preventing the need for retraining the entire model. Also, the current works have discussed pruning only convolutional layers from the CNN. Pruning fully connected parts results in a more significant compression ratio. In this paper, filter importance is calculated based on entropy for convolutional layers. It can be observed from the visualization results that entropy is a proper metric that quantifies the feature maps.

3 Pruning

The detailed process of the proposed pruning techniques for convolutional and fully connected layers in a CNN is described in this section.

3.1 Preliminaries

For a CNN with L convolutional layers and F fully connected layers, the feature map of every convolutional layer (\(l \in \{1,...,L\}\)) is denoted by \(X^{c}_{i,j}\), where \(c \in \{1,...,C\}\) and C is the total number of classes in the image classification task, \(i \in S_c\) and \(S_c\) is the set of image indices belonging to class c, \(j \in \{1,...,K\}\) and K is the total number of filters in a layer l. The layer output of every fully connected layer (\(l \in \{1,...,F\}\)) is denoted as \(O^l_{i,j}\) where \(i \in \{1,...,D\}\) and D is the number of neurons in a layer, \(j \in \{1,...,D+1\}\) and \(D+1\) is the number of neurons in next layer.

3.2 Pruning strategy

For convolutional layers, this strategy is based on pruning the filters in layers which do not extract significant features. Entropy has been selected as a measure to quantify the quality of extracted features. The entropy of feature maps is calculated for each image of every class in the training dataset. To calculate the entropy of feature map \(X^{c}_{i,j}\), the distribution of \(X^{c}_{:,j}\) is divided into b bins and the probability of each bin is obtained [35]. The entropy of \(j^{th}\) feature map of \(i^{th}\) image belonging to \(c^{th}\) class is then calculated as shown in Eq. 1:

$$\begin{aligned} E^c_{i,j} = - \sum _{z=1}^{b} p_z log(p_z) \end{aligned}$$
(1)

The proposed pruning methodology has been illustrated in Fig. 1. The desired model is first trained on the dataset under consideration. The trained model is then considered for pruning. The convolutional layers in the trained model are pruned using entropy-based pruning, while the fully-connected layers are pruned based of number of incoming zeros. The blocks in entropy-based pruning as observed in Fig. 1 describe the operations for one layer. For a convolutional layer and for a particular class, feature maps are first obtained for all images in that class and entropy of all the feature maps is calculated. The average entropy over all images of that class is then calculated. This average metric considers the entropy of all images belonging to a particular class. The filter indices to be pruned for a particular image are selected based on a percentile threshold obtained for every class. The method for class-wise threshold entropy calculation is shown diagrammatically in Fig. 2.

Fig. 1
figure 1

Proposed pruning methodology

Fig. 2
figure 2

Diagrammatic overview of class-wise threshold entropy calculation for a convolutional layer

The steps to obtain the threshold entropy based on percentile are as follows:

  • Step 1: For a class, sort the average entropy values of all feature maps in ascending order.

  • Step 2: Obtain the ordinal rank as shown in equation 2.

    $$\begin{aligned} ordinal\_rank = \frac{percentile}{100} \times K \end{aligned}$$
    (2)

    Where, \(ordinal\_rank\) gives the index of the threshold entropy in the sorted entropy list, percentile is pre-defined and K is the number of feature maps in a layer.

  • Step 3: Obtain the entropy value corresponding to the ordinal rank. This entropy value is the threshold entropy. Only the feature maps having entropy below this threshold value are selected for pruning.

  • Step 4: Repeat the above steps for all classes in the dataset.

The channel indices common for all classes are only considered for pruning that layer. This is repeated for all convolutional layers in the model.

The threshold entropy selection is vital to ensure the effectiveness of the proposed pruning strategy. It is not possible to choose a single global threshold entropy for all the classes because there might be large variations in entropy values amongst them. This paper suggests an adaptive threshold selection approach that is based on finding class-wise threshold entropy. This method establishes distinct thresholds for various classes. Differentiating thresholds based on a class makes sure that classes with denser objects, which may have higher entropies than other classes, do not affect the threshold entropy. Hence, considering a threshold common to all classes will not be a proper selection criterion in such cases where there is significant difference in entropy values across different classes. Hence the threshold should be adaptive and carefully selected class-wise such that it is not influenced by any particular class. The detailed algorithm of the proposed pruning strategy for convolutional layers is in Algorithm 1.

For fully connected layers, the proposed strategy is based on pruning the neurons which have more than a pre-defined threshold of zero incoming inputs. More number of incoming zeros to a neuron will keep the output activation of the neuron low or close to zero and thus redundant zeros will be carried forward. The detailed algorithm of the proposed pruning strategy for fully connected layers is in Algorithm 2.

Algorithm 1
figure a

Entropy based pruning for convolutional layers

Algorithm 2
figure b

Number of zeros based pruning for fully connected layers

Once the network is pruned, this paper focusses on whether there is a need to retrain the pruned network. Retraining after pruning is done to overcome the loss in accuracy which may be caused as a result of pruning. The entropy of feature maps aids to select the filters which extract meaningful features from the data. Hence, pruning away filters which give feature maps with less entropy should maintain the accuracy of the network even without retraining. This paper explores whether retraining is really required even if only less significant filters from convolutional layers are pruned. Thus, the results of the pruned network with retraining, without retraining and retraining only the fully connected layers have been compared. The results are presented in the next section.

4 Experiments and results

The experimentation carried out and results obtained for the proposed technique is discussed in this section.

The proposed pruning technique is tested with three models on three datasets namely Intel image classification dataset [16], CIFAR10 dataset and CIFAR100 dataset [20] which have different number of classes. CNNs with different complexities have been considered. AlexNet [21], VGG-16 [44] that are CNNs with sequential connections and ResNet-50 [12] that has both sequential and skip connections are considered for validating the proposed methodology. The three CNNs are first trained on the mentioned three datasets. This trained model is then pruned based on the proposed methodology. The need for retraining after pruning is also explored. The settings of all experimental parameters are shown in Table 1. The codes have been written in Python programming language and Keras deep learning framework with TensorFlow backend has been used for model training. The Kerassurgeon library has been used to prune models. The experiments have been conducted on NVIDIA Tesla V100 GPU.

Table 1 Settings of experimental parameters

For entropy calculations, the results for number of bins in powers of 2, 4, 8, 16, 32, 64 were investigated with respect to the per-layer compression. It was observed that varying number of bins does not make a significant difference in the per-layer compression. The number of bins which gave the maximum compression amongst the values considered was 8. Hence entropy calculations are made with 8 bins, thus entropy ranges from 0 to log(8). For convolutional layer pruning, threshold selection was done by considering various percentile values with respect to the compression and validation performance and 50 percentile is selected. The training dataset is used for quantifying the importance of feature maps based on entropy.

The training and validation results of the unpruned and pruned models are shown in this section. In Table 2-10, the number of parameters (params), inference floating point operations (FLOPs), model size (size), training loss (TL), training accuracy (TA), validation loss (VL), validation accuracy (VA) are used as criteria for comparing unpruned and pruned models. The columns in Tables 2-10 are denoted as unpruned model (U), pruned model without retraining (P-NR), pruned model with retraining (P-R) and pruned model with retraining only fully connected layers (P-RFC). The model training and retraining experiments have been carried out 5 times and the mean accuracy and loss with standard deviation is found. All the 5 trained models are pruned and the accuracy and loss for P-NR, P-R and P-RFC have been shown in Tables 2-10.

Table 2 AlexNet-Intel image dataset
Table 3 AlexNet-CIFAR10 dataset
Table 4 AlexNet-CIFAR100 dataset
Table 5 VGG-16-Intel image dataset
Table 6 VGG-16-CIFAR10 dataset
Table 7 VGG-16-CIFAR100 dataset
Table 8 ResNet-50-Intel image dataset
Table 9 ResNet-50-CIFAR10 dataset
Table 10 ResNet-50-CIFAR100 dataset

It can be observed from the Tables 23 and 4 that for the pre-defined threshold, in terms of parameters, the proposed pruning technique has helped to compress AlexNet by 83.2% for Intel image dataset, 87.19% for CIFAR10 dataset and 79.7% for CIFAR100 dataset. Tables 56 and 7 show that VGG-16 has been pruned by 83.7% for Intel image dataset, 85.11% for CIFAR10 and 84.06% for CIFAR100 dataset. While, ResNet-50 has been pruned by 62.99% for Intel image dataset, 62.3% for CIFAR10 and 58.34% for CIFAR100 dataset as observed in Tables 8910. In terms of FLOPs reduction, AlexNet has achieved 66.9% for Intel image dataset, 65.16% for CIFAR10 dataset and 52.7% for CIFAR100 dataset. VGG-16 has achieved a FLOPs reduction of 77.2% for Intel image dataset, 80.6% for CIFAR10 and 79.53% for CIFAR100 dataset. While ResNet-50 has achieved 66.76% FLOPs reduction for Intel image dataset, 66.01% for CIFAR10 dataset and 57.83% for CIFAR100 dataset. It can also be observed that for Intel image dataset, pruning without retraining also does not degrade the accuracy much. Since entropy selects feature maps that carry information, pruning the convolutional layers almost maintain the accuracy even without retraining for Intel image dataset. However, pruning with retraining only fully connected part gives a good accuracy for pruned model for AlexNet and VGG-16.

As compared to AlexNet and VGG-16, the pruning significantly degrades ResNet-50 accuracy. This is because AlexNet and VGG-16 have sequential connections and pruning filters from one layer does not interact with many layers. However, ResNet-50 has skip connections and pruning filters from a layer affects all the layers connected to the residual path. Hence, it can be seen that ResNet-50 without retraining gives a very low accuracy. This also affects the output of the last convolutional layer of ResNet-50. However, retraining the fully connected part shows to have regained some accuracy as retraining allows the fully connected layers to adapt to the altered features. But still accuracy cannot be regained for ResNet-50 and it requires retraining the entire network. Thus it can be concluded that the proposed pruning technique can be used with only retraining fully connected layers for models with sequential layers, while models with skip connections will need to be retrained entirely.

Table 11 presents the time required to train the entire model, retrain the pruned model and retrain only the fully-connected layers in pruned model. These time measurements have been made on NVIDIA Tesla V100 graphics processing unit (GPU). It can be observed that the retraining only fully connected part results to significant savings in time.

Table 11 Comparison of training times

Deep neural networks deployed in production may observe new data over time. These networks cannot adapt to changing data on their own and need to be retrained if data or concept drift is observed. Continual learning aims to continuously update models in production by retraining on new data while not forgetting the earlier learnings. For applications such as autonomous vehicles [42], and smart factories [52], huge dynamic data are seen in a day. About 32 TB of data is captured per autonomous vehicle in a day, while smart factories capture about 1 PB of data per day [41]. The deep neural networks deployed for performing data analysis to capture various insights will not perform well for dynamic data observed every day and needs to be updated on-the-fly. The frequency of updating the model depends on the application and may be done at fixed intervals, based on performance trigger, based on data change trigger, or on demand. Models are pruned before deployment to improve their real-time performance. However, to avoid the degradation in the accuracy of pruned models, retraining has to be done to regain the accuracy [22, 23, 25, 26, 40, 53]. If the model has to be updated, then one update cycle will have the steps of training on new dataset, followed by pruning and retraining the pruned model, and finally deploying. Hence, multiple retraining is required if the model has to be updated. Retraining the pruned model can be costly. If the need for retraining after pruning is avoided then significant time savings can be achieved in the case of such frequent updates. According to the results presented in Table 11, retraining fully connected layers versus the entire network, reduces the training times approximately by 47.41% for AlexNet, 54.74% for VGG-16, and 25.8% for ResNet-50 trained on the Intel Image dataset for the proposed pruning technique. If this technique is used for compressing models that are deployed for applications that observe dynamic data every day and demand frequent retraining, then a significant amount of time will be saved. Thus the proposed pruning technique would be resource efficient and would adapt faster in continual learning scenarios.

However, pruning techniques face a few challenges in the continual learning environment. The detailed description of potential challenges of pruning in continual learning scenarios and possible solutions [49] is as follows:

  • Catastrophic forgetting: Model may lose the performance on previously learned task as pruning changes the network structure and it may remove filters corresponding to features that were important for old tasks. The possible solutions to overcome this challenge are adding regularization term in loss function to preserve knowledge from previous tasks, parameter isolation by generating task-specific parameters or maintaining separate neural networks for each task.

  • Knowledge transfer: Pruned models have reduced number of parameters and hence the network learned for old tasks may not be sufficient for new tasks. The possible solutions to overcome this challenge include inducing reusability and introducing new parameters when required or building composite models with old and new knowledge.

  • Inefficient use of parameters: Pruning may lead to models having unbalanced capacity where some parts of network are important while some are not. The possible solutions to overcome this challenge are dynamic pruning based on task or finetuning some parameters for new tasks.

Most of the pruning results reported in the literature have considered VGG-16 architecture trained on CIFAR10 and CIFAR100 dataset [46]. A few research papers have also presented results considering ResNet-50/56 architecture on CIFAR10 dataset. Table 12 shows the comparison of the best performing pruned model from 5 runs with respect to the percentage parameter reduction, percentage FLOPs reduction and accuracy obtained with various pruning techniques in the literature. It can be observed that as compared to other techniques the proposed technique compresses the networks to a larger extent without degrading the accuracy and requires retraining only fully connected layers for AlexNet and VGG-16 architectures. For ResNet architecture, the proposed technique gives a lesser compression as compared to [28, 30, 33] however the accuracy observed is better than these techniques. While for ResNet architecture, as compared to [18], the proposed technique gives a lesser compression and accuracy, however for VGG-16 - CIFAR10, the proposed technique gives a better compression and accuracy. The proposed technique compresses the network by only selecting relevant filters and reduces the need to retrain entire network saving time.

Table 12 Comparison of proposed pruning technique with other techniques

Figures 3, 4, 5 show the layer-wise pruning percentages for convolution (Conv) layers of different blocks and fully connected (FC) layers. It can be observed from Figs. 3 and 4 that maximum pruning happens in initial convolution layers and in fully connected layers for AlexNet and VGG-16.

Fig. 3
figure 3

Layer-wise filter pruning percentage for AlexNet

Fig. 4
figure 4

Layer-wise filter pruning percentage for VGG-16

Fig. 5
figure 5

Block-wise filter pruning percentage for ResNet-50

The visualization of first layer feature maps of AlexNet trained on Intel image dataset are shown in Fig. 6. The entropy values of the feature maps are mentioned below every visualized feature map. The first column shows original images from the dataset and second to sixth columns show feature maps. It can be observed that feature maps with significant features have higher entropy values and entropy is a proper measure to quantify the feature maps. Hence, pruning feature maps with less entropy values does not cause much loss in accuracy and the loss can be recovered just by retraining fully-connected layers for models with sequential connections only. It may also happen that different classes may have different densities of objects and thus the maximum entropy value for feature map of every image of class may be different. In this study, the entropy thresholds selected based on percentile method for pruning AlexNet for Intel image dataset for each class are 0.668 for buildings, 0.642 for forest, 0.406 for glacier, 0.499 for mountain, 0.470 for sea and 0.577 for street. The Fig. 6 (2), (8), (14), (20), (26), (32) show feature maps with maximum entropy value. It can be observed that maximum entropy value may change across different images of different class, hence a global common threshold cannot be selected for pruning. Hence the threshold for pruning should be selected class-wise such that it is not influenced by classes with dense objects.

Fig. 6
figure 6

Layer-1 visualization and entropy values of input image and feature maps for AlexNet—Intel

5 Conclusion

In this paper, for compressing the networks by pruning techniques, a hypothesis that if a metric to quantify feature maps and the pruning threshold is robust then retraining the entire network after pruning can be avoided has been defined. This paper has proposed a novel class-wise entropy-based approach for pruning convolutional layers and number of zeros based approach for pruning fully connected layers. The class-wise approach of finding threshold entropy ensures that a different pruning threshold is set for every class and is not affected by a particular class. An analysis of the need for retraining after pruning has also been done by performing experiments such as observing accuracy of pruned network without retraining, with retraining and with retraining only fully connected layers. The results demonstrate that the pruned AlexNet and VGG-16 model show a maintained accuracy by retraining only fully-connected layers. This avoids the need to retrain entire network thereby saving time. The ResNet-50 model has to be retrained entirely to regain accuracy after pruning. The feature map visualization along with the entropy values has also been presented. This visualization shows that entropy is a proper metric to represent information in feature maps as it is evident that feature maps with less information have less entropy value. The pruning experiments and results thus support the hypothesis for AlexNet and VGG-16 which are networks with sequential connections. For the datasets Intel image, CIFAR10 and CIFAR100, the proposed technique has been able to compress AlexNet by 83.2% , 87.19% and 79.7%, VGG-16 by 83.7%, 85.11% and 84.06%, while ResNet-50 by 62.99%, 62.3% and 58.34% respectively.

As future work, pruning strategies for networks with residual connections such as ResNet-50 have to be explored to get a better compression than existing techniques without the need for retraining. Another possible direction is to apply the proposed method to application-specific datasets to evaluate its generalization ability.