Open AccessArticle

The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks

Multimedia Systems Department, ETI Faculty, Gdańsk University of Technology, 80-233 Gdańsk, Poland

Biomedical Engineering Department, ETI Faculty, Gdańsk University of Technology, 80-233 Gdańsk, Poland

Metrology and Optoelectronics Department, ETI Faculty, Gdańsk University of Technology, 80-233 Gdańsk, Poland

⁴

Algorithms and Systems Modelling Department, ETI Faculty, Gdańsk University of Technology, 80-233 Gdańsk, Poland

Author to whom correspondence should be addressed.

Electronics 2025, 14(1), 14; https://doi.org/10.3390/electronics14010014

Submission received: 26 October 2024 / Revised: 12 December 2024 / Accepted: 19 December 2024 / Published: 24 December 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

In the field of embedded and edge devices, efforts have been made to make deep neural network models smaller due to the limited size of the available memory and the low computational efficiency. Typical model footprints are under 100 KB. However, for some applications, models of this size are too large. In low-voltage sensors, signals must be processed, classified or predicted with an order of magnitude smaller memory. Model downsizing can be performed by limiting the number of model parameters or quantizing their weights. These types of operations have a negative impact on the accuracy of the deep network. This study tested the effect of model downscaling techniques on accuracy. The main idea was to reduce neural network models to 3 k parameters or less. Tests were conducted on three different neural network architectures in the context of three separate research problems, modeling real tasks for small networks. The impact of the reduction in the accuracy of the network depends mainly on its initial size. For a network reduced from 40 k parameters, a decrease in accuracy of 16 percentage points was achieved, and for a network with 20 k parameters, a decrease of 8 points was achieved. To obtain the best results, knowledge distillation and quantization-aware training methods were used for training. Thanks to this, the accuracy of the 4-bit networks did not differ significantly from the 8-bit ones and their results were approximately four percentage points worse than those of the full precision networks. For the fully connected network, synthesis to ASIC (application-specific integrated circuit) was also performed to demonstrate the reduction in the silicon area occupied by the model. The 4-bit quantization limits the silicon area footprint by 90%.

Keywords:

quantized neural network; model accuracy; tiny machine learning; convolutional neural network; silicon area

1. Introduction

Currently, there is a trend of increasing model complexity in machine learning. Models with more parameters achieve higher accuracy than their smaller versions. Unfortunately, scaling models has many negative implications. Training takes longer and requires more data. Model inference requires more computing power, which results in high energy consumption. The inference itself also takes much longer, which reduces the throughput of the solution. In some applications, neural networks must meet restrictive requirements related to performance, size and throughput. In edge, embedded, IoT and networked devices, networks must be small enough to be used while maintaining high accuracy. There are several methods for dealing with this problem. Individual layers used in the model can be used multiple times. This allows for reducing the parameters stored in the device’s memory but does not significantly affect inference. In the publication [1], the authors proved that the use of this technique allows for reducing the parameters used tenfold. The accuracy of the model for the classification of 10 images decreased only by about 1.5 percentage points and by about 4pp for 100 images. The number of parameters used in neural networks can be reduced by changing their architecture, removing layers or reducing specific layers. The quantization of the network is also often used for this purpose. The methods mentioned allow training networks that fulfill the assigned tasks with a small decrease in accuracy.

2. Related Works

Small neural network models have a wide range of applications. A small number of parameters allows them to be used on many hardware platforms without compromising their results. This is proven by the authors of [2]. Using automatic neural architecture search, they created a model that, despite being 5.28 times smaller than MobileNetV2 [3], achieved a 15.14% higher fairness score for the dermatology dataset. An example of a small architecture was also described in [4]. A model called ULEEN in the small version occupied only 16.9 kB of memory and achieved 96.2% accuracy on the MNIST [5] dataset. The authors performed model inference on FPGA, where the latency was 0.21 μs and the power consumption was 1.1 W.

Tiny neural networks are often used in IoT. In [6], a model for processing accelerometer data to classify physical activity was described. It had only 0.24 M multiply–accumulate operations and, when run on the ISM330AILP sensor from STMicroelectronics, achieved a classification accuracy of 96%.

For edge or automotive applications, it is assumed that a small model has several hundred or tens of thousands of parameters. In [7], the authors managed to reduce the convolutional network model to 150.84 k parameters while maintaining the same accuracy. Its task was to analyze the image taken from a car camera. The car had a central computer with the appropriate amount of RAM, so a model of this size was sufficient. In a publication on sound processing [8], it was proven that a model reduced to 30.3 k parameters had an accuracy similar to that of models several times larger. It was implemented in a modern smartphone.

However, the previously mentioned models are much too large for use in integrated sensors, microcontrollers or network devices. For such applications, models with several thousand parameters are optimal. The authors of [9] created a network with only 1493 parameters that correctly classified gestures measured using a motion sensor. It fits together with the program code on an energy-saving, 8-bit microcontroller. Convolutional networks can also come in miniature sizes. In [10], a three-layer CNN was described that efficiently detected faces from 128 × 128 pixel images with 93% accuracy. It was integrated into a sensor that consumed 80.4 μW at 50 fps. Models integrated into an ASIC can also be used in medicine. The authors of [11] created a processor for electrocardiogram monitoring. The implemented network classified health status with 96.7% accuracy occupying only 0.33 mm² of silicon in 55 nm CMOS technology. Implementing neural networks in ASICs can yield different results depending on the methods used. Refs. [12,13] show that the differences can be related to the model area, power consumption and even accuracy.

In addition to reducing the number of network parameters, it can be made smaller through quantization. According to [14], reducing the network to 8 bits has a minimal impact on its accuracy and significantly reduces the memory footprint. However, the network can be quantized even further. In [15], it was proven that it is possible to quantize models down to 4 bits, with a loss of accuracy of about 0.9–5.2% compared to the floating-point model. The most radical quantization is binary quantization. According to [16], binary convolutional networks were able to classify images with an accuracy of 80% and above. The binarization of the network allows the use of other logical operations in its hardware implementation, which dramatically reduces its size.

The combination of model downscaling and quantization is an important step in implementing neural networks close to hardware. This will enable the creation of complex and intelligent distributed IoT systems, the energy-efficient operation of embedded systems and the acceleration of calculations in edge devices. Small artificial intelligence models have enormous application potential.

3. Methodology

This study assumed that the change in the accuracy of a deep network caused by model reduction may depend on the type of architecture used and the problem that such a network solves. For this reason, three separate classical issues were tested. Each corresponds to a problem relevant to embedded and edge systems.

3.1. Network Anomaly Detection

The first research problem was the analysis of network traffic and the detection of anomalies using traffic features. In IoT and embedded devices, it is crucial to ensure the security of Internet communication. Hackers use various techniques to intercept, steal or counterfeit sensitive data sent to an external server. To recognize such an attack, devices analyze both the characteristic features of network traffic and the content transmitted over the network. The publication [17] shows how important the throughput of the neural network and the maximization of its bandwidth is. Neural networks must process data transmitted in network traffic in real-time. The development of Internet technology and the continuous increase in network bandwidth place high demands on the models. The authors suggested using smaller models and quantization to optimize network traffic and implementing the network in parallel to process data in real-time. The dataset that represents this problem is NSL-KDD2 [18] was developed for the competition of the Canadian Cybersecurity Institute. It describes 41 network traffic features that are used to detect and classify 20 anomalies. Features are divided into three categories: basic, content and traffic. Basic features contain information about the duration of the connection, the protocol used and the type of service. Content features contain information obtained from the transmitted TCP packet, for example, the number of failed login attempts or unauthorized root access. Traffic features contain patterns of network traffic, such as the number of connections to the server or the percentage of connections containing errors. The dataset also has 20 classes that indicate whether the traffic is normal or suspicious. Examples of dangerous traffic are DoS attacks, user to root access or surveillance. To solve this issue, a fully connected network was proposed, initially based on the publication [19], with one hidden layer and 61 nodes. Table 1 shows several models trained on the NSL-KDD2 dataset. As time passes, the accuracy of the models increases, but this is accompanied by a significant increase in the network parameters.

3.2. Image Classification

An important issue in sensor systems and IoT is image recognition and classification. Today, cameras are an indispensable part of mobile devices. This means that the deep networks used to process an image must be small and energy efficient. If the network was scaled down appropriately, it could be integrated with the camera instead of using an external microcontroller or processor. This approach would allow the creation of distributed vision systems while minimizing electricity consumption. To investigate the problem of image classification, the CIFAR-10 dataset was used [23]. It contains 60,000 images with a resolution of 32 × 32 grouped into 10 classes. Each image is colored according to the RGB color model. The classes are divided into animals and vehicles. Each class is mutually exclusive, meaning there is no overlap between the categories. The starting point for solving this problem was a residual model with convolutional layers called ThriftyNet [24]. In the publication, the authors demonstrated that their model with 40 k parameters achieves 90% accuracy for the image classification task. Table 2 compares several models classifying images from the CIFAR-10 dataset. It can be seen that with the increase in model accuracy, the size of the models increases rapidly. To achieve over 90% accuracy, networks have over several hundred thousand parameters. It is possible to create a small network that performs this task, but the architecture used is key here. PixelHop++ [25] has 22.2 k more parameters than ThriftyNet, but its accuracy is worse by as much as 25 percentage points.

3.3. Speech Recognition

Audio and speech analysis are very important in the context of embedded devices. It includes tasks such as speech recognition, waking up devices with voice and converting speech to text. If the neural network were small enough, it would be possible to process the audio directly in, e.g., the earpiece or Bluetooth headset, instead of sending it to a higher-level processing unit. Therefore, the third dataset used was Google Speech Command [30]. The dataset contains 105,829 one-second-long utterances recorded by 2618 speakers. It contains various recordings of voice commands used to verbally control a device. The data were encoded as linear 16-bit single-channel PCM values, at a 16 kHz rate. In the study, the number of processed commands was reduced to 12. The audio signals were converted into 40-sample Mel spectrograms and thus processed by the model. The authors of the publication [31] also used the same procedures. Their base architecture was the ResNet model, or more precisely, its smaller version ResNet-8-Narrow, with 19.9 k parameters. It allows for the classification of voice commands with an accuracy of 91%. Larger model version ResNet-26 with 438 k parameters achieved a 95.2% accuracy. Table 3 shows a comparison of models over time. The sizes of the models range from several dozen to several hundred thousand parameters. All collected neural networks have high accuracy. It can also be observed that the number of parameters does not directly translate into better results. This proves that the choice of architecture for the type of problem is crucial.

4. Network Training Pipeline

Model training consisted of several stages. The first was to reduce the models so that their number of parameters was less than 3 k. This was carried out through modifications to the model architecture. Models prepared in this way were trained using two methods: QAT and knowledge distillation. In each method, different configurations of learning rate, optimizer function and batch size were tested. The alpha parameter was also modified for knowledge distillation. The model was trained eight times for each combination, and the median was calculated from these results. In this way, results for the full resolution of the network were collected.

The next step was to quantize the models. The Python library Tensorflow Model Optimization Toolkit in version 0.7.2 was used for this. Quantized models were also trained using two methods with different combinations of parameters.

Figure 1 shows the course of the research. Each training process was repeated eight times to minimize the impact of outliers, which are likely to be observed because small models often fall into local minima. The results given in the publication are the median results of all training processes for a specific network.

5. Network Training Methods

In the study, it was assumed that small and quantized networks can achieve better accuracy results if they are trained using methods other than training from scratch [38]. Therefore, to obtain the best accuracy of the models, they were also trained using knowledge-distillation and quantization-aware training techniques. The average results of the methods were compared with each other to find the best results for neural networks.

5.1. Knowledge-Distillation

One of the training methods that allows for the minimization of the decrease in network accuracy during downscaling is knowledge distillation. The main idea of this technique is to train two neural networks. The one trained first is a full-bit precision network, called the teacher. The second network, the student, is quantized. The knowledge accumulated in the full precision network is transferred to the student during the learning process, which results in the smaller degradation of the accuracy of the quantized model. In a classification problem, the most frequently used output function in the model is the softmax function, which returns values recognized as probability

q_{i}

of the occurrence of class

z_{i}

compared to all classes. The

T

(temperature) is typically set to 1, but increasing its value smooths the probability distribution.

q_{i} = \frac{\exp (z_{i} / T)}{\sum_{j} \exp (z_{j} / T)}

(1)

Knowledge transfer occurs when calculating the cross-entropy gradient

\frac{d C}{d z_{i}}

in the softmax function. The probabilities

q_{i}

and

p_{i}

from the output layers of both networks are subtracted from each other.

\frac{d C}{d z_{i}} = \frac{1}{T} (q_{i} - p_{i}) = \frac{1}{T} (\frac{\exp (\frac{z_{i}}{T})}{\sum_{j} \exp (\frac{z_{j}}{T})} - \frac{\exp (\frac{v_{i}}{T})}{\sum_{j} \exp (\frac{v_{j}}{T})})

(2)

where

v_{i}

represents class from second network.

An approximation of (2) is possible if temperature value is bigger than the magnitude of the logits.

\frac{d C}{d z_{i}} \approx \frac{1}{T} (\frac{1 + \frac{z_{i}}{T}}{N + \sum_{j} \frac{z_{j}}{T}} - \frac{1 + \frac{v_{i}}{T}}{N + \sum_{j} \frac{v_{j}}{T}})

(3)

Assuming that the logits have been zero-meaned separately for each transfer, the calculations

\sum_{j} \frac{z_{j}}{T}

and

\sum_{j} \frac{v_{j}}{T}

equal 0.

\frac{d C}{d z_{i}} \approx \frac{1}{N T^{2}} (z_{i} - v_{i})

(4)

This study used the modified knowledge distillation method described in [38]. It adds an

a l p h a

parameter, which affects the share of the probability calculated by the teacher in relation to the student. An

a l p h a

of 1 means that the distillation’s results are not considered.

l o s s = a l p h a \cdot l o s s_{s t u d e n t} + (1 - a l p h a) \cdot \frac{d C}{d z_{i}}

(5)

5.2. Quantization Aware Training (QAT)

This network training technique allows for optimal training of quantized networks. Unlike the post-training quantization technique, QAT already considers the quantization of weights and network activation at the training stage. Full-bit precision network coefficients during forward propagation are limited to the range of values that can be written at a lower resolution. In the backpropagation step, gradients are calculated assuming the gradient of the quantization function is equal to one. In this way, the scales are modified with full precision, with gradients calculated for quantized weights. This method was described in more detail in [39], while the potential of this method was shown in [40], where the authors showed that its usage improves the accuracy of deep networks.

6. Network Downscaling Methods

It was assumed that each model should have a limited number of parameters to 3 k or less. For the NSL-KDD problem, the fully connected network was reduced by changing the model architecture. The model was deepened by increasing the number of hidden layers from 1 to 3, but the number of nodes in them was reduced from 61 to 24. The architecture called 3 × 24 had 2.7 k parameters. The ThriftyNet model was scaled down to meet the assumptions by limiting the size of filters in the convolutional layers from 175 to 38. Thanks to this, the model obtained 2.9 k parameters. A similar technique was used for ResNet-8-Narrow, where the size of the filters used was reduced from 19 to 7. Therefore, the model had only 2.8 k parameters.

The results of model reduction are presented in Table 4. For the NSL-KDD problem, the number of parameters was reduced by 28.9%, which resulted in a decrease in accuracy by 1.5 percentage points. The model for CIFAR-10 was limited by 92.6%, and its accuracy drop was 5.8 pp. In the case of the Google Speech dataset, shrinking the network parameters by 85.9% resulted in an accuracy degradation of 8.2 pp. Two of the three models maintained an accuracy of above 80% after downscaling. The obtained results show the following relationship: the more we shrink the model, the greater the degradation in accuracy occurs.

7. Quantizing Models

The previously reduced models were subjected to 8- and 4-bit quantization. For this purpose, the Python framework Tensorflow-Model Optimization Toolkit was used. The results are shown in Figure 2, Figure 3 and Figure 4. For the NSL-KDD problem, the 8-bit model worsened the accuracy by 4.4 pp according to baseline and by 3.4 pp for the 4-bit model. The network accuracy for CIFAR-10 decreased by 19.38 pp and 67.05 pp for the 8- and 4-bit models, respectively. The very poor result for the 4-bit is caused by the exploding gradient problem in the low network resolution. Such a significant reduction in the model made it unable to learn the classification of this dataset. For the Google Speech network, quantization resulted in an accuracy degradation of 6.6 pp and 6.9 pp for the 8- and 4-bit models, respectively.

In the case of NSL-KDD and Google Speech Commands, the results of 8- and 4-bit quantization are very similar. This shows that the use of appropriate training methods and the selection of parameters can reduce the degradation of accuracy caused by quantization. Smaller 4-bit models can effectively replace 8-bit models without significantly affecting their performance.

The Table 5, Table 6 and Table 7 compare models described in the literature with the results of this publication. To make the comparison meaningful, the accuracy of the best training is given for the models. For the scaled-down models, it is clear that the median of eight trainings differs significantly from the best result. This is the effect of unstable training of small models. It can also be seen how the quantization of the model reduces its memory footprint. Reduced and quantized models require several times less memory than the corresponding full-precision networks from the literature.

8. Synthesis to ASIC

An important factor when creating integrated sensors with neural networks is the size of the silicon model. The physical size of the network can result in a significant increase in sensor size, which is why quantization is so important. Each component of the network, e.g., the multiplier, is then appropriately scaled, significantly reducing the area footprint. To show the possible gain in the silicon surface, HDL code was generated using the HLS4ML Python library in version 0.6.0, which converts the model while maintaining its bit resolution. The code generated in this way was fed into software commonly used in the design of application-specific integrated circuit systems (ASIC), i.e., the Fusion Compiler from Synopsys. The tool had the elaboration option enabled to verify the correctness of the created ASIC system with the neural network design. The entire workflow is shown in Figure 5.

Figure 6 shows the percentage decrease in the silicon area occupied by the neural network for NSL-KDD. Reducing the number of model parameters by 28.9% shrank the area by 31.3%, according to the base model. The 8-bit quantization reduced the ASIC by 78.3%, and the 4-bit quantization reduced it by 90.6%. The obtained results show that combining model downsizing with quantization allows the creation of tiny chips implementing neural networks without a significant impact on their effectiveness.

The neural network synthesis was performed for a clock frequency of 1 GHz. Table 8 shows how quantization affected the required memory for storing the model. The number of network parameters did not change, but by reducing the weights to 4 bits, the model needs about 10 times less memory for storage. Quantization did not affect the speed of the model, because the number of mathematical operations performed did not differ. Despite this, the system works very fast, because the operations in each layer were parallelized. Operations are performed in each layer simultaneously, but in the workflow, there are groups of flip-flops that buffer intermediate results so that the system can operate with a high-frequency clock.

Table 9 shows the estimated number of MACs for all three reduced, non-quantized networks. The speed of each network will depend on the hardware used. Knowing how many clock cycles it takes to perform one MAC operation allows us to calculate the time it takes to process the data. Despite the high hardware dependency, it can be concluded that networks with parameters up to 3000 can efficiently process data in real time.

9. Conclusions

This study shows the impact of model reduction methods on the accuracy of their classification. For two research problems, NSL-KDD and Google Speech, 8- and 4-bit networks have acceptable degradation in accuracy in comparison to typical 32-bit architecture and allow for a significant reduction in the memory needed to store model weights in embedded systems or to reduce the silicon area when designing ASIC systems. Both models achieved mean accuracy above 80 percentage points during every reduction and quantization step. The 4-bit quantization allowed us to shrink model size in silicon area by 90.6%. For the CIFAR-10 set, there were major problems with an exploding gradient for a 4-bit network. This architecture may not be designed for such a large downscaling. For fully connected networks for NSL-KDD, the accuracy drops turned out to be the smallest. The knowledge distillation teaching method turned out to be a success because it most often achieved higher learning results compared to the QAT method. We observed the dependence between parameters and their accuracy in this technique. Increasing alpha and temperature parameters gives better and more repeatable results.

However, there is potential for further development in this area because such networks are very small and easy to implement in the embedded devices sector. We suggest continuing tests of 4-bit networks and possibly promoting them as a newer standard for quantized deep networks, which are currently not supported by the hardware of any significant company in the industry.

Author Contributions

Conceptualization, P.T.; Methodology, P.T. and M.S.; Validation, J.P.; Investigation, M.S. and J.P.; Writing—original draft, P.T.; Writing—review & editing, M.S. and J.P.; Supervision, P.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Azadbakht, A.; Kheradpisheh, S.R.; Khalfaoui-Hassani, I.; Masquelier, T. Drastically Reducing the Number of Trainable Pa-rameters in Deep CNNs by Inter-layer Kernel-sharing. arXiv 2022, arXiv:2210.14151. [Google Scholar] [CrossRef]
Sheng, Y.; Yang, J.; Wu, Y.; Mao, K.; Shi, Y.; Hu, J.; Jiang, W.; Yang, L. The Larger The Fairer? Small Neural Networks Can Achieve Fairness for Edge Devices. arXiv 2022, arXiv:2202.11317. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Susskind, Z.; Arora, A.; Miranda, I.D.S.; Bacellar, A.T.L.; Villon, L.A.Q.; Katopodis, R.F.; de Araújo, L.S.; Dutra, D.L.C.; Lima, P.M.V.; França, F.M.G.; et al. ULEEN: A Novel Architecture for Ultra-low-energy Edge Neural Networks. ACM Trans. Arch. Code Optim. 2023, 20, 1–24. [Google Scholar] [CrossRef]
Deng, L. The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web]. IEEE Signal Process. Mag. 2012, 29, 141–142. [Google Scholar] [CrossRef]
Ronco, A.; Schulthess, L.; Zehnder, D.; Magno, M. Machine Learning In-Sensors: Computation-enabled Intelligent Sensors For Next Generation of IoT. In Proceedings of the 2022 IEEE Sensors, Dallas, TX, USA, 30 October–2 November 2022; pp. 1–4. [Google Scholar]
Kocić, J.; Jovičić, N.; Drndarević, V. An End-to-End Deep Neural Network for Autonomous Driving Designed for Embedded Automotive Platforms. Sensors 2019, 19, 2064. [Google Scholar] [CrossRef] [PubMed]
Lin, Z.Q.; Chung, A.G.; Wong, A. EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge. arXiv 2018, arXiv:1810.08559. [Google Scholar] [CrossRef]
Venzke, M.; Klisch, D.; Kubik, P.; Ali, A.; Missier, J.D.; Turau, V. Artificial Neural Networks for Sensor Data Classification on Small Embedded Systems. arXiv 2020, arXiv:2012.08403. [Google Scholar] [CrossRef]
Hsu, T.-H.; Chen, G.-C.; Chen, Y.-R.; Liu, R.-S.; Lo, C.-C.; Tang, K.-T.; Chang, M.-F.; Hsieh, C.-C. A 0.8 V Intelligent Vision Sensor with Tiny Convolutional Neural Network and Programmable Weights Using Mixed-Mode Processing-in-Sensor Technique for Image Classification. IEEE J. Solid-State Circuits 2023, 58, 3266–3274. [Google Scholar] [CrossRef]
Zhang, C.; Chang, J.; Guan, Y.; Li, Q.; Wang, X.; Zhang, X. A Low-Power ECG Processor ASIC Based on an Artificial Neural Network for Arrhythmia Detection. Appl. Sci. 2023, 13, 9591. [Google Scholar] [CrossRef]
Lee, S.S.; Nguyen, T.D.; Meher, P.K.; Park, S.Y. Energy-Efficient High-Speed ASIC Implementation of Convolutional Neural Network Using Novel Reduced Critical-Path Design. IEEE Access 2022, 10, 34032–34045. [Google Scholar] [CrossRef]
Gupta, A.; Gupta, A.; Gupta, R. Efficient ASIC Implementation of Artificial Neural Network with Posit Representation of Floating-Point Numbers. In International Conference on Next Generation Systems and Networks; Springer: Berlin/Heidelberg, Germany, 2023; Volume 641, pp. 43–56. [Google Scholar] [CrossRef]
Zafrir, O.; Boudoukh, G.; Izsak, P.; Wasserblat, M. Q8BERT: Quantized 8Bit BERT. In Proceedings of the 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing—NeurIPS Edition (EMC2-NIPS), Vancouver, BC, Canada, 13 December 2019; pp. 36–39. [Google Scholar]
Tadahal, S.; Bhogar, G.; Meena, S.M.; Kulkarni, U.; Gurlahosur, S.V.; Vyakaranal, S.B. Post-training 4-bit Quantization of Deep Neural Networks. In Proceedings of the 2022 3rd International Conference for Emerging Technology (INCET), Belgaum, India, 27–29 May 2022; pp. 1–5. [Google Scholar]
McDanel, B.; Teerapittayanon, S.; Kung, H.T. Embedded Binarized Neural Networks. arXiv 2017, arXiv:1709.02260. [Google Scholar] [CrossRef]
Swamy, T.; Rucker, A.; Shahbaz, M.; Gaur, I.; Olukotun, K. Taurus: A data plane architecture for per-packet ML. In Proceedings of the ASPLOS ’22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February–4 March 2022; pp. 1099–1114. [Google Scholar]
Dhanabal, L.; Shantharajah, S.P. A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algo-rithms. Int. J. Adv. Res. Comput. Commun. Eng. 2015, 4, 446–452. [Google Scholar]
Iglesias, F.; Zseby, T. Analysis of network traffic features for anomaly detection. Mach. Learn. 2014, 101, 59–84. [Google Scholar] [CrossRef]
Chowdhury, M.U.; Hammond, F.; Konowicz, G.; Xin, C.; Wu, H.; Li, J. A few-shot deep learning approach for improved intrusion detection. In Proceedings of the 2017 IEEE 8th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference (UEMCON), New York, NY, USA, 19–21 October 2017; pp. 456–462. [Google Scholar] [CrossRef]
Hindy, H.; Atkinson, R.; Tachtatzis, C.; Colin, J.-N.; Bayne, E.; Bellekens, X. Utilising Deep Learning Techniques for Effective Zero-Day Attack Detection. Electronics 2020, 9, 1684. [Google Scholar] [CrossRef]
Hizal, S.; Cavusoglu, U.; Akgun, D. A new Deep Learning Based Intrusion Detection System for Cloud Security. In Proceedings of the 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey, 11–13 June 2021; pp. 1–4. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. University of Toronto. 2009. Available online: https://api.semanticscholar.org/CorpusID:18268744 (accessed on 10 October 2024).
Coiffier, G.; Hacene, G.B.; Gripon, V. ThriftyNets: Convolutional Neural Networks with Tiny Parameter Budget. IoT 2021, 2, 222–235. [Google Scholar] [CrossRef]
Chen, Y.; Rouhsedaghat, M.; You, S.; Rao, R.; Kuo, C.-C.J. Pixelhop++: A Small Successive-Subspace-Learning-Based (Ssl-Based) Model For Image Classification. In Proceedings of the 2020 IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 25–28 October 2020; pp. 3294–3298. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Hasanpour, S.H.; Rouhani, M.; Fayyaz, M.; Sabokrou, M.; Adeli, E. Towards Principled Design of Deep Convolutional Net-works: Introducing SimpNet. arXiv 2018, arXiv:1802.06205. [Google Scholar] [CrossRef]
Sharif, M.; Kausar, A.; Park, J.; Shin, D.R. Tiny Image Classification using Four-Block Convolutional Neural Network. In Proceedings of the 2019 International Conference on Information and Communication Technology Convergence (ICTC), Jeju, Republic of Korea, 16–18 October 2019; pp. 1–6. [Google Scholar]
Chu, X.; Zhang, B.; Li, X. Noisy Differentiable Architecture Search. arXiv 2021, arXiv:2005.03566. [Google Scholar] [CrossRef]
Warden, P. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. arXiv 2018, arXiv:1804.03209. [Google Scholar] [CrossRef]
Tang, R.; Lin, J. Deep residual learning for small-footprint keyword spotting. In Proceedings of the ICASSP, Calgary, AB, Canada, 15–20 April 2018; pp. 5484–5488. [Google Scholar]
Zhang, Y.; Suda, N.; Lai, L.; Chandra, V. Hello Edge: Keyword Spotting on Microcontrollers. arXiv 2018, arXiv:1711.07128. [Google Scholar] [CrossRef]
Myer, S.; Tomar, V.S. Efficient Keyword Spotting Using Time Delay Neural Networks. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1264–1268. [Google Scholar]
Choi, S.; Seo, S.; Shin, B.; Byun, H.; Kersner, M.; Kim, B.; Kim, D.; Ha, S. Temporal Convolution for Real-Time Keyword Spotting on Mobile Devices. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
Wong, A.; Famouri, M.; Pavlova, M.; Surana, S. TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices. arXiv 2020, arXiv:2008.04245. [Google Scholar] [CrossRef]
Banbury, C.; Zhou, C.; Fedorov, I.; Navarro, R.M.; Thakker, U.; Gope, D.; Reddi, V.J.; Mattina, M.; Whatmough, P. MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers. arXiv 2020, arXiv:2010.11267. [Google Scholar] [CrossRef]
Ng, D.; Chen, Y.; Tian, B.; Fu, Q.; Chng, E.S. Convmixer: Feature Interactive Convolution with Curriculum Learning for Small Footprint and Noisy Far-Field Keyword Spotting. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 3603–3607. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar] [CrossRef]
Roth, W.; Schindler, G.; Klein, B.; Peharz, R.; Tschiatschek, S.; Fröning, H.; Pernkopf, F.; Ghahramani, Z. Resource-Efficient Neural Networks for Embedded Systems. arXiv 2022, arXiv:2001.03048. [Google Scholar] [CrossRef]
Wang, K.; Liu, Z.; Lin, Y.; Lin, J.; Han, S. HAQ: Hardware-Aware Automated Quantization with Mixed Precision. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 8604–8612. [Google Scholar]

Figure 1. Workflow of research.

Figure 2. Accuracy degradation of reduced model for NSL-KDD problem.

Figure 3. Accuracy degradation of reduced model for CIFAR-10 problem.

Figure 4. Accuracy degradation of reduced model for Google Speech Commands problem.

Figure 5. ASIC synthesis workflow.

Figure 6. Percentage decrease of silicon area due model reduction and quantization.

Table 1. Comparison of model sizes for the NSL-KDD2 dataset.

Network	Parameters No.	Accuracy	Year
ANN [19]	3.8 k	84.9%	2015
CNN [20]	159 k	94.62%	2018
AEth 0.2 [21]	36.8 k	94.54%	2020
CNN+GRU [22]	202 k	99.86%	2021

Table 2. Comparison of model sizes for the CIFAR-10 dataset.

Network	Parameters No.	Accuracy	Year
ResNet-110 [26]	1.7 M	93%	2015
SimpNet slimmed [27]	300 k	93%	2018
CNN 4 Conv Lyr. [28]	2.8 M	92%	2019
ThriftyNet [24]	40 k	90%	2020
PixelHop++ (Small) [25]	62.2 k	65%	2020
NoisyDARTS-A-t [29]	4.3 M	98%	2020

Table 3. Comparison of model sizes for the Google Speech Commands dataset.

Network	Parameters No.	Accuracy	Year
LSTM [32]	28 k	88.8%	2017
ResNet-8-Narrow [31]	19.9 k	91.1%	2018
TDNN [33]	251 k	94.3%	2018
TCResNet8 [34]	66 k	96.1%	2019
TinySpeech-X [35]	10.8 k	94.6%	2020
MN-KWS-M [36]	167 k	95.8%	2021
ConvMixer [37]	119 k	98.2%	2022

Table 4. Reduction methods and results.

Networks	NSL-KDD	CIFAR-10	Google Speech
Parameter no.	3.8 k	39.6 k	19.9 k
Methods	1 × 61 to 3 × 24	Filters from 175 to 38	Filters from 19 to 7
Reduced parameter no.	2.7 k	2.9 k	2.8 k
Parameter reduction	28.9%	92.6%	85.9%
Networks accuracy
Before downscaling	84.9%	90.2%	91.4%
After downscaling	83.4%	84.4%	83.2%
Accuracy degradation	1.5 pp	5.8 pp	8.2 pp

Table 5. Comparison of algorithm results for NSL-KDD2.

Algorithm	Quantization	Parameters	Memory	Accuracy
ANN [19]	-	3.8 k	15.2 kB	84.9%
AEth 0.2 [21]	-	36.8 k	147.2 kB	94.5%
CNN [20]	-	159 k	636 kB	94.6%
This paper
ANN	-	2.7 k	10.8 kB	86.4%
	8-bit	2.7 k	2.7 kB	84.7%
	4-bit	2.7 k	1.4 kB	83.1%

Table 6. Comparison of algorithm results for CIFAR-10.

Algorithm	Quantization	Parameters	Memory	Accuracy
ThriftyNet [24]	-	40 k	160 kB	90.2%
PixelHop++ (Small) [25]	-	62.2 k	248.8 kB	64.8%
SimpNet slimmed [27]	-	300 k	1.2 MB	93.3%
This paper
ThriftyNet	-	2.9 k	11.6 kB	89.2%
	8-bit	2.9 k	2.9 kB	74.4%
	4-bit	2.9 k	1.5 kB	25.7%

Table 7. Comparison of algorithm results for Google Speech Commands.

Algorithm	Quantization	Parameters	Memory	Accuracy
LSTM [32]	-	28 k	112 kB	88.8%
ResNet-8-Narrow [31]	-	19.9 k	79.6 kB	91.1%
TinySpeech-X [35]	-	10.8 k	43.2 kB	94.6%
This paper
ResNet-8-Narrow	-	2.8 k	11.2 kB	90.6%
	8-bit	2.8 k	2.8 kB	90.1%
	4-bit	2.8 k	1.4 kB	89.4%

Table 8. Impact of quantization on model memory usage.

Model Resolution	Memory	Frequency	Clock Cycles
32 bits	10.8 kB	1 GHz	18
8 bits	2.7 kB
4 bits	1.4 kB

Table 9. Estimated MACs for the reduced networks.

	NSL-KDD2	CIFAR-10	Google Speech Commands
MACs	2.6 k	78 k	213 k

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tumialis, P.; Skierkowski, M.; Przychodny, J.; Obszarski, P. The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks. Electronics 2025, 14, 14. https://doi.org/10.3390/electronics14010014

AMA Style

Tumialis P, Skierkowski M, Przychodny J, Obszarski P. The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks. Electronics. 2025; 14(1):14. https://doi.org/10.3390/electronics14010014

Chicago/Turabian Style

Tumialis, Paweł, Marcel Skierkowski, Jakub Przychodny, and Paweł Obszarski. 2025. "The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks" Electronics 14, no. 1: 14. https://doi.org/10.3390/electronics14010014

APA Style

Tumialis, P., Skierkowski, M., Przychodny, J., & Obszarski, P. (2025). The Impact of 8- and 4-Bit Quantization on the Accuracy and Silicon Area Footprint of Tiny Neural Networks. Electronics, 14(1), 14. https://doi.org/10.3390/electronics14010014

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu