Open AccessArticle

Accelerating Deep Learning-Based Morphological Biometric Recognition with Field-Programmable Gate Arrays

Nourhan Zayed

^1,2

Nahed Tawfik

Mervat M. A. Mahmoud

Ahmed Fawzy

⁴,

Young-Im Cho

^5,*

and

Mohamed S. Abdallah

^6,7,*

Computers and Systems Department, Electronics Research Institute (ERI), Cairo 11843, Egypt

Mechatronics Engineering, The British University in Egypt, Cairo 11843, Egypt

Microelectronics Department, Electronics Research Institute (ERI), Cairo 11843, Egypt

⁴

Nanotechnology Lab, Electronics Research Institute (ERI), Cairo 11843, Egypt

⁵

Department of Computer Engineering, Gachon University, Seongnam 13415, Republic of Korea

⁶

Informatics Department, Electronics Research Institute (ERI), Cairo 11843, Egypt

⁷

AI Lab, DeltaX Co., Ltd., 5F, 590 Gyeongin-ro, Guro-gu, Seoul 08213, Republic of Korea

Authors to whom correspondence should be addressed.

AI 2025, 6(1), 8; https://doi.org/10.3390/ai6010008

Submission received: 29 October 2024 / Revised: 2 January 2025 / Accepted: 2 January 2025 / Published: 9 January 2025

(This article belongs to the Special Issue Artificial Intelligence-Based Image Processing and Computer Vision)

Download

Browse Figures

Figure 1
AT & T dataset (downloaded from <a href="https://www.kaggle.com/datasets/kasikrit/att-database-of-faces" target="_blank">https://www.kaggle.com/datasets/kasikrit/att-database-of-faces</a>; accessed on 13 October 2024). "> Figure 2
PYNQ-Z2. "> Figure 3
Zybo Z7-20 Zynq-7000 SoC development board. "> Figure 4
Raspberry Pi 3 Model B. "> Figure 5
OV7670 camera module. "> Figure 6
A flowchart of the proposed model. "> Figure 7
Top level block diagram. The light blue block (3) is a regular IP, while blue blocks (1, 2, and 4) are hierarchy blocks, grouping IP blocks together. Block no. 1, named camera_in, is the original data producer. It groups together the IP blocks needed to decode image data coming from the camera and to format it to suit our needs. Block no. 2, named video_out, is the ultimate data consumer. It groups IP blocks doing DVI encoding, so that the image data can be displayed on a monitor. We are going to look at these two hierarchy blocks later. Block no. 3 is an actual IP, named axi_vdma. It is a Xilinx IP with the full name AXI Video Direct Memory Access. VDMA sits in the middle of the video data flow, and its central role makes it an interesting addition. It is needed to decouple two incompatible video interfaces, the image sensor’s MIPI CSI-2 and the monitor’s DVI. "> Figure 8
The hierarchy of the control block, which illustrates, the input, output, and control interfaces modelled in C/C++. "> Figure 9
AlexNet accuracy. "> Figure 10
AlexNet loss. "> Figure 11
ResNet18 accuracy. "> Figure 12
ResNet18 loss. "> Figure 13
Accuracy of the VGG16 network. "> Figure 14
Loss curve of the VGG16 network. "> Figure 15
GoogLeNet accuracy. "> Figure 16
GoogLeNet loss curve. ">

Versions Notes

Abstract

Convolutional neural networks (CNNs) are increasingly recognized as an important and potent artificial intelligence approach, widely employed in many computer vision applications, such as facial recognition. Their importance resides in their capacity to acquire hierarchical features, which is essential for recognizing complex patterns. Nevertheless, the intricate architectural design of CNNs leads to significant computing requirements. To tackle these issues, it is essential to construct a system based on field-programmable gate arrays (FPGAs) to speed up CNNs. FPGAs provide fast development capabilities, energy efficiency, decreased latency, and advanced reconfigurability. A facial recognition solution by leveraging deep learning and subsequently deploying it on an FPGA platform is suggested. The system detects whether a person has the necessary authorization to enter/access a place. The FPGA is responsible for processing this system with utmost security and without any internet connectivity. Various facial recognition networks are accomplished, including AlexNet, ResNet, and VGG-16 networks. The findings of the proposed method prove that the GoogLeNet network is the best fit due to its lower computational resource requirements, speed, and accuracy. The system was deployed on three hardware kits to appraise the performance of different programming approaches in terms of accuracy, latency, cost, and power consumption. The software programming on the Raspberry Pi-3B kit had a recognition accuracy of around 70–75% and relied on a stable internet connection for processing. This dependency on internet connectivity increases bandwidth consumption and fails to meet the required security criteria, contrary to ZYBO-Z7 board hardware programming. Nevertheless, the hardware/software co-design on the PYNQ-Z2 board achieved an accuracy rate of 85% to 87%. It operates independently of an internet connection, making it a standalone system and saving costs.

Keywords:

morphological biometrics; face recognition; CNN; deep learning; FPGA machine learning

1. Introduction

Technology has become a great part of our life as it has spread to every aspect, and so, information has increased drastically. Confidentiality is particularly important for institutions and corporate enterprises since it helps protect them from being susceptible or exposed to their competitors. Nevertheless, these establishments must safeguard these data. Several system approaches are available nowadays, including microprocessor-based systems like microcontrollers and field programmable gate arrays (FPGAs), which are hardware-integrated circuits that can be programmed. The latter include numerous advantageous characteristics that enhance their usability in various applications. One such feature is their inherent speed, as FPGAs can execute functions simultaneously, as they function in parallel.

Recent advancements in Artificial Neural Networks (ANNs) have revolutionized various applications, particularly in speech recognition, machine translation, and scene analysis, by utilizing deep learning algorithms for effective sequence data processing. Deep learning architecture, characterized by multiple convolutional layers and pooling mechanisms, enhance feature extraction and prediction accuracy. The Tensor Processing Unit (TPU) has emerged as a powerful architecture for executing neural network models efficiently, while the integration of deep learning in medical applications highlights the need for tailored solutions. The adoption of deep learning algorithms on central processing units (CPUs) and graphics processing units (GPUs) brings several issues, primarily with respect to speed and power efficiency. Therefore, field programmable gate arrays (FPGAs) are being considered as a promising alternative for real-time embedded systems, presenting a trade-off between performance and power efficiency [1]. The power consumption aspect is even more critical in cloud-based deep neural network (DNN) processing because of the huge energy consumption caused by data transfer inside data centres.

Power consumption is a critical issue in cloud-based deep neural network (DNN) processing, primarily due to energy-intensive data movement within data centres. Strategies such as storing weights and intermediate results on on-chip buffers can mitigate this concern by reducing the time and power needed for data retrieval. Optimizing the utilization of processing elements (PEs) is essential for enhancing throughput, which is influenced by the number of PEs and their operational efficiency in meeting DNN model requirements. Additionally, the architecture of convolutional PE arrays is designed to minimise resource consumption while maintaining performance, allowing for a simultaneous execution of multiple convolution operations. Various hardware accelerators, including GPUs, FPGAs, and ASICs, are employed to improve computational efficiency across different industries. GPUs are optimized for parallel processing tasks, while FPGAs offer flexibility and lower power consumption. ASICs provide specialized performance for specific computations but lack the versatility of GPUs. This paper discusses the importance of sensitivity, area, throughput, and latency in evaluating machine learning systems, emphasizing the need for future designs to focus on a reconfigurable architecture and parallel processing to enhance real-time inference capabilities. The proposed novel hardware architecture for perceptrons aims to improve accuracy and efficiency, particularly in applications like heart attack risk assessments, while also integrating self-healing mechanisms to enhance reliability in neural networks [2].

FPGAs offer significant advantages in deep learning implementations, particularly in mobile and embedded platforms, outperforming traditional GPU setups in efficiency. The flexibility of FPGAs allows for optimized resource allocations, although challenges remain in managing the network topology and size. Effective implementation strategies involve balancing flexibility and optimization, utilizing automatic mapping solutions, and considering fixed-point versus floating-point coding to manage network complexity. The development of frameworks like CNNLab facilitates the testing of various topologies, while performance evaluation metrics focus on resource consumption and energy efficiency. Future developments in programmable deep learning architectures are expected to leverage the advantages of FPGAs to meet the demands of emerging applications across diverse fields [3].

In short, our work builds upon these existing efforts by proposing a novel hardware acceleration approach for deep learning-based morphological biometric recognition. This paper introduces a novel biometric face recognition system designed to enhance security in high-security buildings. Unlike traditional systems that rely on internet connectivity, our system operates independently, ensuring robust security. The key contributions of the proposed method are summarized as follows:

Develop a Building Management System (BMS) implemented on an FPGA to control access and prevent unauthorized entries.
Optimize a deep learning model for FPGA hardware, achieving high-speed and accurate face recognition.
Explore hardware acceleration techniques to significantly improve the system’s processing speed.
Conduct extensive simulations and experiments using Matlab (R2021b 9.11) and various deep learning architectures (such as GoogLeNet, SqueezeNet, AlexNet, ResNet, and VGG-16) on the AT&T Face Database to assess performance.
Deploy the system on three different platforms (Raspberry Pi, ZYBO Z7, and PYNQ Z2) to evaluate feasibility and performance.

By combining these advancements, our system offers a fast, efficient, and reliable solution for secure access control in high-security environments. The rest of this paper is organized as follows: Section 2 presents related work. The material and methods are explained in Section 3. The results and a discussion are discussed in Section 4. Finally, a conclusion is introduced.

2. Related Works

Biometric facial recognition is a technology used to identify and authenticate human faces accurately. Facial scanning can be used to confirm a person’s identity. This means a person cannot possess multiple driver’s licenses or state IDs and will not be identified in a law enforcement database. Facial recognition plays a vital role in future intelligent vehicle applications by determining if a person is authorised to drive. Sanctioned technology businesses struggle to create a facial recognition security application that is efficient and precise. This application must detect and verify a driver’s identity, crucial for identifying suspects and improving public safety. Mobile facial recognition software helps law enforcement quickly identify nearby individuals from a safe distance. The main importance of face recognition lies in its emphasis on security. We will examine several studies using different methodologies for facial recognition.

In [4], Wang et al. proposed a face detection acceleration system that utilises a hardware and software collaboration to maximise the benefits of ARM and FPGA. The MTCNN cascaded deep CNN framework was accelerated, resulting in a completed face detection system based on ZYNQ. The design consists of two main components: software and hardware. Facial detection utilises the MTCNN cascaded deep CNN framework. A three-layer cascaded deep convolutional network was employed to predict facial positions and keypoint coordinates from rough to fine. The MTCNN model trains on the WIDER FACE dataset. The system was deployed on the ZYNQ 7020 SOC platform using standard C language (C99) within the Xilinx SDK software. The hardware integrates the ADV7511 controller and video buffer for real-time image display from an SD card. Direct memory access of video data is achieved using an ARM Cortex-A9 processor and VDMA as an AXI slave device. The study in [5] introduced a method for designing CNN as hardware accelerators, focussing on optimising precision, power consumption, and space for resource-limited settings like IoT and healthcare. It highlighted the role of FPGAs in addressing the computational challenges of complex deep-learning models. This method involved segmenting input feature maps into smaller windows, which allows for the optimized use of multiplication units through a control unit and a synchronization buffer, enhancing data processing efficiency. This method reached an accuracy of 98.79% on the CIFAR-10 dataset.

Zhang et al. [6] developed an end-to-end object detection accelerator tailored for FPGA-only devices, emphasising deep learning algorithms such as YOLO v2 and YOLO v3. It tackles the issues of deploying deep convolutional neural networks (CNNs) on resource-constrained hardware by suggesting a versatile architecture that combines CNN computation and post-processing to reduce latency. Key contributions feature a design that minimises DSP and memory use while delivering high throughput and low latency in object detection tasks. The accelerator uses optimisations like fusing batch normalisations with convolution operations and applying a quantisation method to boost performance while handling resource limitations. The hardware design includes efficient data transfer mechanisms, a processing element (PE) array for better convolution computation, and an on-chip memory system that minimises access latency. The accelerator reaches a maximum throughput of 914 GOP/s, showcasing its ability for complex object detection tasks. Performance evaluations show that the design surpasses current FPGA accelerators in throughput and resource efficiency, emphasising the need for effective post-processing algorithms to achieve low-latency requirements. The paper ends with a discussion on future work, focussing on advanced quantisation techniques and the implementation of more object detection algorithms to boost efficiency and performance. This method reached a Mean Average Precision of approximately 76.7% for YOLO v2, 53.5% for YOLO v2 tiny, 75.4% for YOLO v3, and 58.1% for YOLO v3 tiny on the VOC dataset.

Yu et al. in [7] introduced an architecture for face detection with deep CNNs. The proposed method features a cascading structure of three networks (12-Net, 24-Net, and 48-Net) that enables the early rejection of non-face candidates, reducing the computational load. Models were assessed using the FDDB dataset. The paper highlights the significance of hardware factors for the effective deployment of CNNs. Using Vivado HLS, the hardware implementation shows a performance of 76.8 GFLOPS. This research connects deep learning models with FPGA implementations, achieving 83%.

Said et al. [8] introduced a deep learning model architecture utilising CNNs. The model also included a Logistic Regression Classifier to classify the features learnt by CNN. This system’s performance was assessed with the ORL face and AR face datasets. The experimental results show this approach reached a validation accuracy of 98.7%. Hangaragi et al. [9] presented a framework for face detection and recognition using a face mesh and deep neural network. The model trained the deep neural network using real-time captured images and the Labelled Wild Face (LWF) dataset. If the test image’s face landmarks match those of any training images, the model identifies the person; if not, it outputs “unknown”. The study shows a precision rate of 94.23% for facial recognition using the proposed method.

Teoh et al. proposed a facial recognition system using deep learning techniques in [10]. The authors examine the challenges of face identification, such as misalignments, position variations, lighting variations, and expression fluctuations. They describe how deep learning can effectively tackle these problems and achieve enhanced accuracy in face recognition. The study examines deep learning for training neural networks and compares the performance of OpenCV and MATLAB in image processing tasks, highlighting OpenCV’s superior speed and efficiency. The authors explain how to execute the proposed face recognition system, which uses deep learning for face detection and identification. Real-time video recognition achieved an accuracy of 86.7%.

Albdairi et al. [11] proposed a method for face recognition and ethnicity identification using a deep convolutional neural network (DCNN) model grounded in deep learning. This method outperformed traditional techniques in accurately determining a person’s ethnicity through facial analysis. The authors explored high-performance computing hardware, specifically field-programmable gate arrays (FPGAs), to meet the computational needs of the DCNN model, comparing their performance with graphics processing units (GPUs). The trials utilised a dataset of 3141 facial photos from three different countries. This dataset was specifically created to identify ethnic groups. The results show that the DCNN model achieves an F1 score of 94.6% and an accuracy of 96.9% when implemented on FPGAs.

Valenzuela et al. exploited the structure of a smart imaging sensor (SIS) for real-time face identification in [12]. The SIS in the analogue domain featured a custom smart pixel capable of calculating local spatial gradients, alongside picture classifications performed by a digital coprocessor. The intelligent pixel used spatial gradients to derive a simplified version of local binary patterns (LBPs) called ringed LBPs (RLBPs). This facial recognition method involved three steps: feature extraction with RLBP, feature vector computation through RLBP histograms, and classifications via linear discriminant analysis and nearest neighbour criteria. Accuracy reached 96.0%.

Kulesza et al. [13] suggested a facial recognition system with a single-board computer. The efficacy of various single-board computers, such as Raspberry Pi, Banana Pi, and Nvidia Jetson Nano, were assessed. The authors conducted a comparison between two face detection algorithms: a Haar feature-based cascade classifier and a multitask cascaded convolutional neural network (MTCNN). The authors employed the FaceNet algorithm for face identification, which trains a model to map face images to a condensed Euclidean space where distances represent a measure of facial similarity. This system underwent training and testing using a confidential database and obtained a monitoring accuracy of more than 97% in identifying individuals entering a room. Melzi et al. in [14] introduced FRCSyn-onGoing, an ongoing challenge designed to assess and thoroughly evaluate real and synthetic data to enhance face recognition algorithms. The challenge focused on addressing concerns regarding data privacy, demographic biases, the ability to generalize to new situations, and performance constraints in difficult settings, including age differences, position changes, and occlusions. Their paper argues in favour of information fusion at different levels, including the input data, where a combination of actual and synthetic domains is suggested for specific tasks. The findings acquired in FRCSyn-onGoing, in conjunction with the proposed public ongoing benchmark, make a substantial contribution to the utilization of synthetic data for enhancing facial recognition technology.

Hammouche et al., in [15], introduced a face recognition system combining a Gabor filter bank with a deep learning method called Sparse AutoEncoder (SAE). Features were reduced in dimensionality using Principal Component Analysis and linear discriminant analysis (PCA+LDA). The matching stage uses the cosine Mahalanobis distance. Seven publicly available face databases were used for the experiments, achieving full accuracy.

Moon et al. [16] implemented a face anti-spoofing technique using CNNs to analyse the colour and texture features of face images. This method used a local binary pattern descriptor to extract features from the brightness and colour difference channels. It also looks at the Cb, S, and V bands in the colour spaces. This methodology was assessed using the CASIA-FASD dataset. The authors evaluated the feasibility of applying the proposed method on an AI FPGA board, confirming its effectiveness for edge computing uses.

Dang [17] offers an effective deep-learning approach for a smart attendance system with improved facial recognition features. The author use an enhanced FaceNet model architecture featuring a MobileNetV2 backbone and an SSD component that employs depth-wise separable convolutions. This design choice minimises size and computational complexity while achieving an over 95% accuracy and a processing speed of 25 FPS. The solution successfully addresses the limitations of memory and storage on mobile devices for identifying individuals. It is very compatible with low-capacity hardware and systems with limited resources. The author demonstrates the use of deep learning facial recognition technology in developing an advanced automatic attendance system.

Tsai et al. in [18] suggested the Facenet method which is utilized to extract distinctive high-dimensional features from facial images, enabling the computation of similarity through distance metrics. The model integrated separable convolution layers and fire modules, replacing conventional convolution and pooling layers to enhance efficiency and optimize memory usage. Additionally, the implementation of FPGA ensured a low-power operation, while the control system oversees memory preparation and task scheduling, ensuring effective communication between the HPS and FPGA through a lightweight AXI bus. This system achieves 99.2% accuracy on the LFW dataset and 94% on the VGGFace2 dataset, with a total of 1.4 million parameters and 274 million GOPs. The hardware design was executed using Intel Quartus II, with comprehensive specifications and performance metrics available in accompanying documentation.

A study by Wang et al. [19] presented an approach to designing CNNs as hardware accelerators. It emphasized the utility of FPGAs as adaptable platforms for implementing deep learning models, aiming to optimize resource utilization. This architecture featured a control unit that optimally selected input feature windows and kernels, streamlining the convolution process by applying kernels to smaller, manageable windows. This design minimises the number of multiplication units needed, thereby lowering costs and improving efficiency. Implementation on the Aletra 10 GX FPGA demonstrated effective resource management with low utilization percentages, achieving an accuracy of 98.79% on the CIFAR-10 dataset.

Al Amin et al., in [20], conducted a study on integrating deep learning techniques, particularly CNN and YOLO algorithms, into Advanced Driving Assistance Systems (ADASs) for traffic light detection and classification. It highlights the shortcomings of current algorithms, especially the Single Shot MultiBox Detector (SSD), which has difficulty detecting small objects and relies heavily on large, annotated datasets. This solution used the YOLO v3 tiny algorithm on the Xilinx Kria KV260 FPGA board to achieve real-time performance and to optimise resource use for autonomous vehicles. This system’s implementation included dataset preparation, model training, and deployment on the FPGA board. The Bosch Small Traffic Light Dataset (BSTLD) includes 13,427 images with annotated traffic light signals, crucial for training the YOLO model. The training process aimed to enhance the model’s performance, and the architecture was prepared for deployment on the FPGA. Experimental evaluations show a processing time of about 1.996 s per image, achieving a speed of 15 frames per second, which is suitable for real-time applications. Accuracy is approximately 99%.

Kabir [21] introduced a reconfigurable memory-centric array processor architecture tailored for deep learning applications on FPGAs, aiming to mitigate the von Neumann memory bottleneck and to enhance processing speeds. The architecture leveraged a single-instruction multiple-data (SIMD) design, which is particularly effective for the high operational intensity of CNNs. This work explored various FPGA-based accelerators and automated frameworks that enhance performance and energy efficiency for deep neural networks. It discussed the limitations of existing PIM designs in utilizing Block RAM (BRAM) efficiently and presented a comprehensive design standard for evaluating and guiding future PIM developments. The architecture’s scalability and optimizations, demonstrated through practical implementations like PiCaSO and IMAGine, highlight its potential to achieve high performance in deep learning tasks. This paper showcased the effectiveness of a custom architecture in addressing the demands of modern deep learning applications.

Wu et al. [22] introduced a framework for gaze estimation using an FPGA, improving its use in areas like smart classrooms and advertising research. The authors combine gaze estimation algorithms with block-wise convolutions and various convolution types to improve system performance and to address on-chip memory limitations in FPGAs. Key contributions include the use of block-wise convolutions to improve computational efficiency; a hybrid architecture that combines depthwise separable and standard convolutions to lower resource usage while preserving performance; and the incorporation of head pose information to enhance gaze estimation accuracy, especially during head movements. The system runs at 32 frames per second on the ZYNQ7035 CPU, consuming an average of 6.4 watts, showcasing its effectiveness in real-time processing with little accuracy loss compared to traditional methods. The study shows how FPGA technology improves gaze estimation algorithms.

Teboulbi et al. [23] introduced a method to enhance facial point detection (FPD) using deep CNNs on FPGA-based systems-on-chip (SoCs). The proposed method employs dynamic partial reconfiguration and a hybrid architecture to address the significant computing needs of DCNNs. This includes the GPU software for FPD, CNN acceleration through high-level synthesis, and a DPR architecture to boost performance. Accuracy reached 89.01%, precision was 91.63%, and recall stood at 90.25%.

3. Materials

3.1. AT&T Database of Faces

The Database of Faces, also known as ‘The ORL Database of Faces’, is a collection of 400 face images taken in a laboratory from 1992 to 1994 for a facial recognition study [24]. The images were taken at different times, resulting in discrepancies in lighting conditions, facial expressions, and characteristics. The images were taken against a dark backdrop, with subjects standing erect and oriented front. The files are saved in PGM format and can be accessed on UNIX systems using the ‘xv’ software. The images are organized into 40 directories, each containing ten unique images of the subject, each named Y.pgm. The database is available for preview and access on UNIX systems. Each image has dimensions of 92 × 112 pixels, and each pixel has 256 shades of grey. The photos are categorized into 40 directories, each corresponding to a subject. The directories are named in the format sX, where X represents the subject number, ranging from 1 to 40. Within each of these directories, there are ten distinct photographs of the subject, each labelled with a name in the format Y.pgm, where Y represents the image number for that particular subject, ranging from 1 to 10. Figure 1 shows samples of the AT & T dataset (available from https://www.kaggle.com/datasets/kasikrit/att-database-of-faces; accessed on 13 October 2024).

3.2. PYNQ-Z2

The PYNQ-Z2 board is an integrated FPGA fabric and dual-core ARM Cortex-A9 CPU combined into a single Xilinx ZYNQ SoC chip. Programmable Logic (PL) refers to the FPGA fabric, and Processing System (PS) refers to the dual-core ARM Cortex-A9 CPU. Memory controllers and other peripheral interfaces are just two examples of the numerous specialized peripherals that make up the PS subsystem. By including more custom hardware IP (Intellectual Property) cores in the PL overlay, it can also be enlarged. Overlays, sometimes referred to as hardware libraries, are programmable FPGA designs that increase the capabilities of user applications from the PS subsystem of the PL subsystem of the ZYNQ device. Overlays can be used to optimize a hardware platform for a particular application or to speed up the operation of a software program. Moreover, a noteworthy feature of the PYNQ-Z2 board is its Python interface. Through this interface, Python applications operating in the PS can control overlays in the PL. FPGAs are therefore more widely available for use in computer vision and machine learning applications [23]. In Figure 2, the PYNQ-Z2 board is shown.

3.3. Zybo Z7-20 Zynq-7000 SoC Development Board

Zybo Z7 is a fully functional development board designed for creating embedded software and digital circuits. It is based on the Xilinx Zynq™-7000 family. As seen in Figure 3, Zynq-7000 smoothly integrates Xilinx 7-series FPGA logic with a dual-core ARM Cortex-A9 processor. Zybo Z7 is a powerful single-board computer that offers a wide range of multimedia and communication peripherals. Zybo Z7 has a high DDR3L bandwidth, an HDMI input and output, and a Pcam connector that is compatible with MIPI CSI-2. These features mark it as a cost-effective and powerful option for advanced embedded vision applications. Zybo Z7’s Pmod connections provide the convenient attachment of extra hardware, providing access to Digilent’s extensive collection of over 70 Pmod peripheral boards, which encompass motor controllers, sensors, displays, and various other devices. Zybo Z7 is compatible with Xilinx’s Vivado Design Suite, which includes the free WebPACK version. Additionally, PetaLinux Tools allow for interactions with the PS of the system [24].

3.4. Raspberry Pi 3B Board

Raspberry Pi 3 Model B, as shown in Figure 4, features a 64-bit processor, built-in WiFi and Bluetooth modules, and enhanced connections. These enhancements result in improved performance, connectivity, and battery management. The Raspberry Pi 2 device B is entirely surpassed by this Pi device. The device is equipped with 1 GB of RAM and a 1.2-GHz Broadcom BCM2837 processor. The system employs a BCM43438 wireless LAN and communication module [25].

3.5. Camera Module (OV7670 Camera Module)

OV7670/OV7171 CAMERACHIPTM, as shown in Figure 5, is a CMOS image sensor with low voltage capabilities. It is designed to work as a VGA camera and image processor in a compact device. The OV7670/OV7171 camera module captures images in many formats, including full-frame, sub-sampled, or windowed 8-bit images. The Serial Camera Control Bus (SCCB) interface is used to control the camera. The product has an image array that can operate at VGA resolution at a maximum rate of 30 frames per second (fps). Users have complete control over data transport for output, formatting, and image quality. Every crucial picture in the Serial Camera Control Bus (SCCB) interface is used to control the camera. The SCCB interface can be used to program all of the key image processing functions, including hue control, exposure control, gamma adjustment, white balance, colour saturation, and others. Furthermore, OmniVision CAMERACHIPs employ exclusive sensor technology to enhance image quality by mitigating or eliminating typical forms of image corruption caused by lighting or electrical factors, such as fixed pattern noise (FPN), smearing, blooming, and so on. This results in the production of a pristine and consistently vibrant colour image [26].

4. Methodology

This section addressed the software and hardware implementation aspects of the proposed system. Firstly, we discuss the software framework employed for training and deploying the deep learning models, including the choice of deep learning models. Subsequently, we describe the hardware implementation on the FPGA platform.

4.1. The Suggested System

The flowchart presented in Figure 6 illustrates the proposed system, which was designed for security purposes. Certain operations are performed within the system components, and decisions are made based on the machine learning system for face recognition. The suggested model is based on several steps. Firstly, once the system detects a person’s presence, it initiates its operations. This is achieved through the utilization of face detection techniques using machine learning. The purpose of this step is to determine if there is a person who needs to be checked for authorization to enter the place. If no one appears, the system is still in sleep mode. Once the system completes the face detection procedure, the camera proceeds to transmit an image of the standing individual. This image is then processed by the FPGA board to be identified by the system. Afterward, the FPGA receives the input image and processes it internally to facilitate recognition. The FPGA functions as a self-contained system, serving as the central processing unit responsible for decision making, resource management, data handling, and data processing. It is the primary component of this system. Once the FPGA receives the input image or data, it performs filtering operations to eliminate surrounding noise and to identify the edges of the picture. Following multiple convolution filters in the final processing step, the input image is then compared to other images in the dataset. Following the above processes, if the input image corresponds to any of the stored photographs in the dataset, it indicates that the system has successfully recognized the person. Subsequently, the system grants access to the individual, allowing them to enter through the gate, and records the time of entry. If someone is not recognized, it indicates that they do not have authorization to enter or access the premises. The system will issue an alert informing them that they are not permitted to enter. If the individual attempts to enter more than twice, the system will begin sending alerts to the security team and will capture an image of the person to identify unauthorized entry attempts.

4.2. Software and Hardware Implementation

4.2.1. Convolution Layer

A fully pipelined and parameterized convolution operation was implemented in Verilog. That allowed us to construct different architectures that use different kernel sizes, 1 × 1, 5 × 5, 3 × 3, 7 × 7, or 11 × 11. Pipelining increases frequency, speed, and reliability. MAC (Multiply and Accumulate) units are used in the architecture in order to translate these operations to the FPGA’s DSP blocks. Attaining this would result in significantly faster addition and multiplication operations and lower power consumption, because the DSP blocks are implemented in hard macros. When given the correct inputs, a set of MAC units and a few shift registers perform the convolution operation and, after a predetermined number of clock cycles, output the result. Assume, for the moment, that the dimensions of our kernel (filter/window) are

K \times K

and that the size of the input feature map (picture) is

N \times N

. Naturally, it is known that s < dt denotes the value of the stride at which the window moves across the feature map.

The output feature map’s size will be less than the input feature map’s after 2-D convolution. Specifically, the output will have the dimension (

(N - K + 1) / s \times (N - K + 1) / s)

if the input feature map does not include any zero paddings. The kernel weights are arranged so that during the convolution process, each MAC unit has a constant weight value; nevertheless, the value assigned to each MAC unit varies based on its position. On the other hand, according to the input feature map, the input activation values are set to the same value on each and every MAC unit, with the exception that the value varies with each clock cycle. This significantly reduces the memory access time and storage demand, because the input feature map only has to be sent in once, and the stored activations only need to be retrieved from memory once. A form of local caching for the previously accessible values is produced by the shift registers. Because of its modest computational power, this architecture allows the convolution for any size input to be calculated without breaking the input or temporarily storing it somewhere else. It is also possible to modify the stride value for certain unusual designs that call for a greater stride. Due to its low computing capability, the convolution for any size input can be calculated with this architecture without breaking the input or temporarily storing it elsewhere. It is also possible to modify the stride value for some uncommon architectures that call for a greater stride.

4.2.2. Activation Function Layer

A non-linear function, called an activation function, was added to a multi-layer neural network, so that a linear function is not the final result. The ReLU function was implemented, which is the most commonly used one. A pipelined max-pooling operation was used for the pooling unit. In essence, this is a technique for reducing the number of parameters intricated in a neural network by downsampling the input at different stages of the network. The Max pooler’s output is a small portion of the size of its intake. A 2 × 2 pooling window was used, since anything bigger than this becomes too destructive. Control statements that create control signals for each individual unit in the pooler based on predetermined conditions that change depending on the values of N, K, and P were used to implement max pooling.

4.2.3. Digit Recognition Using Deep Learning

This algorithm uses deep learning techniques of piling neurons at each layer with their perspective weights as well as biases. Then, it is applied to an activation function of choice. Firstly, the weights and bias from the mnist dataset are extracted by training the dataset on the presumed network made on software using Python 3 which consists of five layers. The first one is the input layer with 784 inputs (28 × 28, the resolution size of the mnist dataset). The second, third, and fourth ones are flattened layers of 30, 30, 10, and 10 neurons, respectively. The training is performed with 30 epochs. The minst dataset has around 60,000 pictures, and we used 10,000 of them as validation data. As can be seen, the accuracy rate increases each epoch, since it was 70% after the first epoch, then reached around 91% after the second epoch, and the bias and weights are produced and stored in a file called ‘WeightsAndBiases’. To arrange them in the correct way that acts as an input to the neurons, we ran another script called ‘genWeightsAndBias’, which puts each weight and bias in a file that is named according to its own perspective layer and neuron number in that layer. Consequently, we started designing the hardware constituents (in Verilog) of the layers, as well as the wrapper of those constituents, from simple neurons to activation functions (Sigmoid and ReLU) to the top module connecting all of them. We also tested out the memory usage of both activation functions that were designed to decide which is better in terms of memory consumption, and accuracy and which method of memory implementation needed to be used to be designed on the FPGA. Accordingly, the design was implemented, and the synthesis results are illustrated in Table 1. These results are for the number of resources that one neuron consumes during its implementation. The neuron was implemented with two different activation functions to examine which is better in terms of resource utilization, as shown in Table 2 and Table 3. Tables S1–S5 in the supplementary file show the different deep learning archi-tecture settings.

An analysis of resource utilization, as shown in Table 1, Table 2 and Table 3, offered significant insights into the hardware implementation of the neural network. The ReLU activation function exhibits reduced resource consumption relative to the Sigmoid function, especially regarding LUT and FF utilization. The simplicity of the ReLU function is due to its thresholding operation, in contrast to the more complex exponential calculations needed for the Sigmoid function. Reducing the bit depth of the Sigmoid function from 10 bits to 5 bits leads to a marginal decrease in LUT and FF utilization, indicating a trade-off between precision and resource consumption. The ReLU function may demonstrate greater resource efficiency; however, the selection of an activation function must be meticulously evaluated concerning its influence on overall model accuracy and performance.

The findings demonstrate that the ReLU activation function typically demands fewer resources than the Sigmoid function, especially regarding LUT and FF utilization. This decreased resource consumption is due to the less complex computational demands of the ReLU function in contrast to the Sigmoid function, which necessitates exponential computations.

Additionally, decreasing the bit depth of the Sigmoid function from 10 bits to 5 bits led to a minor decrease in resource utilization, as demonstrated in Table 2 and Table 3. This reduction in bit depth may affect the accuracy and precision of the network’s calculations.

4.2.4. System Integration

The following Figure 7 illustrates the system block diagram. Block no. 1 is the original data producer. It groups together IP blocks needed to decode image data coming from the camera and to format it to suit our needs. To create a pass-through video pipeline using Pcam 5c and Zybo Z7-20, Vivado HLS was used to create re-usable IP blocks. Block no. 2 is the ultimate data consumer. It groups IP blocks doing DVI (Digital Visual Interface) encoding, so that the image data can be displayed on a monitor. Block no. 3 is AXI Video Direct Memory Access (VDMA), a Xilinx IP. VDMA sits in the middle of the video data flow, and its central role makes it an interesting addition. It is needed to decouple two incompatible video interfaces, the image sensor’s MIPI CSI-2 and the monitor’s DVI. In our case, the sensor is capable of outputting at 30 frames per second, but the monitor will need 60 frames per second for a standard resolution. VDMA assumes the slave role, consuming data on its input side (S2MM) and writing it to DDR memory. It also assumes the master role, producing data on its output side (MM2S) by reading it from DDR memory beforehand. This allows both the sensor and the monitor to work at their own pace. Block no. 4 is the system control. The dual-core ARM in the Zynq chip executes the software control program, and its associated blocks give access to the main memory and control the sensor over I2C.

Figure 8 shows the hierarchy of block no. 4. The input, output, and control interfaces were modelled in C/C++. Fortunately, the data type modelling AXI-Stream already exists in HLS template libraries. So, our task involved writing a processing block (function) that accepts an AXI-Stream RGB video input (argument) and outputs the similarly formatted processed video data (argument) to then perform face recognition on the input video. The system requirements are a 1280 × 720@60Hz resolution and a stable video feed.

5. Comparison and Discussion of Results

5.1. CNN Models for Image Classification

The performance test of the proposed biometric recognition system was conducted using the standard AT&T database. This database consists of 10 photographs from each of 40 people, resulting in a total of 400 images. The suggested system was evaluated using various deep-learning models for a performance assessment. The evaluation findings and the system’s recognition accuracy are depicted in Table 4. Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 show the accuracy and loss curves for AlexNet, ResNet18, VGG16, and GoogLeNet, respectively. The findings of the suggested method prove the superiority of ResNet18 in facial recognition. The simulation yielded the following results: a mini-batch size of 5, a maximum of 10 epochs, and a validation frequency of 50 iterations (totalling 550 iterations with scrambled validation data each epoch). AlexNet achieved an accuracy of 98.33% in a time of 3 min and 5 s. It had 60 million parameters and an error rate of 15.3%. SqueezeNet achieved an accuracy of 3.33% in a time of 1 min and 31 s. It had 5 million parameters and an error rate of 16.4%. ResNet achieved an accuracy of 99.17% in a time of 2 min and 45 s. It had 23 million parameters and an error rate of 3.6%. VGG-16 achieved an accuracy of 96.67% in a time of 14 min and 21 s. It had 138 million parameters and an error rate of 7.3%. GoogLeNet achieved an accuracy of 98.33% in a time of 3 min and 8 s. It had 4 million parameters and an error rate of 6.67%.

ResNet18 demonstrated the highest accuracy of 99.17%, indicating its superior ability to learn discriminative features from the facial images. This is further evident in Figure 11, where the accuracy curve for ResNet18 consistently reaches a higher peak compared to the other models.

AlexNet achieved a high accuracy of 98.33% with a relatively moderate number of parameters, suggesting a good balance between accuracy and model complexity.

GoogLeNet also achieved an accuracy of 98.33%, while exhibiting a significantly lower number of parameters compared to AlexNet, demonstrating its efficiency in terms of parameter utilization. This is reflected in Figure 15, where GoogLeNet achieves high accuracy with a relatively simple model architecture.

VGG16, despite achieving a respectable accuracy of 96.67%, exhibited the longest training time and the highest number of parameters, suggesting a higher computational cost. This is evident in Figure 13, where the accuracy curve for VGG16 shows a slower convergence rate compared to other models.

Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15 and Figure 16 visually depict the training and validation accuracy and loss curves for each model. These figures provide insights into the training dynamics and convergence behaviour of the different CNN architectures. For instance, Figure 10 shows that AlexNet exhibits a relatively stable loss curve during training, indicating a smooth and consistent learning process.

The results demonstrate that ResNet18 achieved the highest accuracy on the AT&T dataset, indicating its effectiveness in learning robust facial representations. This superior performance can be attributed to its residual connections, which facilitate the training of deeper networks and improve the flow of information within the network.

While AlexNet and GoogLeNet achieved competitive accuracy levels, ResNet18 offered a better balance between accuracy and model complexity. GoogLeNet, with its inception modules, demonstrated efficient parameter utilization, achieving high accuracy with a relatively small number of parameters. This efficiency is reflected in the faster training time and lower number of parameters observed for GoogLeNet compared to VGG16.

VGG16, despite its high accuracy, exhibited a significantly higher computational cost due to its deeper architecture and larger number of parameters. This increased computational burden may be a concern for resource-constrained applications, especially those requiring real-time processing.

5.2. A Comparison with the Literature

The performance assessment of the suggested work was compared with the findings given in the literature [27,28,29,30,31,32] using the same dataset for face recognition. This comparison is presented in Table 5. The proposed technique and CNN architecture are comparable to the works documented in the literature. The suggested work achieves an improvement in recognition accuracy by optimizing the number of convolution filters, the window size for convolution filters, and pooling.

6. Conclusions

This paper aims to develop a secure, standalone biometric system for building access control. The system leverages FPGA technology to implement a Building Management System (BMS) that can independently verify authorized personnel without relying on internet connectivity. The core component of the system is a facial recognition system, enhanced by a modified machine learning algorithm tailored for FPGA implementation. This optimization improves performance, accuracy, and speed. To select the most suitable network architecture, we evaluated GoogLeNet, SqueezeNet, AlexNet, ResNet, and VGG-16 on two datasets. AlexNet and ResNet demonstrated superior performance in terms of accuracy and efficiency. We further tested the deployment of these models on three different hardware platforms: Raspberry Pi, ZYBO Z7, and PYNQ Z2. The results of these evaluations will inform the final selection of the optimal hardware platform for our system.

Future research directions for this work encompass expanding the scope by incorporating larger and more diverse datasets, including images with varying lighting conditions, poses, occlusions, and ethnicities, to improve robustness and generalizability.

In addition, advanced FPGA architectures and hardware acceleration techniques can be implemented to further improve the real-time performance of the facial recognition system. Moreover, power optimization techniques can be investigated to minimise power consumption for deployment in resource-constrained edge computing environments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai6010008/s1, Table S1: Leznet Architecture; Table S2: Alex-net Architecture; Table S3: VGGNet-16 Architecture; Table S4: ResNet Architecture; Table S5: GoogLeNet Architecture.

Author Contributions

Conceptualization, N.Z., N.T., M.M.A.M., A.F. and M.S.A.; Methodology, N.Z., N.T., M.M.A.M., M.S.A. and Y.-I.C.; Software, N.Z., N.T., M.M.A.M., A.F., M.S.A. and Y.-I.C.; Validation, N.Z., N.T., M.M.A.M., M.S.A. and Y.-I.C.; Formal analysis, N.Z., N.T., M.M.A.M., M.S.A. and Y.-I.C.; Writing—original draft, N.Z., N.T., M.M.A.M., M.S.A. and Y.-I.C.; Writing—review and editing, N.Z., N.T., M.M.A.M., M.S.A. and Y.-I.C.; Visualization, N.Z., N.T., M.M.A.M., M.S.A. and Y.-I.C.; Supervision, N.Z., N.T., M.M.A.M., M.S.A. and Y.-I.C.; Project administration, M.S.A. and Y.-I.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the Korea Agency for Technology and Standards in 2022, project numbers 1415181629 (20022340, Development of International Standard Technologies based on AI Model Lightweighting Technologies) and 1415180835 (20020734, Development of International Standard Technologies based on AI Learning and Inference Technologies).

Data Availability Statement

Data available in a publicly accessible repository: The original data presented in the study are openly available in Kaggle at https://www.kaggle.com/datasets/kasikrit/att-database-of-faces, accessed on 1 January 2025.

Conflicts of Interest

Author Mohamed S. Abdallah was employed by the company DeltaX Co., Ltd., Republic of Korea. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Liu, F.; Li, H.; Hu, W.; He, Y. Review of neural network model acceleration techniques based on FPGA platforms. Neurocomputing 2024, 610, 128511. [Google Scholar] [CrossRef]
Capra, M.; Bussolino, B.; Marchisio, A.; Shafique, M.; Masera, G.; Martina, M. An Updated Survey of Efficient Hardware Architectures for Accelerating Deep Convolutional Neural Networks. Future Internet 2020, 12, 113. [Google Scholar] [CrossRef]
Blaiech, A.G.; Khalifa, K.B.; Valderrama, C.; Fernandes, M.A.C.; Bedoui, M.H. A Survey and Taxonomy of FPGA-based Deep Learning Accelerators. J. Syst. Archit. 2019, 98, 331–345. [Google Scholar] [CrossRef]
Wang, X.; Fan, Z.; Yang, W.; Han, M. Design of deep convolutional neural network cascade for face detection. In Proceedings of the Third International Conference on Electronics Technology and Artificial Intelligence (ETAI 2024), Guangzhou, China, 17–19 May 2024; Volume 13286, pp. 228–233. [Google Scholar] [CrossRef]
Khalil, K.; Khan, M.R.; Bayoumi, M.; Sherif, A. Efficient Hardware Design of Convolutional Neural Networks for Accelerated Deep Learning. In Proceedings of the 2024 IEEE 67th International Midwest Symposium on Circuits and Systems (MWSCAS), Springfield, MA, USA, 11–14 August 2024; pp. 1075–1079. [Google Scholar] [CrossRef]
Zhang, D.; Wang, A.; Mo, R.; Wang, D. End-to-end acceleration of the YOLO object detection framework on FPGA-only devices. Neural. Comput. Appl. 2024, 36, 1067–1089. [Google Scholar] [CrossRef]
Yu, B.S.; Tsao, Y.; Yang, S.W.; Chen, Y.K.; Chien, S.Y. Architecture Design of Convolutional Neural Networks for Face Detection on an FPGA Platform. In Proceedings of the IEEE Workshop on Signal Processing Systems, SiPS: Design and Implementation 2018 IEEE International Workshop on Signal Processing Systems (SiPS), Cape Town, South Africa, 21–24 October 2018; pp. 88–93. [Google Scholar] [CrossRef]
Said, Y.; Barr, M.; Ahmed, H.E. Design of a Face Recognition System based on Convolutional Neural Network (CNN). Eng. Technol. Appl. Sci. Res. 2020, 10, 5608–5612. Available online: www.etasr.com (accessed on 1 October 2024). [CrossRef]
Hangaragi, S.; Singh, T.; Neelima, N. Face Detection and Recognition Using Face Mesh and Deep Neural Network. Procedia Comput. Sci. 2022, 218, 741–749. [Google Scholar] [CrossRef]
Teoh, K.H.; Ismail, R.C.; Naziri, S.Z.M.; Hussin, R.; Isa, M.N.M.; Basir, M.S.S.M. Face Recognition and Identification using Deep Learning Approach. J. Phys. Conf. Ser. 2021, 1755, 012006. [Google Scholar] [CrossRef]
Albdairi, A.J.A.; Xiao, Z.; Alkhayyat, A.; Humaidi, A.J.; Fadhel, M.A.; Taher, B.H.; Alzubaidi, L.; Santamaría, J.; Al-shamma, O. Face Recognition Based on Deep Learning and FPGA for Ethnicity Identification. Appl. Sci. 2022, 12, 2605. [Google Scholar] [CrossRef]
Valenzuela, W.; Soto, J.E.; Zarkesh-Ha, P.; Figueroa, M. Face recognition on a smart image sensor using local gradients. Sensors 2021, 21, 2901. [Google Scholar] [CrossRef]
Kulesza, Z. Face recognition system based on a single-board computer, in: International Conference Mechatronic Systems and Materials (MSM). In Proceedings of the 2020 International Conference Mechatronic Systems and Materials (MSM), Bialystok, Poland, 1–3 July 2020. [Google Scholar]
Melzi, P.; Tolosana, R.; Vera-Rodriguez, R.; Kim, M.; Rathgeb, C.; Liu, X.; DeAndres-Tame, I.; Morales, A.; Fierrez, J.; Ortega-Garcia, J.; et al. Benchmarking and comprehensive evaluation of real and synthetic data to improve face recognition systems. Inf. Fusion 2024, 107, 102322. [Google Scholar] [CrossRef]
Hammouche, R.; Attia, A.; Akhrouf, S.; Akhtar, Z. Gabor filter bank with deep autoencoder based face recognition system. Expert Syst. Appl. 2022, 197, 116743. [Google Scholar] [CrossRef]
Moon, Y.; Ryoo, I.; Kim, S. Face Antispoofing Method Using Color Texture Segmentation on FPGA. Secur. Commun. Netw. 2021, 2021, 9939232. [Google Scholar] [CrossRef]
Dang, T.V. Smart Attendance System based on Improved Facial Recognition. J. Robot. Control 2023, 4, 46–53. [Google Scholar] [CrossRef]
Tsai, T.H.; Hsu, C.W.; Chi, P.T. Hardware Design on Face Recognition System by Deep Neural Network for Access Control System. In Proceedings of the 2024 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 6–8 January 2024. [Google Scholar] [CrossRef]
Wang, J.; Lin, J.; Wang, Z. Efficient hardware architectures for deep convolutional neural network. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 1941–1953. [Google Scholar] [CrossRef]
Al Amin, R.; Hasan, M.; Wiese, V.; Obermaisser, R. FPGA-Based Real-Time Object Detection and Classification System Using YOLO for Edge Computing. IEEE Access 2024, 12, 73268–73278. [Google Scholar] [CrossRef]
Kabir, M.A. ReMoDeL-FPGA: Reconfigurable Memory-Centric Array Processor Architecture for Deep-Learning Acceleration on FPGA, Graduate Theses and Dissertations. 2024. Available online: https://scholarworks.uark.edu/etd/5483 (accessed on 23 October 2024).
Wu, J.; Han, P. Design of an FPGA-Accelerated Real-Time Gaze Estimation System. J. Phys. Conf. Ser. 2024, 2897, 012015. [Google Scholar] [CrossRef]
Teboulbi, S.; Messaoud, S.; Hajjaji, M.A.; Mtibaa, A.; Atri, M. Fpga-Based SoC Design for Real-Time Facial Point Detection Using Deep Convolutional Neural Networks with Dynamic Partial Reconfiguration. Signal Image Video Process. 2024, 18, 599–615. [Google Scholar] [CrossRef]
AT&T Database of Faces, (n.d.). Available online: https://www.kaggle.com/datasets/kasikrit/att-database-of-faces (accessed on 13 June 2024).
Huynh, T.V. FPGA-based Acceleration for Convolutional Neural Networks on PYNQ-Z2. Int. J. Comput. Digit. Syst. 2022, 11, 441–449. [Google Scholar] [CrossRef]
Xilinx Zynq-7000 SoC Development Board—Digilent Zybo Z7, (n.d.). Available online: https://digilent.com/shop/zybo-z7-zynq-7000-arm-fpga-soc-development-board/ (accessed on 13 June 2024).
Karthikeyan, S.; Raj, R.A.; Cruz, M.V.; Chen, L.; Vishal, J.L.A.; Rohith, V.S. A Systematic Analysis on Raspberry Pi Prototyping: Uses, Challenges, Benefits, and Drawbacks. IEEE Internet Things J. 2023, 10, 14397–14417. [Google Scholar] [CrossRef]
CMOS OV7670 Camera Module, (n.d.). Available online: www.ArduCAM.com (accessed on 13 June 2024).
Fredj, H.B.; Sghaier, S.; Souani, C. An Efficient Face Recognition Method Using CNN. In Proceedings of the 2021 International Conference of Women in Data Science at Taif University (WiDSTaif), Taif, Saudi Arabia, 30–31 March 2021; pp. 1–5. [Google Scholar] [CrossRef]
Pranav, K.B.; Manikandan, J. Design and Evaluation of a Real-Time Face Recognition System using Convolutional Neural Networks. Procedia Comput. Sci. 2020, 171, 1651–1659. [Google Scholar] [CrossRef]
Alsayaydeh, J.; Aziz, A.; Hossain, A.Z.; Alsayaydeh, J.A.J.; Xin, C.K.; Hossain, A.K.M.Z.; Herawan, S.G. Face Recognition System Design and Implementation using Neural Networks. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 519–526. [Google Scholar] [CrossRef]
AbdELminaam, D.S.; Almansori, A.M.; Taha, M.; Badr, E. A deep facial recognition system using computational intelligent algorithms. PLoS ONE 2020, 15, e0242269. [Google Scholar] [CrossRef] [PubMed]

Figure 1. AT & T dataset (downloaded from https://www.kaggle.com/datasets/kasikrit/att-database-of-faces; accessed on 13 October 2024).

Figure 2. PYNQ-Z2.

Figure 3. Zybo Z7-20 Zynq-7000 SoC development board.

Figure 4. Raspberry Pi 3 Model B.

Figure 5. OV7670 camera module.

Figure 6. A flowchart of the proposed model.

Figure 7. Top level block diagram. The light blue block (3) is a regular IP, while blue blocks (1, 2, and 4) are hierarchy blocks, grouping IP blocks together. Block no. 1, named camera_in, is the original data producer. It groups together the IP blocks needed to decode image data coming from the camera and to format it to suit our needs. Block no. 2, named video_out, is the ultimate data consumer. It groups IP blocks doing DVI encoding, so that the image data can be displayed on a monitor. We are going to look at these two hierarchy blocks later. Block no. 3 is an actual IP, named axi_vdma. It is a Xilinx IP with the full name AXI Video Direct Memory Access. VDMA sits in the middle of the video data flow, and its central role makes it an interesting addition. It is needed to decouple two incompatible video interfaces, the image sensor’s MIPI CSI-2 and the monitor’s DVI.

Figure 8. The hierarchy of the control block, which illustrates, the input, output, and control interfaces modelled in C/C++.

Figure 9. AlexNet accuracy.

Figure 10. AlexNet loss.

Figure 11. ResNet18 accuracy.

Figure 12. ResNet18 loss.

Figure 13. Accuracy of the VGG16 network.

Figure 14. Loss curve of the VGG16 network.

Figure 15. GoogLeNet accuracy.

Figure 16. GoogLeNet loss curve.

Table 1. ReLU resource utilization table from the Vivado implementation.

Resources	Utilization	Available	Utilization, %
LUT	58	17,600	0.33
FF	66	35,200	0.19
DSP	2	80	2.50
IO	36	100	36.00
BUFG	1	32	3.13

Table 2. Sigmoid (10-bit depth) resource utilization table from the Vivado implementation.

Resources	Utilization	Available	Utilization, %
LUT	47	17,600	0.27
FF	52	35,200	0.15
BRAM	0.50	60	0.83
DSP	2	80	2.50
IO	36	100	36.00
BUFG	1	32	3.13

Table 3. Sigmoid (5-bit depth) resource utilization table from the Vivado implementation.

Resources	Utilization	Available	Utilization, %
LUT	55	17,600	0.31
FF	56	35,200	0.16
DSP	2	80	2.50
IO	36	100	36.00
BUFG	1	32	3.13

Table 4. The results of CNN models.

Model	Accuracy
AlexNet	98.33%
ResNet18	99.17%
VGG16	96.67%
GoogLeNet	98.33%

Table 5. Comparison of face recognition results reported in the literature for AT&T dataset.

Reference	Method	Recognition Accuracy
[27]	PCA-CNN	94.2%
[28]	CNN	98.75%
[29]	CNN	95.93%
[30]	CNN	87.00%
Proposed method	AlexNet	98.33%
	ResNet18	99.17%
	VGG16	96.67%
	GoogLeNet	98.33%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zayed, N.; Tawfik, N.; Mahmoud, M.M.A.; Fawzy, A.; Cho, Y.-I.; Abdallah, M.S. Accelerating Deep Learning-Based Morphological Biometric Recognition with Field-Programmable Gate Arrays. AI 2025, 6, 8. https://doi.org/10.3390/ai6010008

AMA Style

Zayed N, Tawfik N, Mahmoud MMA, Fawzy A, Cho Y-I, Abdallah MS. Accelerating Deep Learning-Based Morphological Biometric Recognition with Field-Programmable Gate Arrays. AI. 2025; 6(1):8. https://doi.org/10.3390/ai6010008

Chicago/Turabian Style

Zayed, Nourhan, Nahed Tawfik, Mervat M. A. Mahmoud, Ahmed Fawzy, Young-Im Cho, and Mohamed S. Abdallah. 2025. "Accelerating Deep Learning-Based Morphological Biometric Recognition with Field-Programmable Gate Arrays" AI 6, no. 1: 8. https://doi.org/10.3390/ai6010008

APA Style

Zayed, N., Tawfik, N., Mahmoud, M. M. A., Fawzy, A., Cho, Y.-I., & Abdallah, M. S. (2025). Accelerating Deep Learning-Based Morphological Biometric Recognition with Field-Programmable Gate Arrays. AI, 6(1), 8. https://doi.org/10.3390/ai6010008

Article Menu