1 Introduction

Recently, after the pandemic of COVID-19, the world’s attention has focused on applying social distance between humans and surrounding objects to protect people from infections. Accordingly, the need for a foundation, accurate, and safe biometric identification algorithm considers an emergency. One of the most robust, accurate, and secure biometric identification methods is gait, which can identify subjects from a distance based on fundamental dynamic patterns to identify the same person across several fixed cameras [1]. Compared to other commonly used biometrics modalities, such as fingerprint [2], iris [3], face [4], and DNA [5], the gait feature offers unique advantages, which are noninvasive, hard to disguise, sturdy to low-resolution images, and not need cooperative from a subject [1]. Due to the potential and advantages of gait identification considered algorithms than other biometrics, it uses in a wide range of security applications concerning hospitals, trade malls, banks, military installations, airports, religious institutions, etc. Moreover, it can use in crime prevention, forensic identification, and criminal investigation [6]. Despite the mentioned advantages of gait recognition, it has some drawbacks caused due to changes in clothes or carrying objects of subjects [7]. Numerous kinds of research have focused on two categories of identification methods to solve these issues: model-based approaches [8] and appearance-based approaches [9]. In this paper, after the silhouette frames have been extracted and enhanced, it passes through two main phases: (i) The first phase has been built based on the appearance-based model, which extracts the gait features using a proposed convolutional neural network (CNN) from the gait energy images (GEIs). (ii) The second phase has been designed based on the model-based technique, which extracts the other features using a proposed fully connected model architecture from the landmarks data frame. Finally, we propose a deep neural network (DNN) in order to recognize the concatenated high-level features from both phases. Figure 1 shows the main components of the fused gait recognition algorithm.

Fig. 1
figure 1

The proposed gait recognition model

The main contributions of this manuscript can be summarized as follows:

  1. 1.

    Extracting silhouette images from human gait videos and enhancing them (phase 1).

  2. 2.

    Extracting lower 2D pose joints from gait videos, converting them to 3D poses, computing knee angles, static and limb distance, and finally reshaping the results in a data frame (phase 2).

  3. 3.

    Proposing two convolutional neural network structures to extract the main features from phase1 silhouette images and phase2 data frames.

  4. 4.

    Proposing a novel deep model that combines two high-level features extracted from silhouette images (appearance-based) and human poses (model-based) to recognize human gait images.

  5. 5.

    Comparing the performance of the recent studies with the proposed algorithm.

  6. 6.

    Introducing a fine-tuned CNN algorithm with the best performance metrics

The remainder of this article is organized as follows: Sect. 2 provides an overview of the recent related works, Sect. 3 introduces the principal methodology of the proposed model, Sect. 4 presents the experimental results and discussions, and finally, Sect. 5 presents the main conclusions and future works.

2 Related works

Li et al. proposed a model called GaitSlice, which analyzes human gait images based on spatiotemporal slice features. GaitSlice has combined residual frame attention mechanism (RFAM) with inter-related slice features to form the spatiotemporal information. The experimental results show higher accuracy than six typical gait recognition algorithms [6]. Han et al. tuned the learning metrics of the gait recognition model on CASIA-B and TUM-GAID gait datasets to improve the recognition model’s performance. They used angular SoftMax loss and triplet loss to enhance features to be separable and discriminative. Finally, they added a batch normalization layer to optimize the two mentioned losses. The experiment results reveal that the tuned model outperforms the other state-of-the-art approaches [7]. Tian et al. proposed spatiotemporal attention to enhance the AGS-GCN structural gait images. They constructed a gait skeleton graph to extract multi-scale gait features from the skeleton data. Moreover, they improved the characteristics of joint points through the spatiotemporal attention mechanism. Extensive experiments demonstrate that the AGS-GCN scores better performance metrics than the other recent studies [10].

Alobaidi et al. introduced an advanced real-world smartphone-based gait recognition system to recognize the human gait images outside. The gait images have captured from 44 uncontrolled subjects in a 7–10 day, just asked to go about their normal daily activities. For each user, the experiment modeled four different forms of motion: normal walking, rapid walking, down, and upstairs. The evaluation result has an error rate ranging from 11.32 to 27.33% in the different mentioned models [11]. Sanjay Kumar Gupta and Pratik Chattopadhyay proposed a model to enhance the performance of the gait recognition model from the covariate conditions. The proposed model depends on determining a set of unique generic poses and computing gait features corresponding to these poses, called the dynamic gait energy image (DGEI). Furthermore, they employed a generative adversarial network (GAN) model to predict the corresponding DGEI images without the covariates. The experimental studies on CASIA-B, TUM-GAID, and OU-ISIR datasets verify the effectiveness of the proposed approach compared with the other studies [9].

Martinez-Hernandez et al. presented a learning architecture for gait recognition and prediction. This model comprises a convolutional neural network (CNN), a predicted information gain (PIG) module, and an adaptive combination of information sources. The outputs of the CNN and PIG modules are blended using a proposed adaptive approach that relies on more reliable data. Finally, the experimental result of walking activity and gait period scored 98.63% accuracy when the CNN model had applied alone and achieved 99.9% accuracy when the PIG method and adaptive combination were used [12].

Weijie Shenga and Xinde Li proposed a human gait recognition and motion prediction model called attention enhanced temporal graph convolutional network (AT-GCN). Due to spatial and temporal attention, the proposed model can represent discriminative features in spatial dependency and temporal dynamics. A multi-task learning architecture was also presented, which can simultaneously learn representations for multiple tasks. Furthermore, they introduced a new dataset called (EMOGAIT) which contains 1440 real gaits, annotated with identity and emotion labels. Experimental results revealed the robustness of the proposed model when tested on two different datasets for identity recognition and emotion recognition [13]. Liao et al. proposed a gait recognition algorithm based on a model-based algorithm called PoseGait. It extracts spatiotemporal features from the 2D pose and then converts it to a 3D pose to enhance the gait authentication rate. The experimental results reveal that the performance metrics of the proposed system are more robust than the appearance-based model’s studies [8].

Altilio et al. used different machine learning classification algorithms to automatically classify patient movements from their gait dataset. Their method achieved the maximum accuracy of 91% in cost-effective and home-based rehabilitation programs from probabilistic neural network (PNN) [14]. In [15], Saleh et al. designed a gait recognition algorithm based on deep convolutional neural network (DCNN) and then compared its performance results with it after applying the image augmentation (IA) procedures. Their experimental results scored 82% without IA and 96.23 with IA, reflecting image augmentation’s main contribution. Wen J and Wang X. presented a gait recognition model based on sparse linear subspace. First, they extracted gait features from frame-by-frame gait energy images (ffGEIs) and then used sparse linear subspace technology for dimension reduction. Second, a new support vector machine-based gait classification technique uses Gaussian radial basis function (RBF) kernels for cross-view gait detection. Finally, the proposed gait authentication approach has been evaluated on CASIA-B and OU-ISIR gait databases to reveal its performance [16].

Gao et al. proposed a skeleton‐based gait recognition algorithm to solve the problem of covariate conditions. The spatial and temporal features of the gait images have been extracted by the space and time relationship between body joints. The feature map has been decomposed to eliminate redundant features and achieve a better recognition rate in the presence of covariate factors. Their experiments have been applied to CASIA‐B and OU-MVLP‐Pose databases, achieving higher recognition accuracy values and remarkable robustness [17]. Hasan and Mustafa proposed a gait recognition model to learn discriminant view‐invariant gait images. The proposed model has been based on a stacked autoencoder that can efficiently and gradually transform skeleton joint coordinates from any arbitrary view to a typical canonical view without prior prediction of the view angle or covariate type or losing temporal information. Finally, they fused the encoded features with two other spatiotemporal gait features to feed into the main recurrent neural network. Experimental results extracted from CASIA-A and CASIA-B gait datasets demonstrate that the proposed approach outperformed other state‐of‐the‐art methods on single‐view gait recognition [18].

In [19], Xiao Jing et al. proposed a gait recognition model called GaitGP that can learn the main details through fine-grained features and the relationship between neighboring regions through global features. The GaitGP model consists of attention feature extractor (CAFE) and the Global and Partial Feature Combiner (GPFC) to extract the global features and learn different fine-grained features. Experimental results on CASIA-B, OU-ISIR gait database, and OU-MVLP show that the GaitGP model is superior to current cross-view gait recognition methods. Gul et al. presented a 3D convolutional deep neural network (3D CNN) that extracts the spatiotemporal features of a gait sequence. They used gait energy images (GEI) as input to 3D CNN, which captures the shape and motion characteristics of the human gait. Moreover, they applied some optimization methods on CASIA-B and OULP datasets to tune the hyperparameters. The evaluation results scored the best values on the CASIA-B dataset [20].

Lee et al. have proposed a human gender recognition model based on a support vector machine (SVM) and random forest (RF), then using recursive feature elimination (RFE) to select the best feature subset. Gender classification results scored 99.11 percent (SVM classifier), and RF-RFE had a performance of 98.89 percent (SVM and RF classifier), indicating that it is a robust classifier [21].

Yusuf et al. collected gait images from 26 participants, 14 males and 12 females, and then recognized them using a convolutional recurrent neural network—long short-term memory (CRNN-LSTM) by analyzing the upper half gait and the whole body. The experimental results approve that the recognition accuracy from the upper half gait is better than the full body with low computation analysis [22]. Zhang et al. proposed an integrated network model called SBLSTM, which combines three models, sparse autoencoder (SAE), bidirectional long short-term memory (BiLSTM), and deep neural network (DNN) for recognizing gait images during human movement. Firstly, the SAE model extracts the key features from gait images. Then, the BiLSTM model learns the temporal and periodic variations in gait images. Finally, the DNN can identify and classify the gait phases. The experimental results prove that the proposed model (SBLSTM) effectively recognizes gait more than DNN or LSTM only, confirming the proposed model’s effectiveness [23].

Dong et al. proposed a framework based on multi-source fusion to recognize human gait phases and patterns and reduce computational costs. This model combines four well-known models with low-cost commercial sensors called, support vector machines (SVM), backpropagation (BP) neural networks, AlexNet, and LeNet5 models to confirm the performance of the proposed methodology for gait recognition. The evaluation results scored 97.7% accuracy from gait phases and 99.2% for gait patterns using the fusion model, proving the proposed framework’s effectiveness [24].

In [25], the human gait images in CASIA-B and CASIA-C have been recognized using deep learning and Bayesian optimization. They proposed a framework that includes both parallel and sequential steps. First, they extract the optical flow-based motion regions and then enhance the video frame, which trains separately instead of selecting a static hyperparameter. Two models have been obtained: the original and the motion frames model, combined using a proposed parallel approach called-Sq-Parallel Fusion (SqPF). The Tiger optimization algorithm has been enhanced, called Entropy controlled Tiger optimization (EVcTO). Finally, an extreme learning machine (ELM) classifier has classified the selected features. The experimental results score 92.04 and 94.97% in recognition accuracy from CASIA-A and CASIA-C datasets, respectively, outperforming the other deep learning-based networks.

Ismail et al. have selected the optimal CNN architecture using the genetic algorithm model (GA) for human activity recognition (HAR task). Furthermore, the proposed search space offers a respectable level of depth because it does not place a cap on the length of the task architecture that may be created. Three datasets have been tested to confirm the effectiveness of the proposed methodology named UCI-HAR, Opportunity, and DAPHNET. The Experimental results scored the accuracy value of 98.3%, 98.5% (∓ 1.1), and 99.14% (∓ 0.8) for Opportunity, UCI-HAR, and DAPHNET, respectively [26]. Teepe et al. proposed a pose estimation model to obtain the optimal skeleton pose analysis from RGB images for vivid model-based gait identification. Moreover, they combine the gait graph with skeleton poses using graph convolutional network (GCN) to obtain the best human gait recognition model (HGR). Finally, the gait features are extracted and combined by the GCN. When the model was conducted on the CASIA-B dataset, it achieved a promised modality compared to the existing methods [27].

3 Frameworks for gait recognition

The typical procedures of gait recognition are demonstrated in Fig. 2, which includes four basic steps: data collection and preprocessing, feature representation, dimension reduction, and recognition or classification [28], as shown in Fig. 2. The following subsections will fully illustrate the methodology of the proposed model and its techniques by pre-trained on the CASIA gait dataset [29, 30].

Fig. 2
figure 2

Gait recognition overview

3.1 Methodology of the proposed model

Gait recognition is a challenging task, especially when the videos of the dataset have been captured from people doing their normal daily activities, including regular, high-speed walking, ascending, descending stairs, and carrying conditions [1]. Therefore, the main aim of the proposed model is to tackle the mentioned problems, in addition, improve gait recognition based on its fused features and broaden the scope of gait applications to use in COVID tracking [31] instead of more typical biometrics like faces, fingerprints, iris scans, and DNA. The proposed model depends on concatenation fusion between appearance- and model-based algorithms to achieve the targets and enhance the gait recognition algorithm, which passes through six main steps: step 1: convert videos to successive frames; step 2: preprocess frames to create silhouette images and GEIs; step 3: extract the prominent landmarks from gait images, including hip, knee, ankle, knee angle, etc.; step 4: extract the first features from GEIs directly by training the proposed CNN model; step 5: extract the second features from the landmarks poses Dataframe by training the proposed fully connected network (FCN); And finally, step 6: recognize the resultant concatenated features by training and testing the well-designed deep learning module, and then, compute the main performance metrics, including precision, recall, F1-score, specificity, and training time. Figure 3 refers to the flowchart of the proposed model, and Algorithm 1 presents the proposed human authentication procedures.

Fig. 3
figure 3

Flowchart of the proposed model

figure a

3.2 Dataset description

Numerous gait databases have been created for gait recognition study, including CASIA, OU-ISIR, and OU-MVLP. The selected dataset in the proposed model is CASIA because it is available in RGB image form so that the human poses can be estimated, not silhouettes. The only available original color image among gait datasets is the CASIA gait dataset, but the other datasets are offered in silhouette shape only due to privacy issues. The Institute of Automation Chinese Academy of Sciences has created CASIA gait datasets containing four different datasets: A, B, C, and D.

CASIA gait datasets have roughly 20 K images, varying in speed, view angles, and clothing variations. On December 10, 2001, Dataset CASIA-A [32], consisting of 20 people, was created. Each person has 12 image sequences, four for each of the three image plane directions (parallel, 45, and 90 degrees). The total size of this dataset is around 2.2 GB, with 19,139 frames in the database. In January 2005, the first large multi-view gait database was created called CASIA-B [29]. It has been collected from 124 subjects and captured from 11 views; the difference between each view is the view angle. Moreover, the view angles had repeated in three variations: regular walking, clothes, and carrying condition changes. The thermal infrared camera captured CASIA-C [33] gait images from 153 subjects in July–August 2005. The main difference among subjects is the walking speed condition, normal, fast, slow, and regular with a carrying bag. The detailed description of the used gait datasets is tabulated in Table 1. Figure 4 shows a sample of the normal walking sequence of CASIA-B [30], and Fig. 5 shows the sample of CASIA gait datasets.

Table 1 Detailed description of CASIA gait datasets
Fig. 4
figure 4

Sample of normal walking sequence of CASIA-B

Fig. 5
figure 5

Sample of datasets a CASIA-A, b CASIA-B, and c CASIA-C

3.3 Preprocessing procedures

One of the overarching goals of the preprocessing steps is to enhance the image quality, suppress distortion, and remove noise for identifying images with clearly visible silhouette gaits for phase one. As such, the first thought that comes to mind is to scale down all gait image datasets to 150 × 150 and then convert them to grayscale images for decreasing computational time. In addition, a subtraction process between foreground gait content and its background would produce isolated images. Moreover, histogram equalization has been performed in order to enhance the contrast according to Eq. (1) and the optimal weight by Eq. (2) [34],

$$ H\left( l \right) = h\left( {l,i} \right) * W\left( I \right) $$
(1)
$$ W\left[ I \right] = \mathop \sum \limits_{x = 1}^{M} \mathop \sum \limits_{y = 1}^{N} \frac{{n^{k} }}{{\max \left( {n^{k} } \right)}} $$
(2)

where \({n}^{k}\) is the pixel counts, \(h\left(l,i\right)\) is the input histogram, and \(H\left(l\right)\) is the modified histogram.

The isolated images (ROI) have then binarized by the Otsu threshold based on minimizing the variance between foreground and background [35], which can be calculated by Eq. (3) and Eq. (4),

$$ \sigma_{B}^{2} = \omega_{0} \left( {\mu_{0} - \mu_{T} } \right)^{2} + \omega_{1} \left( {\mu_{1} - \mu_{T} } \right)^{2} $$
(3)
$$ t^{*} = \arg \max_{1 \le t \le L} \sigma_{B}^{2} $$
(4)

where \({\omega }_{0} \mathrm{and }{\omega }_{1}\) represent the foreground and background items. \({Also, \mu }_{0}, {\mu }_{1}\) are represented by the mean of the gray levels and \({\mu }_{T}\) is the entire gray-level image, respectively.

Finally, some morphological operations applied on the isolated binary images include dilating to find the maximum local values in frames and then filling to remove the holes by convolution of an isolated image and disk structure element (SE) with radius one [36]. Furthermore, all binarized frames have normalized to enhance images by generating a new range from an existing one. The dilation of a frame A (set) by structuring element B is computed as Eq. (5) [36],

$$ A \oplus B = \left\{ {x{|}\left( {B_{x} } \right) \cap A \ne \emptyset } \right\} $$
(5)

The filling operation has applied by filling the holes inside the isolated binary image with ‘black’ starting from any point in ROI to reach the image boundary and was defined as Eq. (6) [37],

$$ X_{K} = (X_{K - 1} \oplus B) \cap A^{c} \quad K = 1,2,3, \ldots $$
(6)

where \({X}_{k}\) = P, the structure element (SE) is B, and the complement of set A frame is Ac.

A pre-defined boundary can normalize the gait dataset, calculated by Eq. (7) [38],

$$ A^{\prime} = \left( {\frac{A - \min \,value\, of\, A}{{\max \,value\, of\, A - \min \,value\, of \,A}}} \right)*\left( {R - M} \right) + M $$
(7)

where Min–Max Normalized data is A′ when pre-defined boundaries are (R, M) and the range of the original data is A.

Algorithm 2 introduces the preprocessing steps for phase 1, and Fig. 6 shows the gait dataset preprocessing procedure for phase 1 that has been applied in this study. Figure 7 represents the gait sample of silhouettes after the preprocessing procedure for phase 1.

Fig. 6
figure 6

Gait dataset preprocessing procedure for phase 1

Fig. 7
figure 7

Gait silhouettes dataset samples after preprocessing a CASIA-A, b CASIA-B, and c CASIA-C

figure b

3.4 Feature extraction methods

This section will declare the existing feature extraction methods for gait recognition. There are different techniques for feature representations; the common types are appearance-based and model-based feature representation models. The main procedures of each model will be described in detail in the following subsections and the proposed one [39].

3.4.1 Appearance-based feature representation model

The main aim of model-free or appearance-based feature representation is to process a human silhouette to identify the person from its gait data from different angles. Using an appearance model as a feature extraction has two significant advantages: It does not need a high-quality video that enables us to capture data far from a subject [28]. Moreover, it is more cost-effective than a model-based algorithm. So, the appearance-based model is more popular [28]. On the other hand, model-free representation has one drawback: It depends on views and scale. For example, the gait recognition rate will be reduced when viewing angle, clothes, and carrying condition changes. The main appearance models for feature extraction methods differ in the shape of raw input data, including silhouettes and gait energy images (GEIs).

3.4.1.1 Silhouettes inputs

This feature representation method was considered the first proposed model for the gait feature, which depends on extracting human gait from the background to focus on the region of interest (ROI) as raw input data. To obtain the silhouette dataset from RGB, gait videos must pass through these procedures [39]. First, decompose gait videos into RGB frames; each contains foreground and background parts. Second, generate the background from frames by computing some statistics of the background pixels, namely the covariances and the mean. Third, determine the silhouette area by computing the Mahalanobis distance [40] between pixels and the background; based on this distance, the pixel has determined whether it belongs to the background or foreground. Finally, the silhouette of the gait image sequence had generated. In general, the procedures described above can produce high-quality binary isolated images. However, some issues are leading to segmentation errors [39]: (1) shadows; (2) background/foreground threshold; and (3) background moving object. Before using this data as input, one must apply several preprocessing procedures to enhance it to be suitable for image recognition/classification purposes—for example, Wu. et al. [41] proposed a model that extracts two domains, spatial and temporal features, from silhouette images called Spatial–Temporal Graph Attention Network (STGAN). This model can observe the relationship between frames and detect variations in the temporal domain to decrease errors that affect the gait recognition rate.

3.4.1.2 Gait energy images

After the gait silhouettes have been extracted from the gait RGB videos dataset in the above subsection, the gait energy images (GEIs) [39] have been created by aligning, averaging the human silhouettes, and comparing the similarities between the two GEIs [8], which is computed by Eq. (8) [42],

$$ G\left( {x,y} \right) = \frac{1}{n}\sum\nolimits_{1}^{n} {B\left( {x,y} \right)} $$
(8)

where \(n\) represents the frame cycle count of the silhouette, x and y are the 2D frame coordinates, and \(B\left(x,y\right)\) is the binary gait silhouette images.

In the GEI images, a high-intensity pixel indicates that the individual’s action frequently occurs at this location [9], which refers to the direct effect of GEI in image representation. Figure 8 represents the gait cycle sequence of silhouettes and the corresponding GEI. Recently, some researchers have begun using human silhouettes directly as raw input data rather than GEI due to their performance outperforming the previous state-of-the-art studies [43].

Fig. 8
figure 8

a A cycle of gait sequence images of silhouettes and b the corresponding GEI

3.4.2 Model-based feature representation model

Despite the pros of using the appearance-based model in gait recognition, it has significant cons, resulting from clothes or carrying conditions variations. The model-based feature representation can cure these problems by modeling the human body skeleton to joint points, limbs, and static distances between joints. Moreover, it can withstand any variation in the human body and model it accurately. However, it is a complex task with a high computational cost and needs high-quality video, so it is less popular than model-free [8]. The model-based or dynamic gait feature extraction model has based on identifying human gaits from the rotation pattern of the lower human part joints (hip, knee, and ankle) in both feet. These annotated points have been placed in several joints [44], and the nose point refers to the center point of both sides, as shown in Fig. 9. After extracting the annotated points for the whole human body, the model-based also uses static, limb distances, and joint angles for human gait recognition [39]. For example, M. Sivarathinabala and S. Abirami [44] modeled the human skeleton to 11 joint points and then extracted four static distances: stride length, degree of toe-out, left–right-ankle, and left–right-knee distances. Moreover, they computed the dynamic angles, namely leg–hip, leg–knee, and leg–ankle angles, and finally fused the static and dynamic features. Liao et al. [8] proposed a model-based called PoseGait, which divided the whole human body into 18 joint points to identify gaits. They also extracted the temporal features from poses to improve the gait recognition rate.

Fig. 9
figure 9

The annotated points, limb length, and knee angle extraction

3.4.3 Proposed feature representation model

After the standard feature representation models have been described in the past subsections, the proposed model has been modeled to be robust to any variation in the human body and improve the gait recognition rate, named the Fused Gait Feature Representation (FGFR) model. This model has based on the concatenating fusion between appearance-based and model-based feature representation to form handcrafted features from both models. The FGFRM passes through some procedures; firstly, in phase one, after the images of the silhouettes have been segmented from RGB gait videos, its main features have been extracted from the proposed convolutional neural network (CNN). Secondly, in phase two, the annotated joints in the lower part of the human body have been detected from the media pipe algorithm [45], representing seven 2D joints: LHip, LKnee, LAnkle, RHip RKnee, RAnkle, and Nose. Then, the 3D poses have estimated from selected 2D annotated joints to tackle the problems of changing in clothes or carrying conditions [46]; the proposed 3D human poses are defined as Eq. (9),

$$ f_{pose} = \left\{ {j_{0} ,j_{1} , \ldots j_{N} } \right\} $$
(9)

where \({j}_{i}=\left\{{x}_{i},{y}_{i},{z}_{i}\right\}\), i\(\in \left\{\mathrm{1,2},\dots ,N\right\}\), N = 7 annotated points.

However, the subject size in gait images varies according to changes in distance between the participant and the fixed camera. Therefore, all 3D annotated point coordinates have been normalized to fixed size by considering the distance between the nose and the center point of the right and left hip as unit length. The selection of the nose comes back to its location in the origin of the human body coordinates [8]. So, the annotated points are normalized as follows, Eq. (10),

$$ J_{N} = \frac{{j_{i} - j_{n} }}{{D_{hn} }} $$
(10)

where \({j}_{i}\) \(\in \) \({R}^{3}\) is the location of body joint, \({j}_{n}\) is the nose position, and \({D}_{hn}\) is the distance between the center of the hip and nose point.

Moreover, the static distances between joints were estimated included, right–left hip, right–left knee, and right–left ankle; then, the limb lengths have measured as distances between hip–knee and knee–ankle joints from Euclidean distance, which is calculated as Eq. (11) [47],

$$ d_{i} = \sqrt {\left( {x_{1} - x_{2} } \right)^{2} + \left( {y_{1} - y_{2} } \right)^{2} + \left( {z_{1} - z_{2} } \right)^{2} } $$
(11)

where (\({x}_{1}\), \({y}_{1}\), \({z}_{1}\)) and (\({x}_{2}\), \({y}_{2}\), \({z}_{2}\)) refer to 3D poses coordinates of between two corresponding points or from hip joint to knee joint and knee joint to ankle joint.

After computing the spatial features from static and limb distances, the dynamic features were also calculated from the knee angles using joint trajectories of lower limb hip, knee, and ankle due to the knee angle changing influence on the gait recognition model performances [48], and the left and right knee angles are computed as follows from Eqs. (12) to (14) [8],

$$ f_{{{\text{angle}}}} = \left\{ {\left( {\alpha_{ij} ,\beta_{ij} } \right)|\left( {i,j} \right) \in \emptyset } \right\} $$
(12)
$$ \alpha_{ij} = \left\{ {\begin{array}{*{20}l} {\arctan \frac{{y_{i} - y_{j} }}{{x_{i} - x_{j} }}} \hfill & { x_{i} \ne x_{j} } \hfill \\ {\frac{\pi }{2}} \hfill & {x_{i} = x_{j} } \hfill \\ \end{array} } \right. $$
(13)
$$ \beta_{ij} = \left\{ {\begin{array}{*{20}l} {\arctan \frac{{z_{i} - z_{j} }}{{\sqrt {\left( {x_{i} - x_{j} } \right)^{2} + \left( {y_{i} - y_{j} } \right)^{2} } }}} \hfill & { \left( {x_{i} - x_{j} } \right)^{2} + \left( {y_{i} - y_{j} } \right)^{2} \ne 0} \hfill \\ {\frac{\pi }{2}} \hfill & {\left( {x_{i} - x_{j} } \right)^{2} + \left( {y_{i} - y_{j} } \right)^{2} = 0} \hfill \\ \end{array} } \right. $$
(14)

where \({f}_{angle}\) is the right and left knee angles, \({J}_{i}=\) (\({x}_{i}\), \({y}_{i}\), \({z}_{i}\)), \({J}_{j}=\) (\({x}_{j}\), \({y}_{j}\), \({z}_{j}\)), the \(\left(i,j\right)\) is the set of \(\varnothing \), and \(\varnothing \) is hip, knee, and ankle joints.

Finally, the normalized lower limb poses coordinates, spatial features, and dynamic features have been tabulated in Dataframe to be input for the second proposed deep learning structure (DL). The normalized joints, spatial, and dynamic features for frontal and side view are shown in Fig. 10. Algorithm 3 represents the procedures of extracting landmarks, static, limb distances, and knee angles.

Fig. 10
figure 10

Poses landmarks, limb, static distances, and knee angles in frontal and side view gaits

figure c

3.5 Proposed pre-trained models

This section will highlight the key components of the proposed FGFR model; Sects. 3.5.1 and 3.5.2 will declare the detailed coding layers of the proposed appearance-based model and the model-based feature extractions models for CASIA-A and CASIA-B datasets, respectively. Figures 11 and 12 show the proposed concatenation feature representation models’ architecture for both datasets.

Fig. 11
figure 11

The proposed architecture of concatenation model between appearance- and model-based for CASIA-A dataset

Fig. 12
figure 12

The proposed architecture of concatenation model between appearance- and model-based for CASIA-B dataset

3.5.1 Concatenation model for CASIA-A dataset

The proposed convolutional neural network has been created to extract features in phase one from silhouette images. It contains four convolutional layers and two max-pooling layers, each one of them stacked after two convolutional layers. The first two convolutional layers have 32 filters with kernel size 3 \(\times \) 3, one stride, no padding, and a ReLU activation function. In contrast, the other two convolutional layers contain 64 filters with the same kernel size, activation function, two strides, and zero padding. Moreover, the dropout layer has been added to prevent overfitting and reduce local response normalization. The convolution operation has been computed from Eq. (15) [49],

$$ y_{j}^{r} = f\left( {b_{j}^{r} + \sum w_{i,j}^{r - 1} *x_{i}^{r} } \right) $$
(15)

where r refers to the layer number in the proposed network, \({w}_{i,j}\) is the convolution kernel between \({x}_{i}\) and \({y}_{j}\), f is the activation function, \({x}_{i}\) and \({y}_{j}\) are the input feature map, * is the convolution operator, and the jth is the output feature map. The mathematical operation of the ReLU activation function has been computed by Eq. (16) [50],

$$ f\left( x \right)_{{{\text{ReLU}}}} = \max \left( {0,x} \right) $$
(16)

The max-pooling layer is popular and mostly used between pooling types techniques, which shrinks the feature to the smaller feature map size; the output of feature map size after pooling operation is defined by Eqs. (17) and (18) [50],

$$ h^{\prime } = \left[ {\frac{h - f}{s}} \right] $$
(17)
$$ w^{\prime} = \left[ {\frac{w - f}{s}} \right] $$
(18)

where \(h^{\prime}\) and \(w^{\prime}\) refer to the height and width of the output feature map, h and w are the height and width of the input feature map, f is the pooling size, and s is the stride size of the pooling layer. The output size of the feature map after each convolution operation was computed by the formula in Eq. (19) [51],

$$ L_{{{\text{output}}}} = \left[ {\frac{N - F + 2P}{S}} \right] + 1 $$
(19)

where N refers to the input size, F is the number of kernels in each layer, P is the padding size, and S is the stride size.

Finally, after the mentioned convolution and pooling layers have been declared, its output has been flattened and passed through three fully connected layers with kernel sizes 1024, 512, and 256, which extracts features from the proposed appearance-based model and reduces unnecessary data from the network. In the second phase, the data frame of normalized poses, limbs, static joint distances, and knee angles has passed through four fully connected layers with kernel sizes 512, 256, 8, and 4 to extract the main features from the proposed model-based architecture. After extracting the two phases’ features, the concatenated fusion fuses them with eight filters formulated in Eq. (20) [49],

$$ {\text{RF}} = \max \left( {0,\sum\nolimits_{i}^{n} {w_{i} l_{i} } + \sum\nolimits_{i}^{m} {w_{j} h_{j} + b} } \right) $$
(20)

where Ph1_F = \(\left\{{l}_{1},{l}_{2},{l}_{3},{l}_{4},\dots ,{l}_{i},\dots ,{l}_{n}\right\}\) and Ph2_F = \(\left\{{h}_{1},{h}_{2},{h}_{3},{h}_{4},\dots ,{h}_{j},\dots ,{h}_{m}\right\}\) represent phase one, phase two features, and b is the bias.

The final layer of the fully connected layer architecture contains the number of output neurons; for recognition purposes, the SoftMax layer has been applied, which is formulated by Eq. (21) [52],

$$ {\text{sm}}(z)_{i} = \frac{{e^{{z_{i} }} }}{{\mathop \sum \nolimits_{j = 1}^{k} e^{{z_{j} }} }} \quad {\text{for}}\, i = 1, \ldots ,k \,{\text{and}}\, z = \left( {z_{1} , \ldots , z_{k} } \right) \in R^{k} $$
(21)

where \({z}_{i}\) refers to the number of elements, z is the input vector, and the resulting values have normalized by dividing the sum of all the exponentials. The parameters for each layer used in the architecture of dataset CASIA-A are tabulated in Table 2.

Table 2 The architecture of the proposed concatenation model between two phases based on CASIA-A dataset

3.5.2 Concatenation model for CASIA-B dataset

In the case of the CASIA-B dataset, the structure of phase one is the same as that of phase one in CASIA-A, while phase two contains five fully connected layers with sizes 1024, 512, 100, 8, and 4 because it is a deeper network than CASIA-A. Moreover, after concatenation between two phases, the last layer in the fully connected layer has eleven neurons output. Table 3 describes the parameters for each layer used in the proposed architecture of dataset CASIA-B.

Table 3 The architecture of the proposed concatenation model between two phases based on CASIA-B dataset

4 Results and discussion

Before the proposed models’ results discuss, some main performance evaluation metrics are basic to verify the proposed models’ robustness in the training and testing phases. These metrics compute from the confusion matrix, which contains the number of rows and columns, and each row represents one model’s class. Moreover, it includes the predicted and actual values of the recognition rate. For calculating the performance metrics values, some data must be computed first from the confusion matrix, including true positive (TP), true negative (TN), false positive (FP), and false negative value (FN). The confusion matrix elements are computed from the following equations [53]:

$$ TP_{{{\text{Class}}\, X}} = C_{i,i} $$
(22)
$$ {\text{FN}}_{{{\text{Class}} \,X}} = \sum\nolimits_{l = 1}^{N} {C_{i,l} } - {\text{TP}}_{{{\text{Class}}\, X}} $$
(23)
$$ {\text{FP}}_{{{\text{Class}}\, X}} = \sum\nolimits_{l = 1}^{N} {C_{l,i} } - {\text{TP}}_{{{\text{Class}}\, X}} $$
(24)
$$ {\text{TN}}_{{{\text{Class }}\,X}} = \sum\nolimits_{l = 1}^{N} {\sum\nolimits_{k = 1}^{N} {C_{l,k} } } - \left( {{\text{FP}}_{{{\text{Class}} \,X}} + {\text{FN}}_{{{\text{Class}}\, X}} + {\text{TP}}_{{{\text{Class}} \,X}} } \right) $$
(25)

where \(C_{i,i}\) refers to the number of samples that are successfully recognized for a given class, \(C_{i,l}\) is the number of negative samples that are mistakenly recognized for another class, \(C_{l,i}\) is the number of positive samples that are mistakenly recognized for another class, and \(C_{l,k}\) is the total number of samples.

The mentioned elements are then conducted on the proposed concatenated models to evaluate their performance metrics, including accuracy, sensitivity, specificity, precision, false discovery rate, F1-score, training time, and recall [54]. The accuracy of the FGFR model is computed from Eq. (26),

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} \times 100 $$
(26)

The sensitivity or true positive rate (TPR) is the measure of correctly recognized samples within a dataset, which is measured by Eq. (27) [55],

$$ {\text{TPR}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} \times 100 $$
(27)

The specificity or true negative rate (TNR) is the measure of not correctly recognized samples within a dataset, which is calculated from Eq. (28) [55],

$$ {\text{TNR}} = \frac{{{\text{TN}}}}{{{\text{TN}} + {\text{Fp}}}} \times 100 $$
(28)

Precision refers to the number of correctly predicted samples that turned out to be positive and computed from Eq. (29) [56],

$$ {\text{PPV}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}} \times 100 $$
(29)

The recall represents the value of actual positive samples correctly predicted from the model and calculated from Eq. (30) [56],

$$ R = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}} \times 100 $$
(30)

F1-score computes the harmonic mean of recall and precision, which is calculated from Eq. (31) [57],

$$ F1 - {\text{score}} = \frac{{2{\text{TP}}}}{{2{\text{TP}} + {\text{FP}} + {\text{FN}}}} \times 100 $$
(31)

The false discovery rate (FDR) measures the proportion of irrelevant alerts, which is computed from Eq. (32) [58],

$$ {\text{FDR}} = \frac{{{\text{FP}}}}{{{\text{FP}} + {\text{TP}}}} \times 100 $$
(32)

The false negative rate (FNR) value is obtained from Eq. (33) [59],

To evaluate the proposed experiments, we used the following configuration. First, both phases of the dataset have been divided into three standard sets, train, test, and validation set, with a 7: 1.5: 1.5 ratio. The proposed network compiles and fits according to the number of the following tuning hyperparameters; batch normalization, which can reduce the value of internal shift of the activation layers with size 32. The learning rate value refers to the parameter updating step size with 0.001, and the momentum factor sets to 0.5, enhancing training speed and accuracy. Furthermore, the Adam optimizer is selected to train the proposed models, consuming less memory than the other optimizers and less computational power [50]. Because all used datasets have multiple outputs, the loss function chosen is the sparse categorical cross-entropy [60]. Table 4 describes the main hyperparameters utilized in the proposed models in this article. All mentioned proposed models were conducted on a PC having these specifications: Microsoft Windows 10 operating system, 7-core processor @ 4.0 GHz, 12 GB of RAM, NVidia Tesla 16 GB GPU.

Table 4 The tuning hyperparameters

4.1 Evaluation of CASIA-A dataset

According to the optimal tuning parameters tabulated in Table 4, the proposed concatenated model has been evaluated by the CASIA-A dataset, which contains three different view angles, parallel, 45, and 90 degrees. The performance metrics of the CASIA-A dataset have been computed from the confusion matrix data, shown in Fig. 13. The main extracted elements from the confusion matrix are a true positive value of 952.33, a true negative of 1909.33, and a false positive; the same as the false-negative is 4.66. When applying these values to equations of performance metrics from Eq. (26) to (33), their values are listed in Table 5, and the detailed accuracy of each label is listed in Table 6. Figure 14 shows the loss and accuracy curves of the proposed fused model, which scored 99.6% of accuracy in a total running time of nearly 4 min 25 s.

Fig. 13
figure 13

The confusion matrix of the proposed model based on CASIA-A dataset

Table 5 Proposed model evaluation metrics results based on CASIA-A dataset
Table 6 The detailed accuracy of each label
Fig. 14
figure 14

Accuracy and loss curves of the proposed model based on CASIA-A dataset

4.2 Evaluation of CASIA-B dataset

In addition, the proposed model has been tested using the CASIA-B dataset with the same mentioned optimum tuning parameters in Table 4. Then, the experimental results of the proposed model have been extracted from the confusion matrix in Fig. 15, which scored 99.8% accuracy in a total running time of nearly 5 min 57 s, and the other performance evaluation metrics are listed in Table 7. After applying the three hold-out experiments on the CASIA-B dataset, their results are listed in Table 8, which scored higher accuracy values in view angles of 36°, 54°, and 144° than the other view angles. Furthermore, the proposed concatenated model scores high gait authentication rates in various conditions (nm, cl, bg), owing to using the fused features of human poses and silhouette images. The experimental results indicate that the FGFR model scores high accuracy in all subject view angles and walking conditions. Furthermore, Fig. 16 shows the loss and accuracy curves, including the number of epochs on the x-axis and the improvement on the y-axis.

Fig. 15
figure 15

The confusion matrix of the proposed model based on CASIA-B dataset

Table 7 Proposed model evaluation metrics results based on the CASIA-B dataset
Table 8 Detailed accuracy of the CASIA-B dataset
Fig. 16
figure 16

Accuracy and loss curves of the proposed model based on CASIA-B dataset

Moreover, the proposed model has been repeated three times in different cases, and Table 9 refers to their results. Figure 17 shows three scene conditions’ loss and accuracy values, and Fig. 18 shows their precision–recall curves. Finally, a complete comparison has been applied between the CASIA-B dataset and the existing state-of-the-art studies, including [61,62,63] and [64], are then listed in Table 10. From the table results, the performance of the proposed algorithm outperforms the recent studies, which scored 99.8% in recognition accuracy in the low time.

Table 9 Proposed model evaluation metrics results based on three different conditions of CASIA-B dataset
Fig. 17
figure 17

The loss and accuracy curves of the walking scene on CASIA-B dataset a normal walking, b wearing a coat, and c wearing a bag

Fig. 18
figure 18

Classes precision–recall curves of the walking scene data a normal walking, b wearing a coat, and c wearing a bag

Table 10 Various studies on CASIA-B-based accuracy

In addition to this, we applied a complete comparison study between our proposed concatenated model results and the commonly used pre-trained models, including, LeNet, AlexNet, GoogLeNet, Xception, and ResNet, listed in Table 11, and then repeated with the current state-of-the-art studies to show the proposed model robustness, listed in Table 12. The results declared that the proposed system scored 99.8 in the case of using dataset CASIA-B in 1.17 s; moreover, it scored 99.6 in the CASIA-A dataset in 0.29 s. From the mentioned results, we noticed the ability of the proposed model to enhance and recognize the gait image of subjects regardless of any covariate factors (bag or clothes) in low training time, which improved the accuracy value by 0.7% [64] and 9.48% than the study in [68].

Table 11 Performance results on Casia (B) gait dataset
Table 12 Comparative studies between the proposed FGFR model and the recent gait articles

5 Conclusions

This paper proposes a novel model for improving the recognition rate of humans via their gaits. The proposed system combines model-based and model-free features to limit the covariate factors in the dataset, such as carrying bags or clothes coats. The first step is to separate RGB gait videos into frames in the state-of-the-art model-free phase. Those frames have been preprocessed, enhanced, and segmented from the background to obtain silhouette images, which would be the kernel input for the proposed CNN to form the first features vector. In addition, in a model-based phase, the annotated joints, limb, and static joint distances have been estimated from the poses of the RGB gait frames, which would be the kernel input for the proposed fully connected layering deep structure to extract the second features vector. To build a deep CNN recognizer, a concatenated fusion has been utilized between the previously discussed two features’ vectors. A complete comparison study has been applied between the proposed model and other recent studies. Experimental results show that the proposed model outperforms other current techniques, scoring 99.8% and 99.6% accuracy on CASIA-B and CASIA-A datasets with a noticeably low training time. Finally, the proposed model can quickly identify and authenticate humans via their gait images as a fast and more accurate methodology than any other recent state of technique.