1. Introduction
Many types of valuable data are stored as simple matrices, which can be viewed as single-channel images. Infrared (IR) images are an example of this format. IR photos are particularly useful for capturing images of wild animals, especially those active at night. For this purpose, multiple camera traps were set up in the forests of Slovakia, aiming to capture images of larger animals that pose significant risks to road traffic. Collisions with large animals can lead to costly damage to vehicles and infrastructure and can even result in the death of both animals and humans. To help prevent these collisions, several driver-assistance systems have been developed that rely on accurate nighttime recognition of animals. During the day, these animals tend to avoid noisy roads, but they may cross them at night, creating hazardous situations for drivers. Therefore, IR cameras are essential tools for enhancing road safety by detecting and identifying animals on or near roadways, especially during nocturnal hours [
1,
2,
3].
Artificial Neural Networks (ANN) are widely used algorithms from machine learning scope. Several problem-solving processes can be simulated by them, such as classification, regression, data reconstruction and others. Classification problem is accurate for the animal recognition in the image. Three channel images of Red-Green-Blue colors (RGB) are mostly used. However, optimalization for RGB can be not effective on single channel pictures like IR. Continually, Neural Network (NN) consists of their architecture and model. Architecture represents the compositions of layers and how will data flow from start to end. Model is the collection of parameters and neural weights that represents solution for specific problem. So, even with architecture suited for IR image classification, the correct model is still needed. There are several successful NN like GoogleNet, ResNet, MobileNet etc. Each consist of unique architecture and pre-trained model for image classification. From the nature of NN it can be assumed that lower layers (closer to input) represents low feature detectors. For convolutional layer it is feature like edges, shapes, color patterns, gradient etc. Therefore, weights of these lower layers will be used for IR images too. Higher layers represent more abstract concepts, therefore they will be trained to solve task of animal recognition. To sum up, used NN are compositions of pre-trained well known networks and few trained classification layers [
4,
5,
6].
To identify the best NN candidate for infrared (IR) image classification, several experiments are required. In each experiment, a new model will be trained based on a selected NN architecture. After training, performance testing will be conducted, generating a confusion matrix and various precision metrics. Two datasets will be used: an IR animal dataset and the Fashion MNIST dataset as a control. This process will be repeated for each architecture until all have been evaluated. Wildlife monitoring is a crucial task in conservation biology and wildlife management. Traditional wildlife monitoring methods, such as radio tracking and visual surveys, can be time-consuming, costly, and intrusive to the animals. Infrared imaging offers a non-invasive, cost-effective alternative for detecting and identifying wildlife in their natural habitats, particularly in low-light conditions. Infrared cameras can capture the thermal signatures of animals, helping to distinguish them from their surroundings. However, manually analyzing infrared images to identify animals is a challenging and labor-intensive task, especially when working with large datasets. The application of automated deep learning methods can address these challenges, making wildlife monitoring more efficient and scalable [
5,
6].
Convolutional Neural Networks (CNNs) have demonstrated exceptional potential in recognizing animals from infrared images. These deep learning models utilize convolutional layers to extract meaningful features from images, followed by fully connected layers to classify those features into distinct categories. Applying CNNs to the task of wildlife recognition using infrared images offers the opportunity to greatly enhance the efficiency and accuracy of wildlife monitoring efforts [
7,
8].
Recent advancements in CNN architectures, data augmentation strategies, and transfer learning techniques have significantly enhanced the performance of wildlife recognition systems using infrared images (see
Figure 1). This review aims to present a comprehensive overview of the current advancements in the field, highlighting the latest research, methodologies, and potential future directions. We will discuss the challenges and opportunities in wild animal recognition from infrared images and how CNNs can be used to address these challenges. We hope that this review will inspire further research in this exciting and rapidly evolving field, which has the potential to revolutionize wildlife monitoring and conservation efforts [
7,
8,
9,
10].
This study offers a detailed analysis and performance evaluation of various deep neural network architectures for wildlife recognition using infrared imagery. The primary contributions of this work include:
Evaluation of Deep Learning Models: We evaluate the performance of several cutting-edge deep neural networks, including CNN, ResNet, and hybrid architectures, on wildlife recognition tasks with infrared images, offering insights into their accuracy, computational efficiency, and suitability for real-time applications.
Infrared Image Analysis for Wildlife Monitoring: By focusing on the unique challenges posed by infrared imagery—such as limited color contrast and distinct noise characteristics—we provide targeted recommendations for preprocessing steps that enhance model performance specifically in this context.
Guidance for Practical Implementation: The study delivers practical insights into model selection and optimization tailored for real-world applications in wildlife monitoring. This includes recommendations on balancing accuracy with resource constraints for deployment in field conditions where computational resources may be limited.
Contribution to Conservation Efforts: The results support improved automated monitoring systems, which can assist conservationists in tracking and managing wildlife populations more effectively, thereby contributing to broader ecological and conservation efforts.
Together, these contributions provide a valuable foundation for advancing automated wildlife recognition in challenging environments, encouraging further exploration and innovation in this area.
The first chapter following this introduction presents the current state of the art. Next, the theory of neural networks is discussed, including descriptions of different types of layers. These are divided into two categories: core layers and utility layers. This chapter also describes the neural network architectures used. The next chapter focuses on experiments. It begins with an overview of the overall experiment, followed by a description of the datasets used. Later, the results of several neural networks are presented. The thesis ends with a final chapter summarizing the results.
2. State of the Art
Wild animal recognition from infrared images is an important and challenging task in wildlife monitoring and conservation. Infrared cameras have emerged as a valuable tool for capturing wildlife activity in their natural habitats, especially during low-light conditions. However, manually analyzing infrared images to identify and track animals is both time-consuming and prone to errors, particularly when handling large datasets. Therefore, there is a need for automated and accurate methods for wild animal recognition from infrared images [
10,
11,
12].
Recent advancements in computer vision and machine learning have resulted in the development of advanced algorithms for recognizing animals in infrared images. This state-of-the-art review explores the latest research, methodologies, and challenges in wildlife recognition using infrared imagery [
13,
14].
One of the major challenges in wild animal recognition from infrared images is the lack of labeled data. Training machine learning models, such as deep neural networks, requires a large amount of labeled data, which is often difficult and expensive to obtain in wildlife monitoring scenarios. To address this challenge, researchers have proposed various transfer learning techniques, where pre-trained models on large-scale datasets, such as ImageNet, are fine-tuned on smaller labeled datasets of infrared images [
13,
14,
15,
16].
Another challenge in wild animal recognition from infrared images is dealing with the variability in animal pose, lighting conditions, and background clutter. To overcome these challenges, researchers have introduced various feature extraction techniques, such as Scale-Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG), which are capable of extracting robust and discriminative features from infrared images. Additionally, advanced deep learning architectures, including Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have been proposed to automatically learn high-level features from infrared images, achieving state-of-the-art performance in wildlife recognition tasks [
14,
15,
16].
Additionally, researchers have explored unsupervised learning methods, such as clustering and dimensionality reduction techniques, to uncover hidden patterns and relationships within infrared images. For example, autoencoders, a type of neural network that learns to reconstruct input data from a compressed representation, have been proposed to learn feature representations of infrared images for wildlife recognition [
17,
18,
19].
Wildlife recognition using Deep Neural Networks (DNNs) has rapidly evolved, providing innovative solutions for environmental monitoring, biodiversity research, and conservation efforts. Traditional methods of wildlife observation, which rely on human observation or basic automated techniques, are often limited by factors like lighting conditions and accessibility. The application of DNNs with infrared (IR) imaging has significantly advanced wildlife recognition, as infrared images enable robust detection even in low-light or nighttime conditions. This article presents an overview of the latest advancements in using DNNs for wildlife recognition with infrared images, compares leading approaches, and highlights how our methodology fits within or diverges from current research trends. Recent advancements in DNN architectures, specifically in convolutional neural networks (CNNs) and transformer models, have revolutionized image-based wildlife recognition. Key research developments in wildlife recognition focus on:
Improved Detection Accuracy: Advanced models such as EfficientNet and ResNet have shown higher accuracy and efficiency, particularly when fine-tuned with specialized datasets. These architectures are widely favored due to their depth and ability to capture complex patterns in large datasets while maintaining efficiency.
Transformer Models and Vision Transformers (ViTs): The introduction of vision transformers has enabled models to capture long-range dependencies and fine-grained details within images. ViTs have demonstrated considerable success in wildlife recognition, especially when combined with IR data, as they can adapt to the distinctive textures and features in infrared images.
Hybrid Architectures: Hybrid architectures that combine CNNs with transformers are emerging as high-performing solutions, as they harness the spatial awareness of CNNs and the contextual attention mechanisms of transformers. Research indicates that hybrid models may better capture subtle features that distinguish similar species in infrared images.
Infrared imaging is particularly useful for wildlife detection as it allows for non-invasive monitoring during nighttime, when many animals are active. The unique features of infrared data, such as differences in thermal signatures, enable better differentiation between animals and background environments. However, IR images lack the color information available in RGB images, posing challenges in accurately distinguishing similar animals. Research has shown that DNNs trained specifically on IR images with appropriate data augmentation and fine-tuning strategies can overcome some of these limitations.
The our current study explored the performance of traditional Convolutional Neural Networks (CNNs) for wildlife recognition using infrared images, highlighting the strengths and limitations of well-established architectures like VGG, ResNet, Xception, MobilNet and DenseNet. As the field of deep learning continues to evolve, it is crucial to explore more recent and advanced models to push the boundaries of infrared image classification. Two such models, HCGNet and ConvNeXt, have gained attention for their ability to achieve high accuracy with lower computational costs. These models offer promising avenues for future research in the area of wildlife recognition using infrared imagery. HCGNet, a lightweight CNN architecture, is specifically designed to reduce computational complexity while maintaining or even enhancing accuracy. In the context of wildlife recognition using infrared images, HCGNet could be particularly useful for real-time applications where inference speed is a key requirement, such as automated wildlife monitoring in remote areas with limited computational resources. ConvNeXt is another recent architecture that is known for its performance in large-scale image classification tasks while maintaining efficiency. With its deep architecture and innovative design, ConNeXt is well-suited for complex image recognition tasks, such as distinguishing between species with similar thermal signatures in infrared images.
3. Materials and Methods
In this section, we first introduce the Neural Networks (NNs) used in our study. While describing these networks by their architectural design is helpful, many networks contain repeating blocks of layers. To clarify these structures, we present simplified diagrams. It’s important to note that not all layers and blocks depicted in these diagrams are composed of artificial neurons. Layers that add functionality without containing trainable neurons will be referred to as utility layers, whereas those containing trainable neurons will be identified as core layers.
3.1. Core Layers
Core layers in neural networks are the fundamental building blocks responsible for essential operations in learning and prediction. These layers enable the network to process input data, extract meaningful features, and generate accurate outputs, forming the backbone of the neural network’s functionality. By combining various types of core layers, neural networks can be customized to tackle a wide range of tasks, from image classification to sequence prediction, allowing them to learn effectively and generalize from data.
3.1.1. Convolution Layer and Deconvolution Layer
Convolution layers (CL) are effective approach to image processing by NN (
Figure 2). To process even small image, significant number of neurons are needed if Dense layer (DL) is used. Lets have Gray-scale image with resolution of 1920 by 1080, DL would need more than 2 million neurons to pass data from each pixel of the image. All pixels needs to be passed as each of them can contain useful information. Moreover, several DL are needed to extract basic image features (like shape, texture, gradient etc.) as one pixel does not contain information about its surroundings. On the other hand, CL utilise moving window to extract cutout from image. This part is than processed by convolution to find common features in the data. For windows of size 3 by 3 pixels only 9 neurons are needed, but such CL would describe only one feature (also called as filter). In practice, CL use more than one filter. It can be reasoned that each filter represents one feature found in the input data. If CL is used as first layer, these filters will represent the most basic of image features (like lines). Consequent layers will gradually represent more and more abstract concepts (line—shape—nose—face). General NN for image recognition can be created by CL at the beginning and DL at the end of architecture. Such architecture is also called Convolutional Neural Network, or CNN in short [
20,
21,
22].
Two-dimensional data between layers are called maps. In general, input map has size and convolution window in the figure is . There are more than one filters, 3 for the figure. Therefore, the output map will have its own size (depending on stride of the window) and the same number of “channels” as the number of filters.
As its name suggests, this layer functions in opposition to the CL. Deconvolution Layer (DCL) translate more abstract form of data to more mathematical form. They are mostly used in Auto-encoder type of NN, where image is on the input and output of the NN. Auto-encoders have in general mirrored structure where data get more and more abstract to represent defined problem and than they are reconstructed into resulted image.
3.1.2. Pooling Layer
Pooling layers (PL) down sample data as is shown in
Figure 3. If CL has stride of its window set to 1, the data resolution will be reduced by 1 pixel in each side. To reduce data, bigger stride can be used but more information between steps is lost. Reduction can be done by averaging of neighbour pixels too. There are multiple PL based on the method (like: max, min, median, average value) and by size of kernel. Example of MaxPooling layer with filter of size
and stride
can be seen in
Figure 3. There are also un-pooling layers. Layer with name Maxpooling2D will down-sample data by maximal value on 2D data [
23,
24,
25].
3.1.3. Recurrent Layer
Recurrent layers (
Figure 4) are a crucial element of recurrent neural networks (RNNs), a type of artificial NN designed to identify patterns in sequential data, such as speech, text and time series. Unlike traditional feedforward NN, recurrent layers feature connections that loop back on themselves, enabling them to retain a state that captures information from previous inputs. This capacity to preserve context over time makes RNNs especially effective for tasks where understanding temporal dynamics and context is essential [
26,
27].
Recurrent layers process sequences of data by maintaining an internal state that evolves over time. This state acts as a memory, capturing information about previous elements in the sequence and using it to influence the processing of current and future elements. In a recurrent layer, each neuron is connected not only to the neurons in the next layer (as in feedforward networks) but also to itself and possibly other neurons in the same layer from the previous time step. These looped connections enable the network to propagate information forward in time, allowing it to “remember” previous inputs. In Recurrent Layers (later as RL) neurons take as input not only data from previous layers but also its own output from previous pass as is shown in
Figure 4). This functions as memory and RL can therefore process relations in time. Most usage can be found in video processing [
28,
29].
3.1.4. Batch Normalisation
The main idea behind Batch normalisation (BN) is to normalize the inputs by subtracting the mean and dividing by the standard deviation, ensuring that the inputs have zero mean and unit variance (
Figure 5). This normalization is performed on mini-batches of training examples rather than individual examples [
30,
31].
At first, this layer selects small batch of images upon which the mean and standard deviation are computed across the batch dimension for each feature (color channel) independently. Each weight in this “mini-batch” is subtracted by the mean and divided by the standard deviation. After standardization, the normalized weights are further scaled by a learnable parameter called gamma and shifted by another learnable parameter called beta. These parameters allow the network to adaptively rescale and shift the weights to better suit the following layers [
30,
31].
3.1.5. Fully Connected Layer
Also called Dense layer (DL), this is the most basic type of NN layer. Mostly it has only two parameters: number of neurons and used activation function. Number of neurons is self-explaining parameter as it defines number of trainable neurons in this layer. Neural networks that are composed from only this layers are called Multi Layer Perceptron (MLP). In architectures for image recognition, this layer is used mostly on the “back” side of the architecture. As data pass thru the network its meaning is moving from raw data to more abstract meanings. Therefore, the last layer outputs (in theory) solution for the problem. In image recognition, this is mostly classification of image. Each neuron represents percentage of confidence that input data belongs to specific class [
32,
33,
34].
Simple three-layer MLP can be seen in
Figure 4. First layer marked as L
1 is input layer, next is one hidden layer (L
2) and at the end output layer (L
3). Each one has Y number of neurons and output of each neuron (from input and hidden layer) is connected to each neuron from next layer [
33,
34].
3.2. Utility Layers
Utility layers, often referred to as auxiliary or helper layers, are components in neural networks that perform specific functions to aid in the training and performance of the network. These layers are not directly responsible for learning representations of the data but provide essential operations that facilitate the overall learning process. Utility layers are crucial for various tasks, such as normalization, dropout, and activation functions, which help improve the stability, efficiency, and generalization of neural networks.
These layers play a vital role in modern neural network architectures, providing necessary operations that support the learning process, improve performance, and enhance the robustness of models. Their proper use is crucial for building effective and efficient deep learning models. They add useful functionality to the overall structure and function of the resulted NN.
Flatten Layer and Dropout
When using CL data on the output of layer have the same dimension as on input. For basic CNN that is 2D matrix. Dense layers work with single row of values which is stored as 1D vector. To transfer architecture from CL to DL, transform from 2D to 1D is needed. Flatten layer (later as FL) provides this functionality (
Figure 6). In general it reduces dimension of data by one [
35].
Dropout layer (OL) sets random number of neuron outputs to zero, effectively disabling the neuron as is shown in
Figure 7. This functionality simulates biological neurons ability to “switch” (be either ON or OFF). Dropout also provides mechanism to prevent “Over-training” of NN. This is state of NN when trained model responds only on trained data and is too rigid to correctly process other data [
35].
3.3. Used Architectures
Several neural network architectures have been developed to address specific tasks and challenges. For instance, Convolutional Neural Networks (CNNs) have achieved remarkable success in image classification tasks. To evaluate presented dataset, several of the widely known architectures were used. They are presented in chronological order as they were introduced to scientific community. There can be seen a process of adding more and more complexity to achieve better results.
3.3.1. Visual Geometry Group
The Visual Geometry Group (VGG) is a CNN architecture introduced in 2014 by a team of researchers from the University of Oxford, named after the group that developed it. VGG is renowned for its simplicity and consists of a series of convolutional layers followed by fully connected layers (
Figure 8). The architecture includes either 16 or 19 layers, depending on the variant, with each convolutional layer using filters of the same size. In 2014, VGG achieved state-of-the-art performance on the ImageNet classification task and has since become widely used in computer vision applications, especially for transfer learning.
Architecture starts with 13 or 16 CL followed by 3 DL. After groups of CL there is PL and ReLU is used as activation function. As the input, network excepts RGB image of
resolution. Each CL uses kernel of size
[
36,
37].
3.3.2. Residual Neural Network
Residual Neural Network (ResNet) is a NN architecture introduced in 2015 by researchers at Microsoft Research Asia. ResNet was specifically designed to overcome the problem of vanishing gradients, a common challenge in very deep neural networks. The key innovation of ResNet is the use of residual blocks, which allow the network to learn residual functions that approximate the identity mapping. This design enables more effective gradient propagation through deep networks, leading to better performance across various computer vision tasks. ResNet won the ImageNet classification competition in 2015 and has since become a foundational architecture widely applied in numerous domains and applications [
38,
39].
The main perk in this architecture are repeating blocks (see
Figure 9). There are two types of blocks depending on output size relative to the input side. If output size of data is different than input, it is “Residual block” (CONV BLOCK in
Figure 9). If they are the same, it is “Identity block” (IDEN BLOCK in
Figure 9). In each type of blocks, data passes thru two paths with one being called “Shortcut”. At the exit from block, data from shortcut are added to data passed thru several CL [
38,
39,
40].
3.3.3. Xception
Xception is a convolutional neural network (CNN) architecture that was introduced in 2016 by François Chollet, the creator of the popular deep learning library Keras. Xception stands for “Extreme Inception”, which refers to its similarity with the Inception architecture while using an extreme form of depthwise separable convolutions [
41,
42].
The Xception architecture aims to improve the efficiency of deep neural networks by using depthwise separable convolutions (shown as “SEPAR CL” in
Figure 10), which separates the spatial filtering and the channel-wise filtering into two separate convolutional layers. This approach significantly reduces the number of parameters in the network and the computational complexity required to train the model [
42].
3.3.4. MobileNet
MobileNet is a series of CNN architectures designed to be lightweight and efficient, optimized specifically for mobile devices and embedded systems with constrained computational resources. It was introduced by researchers from Google in 2017 and has since become a popular choice for various computer vision tasks on mobile devices. The MobileNet architecture employs depthwise separable convolutions (depicted as “SEPAR CL” in
Figure 11), which decompose the standard convolution operation into two separate layers: a depthwise convolution and a pointwise convolution. This method significantly reduces the number of parameters and computational complexity of the network while maintaining high accuracy. MobileNet has been successfully utilized in a wide range of applications, including object detection, image classification, and facial recognition, particularly on mobile devices. MobileNet models are highly efficient, both in terms of computational cost and memory usage. This efficiency does not come at the expense of significant performance loss, as MobileNet maintains competitive accuracy rates compared to larger, more computationally intensive models [
43,
44].
In summary, MobileNet stands out as a highly efficient and effective neural network architecture, optimized for environments where computational resources are limited. Its innovative use of depthwise separable convolutions and tunable hyperparameters (width and resolution multipliers) allows for flexible deployment across a range of applications, particularly in mobile and embedded systems [
43,
44,
45].
3.3.5. DenseNet
The Dense Convolutional Network (DenseNet) is DNN architecture developed to tackle the vanishing gradient problem that can arise in very deep networks. The core innovation of DenseNet is its dense connectivity pattern, where each layer is directly connected to every other layer in a feedforward manner, as illustrated by the colored arrow-lines in
Figure 12. This dense connection improves feature propagation and gradient flow, enhancing the network’s efficiency and performance. This connectivity pattern enables efficient information flow between layers, allowing for better gradient propagation and feature reuse. DenseNet models have achieved state-of-the-art results on a variety of computer vision tasks, such as image classification, object detection, and semantic segmentation. Additionally, they have relatively few parameters compared to other deep neural networks, making them computationally efficient and allowing them to be trained on smaller datasets [
46,
47,
48].
4. Experimental Results
This section presents the experimental results of using various neural network architectures for the recognition of wild animals in infrared images. The model’s performance is evaluated through a confusion matrix, offering detailed insights into its classification accuracy across different animal categories. All experiments were conducted on a computer running Windows 10, using the Keras and TensorFlow frameworks. Two datasets consisting of single-channel images were used in the analysis.
4.1. IR Animal Dataset
First dataset is that of IR Slovak wild animals. The animals were selected based on their potential damage in the collision with road vehicle. The infrared (IR) animal dataset (
Figure 13) comprises a collection of images specifically captured using infrared technology. These images are intended to depict various species of wild animals in natural environments. The dataset is categorized into four different classes (
Figure 13):
The images were resized to 224 by 224 pixels, with each class containing 200 images, resulting in a total of 800 images in the dataset. For the experiments, 175 images from each class were used for training, and 25 images were set aside for testing. Each image in the dataset captures the infrared signature of an animal, representing its thermal emissions or heat patterns rather than visible light. This unique thermal data provides valuable insights into the animals’ characteristics and behaviors in their natural environments. The dataset is a useful resource for advancing research in wildlife monitoring, conservation efforts, and the application of machine learning algorithms for animal recognition and classification using infrared imagery.
Researchers and practitioners can leverage this infrared animal dataset to train and evaluate machine learning models, improving their ability to automatically identify and classify various wildlife species based on their thermal profiles captured through infrared technology. This dataset plays a key role in advancing scientific knowledge and supporting technological applications in wildlife management and ecological research.
4.2. Fashion MNIST
To compare proposed dataset, Fashion MNIST dataset (
Figure 14) was used. This dataset contains 10 classes of grayscale images of fashion objects. The images are sized at 28 by 28 pixels, with a total of 60,000 images for training and 10,000 images for testing. Each class contains 6000 images for training and 1000 images for testing. Example from dataset can be seen in
Figure 14.
Each image measures 28 pixels in height and 28 pixels in width, resulting in a total of 784 pixels per image. Each pixel is represented by a single pixel-value, denoting its brightness level, where higher values indicate darker shades. These pixel-values range from 0 to 255 as integers.
4.3. Evaluation Criteria
Experimental results are presented as confusion matrix, Precision (P), Recall (R) and F1 score of best configuration on presented architectures and datasets. The confusion matrix presents results as accumulated occurrences of predicted vs true labels. Rows of the matrix represents true labels (here called Target) and columns the predicted labels. If tested image has label 1 but was predicted as label 3, value on position 3, 1 (3. column, 1. row) is incremented by 1 (one tested image). Data in the matrix can be one of 4 types (
Figure 15):
True Positive (TP)
False Positive (FP)
True Negative (TN)
False Negative (FN)
Positive data refers to all data predicted as belonging to the target class. Data that truly belong to the target class are considered True Positives (TP), while others are False Positives (FP). Negative data refers to those predicted as not belonging to the target class. True Negatives (TN) are data that do not belong to the target class and are correctly predicted as such, while False Negatives (FN) are data from the target class that are incorrectly predicted as not belonging to it. Various ratios can be calculated from these values to provide insights into the performance of the model, reflecting different aspects of class predictions and overall testing accuracy.
First ratio is called Precision
P and is calculated as ration between True positive and sum of positive (
1). Precision indicates how relevant are positive data:
Next ratio is Recall (
2). It is ratio between True Positive and sum of True Positive and False Negative. Recall indicates how good is the class separated from others:
Combining Precision and Recall into one value can be done by multiple average calculations. If harmonic average is used, it is called
F1 score. Its formula is
4.4. Results
Various hyperparameters were utilized in this study, including the number of epochs, batch size, and different architectural variations, which are detailed in the respective sections for each architecture. Examples of these hyperparameters include learning rate, batch size, number of layers, filter sizes, pooling methods, dropout rates, and activation functions. The selection of these hyperparameters is based on factors such as the dataset’s characteristics, the complexity of the task, and the computational resources available.
Optimizing hyperparameters (
Table 1) is essential for maximizing model performance. For example, tweaking the learning rate can speed up convergence and avoid the model from settling into suboptimal solutions. Likewise, altering the number of layers and filter sizes in a CNN can enhance the model’s capacity to detect intricate visual features and patterns. The behavior and efficacy of machine learning models are heavily influenced by a range of hyperparameters, each contributing to the overall model’s ability to perform well:
Learning Rate: The learning rate controls how much the model’s parameters are adjusted after each training step. It directly impacts how quickly the model converges and whether it stabilizes in an optimal state. Selecting the right learning rate is vital to prevent underfitting or overfitting.
Batch Size: Batch size determines how many training examples are processed together in each iteration. It affects the speed of training, memory consumption, and the model’s ability to generalize. The ideal batch size depends on both the hardware constraints and the nature of the dataset.
Kernel Size: In CNNs, kernel size defines the dimensions of the filter used to extract features from the input image. This parameter influences how much detail and spatial information is captured. Striking a balance between local and global feature extraction requires careful kernel size tuning.
Activation Functions: Activation functions introduce non-linearity into the network, enabling it to model complex relationships. Common functions like ReLU, sigmoid, and tanh determine how neurons respond to input signals, influencing both the network’s capacity to learn intricate patterns and its training behavior.
Dropout Rate: Dropout is a regularization technique that randomly deactivates a subset of neurons during training. This helps prevent overfitting by encouraging the network to build more resilient features and reducing its reliance on specific units. The dropout rate controls how many neurons are dropped.
Optimization Algorithm: The choice of optimization algorithm impacts the model’s ability to converge quickly and stably. Popular algorithms such as Stochastic Gradient Descent (SGD) and Adam influence the training process. Additional hyperparameters like momentum, weight decay, and learning rate decay are also crucial for fine-tuning the optimization process.
NN Architecture: The architecture of NN defines the layers’ structure, including the types (e.g., CL, PL, fully connected layer) and their connectivity. The architecture must be chosen based on the complexity of the task and the available computational resources to optimize both performance and efficiency.
A thorough understanding and careful tuning of these hyperparameters can significantly improve the accuracy and performance of computer vision applications.
4.4.1. Results for Comparison of VGG 16 and VGG 19
In this section of the experiments, we investigate the performance of two popular CNN architectures, VGG16 and VGG19, for wild animal recognition using infrared images. The experiments are conducted with a dataset containing infrared images of various wildlife species captured in their natural environments. This dataset is split into training, validation, and testing sets to train and evaluate the neural network models. Both VGG16 and VGG19 architectures are tested, with the key difference between them being the number of layers utilized.
The experimental results are presented through confusion matrices, offering a detailed overview of the classification performance of the VGG16 and VGG19 models across various animal classes (
Table 2 and
Table 3). Each row in the confusion matrix represents the true labels, while each column shows the predicted labels. The VGG16 model demonstrates strong precision in Classes 3 and 4 (Deer and Fox), highlighting its high accuracy in predicting these categories (
Table 4). However, it shows lower recall in Class 2 (Boar), suggesting it may have missed identifying instances of this category compared to others. Overall, the F1 scores reflect a reasonably balanced performance across the categories, with Class 3 (Deer) standing out with the highest
F1 score of 88.00%. Adjustments and further analysis may be necessary to improve performance in specific categories where recall is lower.
The analysis of the confusion matrices reveals that VGG16 consistently outperforms VGG19 in wild animal recognition using infrared images. Across all animal categories, VGG16 exhibits higher average classification accuracies, as evidenced by the fewer misclassifications depicted in its confusion matrix compared to VGG19.
The VGG 19 model shows varying levels of precision, recall, and
F1 scores across different categories (
Table 5). It performs best in Class 3 (Deer), where it achieves the highest precision (85.00%) and recall (79.00%), resulting in the highest
F1 score (81.00%) among all categories. However, it exhibits lower performance in Class 1 and Class 4 (Bear and Fox), with precision and recall scores in the range of 43.00% to 55.00%. In conclusion, while the VGG 19 neural network architecture demonstrates respectable performance overall, there are noticeable differences in its ability to accurately classify different categories. Adjustments and further analysis may be necessary to improve performance, especially in categories with lower precision and recall scores. These results suggest that the architectural design of VGG16 may be better suited for the complexities inherent in the infrared imagery of wild animals. The superiority of VGG16 underscores its efficacy in automated animal recognition tasks and emphasizes the importance of selecting appropriate neural network architectures for specific image processing applications.
From the results (
Table 6), it is evident that the VGG16 model demonstrates varying degrees of success across different classes. Classes such as Ankle Boot (Class 10) and Trouser (Class 2) show high precision and recall, indicating that the model is highly accurate in identifying these items and does not miss many instances. Classes like T-shirt/Top (Class 1) and Sneaker (Class 8) exhibit moderate performance, with decent precision and recall but room for improvement. The Shirt (Class 7) and Dress (Class 4) classes display lower precision, recall, and F1 scores, highlighting the challenges the model faces in correctly classifying these items. The VGG16 architecture shows robust performance in several categories but also indicates areas where further fine-tuning and data augmentation could enhance classification accuracy.
The VGG19 model (
Table 7) showed excellent performance in classes such as Bag (Class 9) and Ankle Boot (Class 10), with high precision, recall, and F1 scores, indicating the model’s strong capability in correctly identifying these items. Classes like Trouser (Class 2) and Sandal (Class 6) demonstrated moderate performance, with precision and recall values suggesting the model can reliably identify these items but with some room for improvement. The model struggled with classes such as T-shirt/Top (Class 1) and Shirt (Class 7), showing lower precision, recall, and F1 scores. This indicates difficulties in accurately classifying these categories, possibly due to similarities with other classes or insufficient distinctive features.
Overall, the VGG19 architecture performed well across several classes but highlighted areas where further model adjustments and data augmentation could enhance classification accuracy. The mixed results across different classes suggest the need for targeted improvements in the neural network’s ability to distinguish between visually similar items.
4.4.2. Results for ResNet 50
The performance of ResNet is summarized using a confusion matrix (
Table 8), illustrating the classification results across different animal categories. Each row of the confusion matrix corresponds to the true labels, while each column represents the predicted labels. The confusion matrix reveals that ResNet performs well in recognizing wild animals in infrared images, achieving high classification accuracy across most categories.
The ResNet50 demonstrates strong performance in wild animal recognition using infrared images, achieving high precision and recall rates across multiple categories (
Table 9). The model’s ability to maintain high
F1 scores indicates robustness in classification tasks. However, there are slight variations in performance across different categories, particularly in precision for Class 4 (Fox). Further optimization and fine-tuning of the model parameters could potentially enhance performance and address any inconsistencies observed.
These results highlight ResNet50’s effectiveness in complex visual recognition tasks such as wildlife monitoring, where accuracy and reliability are crucial for conservation efforts and ecological research.
The ResNet50 model (
Table 10) showed strong performance in classes such as Trouser (Class 2), Sandal (Class 6), Bag (Class 9), and Ankle Boot (Class 10), with high precision, recall, and F1 scores. This indicates the model’s robust ability to correctly identify these items. Classes like Pullover (Class 3) and Sneaker (Class 8) demonstrated moderate performance, suggesting the model can identify these items with reasonable accuracy but with some potential for improvement. The model struggled with classes such as T-shirt/Top (Class 1) and Dress (Class 4). Although the precision for T-shirt/Top is high, the recall is notably low, indicating that while the model is precise when it makes a prediction, it misses many true instances of this class. The Dress class also has a high recall but low precision, showing it often correctly identifies dresses but also misclassifies other items as dresses.
Overall, ResNet50 performed well on several classes, achieving a good balance between precision and recall in many cases. However, the disparity in performance across different classes highlights areas where the model could benefit from further optimization and potentially more targeted training data to improve its ability to distinguish between similar items.
4.4.3. Results for Xception
This section presents the experimental results of using the Xception neural network architecture for recognizing wild animals in infrared images. The model’s performance is evaluated through a confusion matrix (
Table 11), offering detailed insights into its classification accuracy across different animal categories. The experiments were carried out using a dataset of infrared images featuring various wild animal species. The dataset was split into training, validation, and testing sets to train and assess the Xception model.
The performance of Xception is summarized using a confusion matrix, which illustrates the classification results across different animal categories (
Table 12). Each row in the confusion matrix represents the ground truth labels, while each column represents the predicted labels. The confusion matrix indicates that Xception achieves high performance in recognizing wild animals in infrared images, with high classification accuracy across most categories. The model’s high performance and detailed classification ability make it a suitable candidate for advanced wildlife monitoring and conservation tasks, where precise and reliable recognition is essential.
The Xception model demonstrates excellent precision across all categories, particularly in Classes 1 and 3 (Bear and Deer) where it achieves 100.00%. This indicates very accurate predictions for these classes. The model also shows strong recall across the board, with Class 1 (Bear) having the highest recall of 95.00%. The F1 scores are high overall, reflecting a balanced performance between precision and recall for most categories. In conclusion, the Xception neural network architecture shows robust performance in this classification task, achieving high accuracy, recall, and F1 scores across multiple categories.
For classes such as T-shirt/Top (Class 1), Pullover (Class 3), and Sneaker (Class 8), the model showed moderate performance. These results suggest that while the model can accurately identify these items, there is still room for improvement, especially in terms of precision and recall balance. The Xception model (
Table 13) demonstrated high performance in the Trouser (Class 2) and Ankle Boot (Class 10) categories, achieving high precision, recall, and F1 scores. This indicates that the model is very effective at correctly identifying these items with few errors. On the other hand, the model struggled with classes like Sandal (Class 6) and Coat (Class 5). Although Sandal has a high precision, the recall is relatively low, indicating that the model misses many true instances of this class. Similarly, Coat shows moderate precision but lower recall, suggesting issues in identifying these items consistently.
Overall, the Xception model achieved commendable performance on several classes within the Fashion MNIST dataset, particularly excelling in the identification of trousers and ankle boots. However, the model’s performance varied across different categories, highlighting the need for further optimization to improve its accuracy and robustness across all classes.
4.4.4. Results for MobileNet
The experiments were conducted using a dataset containing infrared images of various wild animal species. The dataset was divided into training, validation, and testing sets to train and evaluate the MobileNet model. The performance of MobileNet is summarized using a confusion matrix (
Table 14), which illustrates the classification results across different animal categories. Each row in the confusion matrix represents the ground truth labels, while each column represents the predicted labels.
The confusion matrix reveals that MobileNet achieves commendable performance in recognizing wild animals in infrared images, though there are some misclassifications. The model correctly identifies a high percentage of each animal category, with lions and tigers being the most accurately classified (
Table 15). However, the model tends to confuse giraffes and zebras more frequently, likely due to their similar infrared signatures.
The MobileNet neural network architecture demonstrates varying performance across different classes in this evaluation (
Table 15). It shows strong precision for Deer and moderate precision for Bear and Fox, but relatively lower precision for Boar. Recall rates are generally high, particularly for Boar and Fox, indicating the model’s ability to correctly identify these classes from the dataset. The
F1 scores reflect a balance between precision and recall, with the highest score achieved for Deer, indicating robust performance in distinguishing this class. These results suggest that MobileNet is effective in classifying infrared images of bears, boars, deer, and foxes, with particular strengths in detecting deer and foxes based on the provided evaluation metrics.
The results indicate that while MobileNet is effective for wild animal recognition, there is room for improvement, particularly in distinguishing between visually similar species. Despite these challenges, MobileNet’s lightweight architecture and efficient performance make it a viable option for real-time wildlife monitoring applications, especially in environments where computational resources are limited. The study highlights the potential of MobileNet in contributing to wildlife conservation efforts through automated image analysis. MobileNet might not achieve the highest accuracy compared to deeper and more complex models like DenseNet or ResNet, it offers a compelling balance between accuracy and efficiency, making it suitable for real-time applications.
MobileNet (
Table 16) demonstrated exceptional performance in the Trouser (Class 2) and Ankle Boot (Class 10) categories, achieving very high precision and recall values. This indicates the model’s high effectiveness in accurately identifying these items. For classes such as Sandal (Class 6), Shirt (Class 7), and Bag (Class 9), the model showed moderate performance. These results suggest that while the model can reasonably identify these items, there is still variability in precision and recall that could be improved. The model struggled with classes like T-shirt/Top (Class 1) and Pullover (Class 3). Despite the high precision for T-shirt/Top, the recall is very low, indicating the model misses many true instances of this class. Similarly, Coat (Class 5) shows high recall but low precision, suggesting issues with false positives in identifying this item.
Overall, MobileNet achieved commendable performance on several classes within the Fashion MNIST dataset, particularly excelling in the identification of trousers and ankle boots. However, the model’s performance varied significantly across different categories, highlighting the need for further optimization to improve its accuracy and robustness across all classes.
The confusion matrix (
Table 17) indicates that DenseNet performs well in recognizing wild animals in infrared images, with high classification accuracy across most categories. The model shows particularly strong performance in identifying lions, elephants, and zebras. However, there are some misclassifications, notably between giraffes and zebras, which might be attributed to the similar infrared patterns exhibited by these species.
4.4.5. Results for DenseNet
The performance of DenseNet is summarized using a confusion matrix, which illustrates the classification results across different animal categories. Each row in the confusion matrix represents the ground truth labels, while each column represents the predicted labels.
The DenseNet neural network architecture demonstrates strong precision in Class 1, indicating accurate predictions for this class (
Table 18). However, it shows lower precision in Class 4 (Fox), suggesting some misclassifications. The model performs well in terms of recall across all categories, with particularly high recall in Class 4 (Fox). The
F1 scores reflect a balanced performance overall, with Class 1 (Bear) achieving the highest
F1 score of 95.00%. The MobileNet architecture shows robust performance in this classification task, with notable accuracy, recall, and
F1 scores across multiple categories. Adjustments and further analysis may be necessary to improve precision, especially in categories where it is lower.
DenseNet showed strong performance in the Trouser (Class 2) and Sandal (Class 6) categories, achieving high precision, recall, and F1 scores (
Table 19). This indicates the model’s reliability in correctly identifying these items with minimal false positives and false negatives. For classes like T-shirt/Top (Class 1), Dress (Class 4), and Ankle Boot (Class 10), DenseNet displayed good but not exceptional performance. These results suggest a solid capability to correctly classify these items, though there is room for improvement. he model had lower performance in classes such as Pullover (Class 3), Shirt (Class 7), and Coat (Class 5). These lower scores indicate a higher rate of misclassification for these items, suggesting challenges in distinguishing these categories from others.
Overall, DenseNet demonstrated robust performance on several classes within the Fashion MNIST dataset, particularly excelling in the identification of trousers and sandals. However, the performance varied across different categories, highlighting areas for potential enhancement to improve the model’s accuracy and consistency across all classes.
In other words, DenseNet demonstrates a robust capability for wild animal recognition in infrared imagery, with an efficient and compact architecture suitable for deployment in resource-constrained environments. The model’s balanced performance across different categories highlights its potential for real-time wildlife monitoring and conservation applications. Future work could focus on further fine-tuning the model and expanding the dataset to improve its accuracy and generalization across a broader range of species.
5. Discussion and Conclusions
This paper provides a comprehensive analysis and comparison of various deep neural network architectures applied to the task of wildlife recognition using infrared (IR) images. The primary objective was to evaluate the performance of different neural networks, namely VGG16, VGG19, ResNet50, Xception, MobileNet, and DenseNet, on a dataset comprising infrared images of different animal species. The evaluation was based on key metrics such as precision, recall, and F1 score. Additionally, confusion matrices were utilized to gain deeper insights into the classification capabilities and errors of each model. Each of these models offers different strengths in terms of computational efficiency, accuracy, and feature extraction capabilities. For instance, MobileNet is known for its lightweight design suitable for mobile and embedded applications, whereas Xception emphasizes depthwise separable convolutions for improved performance. Each architecture was trained and tested to classify IR images into the predefined animal classes. This approach allowed for a comprehensive comparison of different models in terms of accuracy, computational efficiency, and generalization capabilities. Each model’s performance was assessed using metrics such as precision, recall, and F1-score, providing insights into their accuracy and efficiency in animal classification. The results showcased varying degrees of performance across different architectures, highlighting trade-offs between computational efficiency and classification accuracy. Comparative analysis revealed that while lightweight models like MobileNet offered computational advantages, they sometimes compromised on accuracy compared to more complex models such as VGG16 or Xception. This trade-off underscores the importance of selecting models based on specific application requirements, such as real-time monitoring versus high-precision classification.
VGG16 and VGG19: Both architectures showed competitive performance, with VGG19 generally outperforming VGG16. VGG19 achieved higher precision and F1 scores in most categories, demonstrating its ability to capture more complex features due to its deeper architecture. ResNet50: This model excelled in precision across several classes, particularly in distinguishing between animals with subtle differences in IR images. However, it showed variability in recall, indicating potential difficulties in consistently identifying all instances of certain classes. Xception: The Xception model provided balanced performance with relatively high F1 scores. Its unique architecture, leveraging depthwise separable convolutions, proved effective in extracting intricate patterns from IR images. MobileNet: MobileNet, designed for efficiency, showed strong results in terms of precision but faced challenges in recall for some classes. This suggests that while MobileNet is capable of accurate predictions, it may miss some instances due to its lightweight nature. DenseNet: DenseNet achieved notable results with high precision and recall across most classes. Its densely connected layers facilitated effective feature reuse, which was particularly beneficial for the diverse and complex IR images in the dataset.
Misclassifications were observed across all models, with certain animal classes being more difficult to distinguish. For example, classes with similar thermal signatures, such as different types of small mammals or birds, often led to higher confusion rates. The quality and diversity of the IR images in the dataset also played a crucial role in model performance. Variations in image resolution, angle, and environmental conditions (e.g., background temperature) affected the models’ ability to generalize. The findings indicate that deep neural networks hold significant promise for automated wildlife monitoring using infrared imaging. High-performing models like DenseNet and Xception can provide reliable identification of animal species, which is critical for ecological research and conservation efforts. The efficiency of MobileNet highlights its potential for deployment in resource-constrained environments, such as remote field stations or on-device processing in wildlife cameras.
The findings have significant implications for wildlife conservation and management practices. Accurate classification of animals based on IR imagery can enhance wildlife population assessments, habitat monitoring, and mitigation of human-wildlife conflicts. These applications are crucial for informed decision-making and policy formulation aimed at preserving biodiversity and ecosystems. The study’s success in achieving high classification accuracy for animal species using IR images signifies a promising future for integrating AI technologies into conservation efforts. However, challenges such as dataset size, model interpretability, and environmental variability remain pertinent areas for future research. Moving forward, further exploration into ensemble learning techniques, transfer learning from pre-trained models, and the integration of temporal data could enhance the robustness and scalability of deep learning models in wildlife recognition. These advancements hold the potential to address current limitations and broaden the application of AI in safeguarding wildlife populations and habitats globally.
In the case of wildlife recognition using infrared images, the ability to process large volumes of data quickly is essential. Pruned models can provide near-real-time performance for the detection of animals in infrared imagery, even in resource-limited settings. For example, infrared cameras deployed in remote areas could send images to local edge devices running pruned models to quickly identify species, without relying on cloud-based processing, which might introduce delays. In summary, pruning is an effective technique for improving the efficiency of deep neural networks used for wildlife recognition in infrared images. By reducing the model’s size and computational complexity, pruning enables faster inference speeds and makes it feasible to deploy models in environments with limited computational resources. The minimal trade-off in accuracy ensures that the model still performs well, making pruning a valuable tool for real-time, resource-constrained applications in wildlife conservation and monitoring. Also, the quantization is a powerful technique for enhancing the efficiency of deep neural networks used in wildlife recognition with infrared images. By reducing model size and computational requirements, quantization enables faster inference and makes it feasible to deploy complex deep learning models on edge devices with limited resources. In the context of wildlife monitoring, where real-time, resource-efficient detection is critical, quantization provides a practical solution for deploying models in the field. The ability to preserve accuracy while significantly improving efficiency makes quantization an essential tool for advancing wildlife conservation efforts and ecological studies in resource-constrained environments.
While our study focuses on single-channel infrared images, future work could explore multimodal approaches that fuse infrared with visible light or depth information. This approach could offer improved performance, particularly in scenarios where one modality alone is insufficient. Additionally, the integration of pruning, quantization, and knowledge distillation techniques could be explored to enhance the efficiency and speed of the models, making them more suitable for real-time wildlife monitoring applications. The pruning strikes a balance between reducing computational complexity and maintaining high performance. On the other hand, by quantizing the model, the number of bits used to represent each parameter is reduced, which directly leads to reduced computational complexity. For wildlife recognition, this translates to quicker detection and classification of animals in infrared imagery, enabling near real-time performance on low-power devices.
The field of wildlife recognition with DNNs and IR imaging is rapidly progressing. However, several challenges remain, including the need for larger, diverse IR datasets, especially those capturing various species across different habitats. Our work contributes to addressing these challenges by demonstrating the efficacy of DNN models in improving detection accuracy, particularly in low-contrast and varied IR environments. Our approach differs from existing studies by focusing on DNN architectures and comprehensive data augmentation, which collectively enhance the model’s ability to generalize across various wildlife habitats and conditions. Looking forward, future research could explore integrating multispectral data (e.g., combining IR and visible light) and leveraging self-supervised learning to further reduce data dependency and improve model performance.