1 Introduction

Object detection is one of the fastest growing and most widely used techniques in artificial intelligence. It plays an essential role in the fields of mobile robotics, vision SLAM [17], autonomous driving [18], and industrial inspection [15]. In recent years, with the gradual improvement of deep learning theory and the increasing number of working scenarios, object detection based on deep learning has become a mainstream method in this field. However, deep learning models are often designed to pursue higher accuracy and better performance, which often comes at the cost of model size and speed of inference, such as Faster RCNN [22], which are mainly deployed in servers, so there is no need to worry about situations such as limited platform resources. However, the limited computational and storage resources on mobile devices, such as robots, severely limit the deployment of deep learning-based object detection methods on such platforms. The YOLO [10,11,12, 26] family of methods was proposed in this context, and especially the lightweight methods in the YOLOv5 [10] to YOLOv8 [11] family of methods are widely used in mobile platforms for their low latency and higher detection accuracy. However, in some scenarios, the higher detection accuracy does not accomplish specific tasks. Therefore, improving detection accuracy while keeping the number and complexity of model parameters unchanged or a small amount of enhancement is a significant development direction for implementing object detection.

Considering the challenges highlighted, we have developed \(\eta\)-RepYOLO, an augmented object detection methodology. This method utilizes \(\eta\)-RepConv as the core network unit to construct the model, markedly enhancing object detection accuracy during the inference phase without adding computational load or complexity. Initially, \(\eta\)-RepConv is employed to formulate the foundational feature extraction module, \(\eta\)-RepC2f. This module aims to maximize the model’s feature fusion capabilities. \(\eta\)-RepConv, in conjunction with \(\eta\)-RepC2f, is stacked to form the backbone network \(\eta\)-EfficientRep. This configuration not only procures a more refined network model during the training phase, but also yields a streamlined model upon reparameterization during the inference phase, ensuring model accuracy is preserved while achieving a more lightweight network structure. The SiLU activation function is incorporated throughout the network, enhancing the non-linearity of the activation functions without increasing computational demands. Subsequently, we introduce modified versions of the \(\eta\)-RepPANet and \(\eta\)-RepAFPN as the model’s detection neck modules. These alterations are intended to bolster neck performance while simultaneously reducing computational intricacy. The \(\eta\)-RepC2f module is further applied to facilitate feature fusion within these neck modules. For the detection head, \(\eta\)-RepConv is leveraged to refine the object detection capabilities. This head features decoupled classification and regression branches, with the regression branch adopting the integral form presented in the distributional focal loss. This enhancement in the detection head improves performance without complicating the computation process. Our primary contributions are encapsulated as follows:

  • Our team has engineered a novel convolutional structure, \(\eta\)-RepConv, noted for its elevated accuracy and superior real-time execution. Building upon this convolution, we devised the \(\eta\)-RepC2f module. The seamless fusion of the \(\eta\)-RepC2f module with \(\eta\)-RepConv has led to the creation of a robust backbone network, which we have designated as \(\eta\)-Efficie-ntRep.

  • Our design includes two enhanced neck modules, \(\eta\)-RepPANet and \(\eta\)-RepAFPN. Both modules integrate the \(\eta\)-RepC2f module for feature fusion, significantly augmenting the efficacy of the detection method.

  • We introduce an optimized detection head, \(\eta\)-RepHe-ad, which utilizes the network unit \(\eta\)-RepConv to refine the decoupled head structure.

  • Our empirical investigation employs the Pascal VOC 07+12 and MS COCO 2017 datasets, enabling an extensive comparative analysis with current methods. The findings indicate that our approach yields competitive outcomes, striking an optimal balance between accuracy and processing speed.

The structure of the paper is as follows: Sect. 2 provides a comprehensive review and comparative analysis of prevailing real-time object detection algorithms and reparameterization techniques that are pertinent for platforms with limited resources. In Sect. 3, we delve into the methodology proposed, elaborating on the constituent parts of the system—namely, the backbone network, the neck network, and the detection head. Section 4 details the training and inference protocols of the model and presents a series of ablation studies, alongside comparative evaluations conducted on various datasets to affirm the efficacy of the proposed method. Finally, Sect. 5 offers concluding remarks on the study.

2 Related work

2.1 Real-time object detection

Object detection technology has branched into two distinct paths to cater to varying application scenarios: high-precision and real-time object detection, each prioritizing different performance metrics. High-precision object detection typically employs a two-stage process, exemplified by methods such as Faster RCNN [22] and FPN [13]. These techniques incorporate an intermediary step of generating region proposals before finalizing the bounding box, allowing for enhanced accuracy in detection. However, this additional step often results in a trade-off with real-time performance, which can be less than optimal.

Real-time object detection pivots on a one-stage detection strategy, with the YOLO [1, 10,11,12, 26] family, SSD [14] variants, and similar methods leading the charge. The YOLO series, in particular, has garnered widespread adoption in industrial contexts owing to its straightforward architecture, excellent real-time capabilities, and commendable accuracy. The architecture of YOLO algorithms is compartmentalized into three segments: the backbone network, the neck, and the detection head. To achieve avant-garde real-time object detection, scholars concentrate on augmenting: (1) the backbone network for swifter and more potent architectural efficiency, (2) an advanced neck network capable of higher-quality feature map fusion across diverse layers, and (3) an optimized detection head with an effective label assignment method to enhance detection accuracy. Moreover, (4) a more robust loss function is critical—although our current research will not delve further into loss functions. Instead, our investigation focuses on advancements in areas (1), (2), and (3), aiming to maximize object detection accuracy without exacerbating latency.

Fig. 1
figure 1

Schematic illustration of our proposed method. The graph consists of three parts: the backbone network, the neck network, and the head network. A Backbone: we use \(\eta\)-EfficientRep as the backbone network, and extract the feature maps at three scales from the backbone network. B Neck: we designed two neck networks, \(\eta\)-RepPANet and \(\eta\)-RepAFPN. They mainly fuse the feature maps of the three scales to get the enhanced feature maps. C Head: we use an improved decoupled head as the object detection head and replace the \(3 \times 3\) conv using \(\eta\)-RepConv to get a better performance detection head

2.2 Model reparamterization

Nowadays, numerous model lightweighting methods exist, including pruning, distillation, quantization, model reparameterization, and so on. Among them, model reparameterization refers to fusing multiple computational modules into one module in the inference stage, which can be regarded as a model integration technique. It has once again attracted extensive attention from many scholars since RepVGG [7] was proposed and widely used. The method splits a module into multiple branches of the same or different modules in the training phase, thus increasing the non-linearity of the model. In the inference stage, multiple branches are fused into one module to increase the real-time performance of the model. DBB [6] proposes to use modules with different structures, such as convolutional sequences, multi-scale convolution, and average pooling as different branches to improve the non-linearity of the training model. Subsequently, to attenuate the problem of drastically reduced accuracy after quantization of the reparameterized model, scholars have also proposed to improve it by using a new distillation method. RepMLP [5] constructs the convolutional layers internally during training and merges them into the FC layer for inference to obtain a more powerful image recognition function. RmNet [16] eliminates the original ResNet [9] network by preserving and merging the ResBolock of the network with the original ResNet network. MobileOne [25] expands the depth of \(3 \times 3\) conv based on RepVGG to obtain more excellent feature extraction capability. Subsequent works such as RepOptV-GG [4], DyRep [24], EfficientRep [27], and QARepVGG [2] attenuate the effect of quantization on the model after structural reparameterization. In this paper, we improve the object detection model based on the above methods by abstracting the base unit of RepVGG and setting a hyper-parameter to control the width and depth of the multi-branching module. In this paper, we obtain a unit with better performance than the RepVGG based module by expanding the width and depth of the original RepVGG block. We improve and upgrade the whole object detection network by this module, and finally realize the purpose of improving the object detection accuracy as much as possible while keeping the latency not to be improved.

3 Methods

3.1 Overall

In \(\eta\)-RepYOLO, based on the principle of hardware-friendly network design, we propose a scalable depth and width reparameterizable base network module, which can be reparameterized to a standard \(3 \times 3\) conv and better supported by various hardware. We construct our detection model using the following optimization strategies based on the \(\eta\)-RepConv network unit. First, the backbone network is constructed by the \(\eta\)-RepConv network unit and \(\eta\)-RepC2f module. Then, the AFPN [28] module is improved by the \(\eta\)-RepConv network unit and \(\eta\)-RepC2f module to get the neck module with more excellent detection accuracy. Finally, the \(\eta\)-RepConv is used to improve the detection header with the TAL label assignment strategy to ensure the stability of the model training process and to obtain excellent performance. The overall architecture of \(\eta\)-RepYOLO is shown in Fig. 1.

3.2 Object

Studies have indicated that multi-branch networks often outperform their single-branch counterparts in classification tasks, albeit at the cost of diminished parallelism and increased inference time. In contrast, conventional single-branch networks like VGG benefit from greater parallelism and a smaller memory footprint, which enhances inference speed. The innovative RepVG-G model employs structural reparameterization to separate the multi-branch architecture during training from the streamlined structure during inference, optimizing the speed–accuracy balance. However, RepVGG’s reliance on simple parallel 3\(\times\)3 and \(1\times 1\) convolutions only taps into a fraction of the potential performance of enhanced individual convolutions. While DBB attempts to improve upon this by integrating branches of varying sizes and complexities, the use of average pooling inadvertently discards valuable information, thus capping the performance improvements that this branch addition might offer. Nonetheless, the efficacy of DBB’s diverse branch dimensions has been successfully demonstrated in VanillaNet.

Fig. 2
figure 2

Structural units underlying the proposed method. A \(\eta\)-RepConv structure for \(\eta\)=2. When \(\eta\)=2, \(\eta\)-RepConv consists of a \(1\times 1\) conv, a \(3\times 3\) conv, a BN layer, and two sequences of convolutions in parallel, which can be fused into a single \(3\times 3\) conv in the inference phase. B \(\eta\)-RepConv structure for \(\eta\) = 3. When \(\eta\) = 3, \(\eta\)-RepConv consists of a \(1\times 1\) conv, a \(3\times 3\) conv, a BN layer, and four convolution sequences in parallel, which can also be fused into a single \(3\times 3\) conv in the inference phase. C The structure of \(\eta\)-RepC2f. The \(\eta\)-RepC2f module is designed concerning the idea of the C2f module, which allows the proposed method to retain more spatial and semantic information while ensuring lightness in the inference phase, and improve object detection performance

Drawing inspiration from prior studies, we have conceptualized an efficient, reparameterizable base network module. Taking cues from both RepVGG and DBB, we observe that RepVGG boosts the model’s non-linearity by widening its branches. Conversely, DBB suggests deepening individual branches within the model to achieve a similar effect. Our proposition is a multi-branch structure module with adjustable width and depth, governed by two hyperparameters. For simplicity’s sake, these hyperparameters manipulate the model’s architecture through a unified approach, as illustrated in Fig. 2A, B. We define a single unit as a combination of a 3\(\times\)3 conv and a \(1\times 1\) conv. At \(\eta\)=1, the model comprises solely of these units. As \(\eta\) increases to 2, two additional branches are introduced: a sequential \(1\times 1\) & 3\(\times\)3 conv, and a sequential \(1\times 1\) & \(1\times 1\) conv, ensuring that the convolutions expand in both depth and width. During the inference phase, the trained \(\eta\)-RepConv is streamlined into a solitary 3\(\times\)3 conv. This transformation requires three steps: fusion of conv and BN, fsion of sequential convolutions, and conv summation of different branches.

\({\textbf {Fusion of conv and BN.}}\) Certainly, the convolution operation is indeed a linear function in the context of neural networks, where it applies a filter to the input data. The BN layer, on the other hand, is used for normalizing the output of the previous layer to improve the speed, performance, and stability of the learning process. When a BN layer is used right after a convolution layer, the bias term in the convolution can be disregarded because the subsequent normalization would negate its effect. Given the convolution layer’s kernel parameter W and the BN layer’s mean \(\mu\) and standard deviation \(\sigma\), along with the learned parameters \(\gamma\) (scaling factor) and \(\beta\) (bias term), the composite function of the convolution followed by BN can be mathematically expressed as:

$$\begin{aligned} \left\{ \begin{aligned}&Conv(x)=W(x) \\&BN(x)=\gamma \frac{(x-\mu )}{\sigma }+\beta . \\ \end{aligned} \right. \end{aligned}$$
(1)

When fusing a BN layer with its preceding convolutional layer, the normalization parameters (\(\gamma\), \(\beta\), \(\mu\), \(\sigma\)) are assimilated into the convolutional layer’s weights and biases. This fusion simplifies the network by reducing the number of operations required during inference. The transformation can be mathematically represented by adjusting the convolutional filters and biases as follows:

$$\begin{aligned} \begin{aligned} BN(Conv(x))&=\frac{\gamma }{\sigma }W(x)+\left( \frac{\gamma (-\mu )}{\sigma }+\beta \right) \\&={W}'(x)+{\beta }' . \end{aligned} \end{aligned}$$
(2)

The updated weights \(W^{'}\) and biases \(\beta ^{'}\) after fusion are calculated as:

$$\begin{aligned} {W}'=\frac{\gamma }{\sigma }W,\begin{matrix} {} &{} {\beta }'=-\frac{\mu \gamma }{\sigma }+\beta . \\ \end{matrix} \end{aligned}$$
(3)

This fused convolutional layer now effectively incorporates the normalization directly into its operation, streamlining the network for more efficient inference without the separate normalization step.

\({\textbf {Fusion of sequential convolutions.}}\) For the \(\eta\)-Re-pConv, incorporating multiple sequential convolutions is a key design feature. This includes the fusion of successive convolutional layers, such as a \(1\times 1\) conv, followed by a 3\(\times\)3 conv, as well as a sequence of two \(1\times 1\) convs. After fusing with their respective BN layers, these convolution sequences can be combined into a single convolution operation for greater efficiency. The concept here is that a \(1\times 1\) conv can be used to alter the channel dimensions, either by expansion or reduction, and when followed by a \(K\times K\) conv, it can affect the receptive field. When these two convolutions have already been fused with their BN layers, they can be further fused together. Mathematically, assuming the BN layers have been integrated, the combined operation of a fused \(1\times 1\) conv followed by a fused \(K\times K\) conv can be represented in a simplified form. The specific formula would depend on the transformation rules that apply when merging these layers, typically involving the combination of the weight matrices and bias terms of the individual convolutions into a single set of parameters for the resulting \(K\times K\) conv. This fused convolution will then perform both the channel modification and spatial feature extraction in one operation, which can significantly reduce the computational complexity and improve the inference speed of the neural network. When working with a set of sequential convolutions, denoted as \(Conv^{(1)}\) followed by \(Conv^{(2)}\):

$$\begin{aligned} \left\{ \begin{aligned}&Con{{v}^{(1)}}(x)={{W}^{(1)}}(x)+{{\beta }^{(1)}} \\&Con{{v}^{(2)}}(x)={{W}^{(2)}}(x)+{{\beta }^{(2)}} \\ \end{aligned} \right. , \end{aligned}$$
(4)

where W and \(\beta\) denote the kernel and bias parameters of the convolution, respectively. In the context of merging a \(1\times 1\) conv (\(Conv^{(1)}\)) with a \(K\times K\) convolution (\(Conv^{(2)}\)) for the inference stage:

$$\begin{aligned} \begin{aligned} Con{v}'(x)&=Con{{v}^{(2)}}\left( Con{{v}^{(1)}}(x)\right) \\&={{W}^{(2)}}\left( {{W}^{(1)}}(x)+P\left( {{\beta }^{(1)}}\right) \right) +P({{\beta }^{(2)}}) \\&={{W}^{(2)}}\left( {{W}^{(1)}}(x)\right) +{{W}^{(2)}}\left( P\left( {{\beta }^{(1)}}\right) \right) +P\left( {{\beta }^{(2)}}\right) \\&={W}'(x)+{\beta }', \end{aligned} \end{aligned}$$
(5)

where \(W^{'}\) and \(\beta ^{'}\) refer to the reorganized kernel parameters and bias parameters, respectively. The merging process involves a linear reorganization of the kernel parameters from the \(K\times K\) convolution. Begin by transposing the dimensions 0 and 1 of the kernel parameters of the \(1\times 1\) convolution. This operation rearranges the order of the filters and channels:

$$\begin{aligned} Trans\left( {{W}^{(1)}}\right) :{{W}^{(1)}}(D,C,1,1)->{{W}^{(1)}}(C,D,1,1). \end{aligned}$$
(6)

Next, linearly reorganize the kernel parameters of the \(K\times K\) convolution using the transposed \(Trans({{W}^{(1)}})\) parameters from the \(1\times 1\) convolution. If we denote the kernel parameters for the \(K\times K\) convolution as \({W}^{(2)}\), the reorganization step can be expressed by a operation that incorporates \(Trans({{W}^{(1)}})\) into \({W}'(x)\):

$$\begin{aligned} {W}'(x)=Trans\left( {{W}^{(1)}}\right) \left( {{W}^{(2)}}(x)\right) . \end{aligned}$$
(7)

Typically, this involves a matrix multiplication or convolution operation that combines the parameters of the \(1\times 1\) convolution with those of the \(K\times K\) convolution. The \({{W}^{(2)}}({{\beta }^{(1)}})\) in Eq. 5 is part of a single convolution bias in the inference phase, which can be viewed as the convolution of a constant matrix, and therefore the output is also a constant matrix. The result \({{\beta }^{(1)}}\) can be seen as a constant matrix proportional to the sum of all kernel elements.

$$\begin{aligned} {\beta }'={{W}^{(2)}}\left( P\left( {{\beta }^{(1)}}\right) \right) +P\left( {{\beta }^{(2)}}\right) , \end{aligned}$$
(8)

where \(P(*)\) denotes padding of the bias with the value of the bias for each channel instead of 0. It is expanding the bias for each channel into a constant matrix.

Fig. 3
figure 3

Architecture diagram of the backbone network \(\eta\)-EfficientRep

\({\textbf {Convolutional summation of branches.}}\) After performing the above two steps, our \(\eta\)-RepConv is left with only parallel \(1 \times 1\) and \(3 \times 3\) conv. What we need to do is to expand the \(1\times 1\) conv to a \(3\times 3\) conv, and then sum all the \(3\times 3\) conv of the branch as follows:

$$\begin{aligned} \begin{aligned} Con{v}'(x)&=Con{{v}^{(1)}}(x)+Con{{v}^{(2)}}(x) \\&=\left( {{W}^{(1)}}+P\left( {{W}^{(2)}}\right) \right) (x)+{{\beta }^{(1)}}+{{\beta }^{(2)}} \\&={W}'(x)+{\beta }' , \end{aligned} \end{aligned}$$
(9)

where \(P(*)\) denotes padding of the kernel parameters of the \(1\times 1\) conv with a padding value of zero.

Fig. 4
figure 4

Neck network structure for object detection methods. A The structure of \(\eta\)-RepPANet. This module is an improvement of PANet. Using \(\eta\)-RepC2f instead of the original C2f for feature fusion can better fuse the feature map information without increasing the number of parameters. B The structure of \(\eta\)-RepAFPN. This module is an improvement of AFPN. Reconstructing the \(\eta\)-ASFFRep feature fusion module with \(\eta\)-RepC2f can better fuse the feature map information. Also, the number of parameters of this module is less than that of \(\eta\)-RepPANet. C The structure of \(\eta\)-ASFFRep for fusing two feature maps. \(\eta\)-RepAFPN module for fusing two feature maps with different scales. D The structure of \(\eta\)-ASFFRep for fusing three feature maps. \(\eta\)-RepAFPN module for fusing three feature maps with different scales

3.3 Architecture

\({\textbf {Backbone.}}\) In summary, the architecture of the backbone network is a critical factor affecting the effectiveness and efficiency of the detection model. The base module \(\eta\)-RepC2f, inspired by the concept of C2f in YOLOv8, is designed to effectively integrate low-level and high-level features to enhance the performance of object detection tasks, Illustrated in Fig. 2C. This module leverages the \(\eta\)-RepConv design to refine feature maps during the inference stage, thereby improving detection precision without inflating the model’s parameters or computational demands. To construct an efficient and reparameterizable backbone network, named \(\eta\)-EfficientRep, the base module \(\eta\)-RepC2f is utilized in conjunction with the network unit \(\eta\)-RepConv, as shown in Fig. 3. During the inference stage, the network units throughout the entire backbone are reparameterized to merge each network unit into a single, consolidated \(3 \times 3\) convolution operation. This transformation to a single \(3 \times 3\) convolution is particularly beneficial for deployment on mobile platforms, as it harnesses the full potential of hardware acceleration available on GPUs and CPUs. The \(3 \times 3\) convolution is highly optimized for these devices, leading to faster computation times and more efficient use of resources. The end result is a backbone network that not only provides high accuracy but also meets the practical constraints of real-world applications, especially those that require operation on devices with limited computational power.

\({\textbf {Neck network.}}\)The development of object detection models has consistently emphasized the significance of multi-scale feature fusion within the model’s architecture. Efforts continue to refine the feature fusion structure of the neck by employing the network elements and base modules previously outlined. The FPN structure, as suggested in EfficientDet, accomplishes multi-scale feature fusion through bidirectional cross-scale connections and a method of weighted feature fusion. These connections allow for the fusion of relevant feature maps with either upsampled high-resolution or downsampled low-resolution maps, enhancing the feature extraction process. However, the conventional bidirectional cross-scale methods can sometimes lead to the loss or degradation of feature information, diminishing the effectiveness of fusion across non-adjacent layers. To address this, we introduce a modified \(\eta\)-RepPANet, shown in Fig. 4A, which utilizes the \(\eta\)-RepC2f module to achieve more efficient fusion of feature maps without increasing the model’s parameter count. In contrast, other methodologies that attempt to fuse feature maps across all layers may enrich the contextual semantics of the feature maps, yet they also significantly increase the model’s computational complexity. The challenge is compounded by the potential semantic discrepancy between non-adjacent layers, which could lead to incorrect fusion outcomes.

To circumvent these issues, scholars have introduced the concept of using an AFPN to fuse feature maps at different levels, starting with the fusion of two adjacent bottom-level features and progressively integrating higher-level features to minimize any significant semantic gaps. Building upon this concept, we propose the \(\eta\)-RepAFPN as the neck for the object detection network, as depicted in Fig. 4B. To further boost the performance of feature fusion, we employ the \(\eta\)-RepC2f to construct the \(\eta\)-ASFFRep module for feature fusion, as shown in Fig. 4C, D. This module utilizes adaptive spatial operations to address inconsistencies, while the network unit \(\eta\)-RepConv is reparameterized as a \(3 \times 3\) conv network to strike an optimal balance between accuracy and real-time performance during the inference stage. Our detection technique necessitates the fusion of three distinct feature maps. To this end, the neck module incorporates both two-map and three-map fusion structures, each requiring upsampling or downsampling of the original feature maps. Our approach employs bilinear interpolation for upsampling and convolutional downsampling to maintain the integrity of the feature maps during the fusion process.

\({\textbf {Head.}}\) The anchor-based detection head represents a shared-parameter approach where both the classification and localization branches leverage the same set of parameters. This is in stark contrast to the detection heads found in FCOS [23] and YOLOX [8], which opt for a decoupled architecture. This separation of branches allows for the introduction of two additional \(3 \times 3\) conv layers, a design choice aimed at enhancing the model’s overall performance. On the other hand, YOLOv6 adopts a distinct strategy within its decoupled header, implementing a hybrid channel approach. This method strikes a balance by reducing the number of \(3 \times 3\) conv layers to a single layer. The rationale behind this design decision is not just to improve performance but to do so while also simplifying the network architecture, potentially leading to gains in both efficiency and speed without compromising the integrity of the detection capabilities.

Within the architecture of \(\eta\)-RepYOLO, we integrate the novel \(\eta\)-RepConv in place of traditional \(3 \times 3\) convolutional layers, ensuring an enhancement in detection performance, as visually represented in Fig. 5. This substitution is deliberately made to avoid an increase in the number of parameters and computational burden during the inference phase. The detection mechanism relies on two separate heads within the branches to carry out object localization and classification tasks. To further augment the robustness of the model during training, we employ TAL as the default strategy for label assignment. TAL is designed to enhance the stability of the training phase, ultimately contributing to superior performance of the model post-training. The strategic choice to adopt TAL reflects our commitment to achieving not only high accuracy, but also to ensuring the reliability and consistency of model training.

Fig. 5
figure 5

Modified decoupled head. This module uses \(\eta\)-RepConv instead of the standard \(3 \times 3\) conv for feature processing, improving object detection performance without increasing the number of parameters

4 Experimental and analysis

4.1 Training details

\({\textbf {Datasets.}}\) We selected ImageNet-1k, a prominent image classification dataset, as the foundation for pre-training our backbone network. This dataset comprises a substantial collection of 1.28 million training images and a separate set of 500,000 images for validation purposes, with the entire dataset spanning across 1,000 distinct classes. Once we secured a pre-trained model from this initial phase, we proceeded to fine-tune and validate our object detection model using two other authoritative datasets—PASCAL VOC07+12 and MS COCO2017. For our ablation studies, we utilized PASCAL VOC07+12, which features a diverse array of 20 object classes, while the comprehensive comparison experiments were conducted on MS COCO2017, which encompasses a broader range of 80 object classes. This multi-stage training approach, starting from a broad image classification task to more specialized object detection tasks, ensures a robust and well-generalized model that is capable of recognizing a wide variety of object classes with precision.

\({\textbf {Details.}}\) During the image classification experiments, we adhered to a set of default hyperparameters to ensure consistency and reproducibility. The images were resized to a uniform input size of \(256\times 256\) pixels. Training was conducted over 300 epochs with batches of 128 images each. The learning rate was initially set to 0.1 and was modulated using a cosine annealing schedule for gradual reduction over time, with the first 5 epochs designated as the warm-up period. The SGD optimizer was employed for network updates, configured with a weight decay of 1e–4 and momentum of 0.9. In the pre-processing phase, data augmentation was applied through MixUp enhancement and label smoothing regularization to promote model generalization. The probability for MixUp application was set at 0.2, while label smoothing regularization was applied with a probability of 0.1. The training of the backbone network was facilitated by NVIDIA 1080Ti GPUs, providing the computational power necessary to process the extensive dataset and complex model architecture efficiently.

In our object detection experiment, we configured the training image dimensions to \(640\times 640\) and set the batch size at 8. The initial learning rate was established at 0.01 with a cosine decay applied over time. We selected SGD as the optimization algorithm, incorporating a weight decay parameter of 5e–4. The training regimen spanned 300 epochs, during which we froze the backbone network for the initial 50 epochs. For data preprocessing, the MASAIC method was utilized, with a 50% likelihood of being applied to any given data instance. To align with the actual data distribution, MASAIC was employed during the first 70% of epochs and was subsequently deactivated for the final 30%.

4.2 Evaluation indicators

In our study, the efficacy of the proposed method is assessed across four metrics: model parameters, GFLOPs, FPS, and mAP. The model parameter count reflects the model’s size, suggesting that a smaller model is more manageable for deployment on platforms with limited resources. GFLOPs measure the floating-point operations per second and serve as an indicator of the model’s complexity. This metric is closely tied to the computational demands placed on the platform’s performance capabilities. The relationship is captured in the subsequent equation:

$$\begin{aligned} FLOPs = \left( 2 \times C_\textrm{in} \times K^2 \times H \times W \times C_\textrm{out}\right) . \end{aligned}$$
(10)

In the equation, \(C_\textrm{in}\) represents the number of input channels, while \(C_{out}\) stands for the number of output channels in the convolutional layer, and K is the size of the convolutional kernel. H and W correspond to the height and width of the input feature map, respectively. Given that the entire model lacks a fully connected layer, the computational complexity associated with such a layer is excluded from the FLOPs calculation.

FPS, or frames per second, signifies the quantity of images the model can process for detection within 1 s. A higher FPS value indicates superior real-time performance of the model. The FPS can be mathematically expressed as follows:

$$\begin{aligned} FPs = \frac{1}{t}. \end{aligned}$$
(11)

Here, t represents the time required to process a single image, measured in seconds.

AP measures the model’s detection accuracy for each object category. mAP is the arithmetic mean of the AP across all object categories. It reflects the overall detection accuracy of the model. The relationship between mAP and object detection accuracy is direct and can be expressed as follows:

$$\begin{aligned} {mAP=\frac{\sum _{i=1}^{N} A P_{i}}{N}=\frac{\sum _{i=1}^{N} \int _{0}^{1} \frac{T P_{i}}{T P_{i}+F P_{i}} d\left( \frac{T P_{i}}{T P_{i}+F N_{i}}\right) }{N}}, \end{aligned}$$
(12)

where \(AP_i\) denotes the detection accuracy of the model for the ith category. \(TP_i\) denotes the prediction frame in which the ith object is a positive sample, and the detection result is positive. \(FP_i\) denotes the prediction frame in which the ith category is a negative sample, but the detection result is positive. \(FN_i\) denotes the detection frame in which the ith category is a negative sample, and the detection result is also negative.

Table 1 Comparison of YOLOv8s for different unit based backbone networks on the PASCAL VOC07+12 dataset
Table 2 Comparison of ablation results

4.3 Results and discussion

In the pursuit of evaluating the efficacy of enhancements integrated into our methodology, we have employed YOLOv8 as the principal detection framework, corroborated through validation utilizing our proposed methodologies. Performance evaluations, delineated in Table 1, encompass a spectrum of performance metrics associated with diverse backbone networks. The construction of the backbone network rests upon the utilization of three distinct foundational units: the conventional 3\(\times\)3 convolution, RepVGGBlock, and \(\eta\)-RepConv. These units are amalgamated into the backbone network following a standardized configuration, facilitating the comparative analysis of object detection metrics during both training and inference phases. Specifically, RepVGGBlock represents the instantiation of \(\eta\)-Rep-Conv when \(\eta\)=1, while \(\eta\)-RepConv denotes another variant of our method when \(\eta\)=2.

Analysis of the experimental outcomes, as delineated in Experiments 1, 2, and 4, underscores a notable trend: the escalation of parameters and FLOPs exhibits an exponential growth pattern with the increment of \(\eta\), while the concomitant augmentation in mAP scores undergoes a state of diminishing returns. Consequently, the assertion that larger values of \(\eta\) necessarily lead to superior performance is rebutted by empirical evidence. Notably, when \(\eta\) attains a value of 2, the inference time for the unweighted parameterization surpasses a threefold increase. Thus, it becomes evident that for \(\eta > 2\), the computational and memory resources requisite for execution experience a precipitous surge. In light of the aforementioned observations, in the interest of attaining a more judicious balance between performance gains and computational overhead, the selection of a hyperparameter assumes paramount importance. Accordingly, \(\eta\)-RepConv with \(\eta\)=2 emerges as the optimal network unit for conducting the experiments delineated in this study.

Table 1 establishes YOLOv8s as the standard for comparison, with its backbone network featuring the conventional \(3\times 3\) convolution. The accuracy of YOLOv8s stands at 81.59%, which is lower than that of detectors incorporating RepVGGBlock and \(\eta\)-RepConv into their backbones. The object detection models prior to reparameterization demonstrate comparable accuracy to those post-reparameterization, with the inference models being characterized by a smaller parameter count than their training counterparts. Moreover, the \(\eta\)-RepC-onv based detection method not only exhibits superior accuracy over both the original YOLOv8s and RepVGG-Block-based methods but also benefits from reduced parameter size and fewer FLOPs. In terms of detection speed, the \(\eta\)-RepConv based method surpasses the original YO-LOv8s and is on par with the RepVGGBlock-based method. This performance difference is linked to the absence of reparameterization in YOLOv8s. Consequently, for our proposed object detection approach, we have selected the \(\eta\)-RepConv based backbone network.

Table 3 Comprehensive comparison with state-of-the-art methods

Table 2 meticulously delineates the performance metrics of cutting-edge object detection frameworks, namely YOLOv8s, \(\eta\)-EfficientRepYOLO, alongside other comparative methods, meticulously scrutinized under uniform experimental protocols utilizing the renowned PASCAL VOC 07+12 dataset. The experimental findings, as enumerated in Experiments 1 and 2, resoundingly affirm that the adoption of the \(\eta\)-EfficientRep backbone architecture, characterized by a parsimonious parameter count of 10.8 M, necessitating a computational complexity of 26.5 GFLOPs, and an impressive detection throughput of 116 FPS, heralds a palpable performance superiority over its YOLOv8s counterpart. Furthermore, it garners a notable uptick in detection precision, boasting a 0.31% enhancement with the benchmark YOLOv8s. A deeper delving into the experimental array, encompassing Experiments 2 and 3, illuminates that the pioneering \(\eta\)-RepPANet augmentation engenders a commendable augmentation in mAP metrics, manifesting a substantial 1.31% improvement over the baseline neck network, while preserving parameter count, FLOPs, and detection velocity at a constant echelon. Subsequent comparative analyses between Experiments 3 and 4 unveil a commendable enhancement in detection accuracy by 4.56%, attributable to the ameliorated detection head module, sans any commensurate escalation in model complexity or computational overhead. Particularly noteworthy is the discernible 3.22% accuracy amelioration with YOLOv8s, facilitated by the integration of the \(\eta\)-RepPANet neck network coupled with the MD-Head detection head module, a feat achieved without any concomitant increment in model parameters, computational complexity, or detection accuracy.

However, the pursuit of heightened detection precision necessitates trade-offs, as evidenced by the experimental outcomes of Experiments 3 and 4, wherein the deployment of \(\eta\)-RepAFPNYOLO yields commendable accuracy augmentation albeit at the expense of diminished detection velocity. A juxtaposition between Experiments 3 and 5 reveals a modest 0.73% boost in accuracy conferred by the \(\eta\)-RepAFPN network with its \(\eta\)-RepPANet counterpart, albeit accompanied by a discernible reduction in model parameters by 2 M and computational complexity by 3.3 FLOPs. Yet, this comes at the cost of a precipitous 35 FPS decrement in detection velocity, attributed to the heightened memory requisites stemming from the expanded branch architecture inherent to the \(\eta\)-RepAFPN structure. Finally, a comparative analysis between Experiment 4 and Experiment 6 underscores a salient observation: while the \(\eta\)-RepAFPNYOLO configuration evinces reduced model parameters and computational complexity in contrast to its \(\eta\)-RepPAYOLO counterpart, this reduction fails to invariably translate into accelerated detection rates. The protracted detection tempo can be attributed to the augmented memory footprint engendered by the supplementary branches inherent in the \(\eta\)-RepAFPN neck module. The meticulously conducted ablation analyses resoundingly underscore the efficacy of the proposed network module, \(\eta\)-RepConv, underscoring its pivotal role in fortifying model performance.

Fig. 6
figure 6

Comparison chart of the experimental results. The x-axis indicates the FPS of the detection method. The y-axis indicates the AP of the detection method, the radius of the circle in the figure indicates the number of parameters of the model, and the larger radius indicates the larger number of parameters

To validate the overarching efficacy of the proposed method, a comprehensive comparative analysis is conducted, juxtaposing the performance of \(\eta\)-RepPAYOLO and \(\eta\)-RepAFPNYOLO against contemporary state-of-the-art object detection algorithms. This evaluation encompasses renowned detectors such as the two-stage Faster RCNN, the single-stage SSD, as well as iterations of the YOLO series including YOLOv2, YOLOv3, YOLOv4, YOLOv5s, YOLOv5m, YOLOv6s, YOLOv6m, YOLOv7-T, YOLOv7, YOLOv8s, YOLOX-S, and DAMO-YOLO-S, as illustrated in Table 3. Notably, the detection accuracy of \(\eta\)-RepPAYOLO on PASCALVOC07+12 and MS COCO datasets stands at 84.77% and 45.3%, respectively, while \(\eta\)-RepAFPNYOLO exhibits a corresponding accuracy of 85.65% and 45.8%. Remarkably, \(\eta\)-RepAFPNYOLO demonstrates a marginal enhancement in accuracy compared to \(\eta\)-RepPAYOLO, rivaling the performance of YOLOv6m and YOLOv7 albeit with significantly accelerated detection speeds. Despite a slight parameter increase relative to YOLOv5s and YOLOv7-T, the computational efficiency of \(\eta\)-RepPAY-OLO surpasses these counterparts substantially. Further comparisons unveil \(\eta\)-RepPAYOLO’s superior detection accuracy with YOLOv7-T and YOLOv5s by 7.9% and 6.6%, respectively, accompanied by a commendable detection rate of 116 FPS, outperforming YO-LOv5s at 96 FPS.

Table 4 Comprehensive comparison with state-of-the-art methods

Moreover, relative to YOLOX-S, the parameter increment in \(\eta\)-RepPAYOLO is modest, yet a notable reduction in model complexity is achieved. Noteworthy advancements are observed when contrasting \(\eta\)-RepPAYOLO against YOLOX-S on the MSCOCO data-set, where detection accuracy is augmented by 5.3% alongside a substantial 22.1% boost in detection speed. Although a marginal decrease in accuracy is noted compared to DAMO-YOLO-S, the streamlined model parameters and computational complexity render \(\eta\)-RepP-AYOLO more conducive for resource-constrained platforms. Moreover, in comparison with YOLOv8s, \(\eta\)-Rep-PAYOLO demonstrates notable improvements, notably in parameter reduction and enhanced detection rates. Conversely, \(\eta\)-RepAFPNYOLO exhibits diminished parameterization and computational load, albeit at a slig-htly compromised detection rate, thus presenting a nuanced trade-off suitable for platforms with varying real-time and accuracy requirements. In summation, both proposed methodologies manifest advancements across diverse facets when measured against SOTA benchmarks, striking a pragmatic balance between real-time processing and detection accuracy, thereby underscoring their substantive contributions in the realm of object detection algorithms.

To show that the proposed method achieves a good balance between real-time and model parameter counts, and can achieve accurate object detection with excellent real-time performance on memory-constrained platforms, we describe it in the form of a scatter plot, as shown in Fig. 6. It can be seen that our method obtains relatively balanced detection results in terms of detection accuracy and real-time with a small number of parameter models.

To substantiate the capability of our proposed meth-od to execute high-quality detection tasks across various complex scenarios, we have conducted experimental validations on a diverse array of image categories. The results illustrate that our method yields commendable outcomes in detecting categories, accurate localization, and multi-target recognition, as evident in the first two rows of Fig. 7. Nevertheless, our approach is not infallible. It exhibits certain limitations, particularly when dealing with diminutive targets within images, leading to instances of missed detections. Additionally, our method encounters challenges in scenarios where the image contains numerous targets and when there is an overlap of targets within the same category. This issue is depicted in the last row of Fig. 7, where leakages in detection occur. These insights point towards areas that necessitate further refinement within our detection algorithm.

Fig. 7
figure 7

Detection examples in different scenarios

In our relentless pursuit to fortify the resilience and veracity of our proposed methodologies, we embark on a meticulous comparison of the performance exhibited by a spectrum of cutting-edge object detection frameworks. Specifically, we rigorously evaluate the efficacy of YOLOv3, YOLOv5s, YOLOv7-T, YOLOv8s, DC-YOLOv8, \(\eta\)-RepPAYOLO, and \(\eta\)-RepAFPNYOLO on the challenging Visdrone dataset, a pivotal endeavor encapsulated within the comprehensive experimental framework delineated in Table 4. The discerned results underscore a notable achievement, particularly elucidated by the mAP metrics of the \(\eta\)-RepPAYOLO and \(\eta\)-RepAFPNYOLO detection methodologies, registering at 39.9% and 40.8%, respectively. Notably, the mAP score at a stringent threshold of 0.5–0.95 attains commendable values, reaching 23.7% and 24.3% respectively, positioning our methods as a close second solely to the DC-YOLOv8 approach, while surpassing alternative methodologies under scrutiny. Further enriching the discourse, our meticulous analyses reveal that despite the stellar performance, the parameter count of our proposed methods remains notably parsimonious, standing at 10.8M and 8.8M, respectively, a testament to their efficiency with the DC-YOLOv8 method. Moreover, the detection accuracy exhibited by the \(\eta\)-RepPA-YOLO method, soaring to an impressive 116 FPS, is surpassed solely by the YOLOv7-T variant, underscoring the equilibrium achieved between computational efficiency and detection velocity. In summation, the discerned findings herald a commendable balance struck among a myriad of performance metrics encompassing parameters, detection accuracy, computational complexity, and detection velocity, affirming the efficacy and versatility of our proposed methodology. The robustness of our approach is further validated through exhaustive experimentation across multiple benchmark datasets, cementing its position at the forefront of contemporary object detection paradigms.

5 Conclusion

Addressing the trade-off between model parameter count, complexity, and real-time performance inherent in deep learning-based real-time object detection methods, we have introduced a real-time object detection approach. Our method begins with the construction of a backbone network for feature extraction, utilizing the network unit \(\eta\)-RepConv and the module \(\eta\)-RepC2f, both employing SiLU as the principal activation function for the entire network. We further propose enhanced neck networks, namely \(\eta\)-RepPANet and \(\eta\)-RepAFPN, to serve as the detection neck of our model. To boost the neck module’s performance while reducing computational complexity, we incorporate the \(\eta\)-RepC2f module for feature fusion. Additionally, we refine the object detection head with the \(\eta\)-RepConv network module, which features decoupled classification and regression branches, thereby improving detection performance with decreased computational demand. Backbone network ablation studies validate the effectiveness of our foundational module in object detection, and further experimentation with various proposed modules confirms that our enhancements for each module contribute to superior detection accuracy. Testing results on the PASCAL VOC 07+12 and MS COCO datasets indicate that our methodology achieves an optimal balance between speed and accuracy.

Nevertheless, our methodology has demonstrated effectiveness, we acknowledge that it is not without limitations and necessitates further refinement. Specifically, our experimentation with -RepConv network units was limited to \(\eta\) = 1 and \(\eta\) = 2, leaving room for exploration of optimal choices for \(\eta\). Future endeavors will involve extensive experimental exploration to determine the optimal hyperparameters, balancing resource consumption with performance gains across varying values of \(\eta\). Furthermore, enhancements are warranted for the base module \(\eta\)-RepConv, particularly the \(\eta\)-RepC2f module associated with YOLOv8s. These enhancements will address compatibility issues with certain hardware deployments and ensure seamless integration with existing models. Moreover, we recognize the importance of mitigating accuracy degradation resulting from direct model quantization. To this end, we will leverage prior research to incorporate a priori knowledge into optimization for training, enabling direct quantization operations during structural transformation and overcoming the challenges associated with accuracy degradation. In summary, future research will focus on hyperparameter screening, optimization of quantization methods, and exploration of diverse network unit replacements. The proposed \(\eta\)-RepYOLO network unit offers versatility and performance improvements without increasing computational complexity or memory consumption, rendering it suitable for real-time applications requiring a balance between detection accuracy and performance.