PCNet: a human pose compensation network based on incremental learning for sports actions estimation

455 Accesses
Explore all metrics

Abstract

Human pose estimation has a wide range of applications. Existing methods perform well in conventional domains, but there are certain defects when they are applied to sports activities. The first is lack of estimation of the extremity posture, making it impossible to comprehensively evaluate the movement posture; the second is insufficient occlusion handling. Therefore, we propose a human pose compensation network based on incremental learning, which obtains shared weights to extract detailed features under the premise of limited extremity training data. We propose a higher-order feature compensator (HOF-compensator) to embed the attributes of the extremity into the torso and limbs topology structure, building a complete higher-order feature. In addition, to improve the occlusion handling performance, we propose an occlusion feature enhancement attention mechanism (OFE-attention) that can identify occluded keypoints and enhance attention to occlusion areas. We design comparative experiments on three public datasets and a self-built sports dataset, achieving the highest mean accuracy among all comparative methods. In addition, we design a series of ablation analysis and visualization displays to verify that our method performs best in sports pose estimation.

Sports Pose Estimation Based on LSTM and Attention Mechanism

GolfPose: From Regular Posture to Golf Swing Posture

A strong benchmark for yoga action recognition based on lightweight pose estimation model

Article 14 January 2025

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Human pose estimation is an important technology in the field of computer vision, which aims to locate the key skeletal points of the human body to form a human pose [1]. It is widely used in pose evaluation [2], action recognition [3], human-computer interaction [4], pedestrian re-identification [5], and other fields. Human keypoint detection, as the core technology of pose estimation, is mainly divided into two categories, top-down methods[6,7,8,9] and bottom-up methods [10,11,12]. The former first locates the bounding boxes of each person, and then detects the keypoints within each box. The latter directly locates the keypoints of all persons and then aggregates and classifies the keypoints that belong to the same individual. The methods have their advantages and disadvantages, the former has higher accuracy due to the presence of bounding boxes, which reduce the detection range, but the accuracy depends on bounding boxes. The latter has a more comprehensive feature extraction range, but requires larger computation and has lower efficiency.

Although the above methods have excellent practicality in certain areas [1,2,3,4,5], they cannot efficiently handle the task of pose estimation in complex sports, such as gymnastics, dance, and diving in Fig. 1. This can be explained by two main reasons. Firstly, they are pre-defined standard movements, in which the actions between the limbs and extremities, such as bilateral toes and fingertips, are often set as scoring criteria. The closer the movements of athletes conform to the standard, the higher they score. However, traditional methods [6,7,8,9,10,11,12] only focus on the posture of the torso and limbs, neglecting the posture of the extremities. Therefore, when these methods are used to evaluate the pose of sports movements, they cannot pay attention to the posture between the limbs and extremities, making it impossible to accurately evaluate whether the movements of athletes are standard. Although compensating the ability of networks to detect extremity is possible, we lack ready-made and sufficient training data for extremities. Small amounts of data cannot make the network converge, and manually annotating a large amount of data is undoubtedly time-consuming and laborious. Moreover, it will affect the original network during training. Therefore, it is challenging to achieve compensation of extremity without decreasing the accuracy of the original keypoints and within limited training data. Secondly, in sports, occlusion phenomenon often occurs, such as being blocked by other dancers in group dances, and self-occlusion caused by agile movements in single gymnastics. At that time, the obstacle features destroy the complete higher-order topology structure, which affects the ability of networks to extract features of the occluded parts. Therefore, the occlusion problem needs to be improved.

To compensate for the detection of extremities with limited training data and improve the ability to handle occlusions, we propose a human pose compensation network based on incremental learning (PCNet). The aim is to achieve more broad and accurate pose evaluation in sports. The network has five components: the normal keypoints feature encoder (Normal-encoder), the extremity keypoints feature encoder (Extremity-encoder), the high-order feature compensator (HOF-compensator), the occlusion feature enhanced attention mechanism (OFE-attention), and the Decoder. The five components are connected by six operation steps to form PCNet. Step 1: Pre-train a HRNet model on COCO2017 dataset and use its shallow layers (Stage 1–3) as the Normal-encoder and the deep layers (Stage 4–5) as the Decoder. Step 2: Transfer the weights of the Normal-encoder to the Extremity-encoder as prior knowledge. Step 3: Fine-tune the Extremity-encoder using a small amount of extremity training data from self-built sports dataset. Step 4: Use the HOF-compensator to build a comprehensive topology, embedding the extremity topology into the existing torso and limb topology. Step 5: Identify occluded keypoints and their occlusion degrees using OFE-attention and enhance the attention of the network to the occluded areas based on the occlusion degrees. Step 6: Fine-tune the Decoder using the complete topology built by the HOF-compensator and the occlusion area features strengthened by OFE-attention, and obtain the network output. We evaluated PCNet on COCO2017 dataset, COCO-Wholebody dataset, CrowdPose dataset, and our self-built sports dataset. The results show better accuracy than most comparative methods. In addition, we conducted a series of ablation experiments and visualization presentations to verify the effectiveness of each component in PCNet and its superior performance in pose estimation in sports. In summary, PCNet is a pose estimation network primarily designed for sports. Compared to conventional networks, PCNet can effectively solve two problems. The first is to compensate pose estimation of extremities, which is of great significance for the standard evaluation of sports actions. The second is to improve the estimation accuracy of occluded body parts, which can make the sports pose more complete and accurate. The contributions are summarized as follows.

(1)
We propose the PCNet, in which the Extremity-encoder can extract low-order detail features of extremity with only a small amount of training data, and the HOF-compensator can construct a more complete high-order topology. The network achieves more comprehensive pose estimation and has greater application value in sports.
(2)
We propose an easily plug-and-play OFE-attention, which can identify occluded areas and the degree of occlusion and enhance the attention of the network to the occluded regions.

Related work

Human keypoints compensation methods

The existing annotated torso and limb keypoints are unable to meet the requirements of some specific pose estimation tasks, such as dance interaction, sports, gait recognition. Therefore, it is necessary to compensate the keypoints of the detailed parts based on the torso and limb keypoints according to the specific task requirements. To achieve complementary detection of detailed keypoints, Xu et al. [13] proposed a three-channel network called ZoomNet for detecting keypoints of the body torso, limbs, and extremity. Yang et al. [14] proposed an instance-level detection method to improve the accuracy of hand and foot detection. Fang et al. [15] aimed to capture subtle human movements for complex behavior analysis and realizes full-body pose estimation, including facial, body, hand, and foot poses. However, the above methods require manual annotation of new datasets to train the network, which is undoubtedly inefficient. To improve efficiency, Mohammadi et al. [16] introduced transfer learning by migrating weights from a network trained on a large dataset to its own network, achieving more detailed keypoint detection. Zhao et al. [17] proposed a clustering network based on incremental learning, which can add new keypoints to original network, thus having the ability to learn new knowledge. Yan et al. [18] proposed a framework based on incremental learning to increase the number of categories in gesture recognition, allowing the model to recognize new gestures without prior access to previous training data. Zhou et al. [19] proposed a shared learning method that clones and shares the fully connected layer to add new class knowledge incrementally in the model. The above methods have achieved keypoint compensation to some extent, but there are still three problems to be improved. The first is that the obtained prior knowledge is quite different from itself and cannot be highly adapted to the current task. Therefore, the feature extraction ability is insufficient, the weights are difficult to converge, and more data is required for optimization and adjustment. The second is that the cost of obtaining training data is high. The above methods use hand-labeled datasets to obtain training data for compensated keypoints. These methods are time-consuming and labor-intensive, and the labeled values are highly subjective with deviations. The third is that there are defects in the fusion method of compensated keypoints and normal keypoints. The above methods embed the features of compensated keypoints in the feature map by the form of low-order detail features. They ignore the importance of high-order topological structure features. Therefore, the feature representation is incomplete. However, PCNet can effectively solve the above problems. For the first one, PCNet obtains prior knowledge of shallow weights of normal keypoint detection networks, which have a excellent ability to extract broad features of keypoints. As torso, limbs, and extremity keypoints are all human body parts with broad features of highly similarity, the prior knowledge is adapted between the Normal-encoder and Extremity-encoder. For the second problem, PCNet only needs a small amount of data to fine-tune the Extremity-encoder to converge. On this basis, training the Decoder can enable it to obtain the correct convergence direction, saving a large amount of training data. For the third problem, the HOF-compensator can focus on the relationship between the extremities and the four limbs, expressing the relationship between the complete torso, four limbs, and extremities in the form of high-order topological structure features. The feature representation is more complete. In summary, PCNet can achieve higher-precision keypoint compensation with less training cost.

Occlusion handling

To address occlusion issues, many researchers have combined CNNs with other feature extraction methods and have achieved results. Zhou et al. [20] utilized the excellent global feature extraction capability of the transformer, combining it with the high-resolution network to enhance the accuracy of predicting the positions of occluded keypoints. Zhong et al. [21] introduced a self-attention module that integrates each frame features extracted from different layers of the network to capture the relationships between keypoints. The module reduced the negative effects of occluded keypoints on the overall performance. Kim et al. [22] proposed an end-to-end method based on projection segmentation and 3D detection. They also introduced an adaptive sampling strategy and depth consistency constraint to effectively address the issue of occlusion and enhance the robustness of target pose estimation. Wang et al. [23] designed a joint relation extractor to take pseudo-heat map representations of keypoints as input and calculate their similarity, enabling the network to infer the positions of occluded keypoints. Peng et al. [24] proposed a human skeleton topology structure to extract relational information between many non-adjacent keypoints in the model, obtaining topological keypoint features. While the above methods have improved the issue of occlusion, they still have two problems to solve. Firstly, they do not explicitly indicate the occluded keypoints and the extent of occlusion. Secondly, they do not assign varying attention to distinct regions on the feature map based on the occluded level of each body part. The occluded areas are frequently blurred as a result of obstacle characteristics. If the network does not enhance its attention to these areas, it often fails to capture keypoint features in the occluded region. Consequently, it impacts the formation of comprehensive higher-order topological characteristics. To solve the first problem, OFE-attention can clearly identify the occluded keypoints and their relative occlusion degrees through high and low-order feature matching. For the second problem, PCNet can assign different weights to diverse regions in the feature map based on the degree of occlusion, allowing the network to pay more attention to the occluded areas and extract more abstract features. In summary, compared to other methods, PCNet can more accurately identify occluded keypoints and reasonably allocate network attention to different regions based on the degree of occlusion.

The human pose compensation network (PCNet)

Compared to conventional pose estimation tasks, sports actions focus on the completeness and accuracy of the pose. Conventional methods can estimate the pose of the torso and limbs, but neglect the pose of the extremities, resulting in an incomplete pose and making it impossible to conduct more detailed and comprehensive standardized assessments. In addition, sports movements are flexible and complex, and occlusions often occur between different parts of the body. Thus, improving the occlusion handling ability can enhance the accuracy of pose estimation. To address the two issues, we propose PCNet, a pose estimation network designed for sports movements. Compared to conventional networks, PCNet has two additional attributes. Firstly, it can achieve supplementary detection of extremity poses with only a small amount of training data. Secondly, it can identify occlusion areas and enhance occlusion handling capabilities. The structure of PCNet is shown in Fig. 2, which has five components, Normal-encoder, Extremity-encoder, HOF-compensator, OFE-attention, and Decoder. The structure of Normal-encoder and Extremity-encoder consists of the shallow layers (Stage 1–3) of HRNet [9], and the Decoder consists of the deep layers (Stage 4–5). The relationship among all components is as follows. Firstly, the Normal-encoder is pre-trained to acquire the network weights, output feature maps, and heatmaps. Subsequently, the shallow layer weights are transferred to the Extremity-encoder for obtaining its output feature maps and heatmaps. The feature maps and heatmaps of the two encoders are then inputted into the HOF-compensator to construct the complete topology and extract high-order and low-order features. Following this, the two types of features are utilized as input for the OFE-attention mechanism to locate occluded keypoints and generate the reinforced feature map. Lastly, the reinforced feature map is designed as the input for Decoder to produce the output of PCNet. The detailed designs of all steps are as follows.

Step 1: Pre-training the Normal-encoder and Decoder. Since there is sufficient labeled training data for the keypoints of the torso and limbs, we use the data to train the HRNet, enabling it to detect the torso and limbs. Later, we divide the trained HRNet into two parts, with the shallow layer serving as the Normal-encoder, and the deep layer serving as the Decoder.

Steps 2 and 3: Transferring weights and fine-tuning Extremity-encoder. The weights of the Normal-encoder, after sufficient data training, possess the ability to detect keypoints of the torso and limbs. For a convolutional neural network, the shallow layer weights focus on local and broad features; while the deep layer, due to the gradual accumulation of convolutional layers, has a larger receptive field and thus pays more attention to global and higher-order abstract features [25]. In PCNet, Normal-encoder, as the shallow layer of HRNet, has the ability to extract detailed features of the torso and limbs. More importantly, the torso, limbs, and extremities are all human body parts, the details of their features, such as color and texture, have high similarity [26,27,28]. In summary, the weights of the Normal-encoder can be highly adapted to the Extremity-encoder, achieving the extraction of broad features of the extremity. Therefore, we transfer the weights of the Normal-encoder to the Extremity-encoder as the prior knowledge of the latter. Since the weight similarity between the two encoders is high, only a small amount of training data is required to fine-tune the weights for the extremities. This training should focus on the differences between the features of the extremities and the features of the torso and limbs. The above content can be explained by the following theory, we use the size of the loss function to measure the convergence of network weights and detection accuracy, and the loss function $L(\varvec{w})$ can be expressed as,

$$\begin{aligned} L(\varvec{w})=\sum _{i=1}^n\left| \left| \varvec{f}\left( \varvec{w},\varvec{x}_i\right) -\varvec{y}_i\right| \right| ^2, \end{aligned}$$

(1)

where $\varvec{f}(\varvec{w},\varvec{x}_i)$ represents the network prediction of the keypoint coordinates for input image $\varvec{x}_i$ using the weights $\varvec{w}$ during forward propagation, $\varvec{y}_i$ represents the ground truth keypoint coordinates, and n represents the number of training images. We assume $\varvec{w}_\textrm{N}$ represents the weights of the Normal-encoder; $\varvec{w}_\textrm{E}$ represents the Extremity-encoder weights that transferred from Normal-encoder and fine-tuned with a small amount of extremity training data; $\varvec{\overline{w}}_\textrm{E}$ represents the Extremity-encoder weights that trained only with a small amount of training data without transferring. Based on the Taylor series expansion, we obtain the loss functions of $\varvec{w}_\textrm{E}$ and $\varvec{\overline{w}}_\textrm{E}$ at $\varvec{w}_\textrm{N}$ and retain the linear term as,

$$\begin{aligned} L(\varvec{w}_\textrm{E})\approx L(\varvec{w}_\textrm{N})+ (\varvec{w}_\textrm{E}-\varvec{w}_\textrm{N})\cdot \bigtriangledown L (\varvec{w}_\textrm{N}) \ \nonumber \\ L(\varvec{\overline{w}}_\textrm{E})\approx L(\varvec{w}_\textrm{N})+ (\varvec{\overline{w}}_\textrm{E}- \varvec{w}_\textrm{N})\cdot \bigtriangledown L(\varvec{w}_\textrm{N}), \end{aligned}$$

(2)

where $\bigtriangledown $ represents the calculation of gradients. As the local broad features among the human torso, limbs, and extremity have a high degree of similarity, and the network weights can be considered as the ability of network to extract features. Therefore, the more the weights of the Extremity-encoder converge to $\varvec{w}_\textrm{N}$, the better its performance will be. $\varvec{w}_\textrm{E}$ is fine-tuned from $\varvec{w}_\textrm{N}$ based on the training data of the extremity, and the similarity between $\varvec{w}_\textrm{E}$ and $\varvec{w}_\textrm{N}$ is high. Therefore, $\varvec{w}_\textrm{E}$ converges to $\varvec{w}_\textrm{N}$. The fine-tuning of $\varvec{w}_\textrm{E}$ is not significant because of the high degree of feature similarity among torso, limbs, and extremity. In summary, we conclude that $\varvec{w}_\textrm{E}$ $\approx $ $\varvec{w}_\textrm{N}$. However, $\varvec{\overline{w}}_\textrm{E}$ is trained only with a small amount of data, and its status has unconverged. Therefore, we can conclude that $\varvec{\overline{w}}_\textrm{E}$ $\ne $ $\varvec{w}_\textrm{N}$. Based on the above theory and Eq. (2), we can infer that,

$$\begin{aligned} L(\varvec{w}_\textrm{E})<L\left( \varvec{\overline{w}}_\textrm{E}\right) . \end{aligned}$$

(3)

In conclusion, Step 2–3 results in a smaller value for the loss function of Extremity-encoder, reaching convergence with only a small amount of training data. Therefore, the Extremity-encoder has the ability to capture broad features of the extremities.

Step 4: Compensating high-order topological features. The human topology is a crucial high-order feature that informs the network about the position and region of the human body in the feature map. It can facilitate the extraction of potentially easier-to-refine areas. Deeper layers of the network prioritize topological features. In PCNet, the Decoder, as the deep layer of the pre-trained HRNet, has the ability to extract topological structures, but including only the torso and limbs. Therefore, it is necessary to embed the extremity topological structure into the torso and limbs to achieve the overall structure for estimating sports actions. We propose the HOF-compensator, which aims to compensate for the structure of the extremities and preserve the topological structure of the torso and limbs as well. We can obtain feature maps $\varvec{F}_\textrm{N}\in \mathbb {R}^{K_1\times H\times W}$ and $\varvec{F}_\textrm{E}\in \mathbb {R}^{K_2\times H\times W}$, heatmaps $\varvec{H}_\textrm{N}\in \mathbb {R}^{K_1\times H\times W}$ and $\varvec{H}_\textrm{E}\in \mathbb {R}^{K_2\times H\times W}$ for each keypoint from the Normal-encoder and Extremity-encoder. $K_1=17$ represents not only the channel number of $\varvec{F}_\textrm{N}$ and $\varvec{H}_\textrm{N}$, but also the number of torso and limbs keypoints annotated by COCO2017 dataset. $K_2=4$ is the channel number of $\varvec{F}_\textrm{E}$ and $\varvec{H}_\textrm{E}$, and the number of keypoints of bilateral toes and fingertips annotated by self-built sports dataset. $H\times W$ is the resolution of all maps. We utilize $\varvec{F}_\textrm{N}$ and $\varvec{H}_\textrm{N}$ to extract higher-order topological structure features for the torso and limbs. We concatenate the two maps and perform multiple convolutions and relu activations. The concatenation of the two can embed the positional feature information in $\varvec{H}_\textrm{N}$ into $\varvec{F}_\textrm{N}$. After convolution, the topological features in $\varvec{F}_\textrm{N}$ can be more prominent. The relu function can eliminate background and redundant features, thereby strengthening the topological structure features and other useful features. The above can be expressed as,

$$\begin{aligned} \varvec{F}_\textrm{topo} = Relu(Conv(Cat(\varvec{F}_\textrm{N}, \varvec{H}_\textrm{N}))), \end{aligned}$$

(4)

where $\varvec{F}_\textrm{topo}\in \mathbb {R}^{K_1\times H\times W}$ represents the high-order topology feature map, $Cat(\cdot )$ represents the function of concat, $Conv(\cdot )$ represents convolution, and $Relu(\cdot )$ represents relu activation. Next, we construct a feature map $\varvec{F}_\textrm{pos}\in \mathbb {R}^{K_2\times H\times W}$ that can represent the positional features of extremities. It can be expressed as,

$$\begin{aligned} \varvec{F}_\textrm{pos} = \sum _{k_2=1}^{K_2}\left( {\left\{ \varvec{H}_{\textrm{E}, k_2} \odot \varvec{F}_\textrm{E}\right\} _{k_2=1}^{K_2}}\right) , \end{aligned}$$

(5)

where $\varvec{H}_{\textrm{E}, k_2}\in \mathbb {R}^{1\times H\times W}$ represents the heatmap of $k_2$-th keypoint. We concatenate $\varvec{F}_\textrm{topo}$ and $\varvec{F}_\textrm{pos}$, and then perform convolution to obtain the compensated topological structure feature map $\varvec{\overline{F}}_\textrm{topo}\in \mathbb {R}^{K\times H\times W}$ as,

$$\begin{aligned} \varvec{\overline{F}}_\textrm{topo} = Conv(Cat(\varvec{F}_\textrm{topo}, \varvec{F}_\textrm{pos})), \end{aligned}$$

(6)

where $K=K_1+K_2$ is the number of all keypoints. Performing convolution of $\varvec{F}_\textrm{topo}$ and $\varvec{F}_\textrm{pos}$ can establish relationships based on differences between the two maps. As the background pixels in $\varvec{F}_\textrm{pos}$ have a value of 0, connections will only be established among the extremity, wrist, and ankle. In summary, we embed the features of extremities in $\varvec{F}_\textrm{pos}$ into the topological structure of the torso and limbs in $\varvec{F}_\textrm{topo}$. We can use $\varvec{\overline{F}}_\textrm{topo}$ to fine-tune the weights in the Decoder, compensating for its ability to extract features from the extremities. However, before fine-tuning, we further improved the occlusion handling ability of PCNet in Step 5 (see Fig. 3).

Step 5: Enhancing the features of occluded areas. To improve the detection accuracy for occluded parts, we propose the OFE-attention, which aims to increase the attention of the network for occluded parts. Firstly, we concatenate heatmaps of all keypoints to obtain $\varvec{H}\in \mathbb {R}^{K\times H\times W}$, and concatenate $\varvec{F}_\textrm{N}$ and $\varvec{F}_\textrm{E}$, to extract the low-order detailed feature maps of the torso, limbs, and extremity keypoints, $\varvec{F}_\textrm{low}\in \mathbb {R}^{K\times K\times H\times W}$. It is expressed as,

$$\begin{aligned} \varvec{F}_\textrm{low} = \left\{ \varvec{H}_{k} \odot Cat\left( \varvec{F}_\textrm{N}, \varvec{F}_\textrm{E}\right) \right\} _{k=1}^{K}, \end{aligned}$$

(7)

where $\varvec{H}_{k}\in \mathbb {R}^{1\times H\times W}$ represents the heatmap of k-th keypoint. Then, we perform concatenation between $\varvec{\overline{F}}_\textrm{topo}$ and low-order detail feature maps of each keypoint $\varvec{F}_{\textrm{low},k}\in \mathbb {R}^{K\times H\times W}$, and then perform convolutions and relu as,

$$\begin{aligned} \varvec{F}_{\textrm{match},k} = Relu(Conv(Cat(\varvec{\overline{F}}_\textrm{topo}, \varvec{F}_\mathrm{{low},k}))). \end{aligned}$$

(8)

where $\varvec{F}_{\textrm{match},k}\in \mathbb {R}^{K\times H\times W}$ represents the operated feature map for matching of k-th keypoint. As shown as figure 3. For normal keypoints without occlusions, the features expression ability of $\varvec{\overline{F}}_\textrm{topo}$ and $\varvec{F}_\mathrm{{low},k}$ is high. After convolving the concatenated maps of the two, there are only a small number of new relationships established in $\varvec{F}_{\textrm{match},k}$. Therefore, $\varvec{F}_{\textrm{match},k}$ and $\varvec{\overline{F}}_\textrm{topo}$ have a high degree of similarity. For keypoints with occlusions, due to the low accuracy of $\varvec{\overline{F}}_\textrm{topo}$ and $\varvec{F}_\mathrm{{low},k}$, the topology of the keypoint in $\varvec{\overline{F}}_\textrm{topo}$ deviates from the true value, forming a region with pixel values between the true value of the keypoint and the background pixel value around the true keypoint position, such as green points in figure 3. When the concatenated map is convolved, a new connection relationship is established in $\varvec{F}_{\textrm{match},k}$ for this region. Additionally, due to the ambiguous occlusion area features, new features are extracted after convolution because of the inconspicuous features in $\varvec{F}_{\textrm{match},k}$. Based on the above, compared to keypoints without occlusions, keypoints with occlusions have more relationships established in $\varvec{F}_{\textrm{match},k}$. Therefore, $\varvec{F}_{\textrm{match},k}$ and $\varvec{\overline{F}}_\textrm{topo}$ have a lower degree of similarity. We subtract $\varvec{F}_{\textrm{match},k}$ from $\varvec{\overline{F}}_\textrm{topo}$, which means getting the difference between the two. Most of the difference is due to the occlusion areas, and the rest is background noise. We apply softmax $Softmax(\cdot )$ to convert it into a probability distribution $\varvec{P}_{\textrm{match},k}\in \mathbb {R}^{1\times H\times W}$ as,

$$\begin{aligned} \varvec{P}_{\textrm{match},k} = Softmax\left( Conv(\left| \varvec{F}_{\textrm{match},k} - \varvec{\overline{F}}_\textrm{topo}\right| \right) . \end{aligned}$$

(9)

In $\varvec{P}_{\textrm{match},k}$, the occlusion areas are assigned higher weights. For all keypoints, $\varvec{P}_{\textrm{match},k (k=1,\cdots , K)}$ are summed up to obtain the probability distribution $\varvec{P}_\textrm{match}\in \mathbb {R}^{1\times H\times W}$, where the potential occlusion areas of each keypoint are extracted and fused. For keypoints with occlusions, the probability values of the occlusion areas are larger, while other parts are smaller. For unoccluded keypoints, the probability distribution is divided equally to each part. As the resolution of $\varvec{P}_\textrm{match}$ and $\varvec{\overline{F}}_\textrm{topo}$ is consistent, the probability distribution in $\varvec{P}_\textrm{match}$ can be directly mapped to $\varvec{\overline{F}}_\textrm{topo}$ to represent the degree of occlusion. Then, $\varvec{P}_\textrm{match}$ is multiplied with $\varvec{\overline{F}}_\textrm{topo}$ and added to $\varvec{\overline{F}}_\textrm{topo}$, enhancing the attention to occluded parts as,

$$\begin{aligned} \varvec{F} = \left( \varvec{P}_\textrm{match} \odot \varvec{\overline{F}}_\textrm{topo}\right) + \varvec{\overline{F}}_\textrm{topo}, \end{aligned}$$

(10)

where $\varvec{F}\in \mathbb {R}^{K\times H\times W}$ represents the feature map with the topology of all parts and the enhanced occluded features.

Table 1 Datasets Scale Indicator

Full size table

Step 6: Fine-tuning the Decoder. We use $\varvec{F}$ to fine-tune the weights of the Decoder. Compared to directly fusing $\varvec{F}_\textrm{N}$ and $\varvec{F}_\textrm{E}$ as fine-tuning data, the topology provided by $\varvec{F}$ is more suitable for the attention of the HRNet deep layer Decoder. In addition, multiple Relu operations eliminate a large amount of redundant background, and the embedding effect of the HOF-compensator on the features of extremity is better than directly fusing $\varvec{F}_\textrm{N}$ and $\varvec{F}_\textrm{E}$. Therefore, the fine-tuning effect of $\varvec{F}$ is better. Finally, we obtain the output $\varvec{F}_\textrm{out}\in \mathbb {R}^{K\times H\times W}$ of the Decoder. Based on the above, steps 1–3 use only a small amount of training data for the extremity to make the Extremity-encoder converge, possessing the ability to extract generalized features of the extremity. In step 4, the HOF-compensator is used to establish a connection between the torso, limbs, and the extremity, building a complete high-order topological structure feature. Therefore, PCNet is able to achieve a supplementary estimation of extremity pose. In step 5, the OFE-attention is used to identify the occlusion areas and occluded degrees, and the feature map regions are weighted according to the degree of occlusion, enhancing the occlusion handling capabilities of PCNet. In step 6, the Decoder is fine-tuned to make PCNet converge. Therefore, PCNet can be better applied to the pose estimation of sports actions.

Experiments

Dataset and evaluation metrics

To validate the effectiveness of our method, we conducted experiments on COCO2017 dataset [29], COCO-Wholebody dataset [13], CrowdPose dataset [30], and self-built sports dataset. The scale of each dataset is shown in Table 1. COCO2017 dataset annotates a total of 17 human keypoints, including 5 for the head, 12 for the torso and limbs. The annotations have a rich background, with 75% being regular scenes and 25% being occlusion scenes, providing a comprehensive evaluation for the algorithm. COCO-Wholebody dataset is an extension of COCO2017 dataset, adding full-body keypoint annotations to support more accurate and comprehensive estimation of human posture. A total of 133 keypoints are annotated, including 68 for the face, 42 for the hands, and 23 for the body. Due to the denser annotations, it presents a higher challenge. CrowdPose dataset is an image dataset for exploring multi-person pose estimation in crowded scenes, annotating a total of 14 human keypoints. The dataset covers various different scenes and actions, especially in terms of occlusion. Self-built sports dataset contains a total of 4000 images, with 3000 for training and 1000 for validation. Each image is annotated with 4 keypoints for the fingertips and toes.

We use the Object Keypoint Similarity (OKS) metric as the quantitative indicator to evaluate our methods [9]. OKS is a metric that evaluates object detection and keypoint localization performance by comparing the distance between predicted keypoints and ground truth keypoints.

Experimental setup

We conducted the experiments using Python 3.8 and developed a software simulation platform on the PyTorch framework. The hardware platform utilized for our experiments consisted of a Windows 11 system computer equipped with an NVIDIA GeForce RTX 3060 graphics card. To prepare the input images, we resized them to 256$\times $192, and implemented a series of image enhancement operations, such as random rotation of 45 degrees and scaling of 35%. The Adam optimizer is used for training, with an initial learning rate of 0.001 and the learning rate decay factor of 0.1. The Normal-encoder and Decoder are pre-trained on COCO2017 dataset with a total of 210 epochs by Step 1, with the learning rate decay applied at the 170-th epoch and the 200-th epoch, respectively. For the fine-tune processing of the Extremity-encoder and the Decoder both are trained for 60 epochs. The training data is from the training set of self-built dataset. The initial learning rate and decay factor are consistent with those of the Normal-encoder, and the learning rate adjusts at rounds 45 and 55. In the HOF-compensator and OFE-attention, we use 3$\times $3 convolution kernels. Additionally, during the parameter adjustment phase, we add a Dropout factor of 0.3 after each $Linear(\cdot )$ operation to prevent overfitting.

Comparative experiments

To evaluate PCNet, we have selected the following comparison methods. HRNeXt [3] learns high-resolution representations through rich context information to better estimate the poses of occluded humans. HRPoseFormer [4] is a parallel network with self-attention mechanisms that can capture multi-scale features more effectively at lower computational complexity. Adversarial [5] takes into account the physical constraints and internal relationships of body parts and has better estimation effects on complex poses. CPN [6] is a cascaded pyramid network that initially identifies simple keypoints and then integrates features to predict occluded areas. Direction Aware [7] is a human-aware feature extractor based on global and local reasoning features. The global reasoning function uses the non-local computing attributes to consider the entire body, while the local reasoning function focuses on individual body parts. HRNet [9] is a high-resolution network that combines multi-scale feature fusion to extract human features from multiple scales. Attachable Feature Corrector [10] is a keypoint detection method that combines point cloud technology. Bottom-up HRNet [11] is an extension of HRNet that utilizes deconvolution to generate higher resolution feature maps and express poses in a bottom-up manner. DetPoseNet [12] is a method that detects keypoints in a coarse-to-fine manner, first detecting all keypoints in a large range and then eliminating less likely keypoints. ZoomNAS [13] jointly searches for model architecture and connections between different submodules and automatically assigns computational complexity to the searched submodules. AlphaPose [15] can accurately estimate full-body pose and track simultaneously in real-time runtime. Dual-Path transformer [20] is a network combined transformer and high-resolution network. Multiscale Transformer [21] is a transformer framework with multi-scale features extraction modules. UformPose [23] is a pyramid-based network with shared mechanisms, learning stronger low-level features in the initial stage and promoting cross-resolution commonality learning. PVNet [24] can point pixel vectors to keypoints and use these vectors to vote for keypoint locations, creating a flexible representation for locating occluded or truncated keypoints. Two Stage Net [31] is a two-stage training network that learns rough features in the first stage and improves estimation accuracy in the second stage.

For COCO2017 dataset, we choose HRNeXt [3], HRPoseFormer [4], Adversarial [5], CPN [6], Direction Aware [7], HRNet [9], Attachable Feature Corrector [10], DetPoseNet [12], Dual-Path transformer [20], UformPose [23], Two Stage [31] for comparison with our method, since they are universal for the detection of all conventional keypoints. COCO-WholeBody dataset is more suitable for evaluating algorithm performance in dense situations. We choose the HRNet [9], Attachable Feature Corrector [10], Bottom-up HRNet [11], ZoomNAS [13], AlphaPose [15], UformPose [23], PVNet [24] for comparison, because they can adapt to dense environments. Finally, CrowdPose dataset provides cases with different levels of occlusion, and HRNet [9], Attachable Feature Corrector [10], ZoomNAS [13], AlphaPose [15], Multiscale Transformer [21], UformPose [23], Two Stage Net [31], have a certain ability to handle occlusion. Therefore, we choose these methods to compare with our method for occlusion phenomena on CrowdPose dataset.

Table 2 Comparison results of PCNet-W32 and other algorithms on COCO2017 keypoint detection dataset. GFLOPs, params, and FPS are reference indicators for algorithm complexity, model size, parameter quantity, and detection time cost, respectively

Full size table

We compared the accuracy of keypoint detection on torso and limbs parts with other methods with three public datasets. This comparison will also validate the applicability of the PCNet for traditional pose estimation. Since the Extremity-encoder and HOF-compensator in PCNet are specifically proposed for, they are not used for these comparisons. Therefore, we removed the above components and only retained the Normal-encoder, OFE-attention, and Decoder as the experimental network. Specifically, the experimental network uses $\varvec{F}_\textrm{topo}$ instead of $\varvec{\overline{F}}_\textrm{topo}$ as the input to OFE-attention.

Table 3 Comparison results of PCNet-W32 and other algorithms on COCO-wholebody dataset

Full size table

We first evaluated the performance of the experimental network by comparing it with other algorithms on COCO2017 keypoint detection dataset. The comparison results, as shown in Table 2, reveal that PCNet achieves the highest accuracy of 78.1$\%$ and a recall rate of 80.9$\%$ with the W-32 version among all the algorithms. In addition, PCNet still has the best performance of 78.8$\%$ under the deeper network structure W-48 version. The difference between W-32 and W-48 is only the number of basic convolutional units, W-48 has a deeper network depth, and the structures of the two versions are completely the same [9]. It validates that PCNet can effectively detect human keypoints in normal and occluded scenarios. The complexity of PCNet is compared with all the comparison methods in terms of GFLOPs [13]. Although PCNet consists of multiple modules, the connection between these modules is solely facilitated through feature maps and heatmaps, eliminating the necessity for additional variables. Consequently, the complexity of PCNet is relatively low. Similar to reference [23], we used Frames Per Second (FPS) as the measure of time cost. PCNet is competitive compared to other methods, making it suitable for practical engineering applications. In addition, the parameter quantity of PCNet is moderate.

Next, in order to evaluate the performance of the proposed network with W-32 version more comprehensively, we conducted keypoint detection of the whole body, hands, face, and head on COCO-Whole body dataset, which has more complete annotations. The comparison results are shown in Table 3. PCNet achieves the highest accuracy among all comparative methods, except for the detection of feet in reference [11] and [15]. However, for the flexible and occluded hands, the accuracy of our method increases at least 3.1% compared to other methods. It validates the superior performance of PCNet, and it is better suited for complex and dense keypoint detection scenarios.

Finally, to validate the performance of the proposed network with W-32 version in occluded scenarios, we provided the experimental results on CrowdPose dataset, as shown in Table 4. It indicate that PCNet achieves the highest metric scores among all the compared algorithms. It confirms that PCNet has a strong competitive advantage in handling occluded scenarios.

Table 4 Results on Crowdpose dataset. AP.easy, AP.medium, and AP.hard present three crowding levels, which is divided by crowd index: easy(0$-$0.1), medium(0.1$-$0.8), and hard(0.8-1)

Full size table

In summary, the above experiments verify that PCNet has higher detection accuracy. COCO2017 dataset contains 75% of regular scenes, and PCNet achieves the highest accuracy among all comparative methods. Moreover, the model size, complexity, and time cost of the proposed network do not have obvious differences. It shows that PCNet can handle traditional pose estimation tasks and achieve higher accuracy compared to comparative methods. COCO-Wholebody dataset provides a denser detection environment, where PCNet achieves the highest accuracy for whole-body keypoints, but the detection accuracy of keypoints on the feet is lower than three methods [11, 15, 24]. Therefore, PCNet can be adapted to dense environments and achieve higher accuracy than comparative algorithms. However, PCNet has certain limitations in estimating foot pose. CrowdPose dataset offers a more crowded and occluded detection environment, where PCNet achieves higher accuracy than all comparative methods with a considerable accuracy improvement. The results verify that PCNet can better handle occlusion issues. The reasons for the performance improvement and limitations will be analyzed in ablation experiments.

Ablation experiments

Table 5 The results of data usage and accuracy comparison between PCNet-W32 and the comparative algorithm on self-built extremity dataset

Full size table

The PCNet has six steps, where Step 1 is the pre-training, and Steps 2–4 show the detection ability of the network on the extremity by using only a small amount of training data, and Step 5 improves the performance in handling occlusions, and Step 6 is the fine-tuning of the Decoder. To validate the performance of each step, we conducted a series of ablation analyses. First, the weights of the Normal-encoder are transferred as prior knowledge for the Extremity-encoder in Step 2, and it is fine-tuned with a small amount of training data in Step 3. To verify the effectiveness of the two steps, we conducted experimental analysis on self-built sports dataset, using two networks that the PCNet without transferring weights and the PCNet with transferring weights. The experimental results are shown in Table 5. When the same amount of training data is used, the accuracy of PCNet with transferred weights is at least 20% higher than other comparative methods. At the same time, when the accuracy is equal, the PCNet with weights saves about 67% of the data usage. Therefore, Extremity-encoder can achieve high-precision detection of extremity under the condition of insufficient training data. The reasons can be explained as follows. The Extremity-encoder transfers the weights of Normal-encoder, and the weights have a high adaptability between the two encoders. Therefore, we only use a small amount of extremity training data to fine-tune the Extremity-encoder to achieve high accuracy.

Table 6 The ablation study results of HOF-compensator

Full size table

The HOF-compensator performed by Step 4 can embed the detailed features of the extremity into the torso and limb topological structure. It constructs a more complete topology, eliminates redundant backgrounds and provides more accurate high-order topological features for the Decoder to fine-tune. To verify the effectiveness of the HOF-compensator, we designed a comparative method without the extremity structures, but use the detailed features provided by the Extremity-encoder to describe the attributes of the extremity. Specifically, the comparative method uses $\varvec{F}_\textrm{topo}$ instead of $\varvec{\overline{F}}_\textrm{topo}$ as the input of OFE-attention. The experimental results of the comparative method and PCNet on self-built sports dataset are shown in Table 6. With HOF-compensator, the accuracy of AP increases by 10.1%. We analyze that the Decoder, as the deep layer of HRNet, pays more attention to the high-order topology. Most existing traditional methods can only provide the topology of the torso and limbs, but ignore the features of the extremity. The HOF-compensator utilizes the characteristics of convolution to establish a connection between the detailed features of the extremity and the high-order topology features of the torso and limbs. The Decoder demonstrates a strong understanding of the topology of the torso and limbs. By utilizing $\varvec{\overline{F}}_\textrm{topo}$ for fine-tuning, it can enhance its focus on the various parts of the extremities. Consequently, the HOF-compensator has the potential to enhance the network’s detection accuracy for the extremities.

Table 7 The improvement in accuracy of some methods with OFE-attention on COCO2017 dataset

Full size table

To improve the accuracy of detecting occluded parts, we propose the OFE-attention to enhance the features. The module is plug-and-play and can be connected to any method that generates feature maps and heatmaps to improve its robustness to occlusion issues. To verify the effectiveness of this mechanism, we integrated it into several representative algorithms, which directly fuse the output of the OFE-attention with the output feature map of the algorithm network, with the results are shown in Table 7. All methods have improved performance after integrating OFE-attention. It can enhance the detection ability of the target network for occluded parts without weakening the detection ability for non-occluded regular keypoints. Because OFE-attention identifies occlusion keypoints by matching the similarity between high and low-order features and enhances the expression ability of the regional features in a weighted manner. It will not affect the expression of regional features for normal keypoints.

Visualization display

The PCNet aims to achieve more comprehensive and detailed pose estimation in sports such as gymnastics and dance. To validate the performance of the pose estimation, we conducted a series of visualizations. Firstly, we presented the detection results of PCNet for gymnastics and dance actions in the form of keypoint detection. As shown in Fig. 4, we have achieved the detection of keypoints on the torso, limbs, and extremities. All of them have excellent accuracy, with low probabilities of missed detection and false detection. Secondly, to verify the effectiveness of the method of transferring weights in the Extremity-encoder, we showed the extremity keypoints detection results of the Extremity-encoder without transferring weights and Extremity-encoder with transferring weights, as shown in Fig. 5. PCNet without transferred weights has a higher false detection rate and missed detection. In contrast, the accuracy of PCNet with transferred weights is much higher than comparative methods. In addition, to demonstrate the accuracy improvement after adding keypoints on extremities, we provided a comparison of pose estimation results, as shown in Fig. 6. PCNet focuses on the extremities and achieves a joint estimation of the pose of the torso, limbs, and extremities. It can evaluate whether the foot is parallel to the leg or the hand is parallel to the forearm, as well as other crucial postures in gymnastics and dance movements. Therefore, PCNet can better suit the pose estimation in sports activities, achieving comprehensive, detailed, and high-precision evaluations.

To improve the performance in handling occlusions, we designed the OFE-attention. To demonstrate its effectiveness, we showed detection results under occlusion conditions, as shown in Fig. 7. PCNet estimates the pose of the occluded parts well. In crowded places, PCNet can also estimate the pose of people in the back row, who are severely occluded. However, in severe occlusion environments, there is a certain probability of false detection and missed detection in the pose estimation of the extremities and feet. It further confirms the need to improve the extraction ability of edge foot features in PCNet.

To verify the degree of false detection and missed detection in occlusion, we visualized the heatmaps of each keypoint in the occlusion environment, as shown on the left of Fig. 8. Since the comparative methods only involves the keypoints of the torso and limbs, we do not show the extremities for comparison. PCNet generates obvious heat points, so that is has much lower false detection rate compared to other methods and does not exist missed detection phenomenon. Finally, to verify that PCNet has higher detection accuracy in sports movements, we conducted heatmap visualization, as shown on the right of Fig. 8. PCNet makes the heat values of each keypoint more concentrated, indicating a lower probability of false detections and no missed detections. In summary, PCNet can be better applied to sports to achieve more comprehensive and accurate pose estimation.

Results and discussion

In sports pose estimation, we found that the existing methods have two major drawbacks, insufficient estimation of extremity and occlusion handling ability. To address the two issues, we proposed PCNet and conducted experiments on four datasets to verify its effectiveness. We will perform a overall comprehension of PCNet through analyzing results on each dataset in this section. Firstly, on our self-built dataset, PCNet utilized only one-third of the training data and achieved competitive accuracy to other methods. Its accuracy can increase by at least 20% when the amount of training data was the same. Our method can achieve complementary detection of extremity poses under insufficient training data. That is because the weights of the Normal-encoder and Extremity-encoder are highly adapted. The transferred weights considered as the prior knowledge, provide higher convergence performance for the two encoders and decoder of extremity detection. In addition, the HOF-compensator can embed the extremity features into the torso and limb features by constructing higher-order topological features. It contributes to build the overall pose and improve the expressiveness of the features. Secondly, on COCO2017 dataset, COCO-Wholebody dataset, and CrowdPose dataset, the accuracy of PCNet is at least higher 0.5%, 1.0%, and 1.2%, respectively. Therefore, PCNet can be applied to the detection of conventional keypoints, densely located keypoints, and occluded keypoints, and has higher accuracy. Because the OFE-attention can identify occluded keypoints and their relative occlusion degree, and strengthen the feature expression of the occlusion area. Compared to other methods, PCNet is more suitable for the estimation of body pose in sports. It achieves much better performance of extremity pose estimation, and has better occlusion handling ability.

However, on COCO-Wholebody dataset, we found that PCNet has relatively insufficient accuracy for the feet, which are located on the edge of feature map. PCNet judges the occlusion keypoints by calculating the similarity between higher-order features and lower-order features. When foot detailed features $\varvec{F}_\mathrm{{low}}$ and human topological features $\varvec{\overline{F}}_\mathrm{{topo}}$ are convolved, edge padding operations are performed to obtain $\varvec{F}_\mathrm{{match}}$. It brings in a number of zero-value pixel around the edge of feature map. Therefore, for the edge keypoints, $\varvec{F}_\mathrm{{match}}$ weakens the features compared to $\varvec{\overline{F}}_\mathrm{{topo}}$, which leads to an increase in the difference between the two maps. Then, the keypoints on the feet are mistaken for being occluded. More importantly, there is a significant deviation between the difference area and the true area of the feet on the feature map, and the false detection rate increases. In summary, our method has certain limitations in detecting edge parts.

Conclusion

We propose PCNet for sports pose estimation, which achieves the compensation of extremities and improves occlusion handling capabilities under the premise of limited training data. In comparative experiments, PCNet achieves 78.1%, 66.4%, and 77.3% detection accuracy on COCO2017 dataset, COCO-Wholebody dataset, and CrowdPose dataset, respectively, with higher accuracy than all comparative algorithms. The results verify that PCNet can perform well in conventional pose estimation tasks. Moreover, on self-built sports dataset, PCNet achieves a detection accuracy of 78.1% for extremities, which is about 20% higher than other comparative methods. Additionally, when the accuracy is equal, PCNet uses only one-third of the training data. Therefore, PCNet can obtain the ability to estimate extremity pose with only a small amount of data, and our design ideas can provide inspiration for methods based on few-shot incremental learning. In visualization displays, PCNet could comprehensively evaluate the consistence of sports actions for multiple athletes. In heatmap displays, PCNet has more concentrated heat values for occluded areas, which verifies it has better occlusion handling performance. We have two future works. The first is to improve the limitations of PCNet, which is the insufficient detection ability for peripheral parts such as the foot. The second is inspired by the heatmap of sports in Fig. 8. We find that the body structures of sport actions do not conform to the conventional human body structure, where the head, torso, and legs are in a top-down order. They are a combination of rotation and inversion of the three parts. In this case, extracting topological structure features is difficult. Therefore, a more accurate method for constructing topological structure features is required to describe these unconventional sport actions. Finally, the idea of this paper has strong generalization. According to the requirements of different tasks, the keypoints can be changed to detect diverse human body parts.

Data Availability

The data are available from the corresponding author on reasonable request.

References

Ding Y, Zhang Z, Zhao X et al (2022) Multi-feature fusion: graph neural network and cnn combining for hyperspectral image classification. Neurocomputing 501:246–257
Article MATH Google Scholar
Zeng K, Ma Q, Wu J et al (2023) Nlfftnet: a non-local feature fusion transformer network for multi-scale object detection. Neurocomputing 493:15–27
Article MATH Google Scholar
Li Q, Zhang Z, Zhang F et al (2023) HRNeXt: high-resolution context network for crowd pose estimation. IEEE Trans Multimedia 25:1521–1528
Article MATH Google Scholar
Yu X, Chen G (2022) HRPoseFormer: high-resolution transformer for human pose estimation via multi-scale token aggregation. In: IEEE 16th International Conference on solid-state and integrated circuit technology (ICSICT), pp 1-3
Zhang T, Lian J, Wen J (2023) Multi-person pose estimation in the wild: using adversarial method to train a top-down pose estimation network. IEEE Trans Syst Man Cybern Syst 53(7):3919–3929
Article MATH Google Scholar
Chen Y L, Wang Z C, Peng Y X, et al (2017) Cascaded pyramid network for multi-person pose estimation. In: IEEE/CVF Conference on computer vision and pattern recognition, pp 7103–7112
Zhou L, Chen Y, Wang J (2023) Progressive direction-aware pose grammar for human pose estimation. IEEE Trans Biomet Behav Ident Sci 5(4):593–605
Article MATH Google Scholar
Newell A, Yang K Y, Deng J (2019) Stacked hourglass networks for human pose estimation. In: Computer Vision (ECCV), pp 483–499
Wang J (2021) Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell 43(10):3349–3364
Article MATH Google Scholar
Kim G, Kim H, Kong K (2024) Human body-aware feature extractor using attachable feature corrector for human pose estimation. IEEE Trans Multimedia 25:5789–5799
Article MATH Google Scholar
Cheng B W, Xiao B, Wang J D, et al (2020) Bottom-up higherresolution networks for multi-person pose estimation. In: IEEE International Conference on computer vision and pattern recognition (CVPR), pp 1-10
Ke L, Chang M, Qi H et al (2022) DetPoseNet: improving multi-person pose estimation via coarse-pose filtering. IEEE Trans Image Process 31:2782–2795
Article MATH Google Scholar
Xu L et al (2023) ZoomNAS: searching for whole-body human pose estimation in the wild. IEEE Trans Pattern Anal Mach Intell 45(4):5296–5313
Article MATH Google Scholar
Yang L, Song Q, Wang J (2021) Higher-CNN: instance-level human parts detection and a new benchmark. IEEE Trans Image Process 30:39–54
Article MATH Google Scholar
Fang H et al (2023) AlphaPose: whole-body regional multi-person pose estimation and tracking in real-time. IEEE Trans Pattern Anal Mach Intell 45(6):7157–7173
Article MATH Google Scholar
Mohammadi S, Enshaeifar S, Hilton A (2022) Transfer learning for clinical sleep pose detection using a single 2D IR camera. IEEE Trans Neural Syst Rehabil Eng 29(3):290–299
MATH Google Scholar
Zhao X, Wang Z, Gao L (2022) Incremental face clustering with optimal summary learning via graph convolutional network. Tsinghua Sci Technol 16(6):536–547
Yan Q, Xu Y, Yang X (2012) A robust homography estimation method based on keypoint consensus and appearance similarity. In: IEEE International Conference on multimedia and expo, pp 586-591
Zhou D, Ye H, Ma L (2023) Few-shot class-incremental learning by sampling multi-phase tasks. IEEE Trans Pattern Anal Mach Intell 45(11):12816–12831
MATH Google Scholar
Zhou L, Chen Y, Wang J (2024) Dual-path transformer for 3D human pose estimation. IEEE Trans Circuits Syst Video Technol 34(5):3260–3270
Article MATH Google Scholar
Zhong Y, Yang G, Zhong D (2024) Frame-padded multiscale transformer for monocular 3D human pose estimation. IEEE Trans Multimedia 26:6191–6201
Article MATH Google Scholar
Kim S, Kang S, Choi H et al (2023) Keypoint aware robust representation for transformer-based re-identification of occluded person. IEEE Signal Process Lett 30:65–69
Article MATH Google Scholar
Wang Y, Luo Y et al (2023) UformPose: a U-shaped hierarchical multi-scale keypoint-aware framework for human pose estimation. IEEE Trans Circuits Syst Video Technol 33(4):1697–1709
Article MATH Google Scholar
Peng S, Zhou X, Liu Y et al (2022) PVNet: pixel-wise voting network for 6DoF object pose estimation. IEEE Trans Pattern Anal Mach Intell 44(6):3212–3223
Article MATH Google Scholar
Hafiz A, Hassaballah M, Abdullah A (2023) Reinforcement learning with an ensemble of binary action deep Q-networks. Comput Syst Sci Eng 46(3):2651–2666
Article MATH Google Scholar
Lee K, Kim W, Lee S (2023) From human pose similarity metric to 3D human pose estimator: temporal propagating LSTM networks. IEEE Trans Pattern Anal Mach Intell 45(2):1781–1797
Article MATH Google Scholar
Wei L, Huang H, Yu X (2024) Intersection-over-union similarity-based nonmaximum suppression for human pose estimation in crowded scenes. IEEE Trans Cognit Dev Syst 16(2):511–520
Article MATH Google Scholar
Zhang Y, Shao Y, Luo R (2024) Multiple human activities classification based on dynamic on-body propagation characteristics using transfer learning. IEEE Internet Things J 11(5):8637–8646
Article MATH Google Scholar
Lin T (2014) Microsoft COCO: common objects in context. In Computer Vision (ECCV) pp 740-755
Li J, Wang C, Zhu H (2019) CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In: IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 10855-10864
Zhang J, Liu M, Shen J (2024) Lightweight whole-body human pose estimation with two-stage refinement training strategy. IEEE Trans Human-Mach Syst 54(1):121–130
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Science and Engineering, Dalian Polytecnic University, Dalian, 116034, China
Jia-Hong Jiang & Nan Xia

Authors

Jia-Hong Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Nan Xia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nan Xia.

Ethics declarations

Conflict of interest

The authors have no relevant financial or non-financial interests to disclose. The authors have no Conflict of interest to declare that are relevant to the content of this article. All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. The authors have no financial or proprietary interests in any material discussed in this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Jiang, JH., Xia, N. PCNet: a human pose compensation network based on incremental learning for sports actions estimation. Complex Intell. Syst. 11, 17 (2025). https://doi.org/10.1007/s40747-024-01647-1

Download citation

Received: 08 May 2024
Accepted: 19 August 2024
Published: 11 November 2024
DOI: https://doi.org/10.1007/s40747-024-01647-1

PCNet: a human pose compensation network based on incremental learning for sports actions estimation

Abstract

Similar content being viewed by others

Sports Pose Estimation Based on LSTM and Attention Mechanism

GolfPose: From Regular Posture to Golf Swing Posture

A strong benchmark for yoga action recognition based on lightweight pose estimation model

Introduction