CN110428447B

CN110428447B - Target tracking method and system based on strategy gradient

Info

Publication number: CN110428447B
Application number: CN201910638477.5A
Authority: CN
Inventors: 殷海兵; 王康豪; 黄晓峰
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2019-07-15
Filing date: 2019-07-15
Publication date: 2022-04-08
Anticipated expiration: 2039-07-15
Also published as: CN110428447A

Abstract

The invention discloses a target tracking method and system based on a strategy gradient, and belongs to the field of computer vision. The method comprises the following steps: (1) inputting the target image into a convolutional neural network to obtain a target appearance template Z; (2) inputting the search image into a convolutional neural network, and waiting for a search area feature map; (3) calculating the template image Z and the search area characteristic graph through a similarity measurement function f to obtain a response graph ht; (4) inputting the response image ht and the historical response image hi obtained in the step (3) into a policy network, and selecting the action with the highest score to be added into the set Ct (i is 1-N); (5) repeating (4) until each historical response map in the response map template pool is traversed; and finally, executing the action with the largest occurrence number in the set Ct (i equals to 1 to N). The system includes a tracker and a decision maker. Wrong template updating is avoided, and the target can be found and found again in time when the target is lost.

Description

Target tracking method and system based on strategy gradient

Technical Field

The invention relates to the field of computer vision, in particular to a target tracking method and system based on a strategy gradient.

Background

Visual Object Tracking (VOT) is one of the most challenging problems in the field of computer vision. The method has wide application in video monitoring, man-machine interaction and automatic driving. Despite significant advances in VOT technology over the last several decades, it still faces significant challenges where severe occlusion, severe illumination changes and distortions, etc., may cause tracking failures.

Visual target tracking algorithms can be mainly classified into two categories: a generate class method and a discriminate class method. The generation-based method generally constructs a model from a target region of a current frame, and then searches for a region most similar to the model in a next frame, which is known as kalman filtering, particle filtering, mean shift, and the like. The discrimination class method, also known as detection tracking, learns a discrimination model to distinguish between a target region and a surrounding background region. The difference between the two methods is that the detection tracking method utilizes machine learning to train the classifier, and background information is used in the training process. Thus, the classifier can focus on identifying the foreground and background, and therefore, the discrimination class method is generally better than the generation class method.

Among the discrimination-like methods, a method based on a Discrimination Correlation Filter (DCF) is known for its high efficiency and high accuracy. By using the discrete fourier transform and cyclic shift of the training samples, the DCF-based modified KCF tracker can run at 292fps on a single CPU, far exceeding the real-time requirement. In recent years, research on DCF has been successful by using multi-feature channels, scale estimation, and reduction of boundary effects. However, as the accuracy of the DCF-like tracker increases, its speed also drops sharply.

In recent years, tracking algorithms based on Convolutional Neural Networks (CNN) have attracted attention for their excellent performance. Unlike conventional tracking algorithms, CNN-based algorithms use deep convolution features rather than manual features, which makes them show superior results over multiple tracking benchmarks. While these CNN-based trackers are superior in performance, these approaches either use a simple online update strategy or never update the initial appearance template, relying only on the powerful characterization capabilities of the trained CNN. This may be effective for interference free short term tracking. However, once severe occlusion or significant appearance change occurs, the tracker drifts into the background, thereby losing the target. And these methods also lack an effective means to re-detect the target after it is lost.

Therefore, the invention provides a strategy gradient-based target tracking algorithm, which recognizes unreliable tracking results by learning an effective strategy through a strategy gradient algorithm in reinforcement learning, and takes measures to prevent wrong template updating and detect lost targets again.

The closest prior art:

[1]Tracking-Learning-Detection

[2]Long-term correlation tracking

[3]Large Margin Object Tracking with Circulant Feature Maps

[4]Reliable Re-detection for Long-term Tracking

[5]Tracking as Online Decision-Making:Learning a Policy from Streaming Videos with Reinforcement Learning

in the above target tracking technology, the method [1-4] solves the problems of template updating and re-detection by means of manual strategy design, and such methods generally have a fixed mathematical formula to calculate the tracking confidence level, and only update the tracking model when the tracking confidence level is higher. However, this kind of method has certain limitations due to the fixed parameter formula, and cannot perfectly adapt to different tracking sequences. Method [5] learns a policy through the Q-learning algorithm to decide when to update the target appearance template and whether to search the entire image globally. However, this method uses a single response diagram to represent the state, and does not consider the response diversity of different tracking sequences, so that the tracking result cannot be accurately estimated to make a reliable decision. In addition to this, the global search severely impacts the speed of the algorithm.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a target tracking method based on strategy gradients, which can accurately identify unreliable tracking results and determine when to update an appearance template and whether to detect again or not so as to avoid wrong template update and timely find and find out the target again when the target is lost.

A strategy gradient-based target tracking method comprises the following steps:

(1) inputting the target image into a convolutional neural network to obtain a target appearance template Z;

(2) inputting the search image into a convolutional neural network, and waiting for a search area feature map;

(3) comparing the template image Z with a candidate area with the same size of the search area feature map, and calculating through a similarity measurement function f to obtain a response map ht; the similarity measurement function f is a Siamese network:

wherein

Is a convolution embedding function, which represents the cross-correlation of two feature maps, and b is an offset. After a target appearance template Z and a search area feature map are input, a function f generates a response map ht;

(4) inputting the response image ht and the historical response image hi obtained in the step (3) into a policy network together to obtain the regularization score of each decision-making action; the historical response graphs hi (i is 1-N) come from a response graph template pool, N historical response graphs are stored in the response graph template pool, and each historical response graph corresponds to the latest good tracking result; then, the action with the highest score is selected and added to the set Ct (i ═ 1 to N);

(5) repeating the step (4) until each historical response map in the response map template pool is traversed; and finally, executing the action with the largest occurrence number in the set Ct (i equals to 1 to N).

Further, the policy network includes a state s_tAction a, learning strategy pi and reward R_tState s of_tRepresented as a tuple (h)_i,h_t) Action a includes updating, tracking and redetecting, the reward R_tAwarding a prize R according to the overlap ratio of the current bounding box to the object_tThe learning strategy pi optimizes the depth strategy network by using gradient descent:

Δθ＝α▽_θlogπ_θ(a_t|s_t)R_τ (2)

wherein R is_τRepresenting the return of the whole process, during the training process, the action samples are extracted from the policy network, and then the reward R is given through the evaluation of the selected action_tThe strategy is optimized by updating parameters using the reward information to maximize the desired reward, resulting in a trained strategy network.

Further, in the training process, the tracking process of one frame is regarded as the whole process, and a back propagation algorithm is executed by using formula (2), and a return function of the back propagation algorithm is defined as follows:

wherein, the Intersection-over-Union (IOU) represents the overlapping rate of the predicted box b and the real box g;

the strategy network consists of two 516-dimensional full-connection layers and an output layer, wherein the output layer outputs three actions of updating, tracking and re-detecting, and each full-connection layer is initialized randomly and is subjected to ReLU and batch regularization processing; the whole algorithm trains 200 cycles on an object tracking reference (OTB) data set, and each cycle is finished after an agent interacts with all training images; for each cycle, after 8192 samples are collected, the policy network starts learning.

Furthermore, after the strategy network learns every time, the updated strategy network continues to sample for the next learning, and the learning rate is 10 in the whole training process^-6Down to 10^-8And a batch size of 64 is used.

Further, if the maximum action in the execution set Ct is update, the update target position Pt is Ptp, which is the predicted position of the current target, and the response map ht is added to the response map template pool, one old response map in the response map template pool is discarded, and the target appearance template Z is updated with the current target position information.

Further, if the maximum action in the execution set Ct is tracking, the target position Pt is updated to Ptp, which is the predicted position of the current target.

Further, if the most action in the execution set Ct is re-detection, a search area where one target is most likely to appear is obtained through a particle filter, a response map htc of the search area is calculated, a predicted target position Ptc of the re-detection area is obtained, ht ═ htc, Ptp ═ Ptc is updated, and then the predicted target position Ptc is input to the policy network again for decision-making.

Further, in the re-detection process, the particle filter draws candidate search regions where M targets are most likely to appear, and for each candidate search region, the tracking network is reused to calculate a response map, and then an optimal candidate search region is selected through the confidence score:

C_i＝max(f_i)·cos(γ||P_i-P_t||)

wherein f is_iIs a response map of the ith candidate search region, P_iAnd P_tIs the ith candidate search area and the center position of the target in the previous frame, and γ is a predefined distance penalty parameter.

Further, when the re-detection is performed twice or more in one frame, the re-detection result is discarded and the initial tracking result is used.

A target tracking system based on strategy gradient comprises a tracker and a decision maker; the tracker calculates a target appearance template Z and a search area characteristic graph through a similarity measurement function f to obtain a response graph ht; the similarity measurement function f is a Siamese network:

wherein

Is a convolution embedding function, which represents the cross-correlation of two feature maps, and b is an offset. After a target appearance template Z and a search area feature map are input, a function f generates a response map ht; the response map ht and the historical response map hi form the state s of the tracker_tThe tracker selects an action a according to a strategy pi given by the decision maker;

the decision maker is a trained strategy network and converts the state s of the tracker_tInputting the data into a policy network to obtain the regularization score of each decision-making action; the historical response graphs hi (i is 1-N) are from a response graph template pool, N historical response graphs are stored in the response graph template pool, and each historical response graph represents the latest good tracking result; then, the action with the highest score is selected and added to the set Ct (i ═ 1 to N); traversing each historical response map in the response map template pool; and finally, executing the action with the largest occurrence number in the set Ct (i is 1 to N) as a decision result of the decision maker.

The strategy gradient-based target tracking technology provided by the invention learns a strategy network through a strategy gradient algorithm in reinforcement learning, the strategy network can accurately identify unreliable tracking results, then through executing corresponding decision actions, wrong template updating is avoided, and a target can be found and found again in time when the target is lost, so that the difficulties in the target tracking technologies such as shielding, deformation and the like are effectively solved, the tracking precision and robustness are greatly improved, and meanwhile, higher speed is kept. Experiments prove that the method provided by the invention improves the performance by 5-6% in the original tracking frame.

Drawings

FIG. 1 is a general framework diagram of a strategy gradient-based object tracking technique;

FIG. 2 is a block flow diagram of a policy gradient-based target tracking technique;

FIG. 3 is a distance accuracy run result of an OTB-50 reference dataset;

FIG. 4 shows the results of OTB-50 baseline dataset overlap success rate operation;

FIG. 5 is a distance accuracy operation result of the OTB-100 reference data set;

FIG. 6 shows the results of the OTB-100 baseline data set overlap success rate operation.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings of the specification.

As shown in fig. 1 and 2:

(1) based on a SimFC tracking algorithm framework, cutting out an area where a first frame target of a video sequence is located, and zooming the area to a fixed size to obtain a target image; and inputting the target image into a convolutional neural network to obtain a target appearance template Z, wherein the size of the template image is 127 multiplied by 127.

(2) Based on a SimFC tracking algorithm framework, in a t frame, a corresponding area around the central position of a target in a t-1 frame is used as a search area, a search image X is obtained by cutting, the size of the search image is 255 multiplied by 255, the search image X is also zoomed to a fixed size and is input into a convolutional neural network, and a search area characteristic diagram is obtained.

(3) The template image Z is compared with candidate regions of the same size in the search region feature map. If the two image blocks describe the same target, the similarity measurement function f returns a high score, and in fact, the similarity measurement function f is a siamese network in depth.

Wherein

Is a convolution embedding function, which represents the cross-correlation of two feature maps, and b is an offset. After the target appearance template Z and the search area feature map are input, the function f generates a 33 × 33 response map ht.

The structure is about the full convolution of the search image X, the target appearance template Z in the step (1) is used as a convolution kernel to be convolved with the search area feature map in the step (2), so that a response map ht is obtained, and the position of the maximum value in the response map represents the central position of the target to be tracked. From the response map ht, the position of the target to be tracked can be preliminarily predicted, which allows the similarity function for all translated sub-windows in the search image to be calculated in one evaluation.

(4) And (4) inputting the response map ht obtained in the step (3) and the historical response map hi into a policy network together to obtain the regularization score of each decision-making action. The historical response maps hi (i is 1 to N) come from responses, and N historical response maps are stored in the response map template pool, and each historical response map corresponds to the latest good tracking result. Then, the action with the highest score is selected to be added to the set (i ═ 1 to N) for further reliable decision making.

The problem of reinforcement learning for policy networks can be generally viewed as a Markov Decision Process (MDP) in which agents interact with the environment through states, actions and rewards. In the tracking problem, the tracker is treated as a proxy. Given a state s_tThe agent must select an action a according to policy pi. After performing this action, the agent will give a positive or negative reward R based on the current bounding box to object overlap ratio IOU_t. By maximizing the expected reward, the agent learns an optimal strategy to take action.

Strategy gradient algorithm learning strategy pi optimizes the depth strategy network by using gradient descent:

Δθ＝α▽_θlogπ_θ(a_t|s_t)R_τ (2)

wherein R is_τRepresenting the return of the whole process. During the training process, action samples are drawn from the policy network and then awarded through evaluation of the selected action. With this reward information, the strategy can be optimized by updating the parameters to maximize the desired reward.

The selectable actions are tracking, updating and re-detecting respectively. The update and tracking actions determine whether the tracker uses the predicted location information to update the appearance template of the target. When the re-detection action is executed, the re-detection module draws the candidate search areas where the M targets are most likely to appear through the particle filter around the previous target positions. For each candidate search region, the tracking network is reused to calculate a response map, and then a best candidate search region is selected by the confidence score:

C_i＝max(f_i)·cos(γ||P_i-P_t||) (3)

wherein f is_iIs a response map of the ith candidate search region, P_iAnd P_tIs the ith candidate search area and the center position of the target in the previous frame, and γ is a predefined distance penalty parameter. Similarly, the position of the re-detection target is determined by the maximum value of the response map of the best candidate search area. Finally, the response map of the best candidate search area is input into the policy network again to check the reliability of the re-detection result. When the re-detection is performed twice or more in one frame, the re-detection result is discarded and the initial tracking result is used.

State s_tCan be represented as a tuple (h)_i,h_t)，h_iIs a historical response graph of a good tracking result, h_tIs the response map of the current frame. In previous approaches, only a single h was typically used_tTo describe the state s_t. However, because of the uncertainty of the tracking problem, the confidence of a response map may fluctuate in different sequences. For example, when a response graph shows a failed trace result in video a, but at the same time it may show a successful trace result in another video B that contains more challenging factors. So the present invention incorporates the current response graph h_tAnd historical response graph h_iTo evaluate the reliability of the tracking result. In a sense, a policy network can be viewed as a similarity metric function, which yields h_iAnd h_tThe similarity between the two is adopted so as to judge whether the current tracking result is good or badAnd performing further actions.

In the training process, regarding the tracking process of one frame as the whole process, a back propagation algorithm is executed using formula (2), and the reward function is defined as follows:

wherein, the overlap rate of the predicted frame b and the real frame g is represented by interaction-over-Union (IOU), which reflects the credibility of the tracking result of the given frame.

The policy network consists of two 516-dimensional fully-connected layers and one output layer, which outputs 3 actions. Each fully connected layer is initialized randomly and subjected to ReLU and batch regularization. The entire algorithm is trained on an Object Tracking Baseline (OTB) dataset for 200 cycles, each cycle ending after the agent interacts with all the training images. For each cycle, after 8192 samples are collected, the policy network starts learning. After each learning, the updated policy network will continue to sample for the next learning. The learning rate is 10 in the whole training process^-6Down to 10^-8And a batch size of 64 is used.

(5) And (4) repeating the step (4) until each historical response map in the response map template pool is traversed. And finally executing the action with the largest occurrence number in the set (i equals to 1 to N). If the decision action is an updating action, updating the target position and the appearance template according to the prediction result; if the decision is made as a tracking action, updating the location but not the appearance template according to the prediction result; if the action is decided as the re-detection action, the target position and the appearance template are not updated according to the prediction result, and the lost target is searched by using the re-detection module.

By adopting the tracking method, the OTB-50 and OTB-100 reference data sets shown in figure 3 are processed, and the processing results are shown in figures 3 to 6, so that the tracking method and the tracking system improve the performance by 5 to 6 percent.

A target tracking system based on strategy gradient is characterized by comprising a tracker and a decision maker; the tracker calculates a target appearance template Z and a search area characteristic graph through a similarity measurement function f to obtain a response graph ht; the similarity measurement function f is a Siamese network:

wherein

the decision maker is a trained strategy network and converts the state s of the tracker_tInputting the data into a policy network to obtain the regularization score of each decision-making action; the historical response graphs hi (i is 1-N) come from a response graph template pool, N historical response graphs are stored in the response graph template pool, and each historical response graph corresponds to the latest good tracking result; then, the action with the highest score is selected and added to the set Ct (i ═ 1 to N); traversing each historical response map in the response map template pool; and finally, executing the action with the largest occurrence number in the set Ct (i is 1 to N) as a decision result of the decision maker.

Claims

1. A strategy gradient-based target tracking method is characterized by comprising the following steps:

(2) inputting the search image into a convolutional neural network to obtain a search area characteristic diagram;

wherein

Is a convolution embedding function, which represents the cross correlation of two characteristic graphs, b is an offset; after a target appearance template Z and a search area feature map are input, a function f generates a response map ht;

2. The method of claim 1, wherein the policy network comprises a state s_tAction a, learning strategy pi and reward R_tState s of_tRepresented as a tuple (h)_i,h_t) Action a includes updating, tracking and redetecting, the reward R_tAwarding a prize R according to the overlap ratio of the current bounding box to the object_tBy passingGradient descent algorithm to optimize the policy network:

wherein R is_τRepresenting the return of the whole process, during the training process, extracting action samples from the strategy network, and then giving a reward R through the evaluation of the selected action_tThe strategy is optimized by updating parameters using the reward information to maximize the desired reward, resulting in a trained strategy network.

3. The method according to claim 2, wherein the training process considers a frame tracking process as a whole process, and a back propagation algorithm is performed using formula (2), and a reward function is defined as follows:

wherein, the Intersection-over-Unit, i.e. IOU represents the overlapping rate of the prediction box b and the real box g;

the strategy network consists of two 516-dimensional full-connection layers and an output layer, wherein the output layer outputs three actions of updating, tracking and re-detecting, and each full-connection layer is initialized randomly and is subjected to ReLU and batch regularization processing; the whole algorithm trains 200 periods on a target tracking reference OTB data set, and each period is finished after an agent interacts with all training images; for each cycle, after 8192 samples are collected, the policy network starts learning.

4. The method as claimed in claim 3, wherein after each learning of the policy network, the updated policy network continues to be sampled for the next learning, and the learning rate is 10 throughout the training process^-6Down to 10^-8And a batch size of 64 is used.

5. The method of claim 1, wherein if the most actions in the execution set Ct are update, the target position Pt is updated to Ptp, which is the predicted position of the current target, and the response map ht is added to the response map template pool, one old response map in the response map template pool is discarded, and the target appearance template Z is updated with the current target position information.

6. The method according to claim 1, wherein if the most action in the execution set Ct is tracking, the target position Pt is updated to Ptp, which is the predicted position of the current target.

7. The method of claim 1, wherein if the most actions in the execution set Ct are re-detection, a search area where a target is most likely to appear is obtained through a particle filter, a response map htc of the search area is calculated, a predicted target position Ptc of the re-detection area is obtained, ht — htc and Ptp — Ptc are updated, and then the predicted target position Ptc is input to the policy network again for decision-making.

8. The method of claim 7, wherein in the re-detection process, the particle filter maps out the most likely candidate search regions for the M targets, and for each candidate search region, the tracking network is reused to calculate the response map, and then a best candidate search region is selected by the confidence score:

C_i＝max(f_i)·cos(γ||P_i-P_t||) (3)

9. The method of claim 7, wherein when the re-detection is performed twice or more in a frame, the re-detection result is discarded and the initial tracking result is adopted.

10. A target tracking system based on strategy gradient is characterized by comprising a tracker and a decision maker; inputting a target image into a convolutional neural network to obtain a target appearance template Z, inputting a search image into the convolutional neural network to obtain a search area characteristic diagram, and calculating the target appearance template Z and the search area characteristic diagram by the tracker through a similarity measurement function f to obtain a response diagram ht; the similarity measurement function f is a Siamese network:

wherein

A convolution embedding function represents the cross correlation of two feature graphs, b is an offset, and after a target appearance template Z and a search area feature graph are input, a response graph ht is generated by a function f; the response map ht and the historical response map hi form the state s of the tracker_tThe tracker selects an action a according to a strategy pi given by the decision maker;

the decision maker is a trained strategy network and converts the state s of the tracker_tInputting into a policy network to obtain each decision actionMaking a regularization score; the historical response graphs hi (i is 1-N) come from a response graph template pool, N historical response graphs are stored in the response graph template pool, and each historical response graph corresponds to the latest good tracking result; then, the action with the highest score is selected and added to the set Ct (i ═ 1 to N); traversing each historical response map in the response map template pool; and finally, executing the action with the largest occurrence number in the set Ct (i is 1 to N) as a decision result of the decision maker.