CN110008844A

CN110008844A - A kind of long-term gesture tracking method of KCF merging SLIC algorithm

Info

Publication number: CN110008844A
Application number: CN201910184848.7A
Authority: CN
Inventors: 郭锦辉; 刘伟东
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2019-07-12
Anticipated expiration: 2039-03-12
Also published as: CN110008844B

Abstract

The invention discloses a kind of long-term gesture tracking methods of KCF for merging SLIC algorithm, comprising steps of 1) constructing gesture training dataset, extract the SVM model of simultaneously off-line training super-pixel block, obtain the rough segmentation class model of gestures detection；2) foreground-background dictionary is constructed, by combining the similarity function of FHOG feature and CN characteristic Design KNN algorithm, to complete the disaggregated classification of gestures detection；3) gestures detection model is obtained by the disaggregated classification of the rough segmentation class model of gestures detection and gestures detection, using gestures detection model inspection target, obtains the detection block of target gesture；4) designed target scale estimator is used, estimates the rectangle frame of most suitable target gesture；5) confidence level function is designed, determines the whether credible realization gesture tracking of current tracking result by comparing the similarity of present frame and the result of previous frame tracking.Inventive algorithm complexity is low, and tracking accuracy is high, and strong robustness is suitble to real-time application.

Description

KCF long-term gesture tracking method fused with SLIC algorithm

Technical Field

The invention relates to a gesture recognition technology, in particular to a KCF long-term gesture tracking method fusing an SLIC algorithm.

Background

Gesture recognition technology has been a focus of research, and gesture tracking is an important part of gesture recognition technology. The gesture tracking is generally classified into two types, namely short-term tracking, namely the movement tracking condition of a target in a short period of time is considered, such as algorithms of KCF, DSST, MOSSE and the like; and secondly, long-term tracking, namely, the target can be tracked well within a long period of time.

The KCF target tracking algorithm is a discriminant correlation filtering algorithm, and generally, in the method, a target detector is trained in the tracking process, the target detector is used for detecting whether the next frame prediction position is a target or not, and then a new detection result is used for updating a training set so as to update the target detector. The KFC target tracking algorithm collects positive and negative samples by using a circulation matrix of a region around a target, trains a target detector by using ridge regression, and successfully converts the operation of the matrix into a Hadamad product of a vector by using diagonalization property of the circulation matrix in a Fourier space, namely, element dot multiplication, thereby greatly reducing the operation amount and improving the operation speed. For the situation of nonlinearity, the KFC target tracking algorithm maps the ridge regression of a linear space to a nonlinear space through a sum function, solves a dual problem and some common constraints in the nonlinear space, and also simplifies the calculation by utilizing the diagonalization property of a circulant matrix Fourier space.

The KCF algorithm is a better real-time algorithm to some extent, but it still has the following problems:

1. the KCF algorithm can not be changed in a self-adaptive manner depending on a cyclic matrix and an initialization matrix thereof, so that the KCF algorithm is not ideal for multi-scale target tracking;

2. the KCF algorithm has a defect in tracking ability for a high-speed moving target and a target in a low frame rate, and the reason is that the displacement of the target between adjacent frames is too large and exceeds the search range of the KCF algorithm;

3. the KCF algorithm has difficulty in continuing to track the target after the target is occluded for several frames.

Disclosure of Invention

Aiming at the technical problems, the invention aims to provide a KCF long-term gesture tracking method fused with an SLIC algorithm.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

a KCF long-term gesture tracking method fused with an SLIC algorithm comprises the following steps:

1) constructing a gesture training data set, extracting superpixel blocks of the picture through a SLIC algorithm, and training an SVM model of the superpixel blocks in an off-line mode to obtain a coarse classification model of gesture detection;

2) extracting the foreground and the background of various gesture pictures from the gesture training data set, constructing a foreground-background dictionary, and designing a similarity function of a KNN algorithm by combining FHOG characteristics and CN characteristics so as to complete fine classification of gesture detection;

3) obtaining a gesture detection model through the coarse classification model of the gesture detection and the fine classification of the gesture detection, and detecting a target by using the gesture detection model to obtain a detection frame of the target gesture; initializing a KCF filter by using a detection box of the target gesture, and then estimating the target gesture of the next frame by using the KCF filter, wherein the KCF filter takes FHOG characteristics and CN characteristics as input;

4) estimating an optimal rectangular box of the target gesture by using a designed target scale estimator, wherein the target scale estimator adopts FHOG characteristics and CN characteristics as input;

5) determining whether the current tracking result is credible by comparing the similarity of the tracking results of the current frame and the previous frame by combining a confidence function designed by a perceptual hash algorithm, the FHOG characteristic cosine similarity and the color statistical characteristic cosine similarity, and if the confidence is greater than a threshold, identifying the next frame by adopting the current tracking result, and repeating the steps 3) to 5); if the confidence coefficient is smaller than the threshold value, abandoning the current tracking result, detecting the current frame by using the gesture detection model, taking the detection result as the current tracking result, re-initializing the KCF tracker, repeating the steps 3) to 5), and finally updating the foreground-background dictionary by using the current frame recognition result.

Compared with the prior art, the invention has the following advantages:

1. combining with SLIC algorithm, generating a superpixel block, extracting features on the basis of the superpixel block, using svm for rough division, and subdividing by KNN under a foreground-background dictionary, thereby realizing multi-scale detection;

2. by combining the perceptual hash algorithm, the FHOG characteristic cosine similarity and the color statistical characteristic cosine similarity, a confidence function is designed, and whether the current result is credible or not is judged by comparing the similarity of the current frame tracking result and the previous frame tracking result, so that the loss of a tracking target is avoided;

3. the HOG characteristic and the color statistical characteristic are extracted from the superpixel block, the HOG characteristic and the color statistical characteristic have invariance to illumination, scale and the like, and the HOG characteristic and the color statistical characteristic have invariance to non-rigid deformation, rotation and rapid movement, and are complementary to each other, so that the characteristics have better robustness;

4. the KCF position estimator and the scale estimator adopt FHOG + CN characteristics, have better robustness on the gesture, and adopt the multi-scale estimator to be well adapted to the change of the target scale.

Drawings

Fig. 1 shows a schematic flow diagram of an embodiment of the invention.

Fig. 2 shows a flow chart of the KNN-foreground-background dictionary algorithm according to an embodiment of the present invention.

FIG. 3 shows a flow chart of a foreground-background dictionary update algorithm of an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings:

as shown in fig. 1, a KCF long-term gesture tracking method fused with SLIC algorithm includes the following steps:

the method comprises the following steps: and constructing a gesture training data set, extracting a superpixel block of the picture through a SLIC algorithm, and training an SVM model of the superpixel block in an off-line manner to obtain a rough classification model for gesture detection.

Specifically, the SLIC algorithm is a superpixel generation algorithm, which is a learning algorithm based on a clustering mode, and specifically comprises the following steps:

1. seed point initialization (cluster center): and uniformly distributing the seed points in the image according to the set number of the super pixels. Assuming that a picture has N pixel points in total, the picture is pre-divided into K super pixels with the same size, the size of each super pixel is N/K, the distance (step length) between adjacent seed points is approximately S ═ sqrt (N/K), and sqrt (.) represents solving a square root;

2. the seed point is reselected within n x n neighborhood of the seed point (typically, n is 3). The specific method comprises the following steps: calculating gradient values of all pixel points in the neighborhood, and moving the seed point to the place with the minimum gradient in the neighborhood;

3. each pixel point is assigned a class label (i.e., to which cluster center) in the neighborhood around each seed point. Unlike standard k-means search through the entire graph, SLIC search range is limited to 2S x 2S, algorithm convergence can be accelerated, desired superpixel size is S x S, but search range is 2S x 2S;

4. a distance measure. Including color distance and spatial distance. For each searched pixel point, the distance between the pixel point and the seed point is calculated respectively. The distance calculation method is as follows:

where dc represents the color distance, ds represents the spatial distance, and Ns is the maximum spatial distance within a class, defined as Ns-sqrt (N/K), and is applicable to each cluster. The maximum color distance Nc is different from picture to picture and from cluster to cluster, so we replace it with a fixed constant m (value range [1,40], generally 10). The final distance measure D' is as follows:

each pixel point is searched by a plurality of seed points, so that each pixel point has a distance with the surrounding seed points, and the seed point corresponding to the minimum value is taken as the clustering center of the pixel point;

5. and (5) performing iterative optimization. Theoretically, the steps are iterated continuously until the error is converged (the fact that the clustering center of each pixel point is not changed can be understood), practice shows that 10 iterations can obtain a relatively ideal effect on most pictures, and therefore the general iteration number is 10;

6. and enhancing connectivity. And (3) newly building a mark table, wherein the elements in the table are all-1, the discontinuous superpixels and the oversize superpixels are redistributed to the adjacent superpixels according to the Z-shaped trend (from left to right and from top to bottom), and the traversed pixel points are distributed to the corresponding labels until all the points are traversed.

Specifically, a super-pixel block of a picture to be detected is obtained through an SLIC algorithm, and the current picture to be detected is assumed to be the T-th frame, s (r, T) is the r-th super-pixel of the T-th frame, and T_t＝{X_t,Y_t,W_t,H_tIs a gesture target frame in the t-th frame image, and X_t,Y_tIs the gesture target center coordinate, { W_t,H_tThe length and width of the gesture target. The superpixels that coincide with the target frame are marked as foreground, and the rest are marked as background. The label of the r-th super-pixel can be expressed as:

after the super pixels are obtained, the labels of the super pixel blocks are marked according to the formula, and the HOG characteristic and the color statistical characteristic of each super pixel block are extracted:

since the number of pixels of different superpixel blocks is not always the same, assume that the number of pixels of the r-th superpixel block s (r, t) of the first frame is num_s(r,t)Taking the statistic bin of the HOG features as 18, regarding a super-pixel block as a unit cell, calculating the gradient of each pixel in the cell, and counting the number of the gradients of the pixels in the cell falling into each bin, so that the HOG features obtained by one super-pixel block are 18-dimensional vectors VecH_s(r,t)The HOG features were normalized as follows:

N_VecH_s(r,t)＝VecH_s(r,t)/||VecH_s(r,t)||/num_s(r,t)

before the HOG features are extracted, the image is optically corrected using gamma algorithm and grayed.

The image gradient within a superpixel cell is calculated as follows:

wherein G is_xIs a gradient in the horizontal direction, and G_yIs the gradient in the vertical direction, G (x, y) is the gradient of the cell,is its phase angle;

for the color statistical characteristics, the image is kept in an RGB mode, r, g, b components of the RGB image can be regularly divided into 64 parts, and values of r, g, b in the image are all (0, 255), so that:

wherein,to round down, and r_div、g_divAnd b_divThe r, g and b components are respectively taken as block values;

establishing a statistical array count [64], carrying out statistics on 64 sections divided by r, g and b, wherein the corresponding index is as follows:

index＝r_div*4*4+g_div*4+b_div

＝>count[index]；

thus, by counting the number of colors, a 64-dimensional vector VecC can be obtained_s(r,t)It was normalized as follows:

N_VecC_s(r,t)＝VecC_s(r,t)/||VecC_s(r,t)||/num_s(r,t)

then, the HOG features and the color statistical features are concatenated to obtain the final features:

Vec_s(r,t)＝[N_VecH_s(r,t),N_VecC_s(r,t)]

and finally, combining the final characteristics of the super pixels and the labels into a training sample set dataSet of the svm classifier { Vec ═ Vec_s(r,t)And l (r, t) }, sending the sample set into an svm classifier, and training to obtain a parameter model of the svm classifier, wherein the svm classifier adopts a Gaussian kernel.

The svm classifier is specifically as follows:

for a hyperplane:

wherein,is weight, b is bias, phi (-) is a nonlinear function, and x is a feature input;

the following constraint problem is solved:

wherein y is a category label, and N is the number of samples;

the method adopts a Lagrange multiplier method to obtain:

wherein,is a lagrange multiplier.

The hyperplane may become:

wherein K (x)_i,x)＝<φ(x_i)·φ(x)>Is a kernel function;

whereinSolving by the following dual problem:

s.t.α_i≥0,i＝1,...,N

the above problem can be solved by SMO algorithm.

Step two: and extracting the foreground and the background of various gesture pictures from the gesture training data set, constructing a foreground-background dictionary, and designing a similarity function of a KNN algorithm by combining FHOG characteristics and CN characteristics so as to complete fine classification of gesture detection.

Specifically, after gamma correction and graying are carried out on a picture to be detected, the FHOG characteristic extraction step is as follows:

1. extracting 9-dimensional HOG features, defining cells as units, for example, defining the cells as 4 x 4 pixels, and counting the 4 x 4 pixels by adopting a histogram of 9 bins;

2. and (6) normalization truncation, namely performing normalization truncation on the cell vector obtained above. Knowing that C (i, j) is a 9-dimensional feature vector of the (i, j) th cell, the feature vectors adjacent thereto are:

definition of N_β,γComprises the following steps:

N_β,γ＝(||C(i,j)||²+||C(i+β,j)||²+||C(i+β,j+γ)||²+||C(i,j+γ)||²)

the 4 x 9 dimensional feature vector H (i, j) is then:

3. performing PCA dimension reduction, summing the obtained 4 x 9-dimensional feature vectors H (i, j) according to rows to obtain a 9-dimensional feature vector, summing according to columns to obtain a 4-dimensional feature vector, and splicing into a 13-dimensional feature vector;

4. extracting 18-dimensional HOG features, obtaining an 18-dimensional HOG feature by taking a cell as a unit, then carrying out normalization truncation on the 18-dimensional HOG feature to obtain a 4 x 18-dimensional feature vector, and summing the 4 x 18-dimensional feature vector according to rows to obtain an 18-dimensional feature vector;

5. and serially splicing the 18-dimensional feature vector and the 13-dimensional feature vector to obtain a 31-dimensional FHOG feature vector.

Specifically, when the CN feature is extracted from the picture to be detected, the CN feature maps the color into a 10-dimensional feature vector space, and the extraction steps are as follows:

1. setting the size of the image to be detected as width × height × 3, and dividing r, g, and b of the RGB image into 32 parts, namely:

2. in a designed (32 × 32) × 10-dimensional feature mapping table, mapping each rgb pixel in the image into a 10-dimensional feature vector according to the following index, and finally obtaining the vector dimension of width × height × 10;

index＝r_div*32*32+g_div*32+b_div；

3. the width × height × 10 vector is expanded into a feature vector of (width × height × 10) × 1 dimension.

Further, the FHOG features and the CN features are combined in series, and a KNN algorithm is used in a constructed foreground-background dictionary.

Specifically, the KNN algorithm steps are as follows, and the flowchart is shown in fig. 2:

1. the foreground data volume and the background data volume in the established foreground and background dictionary are equal, in the method, only the categories are classified into a foreground category and a background category, the distance between a sample to be detected and the two categories of data is calculated, and the distance function of the KNN adopts the Euclidean distance;

2. sequencing the samples to be tested and the distances between the foreground and the background according to an increasing relationship;

3. selecting K points with the minimum distance;

4. determining the occurrence frequency of the category where the first K points are located;

5. and returning the category with the highest occurrence frequency in the previous K points as the prediction classification of the sample to be detected.

Step three: and obtaining a gesture detection model through the coarse classification model of the gesture detection and the fine classification of the gesture detection, and detecting a target by using the gesture model to obtain a detection frame of the target gesture. Initializing a KCF filter by using a detection box of the target gesture, and then estimating the target gesture of the next frame by using the KCF filter, wherein the KCF filter adopts FHOG characteristics and CN characteristics.

Specifically, the KFC filter is a process of solving a ridge regression function:

wherein, λ is penalty factor, α is weight parameter, y is regression value

1. The training process solves for the fourier transform fft (α) of the parameters α:

fft(α)＝fft(y)./(fft(K^xx)+λ))；

2. solving detection response in the detection process:

response＝ifft(fft(α).*fft(K^xz))；

3. solving K of kernel function^xx'：

K^xx'＝φ(ifft(fft(x).*fft(x')))^T

Wherein fft (-) is a Fourier transform, ifft (-) is an inverse Fourier transform, phi (-) is a nonlinear function, and K is a kernel function;

step four: the rectangular box of the most suitable target gesture is estimated using a designed target scale estimator that takes FHOG and CN features as input.

Specifically, the target scale estimator uses a one-dimensional KCF filter, which is a process of solving the following optimal filter:

the method comprises the steps that a set of image blocks are obtained, wherein l belongs to {1, 2.,. d } is a mark for extracting d kinds of image blocks according to different scales near the center of a gesture target of a previous frame of picture, g is a Gaussian response function given according to the distance between each image block and the center of the target, h is a designed scale estimator, and f is a corresponding image feature.

Assuming the frequency responses of H and f as H and G, the above can be solved to obtain a size estimator as:

where F is the frequency response of the image feature F, andfor its conjugate, H is the frequency response of the scale estimator H,and for its conjugate, λ is a penalty factor, d is the number of image blocks extracted, and l ∈ {1, 2.

From the above equation, the following two processes can be obtained:

1. the prediction process of the scale estimator takes the position estimation obtained in the step three as the center, extracts 33 image blocks according to different scales in the picture of the t-th frame, extracts FHOG characteristics and CN characteristics of the image blocks and takes the FHOG characteristics and the CN characteristics as the input of the scale estimator:

wherein Z_tFor the FGOG feature and the CN feature of the image block in 33 extracted from the picture of the t-th frame, A and B are two pending parameters, which can be obtained by the following updating, andrepresenting conjugation.

2. After a prediction target is obtained at a current frame, extracting 33 image blocks according to different scales near the center of a gesture target of a picture of a current t-th frame, extracting FHOG characteristics and CN characteristics of the image blocks, and using the FHOG characteristics and the CN characteristics as input of a scale estimator, wherein the scale estimator parameters are updated through the following process;

wherein η is a parameter adjustment factor.

Step five: determining whether the current tracking result is credible by comparing the similarity of the tracking results of the current frame and the previous frame by combining a confidence function designed by a perceptual hash algorithm, FHOG characteristic cosine similarity and color statistical characteristic cosine similarity, if the confidence is greater than a threshold, identifying the next frame by adopting the current tracking result, and repeating the third step to the fifth step; and if the confidence coefficient is smaller than the threshold value, abandoning the current tracking result, detecting the current frame by using the gesture detector, taking the detection result as the current tracking result, re-initializing the KCF tracker, repeating the third step to the fifth step, discarding part of data of the foreground-background dictionary according to a certain random function, and extracting foreground and background data of the current frame as supplement.

Specifically, for the perceptual hash algorithm, the steps are as follows:

1. correcting two pictures to be compared by using a gamma correction algorithm;

2. the interpolation or sampling reset size of the two compared pictures is 16 x 16;

3. carrying out graying treatment on the two pictures with the reset sizes;

4. two 16 x 16 pictures are expanded row by row into 256 dimensional vectors vecHash _ src and vecHash _ dst, and the average pixels vecHash _ src _ avg and vecHash _ dst _ avg for each vector are calculated:

5. comparing the element value of the vector vecHash _ src with the magnitude of vecHash _ src _ avg and comparing the element value of the vector vecHash _ dst with the magnitude of vecHash _ dst _ avg, and encoding the image to obtain vecHash _ src _ code and vecHash _ dst _ code:

vecHash_src_code_i＝vecHash_src_i≥vecHash_src_avg？1:0

vecHash_dst_code_i＝vecHash_dst_i≥vecHash_dst_avg？1:0；

6. calculating the similarity of codes, comparing whether elements in the vecHash _ src _ code and the vecHash _ dst _ code are the same one by one, and marking the same number as similarNum, wherein the similarity of the perceptual Hash algorithm is given by the following formula:

similarPercent＝similarNum/256

specifically, given the FHOG feature and color statistics feature vectors featureVec1 and featureVec2 for two pictures, the cosine similarity of the two pictures is calculated as follows:

cosSimilar＝featureVec1*featureVec2/(||featureVec1||*||featureVec2||)

the extraction of FHOG features and the extraction of color statistics are the same as the color statistics described in step one and the FHOG feature extraction process described in step two.

In particular, the confidence is calculated in combination with the perceptual hashing algorithm, the FHOG feature cosine similarity and the color statistics feature in the following way:

setting the similarity obtained by the perceptual hash algorithm as hashSimilar, the cosine similarity obtained by the FHOOG characteristic as fhogCossimilar, and the similarity obtained by the color statistical characteristic as colorCossimilar;

calculating the similarity of the two pictures according to a certain weight:

similar＝α₁×hashSimilar+α₂×fhogCosSimilar+α₃×colorCosSimilar。

specifically, for updating the foreground-background dictionary data, the steps are as follows, and the specific flow is as shown in fig. 3:

1. the foreground-background dictionary stores FHGG and CN feature vectors of a gesture target and a background picture, the two types of feature vectors are equal in quantity, the quantity of the foreground-background dictionary is num _ data, a certain quantity threshold num _ threshold is set, if num _ data is less than num _ threshold, a tracking or detection result is used for cutting out the target gesture picture in the current frame, the size is reset to 256 x 256, FHGG and CN features of the target gesture picture are extracted and stored in a foreground data set, and the same extraction frame with the size as the recognition result is used for cutting out the background picture outside the target gesture, the size is reset to 256 x 256, FHGG and CN features are extracted and stored in a background data set; if num _ data is greater than or equal to num _ threshold, updating is performed through the following 2;

2. the data stored in the foreground-background dictionary is arranged according to a certain sequence number, a random function is used, each record of the foreground and the background is randomly discarded according to the probability of 1/num _ data, and then the data is supplemented in a mode when num _ data is less than num _ threshold.

The above examples of the present invention are merely examples for clearly illustrating the present invention and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A KCF long-term gesture tracking method fused with an SLIC algorithm is characterized by comprising the following steps:

2. The KCF long-term gesture tracking method fused with the SLIC algorithm of claim 1, wherein in step 1), the super pixel block of the picture is extracted through the SLIC algorithm, and the off-line training of the SVM model of the super pixel block specifically comprises:

step 2.1) obtaining the super pixel block of the picture to be detected through SLIC algorithm, and assuming that the current picture to be detected is the T-th frame, s (r, T) is the r-th super pixel of the T-th frame, T_t＝{X_t,Y_t,W_t,H_tIs a gesture target frame in the t-th frame image, and X_t,Y_tAs gesture target center, { W_t,H_tThe length and width of the gesture target are set; marking the super pixels overlapped with the target frame as a foreground, and marking the background in other situations; the label of the r-th super-pixel can be expressed as:

step 2.2) after the super pixels are obtained, extracting HOG characteristic N _ VecH of each super pixel block according to the label of the super pixels_s(r,t)And color statistical feature N _ VecC_s(r,t)；

Step 2.3) connecting HOG characteristics and color statistical characteristics in series to obtain final characteristics:

Vec_s(r,t)＝[N_VecH_s(r,t),N_VecC_s(r,t)]；

step 2.4) forming the final characteristics and labels of the super pixels into a training sample set dataSet of the svm classifier { Vec ═_s(r,t)And l (r, t) }, sending the sample set into an svm classifier, and training to obtain a parameter model of the svm classifier.

3. The method for tracking the KCF long-term gesture fused with the SLIC algorithm of claim 2, wherein in the step 2.2), the specific process of extracting the HOG feature of each super-pixel block after obtaining the super-pixel and according to the label of the super-pixel comprises:

because the number of the pixel points of different superpixel blocks may be different, the number of the pixel points of the r-th superpixel block s (r, t) of the first frame is assumed to be num_s(r,t)Taking the statistic bin of the HOG characteristic as 18, regarding a superpixel block as a cell, and calculating the gradient of each pixel in the cell:

counting the number of bins in which the gradient of the pixels in the cell falls, and obtaining a vector VecH with 18-dimensional HOG characteristics by using a super-pixel block_s(r,t)And performing the following normalization processing on the HOG characteristics:

N_VecH_s(r,t)＝VecH_s(r,t)/||VecH_s(r,t)||/num_s(r,t)。

4. the KCF long-term gesture tracking method fused with SLIC algorithm of claim 2, wherein in step 2.2), before extracting the HOG feature, the gamma algorithm is used to optically correct the image and grays the image.

5. The method for tracking the KCF long-term gesture fused with the SLIC algorithm of claim 2, wherein in the step 2.2), the specific process of extracting the color statistical characteristics of each super-pixel block after obtaining the super-pixels and according to the labels of the super-pixels comprises:

for the color statistical characteristics, the image is kept in an RGB mode, r, g, b components of the RGB image are regularly divided into 64 parts, and values of r, g, b in the image are all (0, 255), so that:

index＝r_div*4*4+g_div*4+b_div

＝>count[index]；

obtaining a 64-dimensional vector VecC by counting the number of colors_s(r,t)It is normalized as follows:

N_VecC_s(r,t)＝VecC_s(r,t)/||VecC_s(r,t)||/num_s(r,t)。

6. the method for tracking the KCF long-term gesture fused with the SLIC algorithm, as claimed in claim, wherein the specific process of the step 2) is as follows:

step 3.1) extracting the foreground and the background of various gesture pictures from the gesture training data set, constructing a foreground-background dictionary, wherein the foreground data amount and the background data amount in the constructed foreground-background dictionary are equal, only the categories are divided into a foreground category and a background category, the distance between a sample to be detected and the two categories of data is calculated, and the distance function of the KNN algorithm adopts the Euclidean distance:

step 3.2) sequencing the distance between the sample to be tested and the foreground and the background according to the increasing relationship;

step 3.3) selecting K points with the minimum distance;

step 3.4) determining the occurrence frequency of the category where the front K points are located;

and 3.5) returning the category with the highest occurrence frequency in the previous K points as the prediction classification of the sample to be detected.

7. The method for tracking the KCF long-term gesture fused with the SLIC algorithm of claim 1, wherein in the step 4), the target scale estimator adopts a one-dimensional KCF filter, which is a process of solving the following optimal filter:

the method comprises the following steps that l belongs to {1, 2.,. d } is a mark for extracting d image blocks according to different scales near the center of a gesture target of a previous frame of picture, g is a Gaussian response function given according to the distance between each image block and the center of the target, h is a designed scale estimator, f is corresponding image characteristics, and lambda is a penalty factor; assuming the frequency responses of H and f as H and G, the above can be solved to obtain a size estimator as:

8. The method for tracking the KCF long-term gesture fused with the SLIC algorithm, as claimed in claim 1, wherein in step 5), the specific process of the confidence function designed by combining the perceptual hash algorithm, the FHOG feature cosine similarity and the color statistic feature cosine similarity is as follows:

step 4.1) inputting two pictures, obtaining a similarity of hashSimiar through a perceptual hash algorithm, calculating FHOG characteristics to obtain a cosine similarity of FHOG characteristics of FHOG of fh;

step 4.2) calculating the similarity of the two pictures according to a certain weight:

similar＝α₁×hashSimilar+α₂×fhogCosSimilar+α₃×colorCosSimilar。

9. the KCF long-term gesture tracking method fused with SLIC algorithm of claim 8, wherein in step 5), the specific process of updating the foreground-background dictionary using the current frame recognition result is as follows:

step 5.1) FHOG and CN feature vectors of the gesture target and the background picture are stored in the foreground-background dictionary, the quantity of the two types of FHOG and CN feature vectors are equal, and a certain quantity threshold value num _ threshold is set on the assumption that the quantity of the foreground-background dictionary is num _ data;

step 5.2) if num _ data is less than num _ threshold, cutting out a target gesture picture in the current frame by using a tracking or detection result, resetting the size to be 256 x 256, extracting FHOG and CN characteristics of the target gesture picture, storing the FHOG and the CN characteristics into a foreground data set, and intercepting a background picture outside the target gesture picture by using an extraction frame with the size similar to that of the identification result, resetting the size to be 256 x 256, extracting the FHOG and the CN characteristics, and storing the FHOG and the CN characteristics into the background data set;

and 5.3) if the num _ data is more than or equal to num _ threshold, the data stored in the foreground-background dictionary is arranged according to a certain sequence number, a random function is used, each record of the foreground and the background is randomly discarded according to the probability of 1/num _ data, and then the data is supplemented in a mode when the num _ data is less than num _ threshold in the step 5.2.