CN110569814B

CN110569814B - Video category identification method, device, computer equipment and computer storage medium

Info

Publication number: CN110569814B
Application number: CN201910862697.6A
Authority: CN
Inventors: 肖定坤
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2023-10-13
Anticipated expiration: 2039-09-12
Also published as: CN110569814A

Abstract

The application discloses a video category identification method, a video category identification device, computer equipment and a computer storage medium, and belongs to the field of video identification. The method comprises the following steps: acquiring video data to be identified; the method comprises the steps that video data to be identified are input into a video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description characteristic 3D-VLAD model and a classification recognition model, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, the 3D-VLAD model is used for taking a space-time characteristic diagram output by the space-time maximum pooling layer of a designated number of inverse numbers in the plurality of space-time maximum pooling layers as input, the space-time local aggregation description characteristic is obtained, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description characteristic. The problem of the lower accuracy of classification result among the prior art is solved, the effect of improving accuracy has been reached.

Description

Video category identification method, device, computer equipment and computer storage medium

Technical Field

The present application relates to the field of video identification, and in particular, to a method and apparatus for identifying video categories, a computer device, and a computer storage medium.

Background

At present, the development of big video data is a fire explosion, and the video of contents becomes a great trend of the development of the Internet. Therefore, identification techniques that classify video are particularly important.

In the video category recognition method, video data to be recognized is input into a space-time convolutional neural network (3D-Convolutional Neural Network, 3D-CNN) model, a feature map output by the last layer of the model is obtained, and then the feature map is input into a classification model to obtain a classification result.

However, the capturing capability of the method for capturing the fine motion change in the video data is poor, and thus the accuracy of the classification result is low.

Disclosure of Invention

The embodiment of the application provides a video category identification method, a video category identification device, computer equipment and a computer storage medium, which can solve the problem that the accuracy of a classification result is lower due to poor capturing capability of fine motion change in video data in the related technology. The technical scheme is as follows:

according to a first aspect of the present application, there is provided a video category identification method, the method comprising:

acquiring video data to be identified;

inputting the video data to be identified into a video classification model, wherein the video classification model comprises a space-time convolutional neural network model, a space-time local aggregation description characteristic 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolutional neural network model comprises a plurality of space-time convolutional neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be identified is input into the space-time convolutional neural network model, the 3D-VLAD model is used for taking a space-time characteristic diagram output by the space-time maximum pooling layer with the designated number of reciprocal in the plurality of space-time maximum pooling layers as input to obtain space-time local aggregation description characteristics, and the classification recognition model is used for obtaining classification results according to the space-time local aggregation description characteristics;

and obtaining a classification result of the video data to be identified, which is output by the video classification model.

Optionally, before the obtaining the video data to be identified, the method further includes:

obtaining a model training sample set, wherein the model training sample set comprises a plurality of types of video sets, and each type of video set comprises a plurality of video data;

optimizing the video classification model through the model training sample set;

and stopping optimizing when the video classification model converges.

Optionally, the optimizing the video classification model through the model training sample set includes:

and taking the model training sample set as training data, and optimizing the video classification model according to a loss function and a gradient descent method.

Optionally, before the model training sample set is used as training data and the video classification model is optimized according to a loss function and a gradient descent method, the method further includes:

and carrying out data expansion on the model training sample set by a data enhancement method of dynamically and randomly adjusting the extraction frame number and the extraction frame rate strategy.

Optionally, the space-time convolutional neural network layer includes the formula:

O＝{O _j |j＝1,2,...,n _O }

wherein I is _i Inputting an ith space-time characteristic diagram of the I for a space-time convolutional neural network layer; o is the output of the space-time convolutional neural network layer, O _j A j-th spatiotemporal feature map of O; w (W) _ij Is I _i With O _j A connected convolution kernel; n is n _I The number of the space-time characteristic diagrams input for the space-time convolutional neural network layer, n _O The number of the space-time characteristic diagrams output by the space-time convolutional neural network layer; b _j Is O _j Is set to be a bias parameter of (a); f (·) is the activation function;

the spatio-temporal maximization layer comprises the formula:

Y＝{y _m |m＝1,2,...,N}

wherein Y is the characteristic tensor output by the space-time maximum pooling layer,mth spatiotemporal feature map O of O _m Is the (i+r) th of ₁ Frame j+r ₂ Line t+r ₃ Characteristic values of the columns; />Is the mth space-time characteristic diagram Y in Y _m Characteristic values of the ith row and the jth column of the ith frame; p is p ₁ ,p ₂ ,p ₃ Is O _m Is a dimension of (2); k (k) ₁ ,k ₂ ,k ₃ Dimension of the pooling core for the spatiotemporal maximization pooling layer.

Optionally, the Y is a feature vector with dimensions of nxs×hxd, S is a width of the space-time feature map, H is a height of the space-time feature map, D is a channel number of the space-time feature map, and the 3D-VLAD model is used for:

converting the Y into a feature map M with dimension L×D, and converting the feature map M into a feature matrix G with dimension K×D by a conversion formula, wherein the conversion formula comprises:

Z＝M·W+B

A＝softmax(Z)

wherein W, B is a parameter of a fully connected layer with an output neuron K, and Z represents the output of the fully connected layer; softmax (·) is a normalized exponential function, a is the output of the normalized exponential function; sum (·, 1) represents column summing the matrix;representing dot product operations between matrices, A ^T Is the transposed matrix of matrix A; q is a clustering center matrix parameter with dimension of K multiplied by D;

deforming the feature matrix G into a feature vector with the length of K.D;

and (3) carrying out L2 norm normalization on the feature vector with the length of K.D and a full-connection layer to obtain the space-time local aggregation description feature.

Splicing a plurality of spatio-temporal local polymerization description characteristics V obtained by the 3D-VLAD layer of the spatio-temporal maximum pooling layer with the designated number of reciprocal in the spatio-temporal maximum pooling layers into a spatio-temporal local polymerization description fusion characteristic vector V= [ V ] ₁ ,v ₂ ,...,v _n ]。

Optionally, the classification recognition model is used for:

sequentially passing the space-time local aggregation description fusion feature vector V through three full-connection layers, wherein the number of neurons of the last full-connection layer in the three full-connection layers is C, and C is the number of video categories in the model training sample set;

determining the classification result by using the output value of the last full-connection layer and a probability formula, wherein the probability formula comprises:

wherein p (o) _t ) The probability value, o, of the video data to be identified belonging to the t-th class _t A t-th output value, o, representing the last fully-connected layer _k A kth output value representing the last fully-connected layer; e represents a natural constant.

In another aspect, there is provided a video category identification apparatus, the apparatus comprising:

the data acquisition module is used for acquiring video data to be identified;

the data processing module is used for inputting the video data to be identified into a video classification model, the video classification model comprises a space-time convolutional neural network model, a space-time local aggregation description characteristic 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolutional neural network model comprises a plurality of space-time convolutional neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be identified is input into the space-time convolutional neural network model, the 3D-VLAD model is used for taking a space-time characteristic diagram output by the space-time maximum pooling layer with the designated number of reciprocal in the plurality of space-time maximum pooling layers as input, so that a space-time local aggregation description characteristic is obtained, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description characteristic;

and the result acquisition module is used for acquiring the classification result of the video data to be identified, which is output by the video classification model.

In one aspect, a computer device is provided, the computer device including a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the video category identification method described above.

In one aspect, a computer storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the video category identification method described above is provided.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

the method comprises the steps that video data to be identified are acquired and input into a video classification model, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be identified are input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature map output by a designated number of the reciprocal space-time maximum pooling layers in the plurality of space-time maximum pooling layers as input, so that space-time local aggregation description features are obtained, and the classification recognition model is used for obtaining classification results according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified, which is output by the video classification model. And (3) carrying out space-time local aggregation description on the plurality of space-time feature graphs, and inputting a classification recognition model to obtain a finer classification result. The problem that in the prior art, the capturing capability of fine motion change in video data is poor, and the accuracy of classification results is low is solved, and the effect of improving the accuracy is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a video category identification method provided by an embodiment of the present application;

FIG. 2 is a flowchart of another video category identification method provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a video classification model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of the Block model of FIG. 3;

fig. 5 is a schematic structural diagram of a video category recognition device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.

Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

In the prior art, the identification technology for classifying videos roughly includes the following methods:

the identification method of the double-flow 2D convolutional neural network comprises the following steps: and respectively training two independent models of a space 2D convolutional neural network based on an RGB (red green blue) diagram and a time sequence 2D convolutional neural network based on a light flow diagram, and fusing the outputs of the two convolutional neural network models to obtain a final recognition result. However, in the method, a great amount of calculation power and time are consumed for extracting the optical flow, the double-flow network is used for separate and independent training, no space-time information interaction exists in the training process, and the characteristics on space and time sequence cannot be well fused; and because the space network adopts a key single-frame RGB image in the video segment, modeling of a long-range time context cannot be performed.

The identification method of the long-time memory LSTM network comprises the following steps: firstly, a trained 2D convolutional neural network CNN is utilized to extract spatial features from video sequence frames, and then a long-short-time memory LSTM network is utilized to conduct context feature extraction modeling on the extracted spatial features in time sequence. However, the feature extraction is completed in two stages, no end-to-end joint training is performed, and the method has poor performance on extracting short-time fine time sequence relations.

The method for identifying the space-time 3D convolutional neural network comprises the following steps: and inputting the video data to be identified into a space-time convolutional neural network model, obtaining a feature map output by the last layer of the model, and then inputting the feature map into a classification model to obtain a classification result. However, the method only inputs the last layer of feature map into the classification model, and the last layer of feature map loses a large amount of space-time semantic detail information due to pooling operation, so that the capturing capability of the model on fine motion change is not strong.

The embodiment of the application provides a video category identification method, a video category identification device, computer equipment and a computer storage medium, which can solve the problems in the related art.

Fig. 1 is a flowchart of a video category identification method according to an embodiment of the present application, where the video category identification method may include the following steps:

step 101, obtaining video data to be identified.

Step 102, inputting video data to be identified into a video classification model, wherein the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description characteristic 3D-VLAD model and a classification identification model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be identified is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time characteristic diagram output by a designated number of the space-time maximum pooling layers which are the reciprocal of the plurality of space-time maximum pooling layers as input, so that space-time local aggregation description characteristics are obtained, and the classification identification model is used for obtaining classification results according to the space-time local aggregation description characteristics.

And step 103, obtaining a classification result of the video data to be identified, which is output by the video classification model.

In summary, according to the video category identification method provided by the embodiment of the application, video data to be identified is input into a video classification model by acquiring the video data to be identified, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification identification model which are sequentially connected, wherein the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be identified is input into the space-time convolution neural network model, the 3D-VLAD model is used for taking a space-time feature map output by the space-time maximum pooling layers with the designated number of inverse numbers in the plurality of space-time maximum pooling layers as input, so as to obtain space-time local aggregation description features, and the classification identification model is used for obtaining classification results according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified, which is output by the video classification model. And (3) carrying out space-time local aggregation description on the plurality of space-time feature graphs, and inputting a classification recognition model to obtain a finer classification result. The problem that in the prior art, the capturing capability of fine motion change in video data is poor, and the accuracy of classification results is low is solved, and the effect of improving the accuracy is achieved.

Referring to fig. 2, a flowchart of another video category identification method according to an embodiment of the present application is shown, where the video category identification method may include the following steps:

step 201, a model training sample set is obtained, wherein the model training sample set comprises a plurality of types of video sets, and each type of video set comprises a plurality of video data.

Collecting C kinds of different kinds of viewsThe frequency segment data set is used for sampling n frames of RGB images according to the interval of 0.1 second to form a time sequence frame sample x _i

Wherein the method comprises the steps ofRepresenting sample x _i RGB image of the kth frame of (c).

Sample x _i Composition sample set x= { X ₁ ,x ₂ ,...,x _i ,...,x _N And use R= { R } ₁ ,r ₂ ,...,r _i ,...,r _N Recording each sample X in sample set X _i Wherein R is a One-Hot (One-bit valid) encoding vector of dimension C.

Step 202, constructing a model framework of a video classification model.

Fig. 3 is a schematic structural diagram of a video classification model. Wherein the spatiotemporal convolutional neural network model 32 comprises a plurality of spatiotemporal convolutional neural network layers 321 (exemplary, the perceived view may be 7x7x 7), a plurality of spatiotemporal max-pooling layers 322 (exemplary, the perceived view may be 1x3x3, or 1x2x2, etc.), and a plurality of blocks 323 (Block is a unit of processing in the model, may comprise a spatiotemporal convolutional neural network layer, a spatiotemporal max-pooling layer); the 3D-VLAD model 33 includes a plurality of 3D-VLAD layers 331, 332, 333, and 334 and a temporal local aggregate description fusion feature 335; the classification recognition model 34 includes three full connection layers 341, 342, and 343, and by way of example, the number of output neurons of the connection layers 341 and 342 may be 1024, and the number of output neurons of the connection layer 343 may be C, where C represents the number of video categories. Where 31 is the data input to the spatiotemporal convolutional neural network layer 321.

The process of constructing the video classification model at step 202 may include:

1) And constructing a space-time convolutional neural network model. The video classification model comprises a space-time convolutional neural network model, a space-time local aggregation description characteristic 3D-VLAD model and a classification recognition model which are connected in sequence.

The space-time convolution neural network is a network model formed by stacking and combining a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers, and can capture motion information in a video sequence and extract 3D characteristics in space and time. Fig. 4 is a schematic diagram of a Block model in fig. 3. Including a spatiotemporal convolutional neural network layer 321 (illustratively, the perceived field of view may be 1x1x1, 3x3x 3); the spatiotemporal maximization layer 322 (illustratively, the perceived field of view may be 3x3x 3). Wherein the space-time convolutional neural network layer comprises the formula:

O＝{O _j |j＝1,2,...,n _O }

wherein I is _i Inputting an ith space-time characteristic diagram of the I for a space-time convolutional neural network layer; o is the output of the space-time convolutional neural network layer, O _j A j-th spatiotemporal feature map of O; w (W) _ij Is I _i With O _j A connected convolution kernel; n is n _I The number of the space-time characteristic diagrams input for the space-time convolutional neural network layer, n _O The number of the space-time characteristic diagrams output by the space-time convolutional neural network layer; b _j Is O _j Is set to be a bias parameter of (a); f (·) is the activation function,representing a 3D convolution operation.

The spatio-temporal maximization layer includes the formula:

Y＝{y _m |m＝1,2,...,N}

2) And constructing a 3D-VLAD model.

The 3D-VLAD model takes three-dimensional feature graphs with different sizes, which are output by n inverse maximum pooling layers in a space-time convolutional neural network model, as the input of the 3D-VLAD layer respectively, extracts the space-time local aggregation description features on n different sizes, and each space-time local aggregation description feature v _i The dimension of (2) is 512; describing the features v by means of these spatio-temporal local aggregates _i Splicing to form a space-time local aggregation description fusion characteristic V= [ V ] with length of 512.n ₁ ,v ₂ ,...,v _i ,...,v _n ]. The 3D-VLAD layer extracts space-time local aggregation description features from the feature tensor Y output by the maximum pooling layer, can capture the statistics information of aggregation of the local features of the three-dimensional feature map in the global video time sequence, and is a method for representing the aggregated local features as global features.

Y is a feature vector with dimensions of N x S x H x D, S is the width of the space-time feature map, H is the height of the space-time feature map, D is the channel number of the space-time feature map, and the 3D-VLAD model is used for:

converting Y into a feature map M with dimension L×D, and converting the feature map M into a feature matrix G with dimension K×D by a conversion formula, wherein the conversion formula comprises:

Z＝M·W+B

A＝softmax(Z)

wherein W, B is the inputOutputting parameters of a full-connection layer with the neuron being K, wherein Z represents the output of the full-connection layer; softmax (·) is the normalized exponential function, a is the output of the normalized exponential function; sum (·, 1) represents column summing the matrix;representing dot product operations between matrices, A ^T Is the transposed matrix of matrix A; q is a clustering center matrix parameter with dimension of KxD.

The feature matrix G is deformed into a feature vector of length k·d.

The feature vector with the length of K.D is subjected to L2 norm normalization and a layer of full-connection layer with the output neuron of 512 to obtain the space-time local aggregation description feature v _i A plurality of space-time local aggregation description features V obtained by the 3D-VLAD layer of the given number of reciprocal space-time maximization layers in the plurality of space-time maximization layers are spliced into a fusion feature vector V= [ V ] ₁ ,v ₂ ,...,v _n ]

3) And constructing a classification recognition model.

The classification recognition model is used for: and sequentially passing the space-time local aggregation description fusion feature V through three full-connection layers, wherein the number of neurons of the last full-connection layer in the three full-connection layers is C, and the number of neurons of the last full-connection layer in the three full-connection layers is the number of video types in the model training sample set.

Determining a classification result by using an output value of the last full connection layer and a probability formula, wherein the probability formula comprises:

wherein p (o) _t ) Probability value, o of video data to be identified belonging to the t-th class _t T output value, o, representing the last full link layer _k A kth output value representing the last fully-connected layer; e represents a natural constant. p (o) _t ) The probability value of the video data to be identified belonging to each of the C category videos may be exhibited. Illustratively, the probability that the video data to be identified belongs to the animation-like video is 85%The probability of the video belonging to the fun class is 10 percent, the probability of the video belonging to the literature class is 5 percent, wherein the highest probability value can be used as a judgment standard, and the video data to be identified can be obtained to belong to the animation class video under the judgment standard. Other criteria may be used, and embodiments of the present application are not limited herein.

And 203, performing data expansion on the model training sample set by a data enhancement method of dynamically and randomly adjusting the extraction frame number and the extraction frame rate strategy.

In training the network model, for each batch of samples of the input network, the extraction interval is from [1,2,4 ]]A number mu is randomly selected ₁ The number of frames is extracted from [4,8,16 ]]A number mu is randomly selected ₂ Then for time sequence frame sample x _i Every mu ₁ One frame is extracted, and mu is extracted altogether ₂ Frame construction represents this time-series frame sample x _i . Thereby achieving the effect of improving the identification.

And 204, taking the model training sample set as training data, and optimizing the video classification model according to the loss function and the gradient descent method.

The loss function describes the loss of the system under different parameter values, and the loss function is minimized, which means that the fitting degree is best, and the corresponding model parameter is the optimal parameter. When the loss function is minimized, the minimized loss function and model parameter values can be obtained through one-step iterative solution by a gradient descent method. The method may refer to related technologies, and the embodiments of the present application are not described in detail.

And 205, stopping optimizing when the video classification model converges.

And performing end-to-end model optimization on the whole model until the video classification model converges, and stopping optimization.

And detecting and verifying the trained and optimized video classification model by using a test sample, and retraining the video classification model according to a test result so as to improve the recognition effect. The test sample may be preset video data to be identified.

Steps 201 to 205 are processes of building a model, and the following steps 206 to 208 are application processes of the built video classification model.

Step 206, obtaining the video data to be identified.

The video data to be identified is acquired and carried to step 207. Wherein the video data to be identified is different from the video data in step 201, and is random video data to be identified which the user wants to classify.

Step 207, inputting the video data to be identified into a video classification model, after inputting the video data to be identified into a space-time convolutional neural network model, using a 3D-VLAD model to input a specified number of space-time feature graphs output by a specified number of space-time maximum pooling layers of the reciprocal of a plurality of space-time maximum pooling layers to obtain space-time local aggregation description features, and using a classification recognition model to obtain classification results according to the space-time local aggregation description features.

The video data to be identified is input into the model constructed in the step 202, the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description characteristic 3D-VLAD model and a classification identification model which are sequentially connected, and the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked. The model is used to process the video data to be identified.

And step 208, obtaining a classification result of the video data to be identified, which is output by the video classification model.

The classification result of the video data to be identified may include a probability value of the video data to be identified belonging to each of the C category videos. The classification result of the video data to be identified obtained from the video classification model considers the value of each dimension of the space-time characteristic points, so that the video local information is more finely depicted, and the capturing capability of fine action change in the video is good. The 3D-VLAD features of the feature map of the n layers of reciprocal layers are extracted, so that the fusion of multi-scale features is achieved, and the recognition rate can be effectively improved; the data enhancement method of dynamic random adjustment of the extraction frame number and the extraction interval strategy is used for achieving a certain dynamic adjustment on the problem that the motion change rhythm speed in the video is inconsistent with the span length of the video content, so that the generalization capability of the model is enhanced; and the training tuning is combined, so that the recognition accuracy is improved, and the end-to-end video recognition is realized.

Fig. 5 is a schematic structural diagram of a video category recognition device according to an embodiment of the present application, and as shown in fig. 5, the video category recognition device 500 includes:

the data acquisition module 501 is configured to acquire video data to be identified.

The data processing module 502 is configured to input video data to be identified into a video classification model, where the video classification model includes a space-time convolutional neural network model, a space-time local aggregation description feature 3D-VLAD model, and a classification recognition model that are sequentially connected, the space-time convolutional neural network model includes a plurality of space-time convolutional neural network layers and a plurality of space-time maximum pooling layers that are sequentially stacked, the 3D-VLAD model is configured to input a space-time feature map output by a specified number of space-time maximum pooling layers that are the inverse of the plurality of space-time maximum pooling layers after the video data to be identified is input, so as to obtain a space-time local aggregation description feature, and the classification recognition model is configured to obtain a classification result according to the space-time local aggregation description feature.

And the result obtaining module 503 is configured to obtain a classification result of the video data to be identified, which is output by the video classification model.

In summary, in the video category recognition device provided by the embodiment of the application, by acquiring video data to be recognized, inputting the video data to be recognized into a video classification model, wherein the video classification model comprises a space-time convolution neural network model, a space-time local aggregation description feature 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolution neural network model comprises a plurality of space-time convolution neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be recognized is input into the space-time convolution neural network model, the 3D-VLAD model is used for inputting a space-time feature map output by a designated number of the space-time maximum pooling layers which are the reciprocal of the plurality of space-time maximum pooling layers, so as to obtain space-time local aggregation description features, and the classification recognition model is used for obtaining classification results according to the space-time local aggregation description features; and obtaining a classification result of the video data to be identified, which is output by the video classification model. And (3) carrying out space-time local aggregation description on the plurality of space-time feature graphs, and inputting a classification recognition model to obtain a finer classification result. The problem that in the prior art, the capturing capability of fine motion change in video data is poor, and the accuracy of classification results is low is solved, and the effect of improving the accuracy is achieved.

Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application. Specifically, the present application relates to a method for manufacturing a semiconductor device.

The server 600 includes a central processing unit (in english: central Processing Unit, abbreviated: CPU) 601, a system Memory 604 including a random access Memory (in english: random Access Memory, abbreviated: RAM) 602 and a Read Only Memory (in english: ROM) 603, and a system bus 605 connecting the system Memory 604 and the central processing unit 601. The server 600 also includes a basic input/output system (I/O system) 606 for facilitating the transfer of information between various devices within the computer, and a mass storage device 607 for storing an operating system 613, application programs 614, and other program modules 615.

The basic input/output system 606 includes a display 608 for displaying information and an input device 609, such as a mouse, keyboard, etc., for a user to input information. Wherein both the display 608 and the input device 609 are coupled to the central processing unit 601 via an input output controller 610 coupled to the system bus 605. The basic input/output system 606 may also include an input/output controller 610 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 610 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 607 is connected to the central processing unit 601 through a mass storage controller (not shown) connected to the system bus 605. The mass storage device 607 and its associated computer-readable media provide non-volatile storage for the server 600. That is, the mass storage device 607 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.

Computer readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read Only Memory), EEPROM (Electrically Erasable Programmable Read Only Memory, charged erasable programmable read-only memory), flash memory or other solid state memory technology, CD-ROM, DVD (Digital Versatile Disc, digital versatile disk) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 604 and mass storage device 607 described above may be collectively referred to as memory.

The server 600 may also operate by a remote computer connected to the network through a network such as the internet, according to various embodiments of the present application. I.e., server 600 may be connected to network 612 through a network interface unit 611 coupled to system bus 605, or other types of networks or remote computer systems (not shown) may be coupled to using network interface unit 611.

The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.

The application also provides a computer device comprising a processor and a memory, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to realize the video category identification method.

The present application also provides a computer-readable storage medium having stored therein instructions that, when executed by video category recognition means, cause the video category recognition means to implement the video category recognition method provided by the above embodiment.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed according to the appended claims.

Claims

1. A method for identifying video categories, the method comprising:

acquiring video data to be identified;

acquiring a classification result of the video data to be identified, which is output by the video classification model;

the space-time convolutional neural network layer comprises the following formula:

O＝{O _j |j＝1,2,...,n _O }

the spatio-temporal maximization layer comprises the formula:

Y＝{y _m |m＝1,2,...,N}

wherein Y is the characteristic tensor output by the space-time maximum pooling layer,mth spatiotemporal feature map O of O _m Is the (i+r) th of ₁ Frame j+r ₂ Line t+r ₃ Special of columnA sign value; />Is the mth space-time characteristic diagram Y in Y _m Characteristic values of the ith row and the jth column of the ith frame; p is p ₁ ,p ₂ ,p ₃ Is O _m Is a dimension of (2); k (k) ₁ ,k ₂ ,k ₃ Dimension of the pooling core for the spatiotemporal maximization pooling layer.

2. The method of claim 1, wherein prior to the obtaining video data to be identified, the method further comprises:

and stopping optimizing when the video classification model converges.

3. The method of claim 2, wherein said optimizing the video classification model by the model training sample set comprises:

4. A method according to claim 3, wherein said model training sample set is used as training data, and wherein prior to optimizing said video classification model according to a loss function and gradient descent method, said method further comprises:

5. The method of claim 1, wherein Y is a feature vector having dimensions N x S x H x D, S is a width of the spatiotemporal feature map, H is a height of the spatiotemporal feature map, D is a number of channels of the spatiotemporal feature map, and the 3D-VLAD model is configured to:

Z＝M·W+B

A＝softmax(Z)

deforming the feature matrix G into a feature vector with the length of K.D;

respectively carrying out L2 norm normalization and one full-connection layer on the characteristic vector with the length of K.D to obtain the space-time local aggregation description characteristic v;

splicing a plurality of space-time local aggregation description features V obtained by the 3D-VLAD layer of the space-time maximization pooling layer with the designated number of reciprocal in the plurality of space-time maximization pooling layers into a fusion feature vector V= [ V ] ₁ ,v ₂ ,...,v _n ]N is the kind of spatiotemporal local aggregate descriptive feature over different sizes.

6. The method of claim 5, wherein the classification recognition model is used to:

7. A video category identification device, the device comprising:

the data acquisition module is used for acquiring video data to be identified;

the data processing module is used for inputting the video data to be identified into a video classification model, the video classification model comprises a space-time convolutional neural network model, a space-time local aggregation description characteristic 3D-VLAD model and a classification recognition model which are sequentially connected, the space-time convolutional neural network model comprises a plurality of space-time convolutional neural network layers and a plurality of space-time maximum pooling layers which are sequentially stacked, after the video data to be identified is input into the space-time convolutional neural network model, the 3D-VLAD model is used for taking a space-time characteristic diagram output by the space-time maximum pooling layer of a designated number of reciprocal numbers in the plurality of space-time maximum pooling layers as an input to obtain a space-time local aggregation description characteristic, and the classification recognition model is used for obtaining a classification result according to the space-time local aggregation description characteristic, and the space-time convolutional neural network layer comprises the following formula:

O＝{O _j |j＝1,2,...,n _O }

the spatio-temporal maximization layer comprises the formula:

Y＝{y _m |m＝1,2,...,N}

wherein Y is the characteristic tensor output by the space-time maximum pooling layer,mth spatiotemporal feature map O of O _m Is the (i+r) th of ₁ Frame j+r ₂ Line t+r ₃ Characteristic values of the columns; />Is the mth space-time characteristic diagram Y in Y _m Characteristic values of the ith row and the jth column of the ith frame; p is p ₁ ,p ₂ ,p ₃ Is O _m Is a dimension of (2); k (k) ₁ ,k ₂ ,k ₃ Dimension of the pooling core which is the space-time maximum pooling layer;

8. A computer device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set, or instruction set that is loaded and executed by the processor to implement the video category identification method of any one of claims 1 to 6.

9. A computer storage medium having instructions stored therein which, when executed on a computer, cause the computer to perform the video category identification method of any one of claims 1 to 6.