CN114299321A

CN114299321A - Video classification method, device, equipment and readable storage medium

Info

Publication number: CN114299321A
Application number: CN202110893350.5A
Authority: CN
Inventors: 王菡子; 王晓; 叶伟荣; 祁仲昂; 赵珣; 单瀛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-08-04
Filing date: 2021-08-04
Publication date: 2022-04-08

Abstract

The application discloses a video classification method, a video classification device, video classification equipment and a readable storage medium, and relates to the field of machine learning. The method comprises the following steps: acquiring a target video; performing feature extraction on a target video based on a video frame in the target video to obtain video features corresponding to the target video; performing semantic analysis on the target video based on the video characteristics to obtain semantic characteristics corresponding to the target video; fusing the video features and the semantic features to obtain fused features corresponding to the target video; and carrying out classification prediction on the fusion characteristics to obtain the video classification of the target video. In the process, not only the classification information contained in the video features is fully considered, but also the semantic features corresponding to the video features are extracted from the video features and are included in the analysis process, so that the acquisition range of the video feature information can be expanded, the situation that the information is not completely contained in the video classification process is avoided, and the accuracy of video classification is improved.

Description

Video classification method, device, equipment and readable storage medium

Technical Field

The embodiment of the application relates to the field of machine learning, in particular to a video classification method, a video classification device, video classification equipment and a readable storage medium.

Background

Video classification is an important topic in the technical field of computer vision, and the main objective of the video classification is to understand the content contained in the video and determine several corresponding key topics of the video. Because video classification has very important significance in the application fields of video retrieval, content analysis and the like, effective classification of videos becomes a key problem to be solved urgently.

In the related art, video classification mainly includes research on behavior recognition and event detection, and generally, after video features are extracted from a video, the category of the video is determined through machine learning. This typically requires a large number of training samples to train the video classifier to distinguish the video classes.

However, in many video classification tasks, samples available for training are very few, and the amount of information contained in the extracted video features is limited, resulting in a low accuracy rate of video classification.

Disclosure of Invention

The embodiment of the application provides a video classification method, a video classification device, video classification equipment and a readable storage medium, which can effectively improve the accuracy of video classification. The technical scheme is as follows:

in one aspect, a video classification method is provided, and the method includes:

acquiring a target video, wherein the target video is a video to be classified and predicted;

performing feature extraction on the target video based on video frames in the target video to obtain video features corresponding to the target video, wherein the video features are used for indicating picture information expressed by the video frames of the target video;

performing semantic analysis on the target video based on the video features to obtain semantic features corresponding to the target video, wherein the semantic features are used for indicating entity association expressed by video frames of the target video;

fusing the video features and the semantic features to obtain fused features corresponding to the target video;

and carrying out classification prediction on the fusion characteristics to obtain the video classification of the target video.

In another aspect, there is provided a video classification apparatus, the apparatus including:

the system comprises an acquisition module, a prediction module and a prediction module, wherein the acquisition module is used for acquiring a target video which is a video to be classified and predicted;

the extraction module is used for extracting the features of the target video based on the video frames in the target video to obtain the video features corresponding to the target video, wherein the video features are used for indicating the picture information expressed by the video frames of the target video;

the analysis module is used for carrying out semantic analysis on the target video based on the video characteristics to obtain semantic characteristics corresponding to the target video, wherein the semantic characteristics are used for indicating entity association relation expressed by video frames of the target video;

the fusion module is used for fusing the video features and the semantic features to obtain fusion features corresponding to the target video;

and the prediction module is used for carrying out classification prediction on the fusion characteristics to obtain the video classification to which the target video belongs.

In another aspect, a computer device is provided, which includes a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by the processor to implement the video classification method according to any of the embodiments of the present application.

In another aspect, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by a processor to implement a video classification method as described in any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video classification method described in any of the above embodiments.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

after the characteristics of a target video to be detected are extracted, video characteristics can be obtained, and then semantic information and the video characteristics are fused to obtain fusion characteristics after the semantic characteristics are obtained from the obtained video characteristics, wherein the fusion characteristics not only comprise visual information but also comprise the semantic information, so that the obtaining range of the video characteristic information is expanded, the situation that the information is not comprehensive in the video classification process is avoided, and the accuracy of video classification is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic flow chart of an implementation of an embodiment provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a video classification method provided by an exemplary embodiment of the present application;

FIG. 4 is a flow chart of a video classification method provided by another exemplary embodiment of the present application;

FIG. 5 is a flow chart of a video classification method provided by another exemplary embodiment of the present application;

FIG. 6 is a flow chart of a video classification method provided by another exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a video method process provided by an exemplary embodiment of the present application;

FIG. 8 is a graph illustrating the accuracy of video classification using a video classification method according to an exemplary embodiment of the present application;

fig. 9 is a block diagram illustrating a structure of a video classification apparatus according to an exemplary embodiment of the present application;

fig. 10 is a block diagram of a video classification apparatus according to another exemplary embodiment of the present application;

fig. 11 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms referred to in the embodiments of the present application will be briefly described.

Prototype network: the prototype network is a metric-based meta-learning method and aims to reduce the occurrence of overfitting caused by too small data volume. The basic idea is as follows: a prototype representation is created for each classification. For a video needing classification, the distance between a classification prototype vector and a sample to be inquired is calculated for determination. Illustratively, a sample to be queried is projected to a prototype space, wherein the homogeneous sample is closer to the prototype space and the heterogeneous sample is farther from the prototype space. And projecting the sample to be queried to a prototype network, and taking the class which is close to the sample to be queried as the class of the sample to be queried.

Machine Learning (ML): the method is a multi-field cross discipline and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

Computer Vision technology (Computer Vision, CV): the method is a science for researching how to make a machine see, and particularly refers to that a camera and a computer are used for replacing human eyes to perform machine vision such as identification, tracking and measurement on a target, and further graphics processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

In the related art, a mature classification model needs to be trained for video classification, a large amount of sample video data labeled with label information needs to be trained for the classification model, the classification model is used for classifying and identifying sample image data to obtain an identification result, and the classification model is trained according to the difference between the labeled label information and the identification result. However, when there are few training samples, some deep learning methods cannot take advantage of video recognition. Meanwhile, as the training samples are less, videos with different categories may be classified into the same category only because the video features are similar, and finally the accuracy of video classification is low.

In the embodiment of the application, the video classification method is provided, and the training efficiency and accuracy of video classification are improved under the condition that the number of samples is small. The video classification method trained by the application comprises at least one of the following scenes when applied.

Firstly, in the internet video area video classification scene, schematically, for a target video newly incorporated into a video area, it needs to be classified and determined according to the category to which the target video belongs. At the moment, the video features are obtained after the features of the target video are extracted, then the video features are generated through a semantic generation network to obtain semantic features, the video features and the semantic features are fused to obtain fusion features, the fusion features are input into a graph neural network to carry out relationship propagation to obtain more robust features, the target nodes are compared with other feature nodes, and the behavior category of the feature node with the minimum distance to the target nodes serves as the category to which the target video belongs.

Second, in the application of the short video push service, illustratively, the category of the short video favored by the user can be roughly determined according to the category of the short video watched by the user most frequently on the short video software, and the short video of the relevant category can be pushed according to the preference of the user. There is a need for efficient classification of short videos. In the application process, the short video which is used for judging whether to push the short video to the user is used as a target video, semantic features are generated after video features are extracted from the target video, the video features and the semantic features are fused to obtain fusion features, the fusion features are input into a graph neural network for relation propagation, target nodes are compared with other feature nodes, the feature node with the minimum distance to the target nodes is used as the category corresponding to the short video, and therefore whether to push the short video to the user is judged.

Thirdly, videos uploaded by users are classified, and illustratively, after the videos are uploaded by the users, the videos need to be partitioned, that is, the video category is determined. In the application process, a user uploads a video, firstly, semantic features are generated after video features are extracted, the video features and the semantic features are fused to obtain fusion features, the fusion features are input into a graph neural network for relation propagation, a target node is compared with other feature nodes, and the behavior class of the feature node with the minimum distance to the target node serves as the class of the target video, so that the video classification effect is achieved.

The target nodes are nodes of the target video in the graph neural network. It should be noted that the foregoing application scenarios are only illustrative examples, and the video classification method provided in the embodiment of the present application may also be applied to other scenarios, for example, videos of the same category are pushed according to a user search, which is not limited in the embodiment of the present application.

Next, an implementation environment related to the embodiment of the present application is described, and please refer to fig. 1 schematically, in which a terminal 110 and a server 120 are related, and the terminal 110 and the server 120 are connected through a communication network 130.

In some embodiments, the terminal 110 is configured to send the video to be classified to the server 120. In some embodiments, the terminal 110 has an application program with a classification function installed therein, and illustratively, the terminal 110 has an application program with video feature extraction installed therein; alternatively, the terminal 110 has an application program with semantic feature extraction installed therein.

The server 120 includes a classification result obtained by predicting through the video classification model, classifies the video to be classified according to the classification result, outputs the classification result, and feeds the classification result back to the terminal 110 for display.

The video classification model is obtained by adopting the video classification method and training sample videos in a sample video library. The method comprises the steps of extracting features of video frames in sample videos from the sample videos in a sample video library to obtain video features corresponding to the sample videos, conducting semantic analysis on the sample videos based on the video features to obtain semantic features corresponding to the sample videos, fusing the video features and the semantic features to obtain fusion features corresponding to the sample videos, and determining video categories according to the fusion features. The above process is an example of a non-exclusive case of the video classification model training process.

The terminal may be a mobile phone, a tablet computer, a desktop computer, a portable notebook computer, an intelligent television, and other terminal devices in various forms, which is not limited in the embodiment of the present application.

It should be noted that the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

The Cloud technology (Cloud technology) is a hosting technology for unifying series resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, can form a resource pool, is used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

In some embodiments, the servers described above may also be implemented as nodes in a blockchain system. The Blockchain (Blockchain) is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

Fig. 2 is a schematic flowchart of an implementation process of an overall scheme provided by an exemplary embodiment of the present application, taking a video classification process performed by an internet video platform as an example, as shown in fig. 2, the method includes:

step 210, a target video is obtained.

The method for acquiring the target video comprises at least one of the following methods: firstly, randomly acquiring a video from an existing video library, and classifying the acquired video as a target video; secondly, classifying the self-recorded video serving as a target video after the self-recorded video is acquired; and thirdly, the server receives the video uploaded by the terminal and classifies the video uploaded by the terminal as a target video. The above modes are only brief descriptions of the target video obtaining modes, and the embodiment is not limited.

The target video is a video to be classified and predicted. The target video comprises videos with long time such as movies and televisions, and also comprises short videos which are played and pushed through various media ways and have short time. Optionally, any video with a certain duration composed of continuous image frames can be classified by the video classification method provided by the embodiment of the present application. Illustratively, the target video is a television video, a movie video, a music video, a short video, or the like, which is not limited in the embodiments of the present application.

And step 220, extracting the features of the target video based on the video frames in the target video to obtain the video features corresponding to the target video.

Wherein the video feature is used to indicate picture information expressed by video frames of the target video.

A frame is the smallest unit of network transmission, and a relatively complete, independent piece of information is generally divided into one frame. According to the principle of persistence of vision, if the continuous image changes more than 24 frames per second, human eyes cannot distinguish a single static image. At this time, if a plurality of static pictures are changed in sequence, human eyes look like a smooth continuous visual effect, and thus the continuous pictures are called videos.

The video frame contains the visual characteristics of the video in a short time. Compared with the characteristic extraction of the whole video, the characteristic extraction of the video frame extracted from the video is more accurate, and the cost is lower. Therefore, the video features acquired from the video frames have better feature extraction effect than the video features obtained by directly extracting the features of the video, and the extracted features are more accurate.

The video frame extraction method includes various extraction methods, schematically, the video frame extraction method can adopt a random extraction method, for example, after a section of video is obtained, the video is imported into video editing software, after one or more parts in the video are randomly selected, a frame export button is clicked, and the effect of randomly extracting the video frame is achieved; or, extracting the video frames according to a preset interval, for example, after a section of video is acquired, importing the video into video editing software, setting the preset interval for extracting the video frames, and clicking a frame exporting button to obtain the video frames with the same time interval; alternatively, the extraction of the video frame may also adopt a segmented extraction manner, such as: dividing a target video into n video segments with equal length, and randomly sampling from the n video segments to obtain video frames, wherein n is a positive integer.

In addition, there are a number of ways of operation for video feature extraction. Schematically, frame extraction is carried out on a video object to obtain one or more frame images, various types of pooling are carried out on each frame image step by step to obtain image characteristics of the frame images, and then video characteristics are comprehensively determined according to the image characteristics of one or more frame images; video features may also be extracted using three-dimensional convolution operations.

And step 230, performing semantic analysis on the target video based on the video characteristics to obtain semantic characteristics corresponding to the target video.

The semantic features are used for indicating entity association relation expressed by video frames of the target video.

Semantics, which means how to utilize information of an image, especially high-level information, provides a way to describe the image for research. For video, semantics include the temporal and spatial relationships between important objects present in the video, as well as the content implied behind them. When the semantic is analyzed, the characteristics that pictures, sounds and the like in the video cannot be directly reflected can be obtained. Therefore, the semantic features are effectively extracted, so that the analysis of the video from multiple angles can be facilitated, and various features in the aspects of time, space and the like can be obtained. And after the obtained features are processed, the video classification accuracy is improved.

The extraction of semantic features is based on the description of the low-level features of the video. The video contains rich information, and although the low-level features are too straightforward when describing video objects and cannot represent deeper information, the low-level features are usually obtained by directly performing statistics and calculation on video data and are an information layer closest to the video data. Therefore, the efficient extraction of semantic features can help to improve the accuracy of video classification.

In this embodiment, the entity association relationship refers to a relationship that different objects have an association with each other and can express an entity meaning, and the entity association relationship refers to a relationship that the objects have an association with each other and can express an entity meaning within the same video frame or between different video frames. Illustratively, the entity relationship is used to indicate the association relationship between the object a and the object B in the video frame, such as: object a is located on top of object B.

And 240, fusing the video features and the semantic features to obtain fused features corresponding to the target video.

In the embodiment of the application, the video features and the semantic features are mapped on the original network to obtain corresponding vectors, so that the vectors corresponding to the video features and the semantic features can be relatively determined. The prototype network is obtained by training in advance, and can reflect corresponding feature vectors obtained by feature extraction of input target videos.

The fusion is to add the video features in the target video and the elements of the vector corresponding to the semantic features. The video features and the semantic features are fused, so that the information expressed by the video frame is more comprehensive and complete. The video characteristics and the semantic characteristics of the target video are fused to obtain a fusion characteristic, and the fusion characteristic is embodied in a form of a target node in the prototype network.

Because the fusion features comprise the video features and the semantic features, and the video features comprise the visual features and the non-visual features (such as the behavior features), the feature information amount is rich, and the result of predicting by subsequently adopting the fusion features is accurate.

And step 250, performing classification prediction on the fusion characteristics to obtain the video classification of the target video.

The fusion feature at this time includes not only the video feature but also the semantic feature. On the basis, classification prediction is carried out on the videos, and the videos can be better judged according to different video label information pointed by the characteristics. The video tag information referred to herein refers to a characteristic of a video that is distinguished from other videos.

Schematically, a sample video marked with a category label is obtained, sample fusion features of the sample video are extracted, the sample fusion features comprise sample semantic features and sample video features of the sample video, and the generation mode of the sample fusion features is similar to the processing mode of a target video.

Secondly, a graph neural network is constructed based on intra-class consistency requirements and inter-class difference requirements, wherein the graph neural network is obtained through training, and nodes in the graph neural network are composed of sample fusion characteristics and target fusion characteristics.

And then inputting the fusion features into a graph neural network for relationship propagation to obtain more robust fusion features, comparing the target node with other feature nodes in the graph neural network, and taking the behavior class to which the feature node with the minimum distance to the target node belongs as the class to which the target video belongs. In the embodiment of the application, the nodes of the graph neural network are fusion features, and the relationship propagation is to update the nodes in the graph neural network so as to obtain enhanced fusion features. And inputting the target fusion feature and the sample fusion feature into a graph neural network to obtain the fusion feature after the target fusion feature is enhanced and the fusion feature after the sample fusion feature is enhanced, and comparing the fusion feature and the sample fusion feature. In addition, the graph neural network is divided into similar degrees according to node distances, and therefore, distance analysis needs to be performed on a target node and other nodes in the graph neural network to obtain a feature distance between the fusion feature and the other nodes. And the target node is a fusion feature corresponding to the target video.

Then, the analysis node with the minimum distance to the feature of the fusion feature in the graph neural network is determined. Schematically, in the embodiment of the present application, distance analysis is performed on a target node and other nodes in the graph neural network, and if the target node is close to the other nodes in the graph neural network, it indicates that the similarity between the target video and the sample video corresponding to the node is high; on the contrary, if the target node is far away from the rest nodes in the graph neural network, the target video is low in similarity with the sample video corresponding to the node.

Illustratively, the clustering result can also be determined according to the distance between the fusion feature and the clustering center. The method comprises the steps of obtaining a sample data input in a neural network, obtaining a characteristic mean value of the sample data input in the neural network, and calculating the characteristic mean value of the sample data input in the neural network. And determining a clustering result according to the distance between the fusion characteristics and the clustering center of each cluster.

It should be noted that the above feature distance comparison method is only an exemplary method, and the present application is not limited thereto.

And taking the classification corresponding to the analysis node as the video classification to which the target video belongs.

And the fusion features are close to the rest nodes in the neural network of the graph, so that the similarity between the target video and the sample video is high. Therefore, the classification corresponding to the analysis node is most similar to the category of the target video. The classification corresponding to the analysis node can be taken as the video classification to which the target video belongs.

In summary, according to the video classification method provided in this embodiment, after the target video is obtained, feature extraction is performed on the target video based on the video frames in the target video to obtain video features corresponding to the target video, semantic analysis is performed on the target video based on the video features to obtain semantic features corresponding to the target video, then the video features and the semantic features are fused to obtain fusion features corresponding to the target video, and finally, classification prediction is performed on the fusion features to obtain video classification to which the target video belongs. In the process, the video characteristic and the semantic characteristic are fused, so that the acquisition range of the video characteristic information can be expanded, the acquired characteristics are richer, the identification ability is better, and the accuracy of video classification is improved.

For schematically describing the situation of obtaining the video features in step 220 in embodiment 2 of this application, please refer to fig. 3. The method for extracting the characteristics of the target video based on the video frame in the target video to obtain the video characteristics corresponding to the target video comprises the following steps:

at step 310, at least two video frames are extracted from the target video.

Wherein, a time sequence relation exists between at least two frames of video frames.

The time sequence refers to the sequential order of time. The timing relationship expressed in the present application example refers to that when extracting video frames from a target video, not only the sequence of time should be considered, but also whether the time between the extracted video frames conforms to a certain rule should be considered, including but not limited to that the time between the extracted video frames is equal.

Illustratively, the method of extracting the video frame may include the following processes: the target video is segmented to obtain at least two video segments, and video frame sampling is carried out on the at least two video segments to obtain at least two video frames. If the target video is divided into 8 sections on average, 1 frame is randomly extracted from each section, so that 8 frames of video can be acquired. Through the above operation, at least two video frames having a time sequence relationship can be acquired.

And step 320, extracting the video characteristics of the target video frame based on the time sequence relation and the picture between at least two frames of video frames.

The time sequence relation and the picture characteristics contained between at least two frames of video frames extracted by previous operation are respectively the interpretation of the video frames from the two-dimensional and three-dimensional angles. By carrying out multi-dimensional analysis and extraction on the target video frame, more accurate video features can be extracted, and the accuracy of the final video classification result is improved.

Step 330, extracting behavior characteristics between at least two frames of video frames based on a time sequence relationship between the at least two frames of video frames;

the time sequence relation between at least two frames of video frames is extracted, the analysis of two-dimensional angles of the video frames is emphasized, and the time sequence extraction can be performed based on the time sequence relation.

Illustratively, given an input feature X ∈ R^{N×T×C×H×W}Where N is the batch size, T refers to the number of video frames, C refers to the number of channels, and H and W refer to the height and width of the video frames. By inputting the original features, we can obtain the corresponding adjacent video frames, X_t+1And X_tFirst, the original feature X is processed using a 3 × 32D convolution_tThen the timing difference process can be expressed as:

H(t)＝conv₂(X_t+1)-X_t，1≤t≤T-1

where H (t) denotes the behavior characteristic at the time t, conv₂To represent3 x 32D convolution operations. In particular, we set the behavior signature to 0 for the last time T, i.e. h (T) 0. All behavioral characteristics are then linked together, i.e. H ═ H (1), …, H (t)]Therein []Indicating a connect operation.

And the original features are subjected to feature extraction through a pre-trained feature extraction network to obtain features.

Step 340, extracting visual characteristics of a target video frame based on pictures of at least two frames of video frames;

the extraction of the picture between at least two frames of video frames focuses on the analysis of the three-dimensional angle of the video frames, and the spatial extraction can be performed based on the picture.

In addition, different channels contain abundant visual information.

Wherein, a channel refers to an area where video frame information flows. The adjacent channel refers to a region where video frame information adjacent to a video frame to be analyzed flows. Therefore, the visual modeling capability of the video classification method can be enhanced by utilizing the difference of adjacent channels.

Illustratively, the visual features of the input are divided into C feature subsets. Given two adjacent features, X_cAnd X_c+1First, feature X is processed using a 3 × 32D convolution_c+1The process of channel differencing is defined as:

S(c)＝conv₁(X_c+1)-X_c，1≤c≤C-1

wherein S (c) represents the visual characteristic of channel c, conv₁Representing a 3 x 32D convolution operation. In particular, to make the resulting features consistent with the original feature dimensions, we set the visual feature of the last channel to zero, i.e., s (c) is 0.

The above mentioned spatial extraction and temporal extraction content can be refined into a spatio-temporal difference module, namely: and enabling at least two frames of video frames to pass through a space-time difference module to extract the video characteristics of the target video frame.

And step 350, fusing the behavior characteristics and the visual characteristics to obtain the video characteristics of the target video frame.

The process of fusing the behavior characteristics and the visual characteristics to acquire the video characteristics of the target video frame comprises the following two steps:

in some embodiments, the behavioral and visual characteristics are fused to obtain a hybrid characteristic.

Schematically, the time sequence relation and the picture between at least two frames of video frames are analyzed by adopting a time sequence extraction method and a space extraction method respectively. Finally, behavior characteristics at different moments can be obtained after the time sequence relation of at least two frames of video frames is extracted; visual features of different channels can be obtained after the pictures of at least two video frames are extracted. The features of two dimensions of time and space are respectively extracted, so that the condition that the feature extraction is incomplete due to the fact that the features are extracted in a single dimension can be avoided, the condition that the feature extraction is inaccurate due to the fact that the features are extracted in two dimensions simultaneously can also be avoided, and the accuracy of the feature extraction is effectively improved.

And then, fusing the behavior features extracted based on the time sequence extraction mode and the visual features extracted based on the space extraction mode to obtain mixed features. The mixed features comprise the information contained in the behavior features after extraction and the information contained in the visual features after extraction, so that the information of the video frames is covered more comprehensively and accurately.

And connecting the visual features and the mixed features in a residual error connection mode to obtain the video features.

The mixed features comprise behavior features extracted based on a time sequence extraction mode and visual features extracted based on a space extraction mode. Although the feature inclusion is comprehensive, the problem of gradient dissipation phenomenon caused by the feature extraction is inevitable. At this time, the input visual feature and the mixed feature are subjected to residual error connection operation, so that the phenomenon can be effectively weakened, and the enhanced visual feature, namely the video feature, can be obtained.

Where residual concatenation is a linear superposition of a non-linear transformation that expresses the output as input and input. The residual error connection is to solve the gradient dissipation phenomenon caused by the increase of the network depth in the deep learning process.

The gradient dissipation refers to a phenomenon that, in a neural network, as the number of hidden layers increases, the learning rate of the current hidden layer is lower than that of the subsequent hidden layer, so that the classification accuracy rate does not increase any more, but decreases instead.

In summary, the video classification method provided in this embodiment is a relevant description of how to obtain corresponding video features from a target video. Since the initial visual features obtained by extracting the features of the target video include various noises, it is necessary to fully consider a method for obtaining enhanced visual features while reducing noises. Through analysis of the target video, the video feature accuracy obtained by feature extraction of at least two frames of video frames with time sequence relation in the target video is higher.

The video frames are analyzed by considering not only the time sequence relation among the video frames but also the picture problem of the video frame time. And fusing the behavior characteristics obtained by analyzing the time sequence relation with the visual characteristics obtained by analyzing the picture, and then performing residual error connection operation on the behavior characteristics and the visual characteristics to obtain enhanced video visual characteristics, namely video characteristics. Through the operation, the situation that the visual features directly obtained from the video are inaccurate due to noise intervention can be reduced, and the accuracy rate of acquiring the corresponding video features from the target video is improved.

For schematically explaining the situation of obtaining semantic features involved in step 230 in embodiment 2 of the present application, please refer to fig. 4. The process of performing semantic analysis on the target video based on the video features to obtain semantic features corresponding to the target video includes the following steps 410 to 420.

Step 410, inputting the video features into a semantic generation network.

The semantic generation network is a semantic feature generation network obtained after training. Semantic network, which refers to a network containing a plurality of concepts and a relationship instance for connecting two concepts. Semantic generation refers to a process of analyzing the described screen information. The semantic generation network is a network capable of generating semantics through the interrelation between a plurality of concepts.

In the present application, semantic generation refers to a process of analyzing the described video information; the semantic generation network refers to a network which expresses entity association through video frames.

The video features are input into the semantic generation network in order to learn semantic information corresponding to the video features from the video features by using the trained semantic generation network.

And 420, performing semantic analysis on the video features through a semantic generation network, and outputting to obtain semantic features corresponding to the target video.

In some embodiments, a prototype of the video feature is obtained to obtain a prototype feature corresponding to the video feature.

Wherein, the prototype feature is a feature mapping result of the video feature in a prototype space. The prototype space is a space obtained through multiple training and capable of sufficiently reflecting the relationship between the original video and the processed video. By inputting the video features of the target video into the prototype space, the prototype features corresponding to the video features can be obtained after prototype extraction.

And performing semantic analysis on the prototype features through a semantic generation network, and outputting to obtain semantic features corresponding to the target video.

And performing prototype extraction on the video features to obtain prototype features corresponding to the video features. And performing semantic analysis on the prototype features, performing semantic analysis on the prototype features through a semantic generation network obtained by training, wherein the semantic generation network obtained by training can express entity association relation among the semantic features, and generating the semantic features corresponding to the target video.

In summary, the video classification method provided in this embodiment is a relevant description for the situation of obtaining semantic features. The semantic features are obtained from the video features, and the specific mode is to generate a network through semantics. The semantic generation network is a feature generation network obtained after training, and the video features can generate semantic features from the video features through the semantic recognition network. Training the semantic feature extraction network by constructing a training task, thereby training to obtain a semantic generation network. The method can reduce the occurrence of inaccurate semantic feature extraction in the video features under the condition of less sample videos, and improves the prediction accuracy of the semantic generation network.

Referring to fig. 5, schematically, before inputting the video features into the semantic generation network according to step 310 in embodiment 3 of the present application, a training process for the semantic generation network is further included, where the training of the semantic generation network is performed synchronously with a training process of the video feature extraction network, and the training process of the semantic generation network includes the following steps:

step 510, a sample video is obtained, and the sample video is correspondingly marked with a category label.

Wherein the category label is used for indicating a reference video category to which the sample video belongs. The videos are classified through the label information, and the specific classification of the categories of the sample videos is facilitated.

And step 520, performing feature extraction on the sample video to obtain sample video features corresponding to the sample video.

The feature extraction of the sample video is to extract features of the video frames of the sample video to obtain features of the sample video, and the specific operation is similar to the manner of extracting video features of the target video.

The video characteristics in the sample video are extracted by performing space extraction and time sequence extraction on the sample video through the space-time difference module to obtain the video characteristics of the sample video.

Illustratively, video frames with a time sequence relationship between at least two frames are extracted from a sample video, behavior features between at least two frames are extracted based on the time sequence relationship between at least two frames, and visual features of a target video frame are extracted based on pictures of at least two frames. And then the behavior characteristics and the visual characteristics are fused to obtain mixed characteristics. And finally, connecting the visual features and the mixed features in a residual error connection mode to obtain sample video features.

Step 530, determining a sample visual feature corresponding to the category label based on the sample video feature.

The sample video features are obtained by carrying out feature extraction on the sample video, and based on the sample video features, the label information of the sample video features can be obtained. In the embodiment of the present application, the sample video features include sample visual features. Through the obtained video characteristics, the sample visual characteristics can be determined according to the category label corresponding to the sample video.

And 540, performing semantic generation on the sample video features through a semantic generation network to obtain sample semantic features corresponding to the sample video.

The semantic generation network is trained together with the whole network, and corresponding semantic features can be obtained according to the video features of the samples.

At step 550, semantic generation loss is determined based on the sample semantic features and the class labels.

The semantic features of the samples are obtained by subjecting the video features of the samples to a semantic generation network. Optionally, the information guided by the category labels is obtained through a pre-trained network, such as: word2Vec model. In the embodiment of the application, the category label is a pseudo label marked by the Word2Vec model.

At this time, when the existence of the semantic difference is sufficiently considered, the sample semantic generation loss can be determined by combining the category label of the sample video based on the semantic features of the sample video itself.

Illustratively, determining semantic generation loss based on the sample semantic features and the class labels comprises the operations of:

in some embodiments, the class labels are vector space mapped to obtain space mapping features.

Based on the feature distance between the sample semantic features and the spatial mapping features, a semantic generation loss is determined.

Illustratively, a semantic feature is generated by projecting a prototype network to a semantic generation space by using a semantic generation network:

wherein

Is the semantic feature generated. g (-) is a semantic generation network, which is a nonlinear neural network. This semantic generation network g (-) is guided by a semantic generation loss function, so that the visual prototype can be closer to the corresponding semantic-like feature, which can be expressed as:

wherein f is_ψ(. to) represents the Word2Vec model, l_mDenoted as label information of the sample video.

Step 560, train the semantic generation network with semantic generation losses.

Based on the feature distance between the sample semantic features and the spatially mapped features, a semantic generation loss may be determined. The semantic generation loss is obtained after the training is carried out by adopting a plurality of sample videos, the semantic generation loss is reduced as a target, the semantic generation network is trained, the semantic information generated by the semantic generation network is enabled to be as close to the real semantic information as possible, and the semantic generation accuracy of the semantic generation network can be improved.

In summary, the video classification method provided in this embodiment is a relevant description of a process for training a semantic generation network before inputting visual features into the semantic generation network. Firstly, a sample video marked with a category label is obtained, and then feature extraction is carried out on the sample video to obtain sample video features corresponding to the sample video. And then determining sample visual features corresponding to the category labels based on the sample video features, and performing semantic generation on the sample video features through a semantic generation network to obtain sample semantic features corresponding to the sample video. And determining semantic generation loss based on the sample semantic features and the class labels, and training the semantic generation network by the semantic generation loss. Through the process, semantic generation loss can be used for helping to improve the accuracy of semantic generation of the semantic generation network, the trained semantic generation network can extract semantic features as accurately as possible when performing semantic generation on video features, and the method is favorable for improving the accuracy of video classification when analyzing video classification after fusing the semantic features and the video features.

For a schematic description of the video classification result obtained in step 250 in embodiment 2 of this application, please refer to fig. 6. The method for carrying out classification prediction on the fusion characteristics to obtain the video classification of the target video comprises the following steps:

and step 610, obtaining a sample video, wherein the sample video is correspondingly marked with a category label.

And step 620, extracting sample fusion characteristics of the sample video.

The sample fusion features comprise sample semantic features and sample video features of the sample video. The generation of the sample fusion features is similar to the processing of the target video. And the sample fusion feature is obtained by respectively extracting the semantic features and the video features and then fusing the semantic features and the video features.

And the generation of the semantic features in the sample video is obtained by generating the sample video features through a semantic generation network.

The semantic generation network is a semantic feature generation network obtained after semantic generation loss training.

And for the extraction of the video features in the sample video, the video features of the sample video frame are obtained by performing space extraction and time sequence extraction on the sample video frame through the space-time difference module.

Step 630, a graph neural network is constructed based on intra-class consistency requirements and inter-class difference requirements.

The nodes in the graph neural network are composed of sample fusion features and target fusion features.

The graph neural network combines graph data and the neural network, end-to-end calculation is carried out on the graph data, the whole calculation process is carried out along the structure of the graph, and the structure information can be learned on the basis of retaining the structure information of the graph.

The intra-class consistency requirement and the inter-class difference requirement are used, so that the belonged range of the video class can be better divided by distinguishing the consistency from the difference. The neural network of the graph constructed based on the method has higher accuracy, can better achieve the purpose of effective classification of videos, and the operation can be summarized into relationship propagation.

In the embodiment of the application, the relationship propagation is to update the relationship node so as to obtain the fusion feature with discriminant power. And inputting the target fusion feature and the sample fusion feature into a graph neural network to obtain the fusion feature after the target fusion feature is enhanced and the fusion feature after the sample fusion feature is enhanced, and comparing the fusion feature and the sample fusion feature.

Schematically, we mean that the relational node matrix is R and R_mRefer to the mth relationship node. Given a video v_mIts semantic features are

And its corresponding video prototype is

The two are operated by adding elements to obtain a relation node R_m. And updating the relation nodes through the graph convolution network to obtain the relation characteristics with discriminative power. The relationship propagation rules are as follows:

wherein

Is a relational adjacency matrix with self-connected relational graphs. The adjacency matrix a is constructed by using a cosine similarity function. E is an identity matrix.

And Q is a weight matrix. δ is the activation function. Thus we can obtain the enhanced features after the relation is propagated.

And step 640, performing distance analysis on the target node and other nodes in the graph neural network to obtain the characteristic distance between the target node and the sample node.

In the embodiment of the application, the nodes of the graph neural network are fusion features and are obtained through training. Further, the graph neural network is divided by the node distance to a similar degree, and therefore, it is necessary to compare the feature distances between the target node and the rest of the nodes in the graph neural network, which are sample fusion features, called sample nodes.

When the distance comparison analysis is performed on the target node and the sample node, it is schematic that the distance comparison is performed by using an euclidean distance function, and the smaller the euclidean distance is, the more similar the content of the fusion feature and the sample fusion feature is, otherwise, the larger the difference between them is. In addition, the cosine similarity distance function is adopted for distance comparison, and a better experimental effect can be obtained.

Illustratively, we compare the fusion features of the sample video with the fusion features of the target video, and this process can be expressed as:

wherein

Representing a similarity distance function. B is_qRepresenting the target video q.

Is to analyze the fusion characteristics of the video, B_nRepresenting fusion features of sample video,/_aIs the category to which the sample video belongs, and y is the prediction category of the target video.

And step 650, determining the node with the minimum characteristic distance to the target node in the graph neural network as an analysis node.

Because the neural network of the graph is divided into the similarity degrees through the node distances, if the distance is close, the similarity degree is high, and if the distance is far, the similarity degree is low. Wherein, the nodes of the graph neural network are the fusion features.

In the embodiment of the application, the distance between a target node and a sample node is analyzed, and if the distance between the target node and the sample node is short, the similarity between a target video and a sample video is high; on the contrary, if the target node is far away from the sample node, it is indicated that the similarity between the target video and the sample video is low, wherein the target node is a fusion feature corresponding to the target video.

And step 660, taking the classification corresponding to the analysis node as the video classification to which the target video belongs.

Comparing the target node with the sample node, wherein the sample node closer to the target node indicates that the similarity degree between the target video and the sample video corresponding to the sample node is higher. Therefore, the sample node closest to the target node is selected, and the video category corresponding to the sample node is closest to the category of the target video. Therefore, the node closest to the target node can be used as an analysis node, and the classification corresponding to the analysis node is used as the video classification to which the target video belongs.

In summary, the video classification method provided in this embodiment is a relevant description of how to perform classification prediction on the fusion features to obtain the video classification to which the target video belongs. The process is a brief description of the training model used in the data classification process. Firstly, extracting sample fusion characteristics after obtaining sample videos from classified videos, and then constructing a graph neural network based on intra-class consistency requirements and inter-class difference requirements, wherein nodes comprise the sample fusion characteristics and target fusion characteristics. The graph neural network has the function of relation propagation, can enhance the fusion characteristics and obtain the characteristics with more discriminative power. The graph neural network is trained and used for mapping the target video, and then distance analysis is carried out on the target nodes and the sample nodes, so that the difference between the target video and the video categories represented by the back of the compared sample nodes can be obtained. And finally, selecting a point with the minimum distance between the target video and the graph neural network node, and taking the classification corresponding to the node as the video classification to which the target video belongs. After the training mode, a video classification model can be obtained, and after the target video is processed, the video classification model can be used for comparing to obtain the category of the video. The method considers that the types of the videos in the model obtained through training of the plurality of sample videos are more comprehensive, and if the target videos are compared through the model, the video classification result is more comprehensive and more accurate.

In combination with the above noun introduction and application scenario, a video classification method provided by the present application is described, where the method may be executed by a server or a terminal, or may be executed by both the server and the terminal, and in this embodiment of the present application, the method is described as an example executed by the server, as shown in fig. 7, the method includes the following processes:

first, M types of videos are extracted from the videos in the sample library 701, and K videos are extracted for each type, thereby forming a support set 702. Then 1 video is extracted from the remaining videos in the M categories to form a query set 703. Then, each video is divided into 8 segments on average, 1 frame is randomly extracted from each segment, and 8 frames of video can be acquired in total. These video frames are input to the feature extraction network 704, respectively, to obtain video features. In order to improve the feature expression capability of the video features, the spatiotemporal difference module 705 is utilized to obtain improved video features. To obtain semantic information, semantic generation network 706 is utilized to generate semantic information. In order to make the generated semantic information closer to the real semantic information, a semantic generation loss function 707 is introduced to supervise the learning of the semantic generation network 706. The visual information and corresponding semantic information then form visual-semantic pairs, which are learned using the graph convolution network 708 using intra-class consistency and inter-class differences. Finally, a small sample classifier 709 estimates the similarity between the video in the support set and the video in the query set, so as to use the class with the highest estimated score as the video class 710 of the query set.

In fig. 7, a plurality of labels are used, and the meaning of the label is shown as label meaning 711.

In summary, in the video classification method provided in this embodiment, the video frame of the target video is subjected to feature extraction to obtain the video features corresponding to the target video, then the semantic features obtained based on the video features are fused with the video features to obtain the fusion features, and finally the category to which the video belongs is divided based on the fusion features to obtain the video classification result. In the process, not only the classification information contained in the video features is fully considered, but also the semantic features corresponding to the video features are extracted from the video features and are included in the analysis process, so that the acquisition range of the video feature information can be expanded, the situation that the information is not completely contained in the video classification process is avoided, and the accuracy of video classification is improved.

The above-described classification process is only a simple example of a video classification method, and the present application is not limited thereto.

Schematically, the video classification method provided in the present application is described, as shown in fig. 8, the video classification method is adopted to perform precision comparison with other commonly used existing video classification methods, where an experiment is performed on a published data set.

Illustratively, the selected video classification methods 810 include the existing video classification methods BaseNet, CMN, TARN, embossed Learning, ARN, AMeFu-Net and the present video classification method SRPN, and the selected public data set types 820 include MiniKinetics, UCF101 and HMDB 51.

The BaseNet is a mature model which is a backbone extraction feature network and can directly extract video features in a video; CMN is an abbreviation of Compound Memory Networks, which means a Compound Memory network, and the principle is that the network retains the characteristics of sample videos by means of the accumulation characteristics of a Memory structure, so that the video classification problem is solved; TARN is an abbreviation of Temporal attention Network, meaning time attention relationship Network, and the core is a meta-learning method for solving the problem of motion recognition of small samples and zero samples; the animated Learning means that a nonreal action data set is constructed, and a data enhancement method for video segment replacement is proposed and used for identifying behavior of small samples for video classification; the ARN is an abbreviation of Action Recognition Net, the principle is that an encoder captures a short-range Action mode, coding blocks are aggregated through a replacement invariant pool, and the relation between the variable Action length and the long-range event dependency of the method is researched to further classify videos; the AMeFu-Net is an abbreviation of Adaptive Meta-Fusion Network, means an Adaptive element Fusion Network, firstly introduces depth information as additional visual information, and proposes a time sequence asynchronous enhancement mechanism to enhance the original video representation. The video classification method specifically adopts the SPRN in the present application example, which means that the semantic guide relationship propagation network small sample behavior identification method is only an example of the video classification method, and the present embodiment is not limited thereto.

Fig. 8 is an experimental result of video classification accuracy in different data sets by using different video classification methods, where a shot represents the number of video samples used for training, and for example, 1-shot represents that the number of video samples used for training is 1. Fig. 8 is analyzed as follows:

when the vertical comparison is carried out, it can be found that the video classification accuracy rate is increased when the number of samples used for training is larger by adopting the same video classification method. The best effect of video classification in the data set UCF101 is achieved;

when performing the horizontal comparison, it can be found that the SRPN can obtain the best video classification effect if different video classification methods are adopted. In the UCF101, when the number of trained sample videos is large, the accuracy rate may even approach the characteristics of the videos themselves infinitely.

The above results fully prove that the SRPN method proposed in the present application example has better precision for classifying videos than the existing video classification method, and can obtain higher accuracy.

It should be noted that, in the above embodiments, the method described in this embodiment is described as being compared with other listed limited methods in a limited data set, and the classification method may also be compared and analyzed with other video classification methods in other data sets, which is not limited in the embodiments of the present application.

Fig. 9 is a block diagram illustrating a structure of video classification according to an exemplary embodiment of the present application, and as shown in fig. 9, a video classification apparatus includes:

an obtaining module 910, configured to obtain a target video, where the target video is a video to be classified and predicted;

an extracting module 920, configured to perform feature extraction on a target video based on a video frame in the target video to obtain a video feature corresponding to the target video, where the video feature is used to indicate picture information expressed by the video frame of the target video;

an analysis module 930, configured to perform semantic analysis on the target video based on the video features to obtain semantic features corresponding to the target video, where the semantic features are used to indicate entity association expressed by video frames of the target video;

a fusion module 940, configured to fuse the video features and the semantic features to obtain fusion features corresponding to the target video;

and the prediction module 950 is configured to perform classification prediction on the fusion features to obtain a video classification to which the target video belongs.

In an alternative embodiment, as shown in FIG. 10, the analysis module 930, comprises:

the input unit 931 is configured to input the video features into a semantic generation network, where the semantic generation network is a training-obtained semantic feature extraction network;

an analyzing unit 932, configured to perform semantic analysis on the video features through the semantic generation network, and output a semantic feature corresponding to the target video.

In an optional embodiment, the analyzing module 930 further includes:

an obtaining unit 933, configured to perform prototype obtaining on the video feature to obtain a prototype feature corresponding to the video feature, where the prototype feature is a feature mapping result of the video feature in a prototype space;

the analyzing unit 932 is further configured to perform semantic analysis on the prototype feature through the semantic generation network, and output a semantic feature corresponding to the target video.

In an optional embodiment, the extracting module 920 is further configured to extract at least two video frames from the target video, where a timing relationship exists between the at least two video frames; and extracting the video characteristics of the target video frame based on the time sequence relation and the picture between the at least two frames of video frames.

In an optional embodiment, the extracting module 920 is further configured to extract a behavior feature between the at least two frames of video frames based on a timing relationship between the at least two frames of video frames; extracting visual features of the target video frame based on pictures of the at least two video frames;

the fusion module 940 is further configured to fuse the behavior feature and the visual feature to obtain the video feature of the target video frame.

In an optional embodiment, the fusion module 940 is further configured to fuse the behavior feature and the visual feature to obtain a mixed feature; and connecting the visual features and the mixed features in a residual error connection mode to obtain the video features.

In an optional embodiment, the extracting module 920 is further configured to perform segmentation processing on the target video to obtain at least two video segments; and sampling video frames from the at least two video clips to obtain the at least two video frames.

In an optional embodiment, the prediction module 950 is further configured to perform distance analysis on the fused features and nodes in the graph neural network to obtain feature distances between the fused features and the nodes; determining a target node in the graph neural network with the minimum feature distance to the fusion feature; and taking the classification corresponding to the target node as the video classification to which the target video belongs.

In an optional embodiment, the obtaining module 910 is further configured to obtain a sample video, where the sample video is correspondingly labeled with a category label, and the category label is used to indicate a reference video category to which the sample video belongs;

the extracting module 920 is further configured to perform feature extraction on the sample video to obtain sample video features corresponding to the sample video;

an analysis module 930, further configured to determine, based on the sample video features, sample visual features corresponding to the category labels;

the extracting module 920 is further configured to perform semantic extraction on the sample visual features through the semantic generation network to obtain sample semantic features corresponding to the sample video;

the device further comprises:

a determining module 960 for determining semantic extraction loss based on the sample semantic features and the class labels;

a training module 970, configured to train the semantic generation network with the semantic extraction loss.

In an optional embodiment, the determining module 960 is further configured to perform vector space mapping on the category label to obtain a space mapping feature; determining the semantic extraction loss based on a feature distance between the sample semantic features and the spatially mapped features.

the extracting module 920 is further configured to extract a sample fusion feature of the sample video, where the sample fusion feature includes a sample semantic feature and a sample video feature of the sample video;

the device further comprises:

a building module 980 configured to build the graph neural network based on intra-class consistency requirements and inter-class difference requirements, wherein nodes in the graph neural network are formed by the sample fusion features.

In summary, in the video classification device provided in this embodiment, first, a target video is obtained by an obtaining device, feature extraction is performed on the target video by an extracting module to obtain video features corresponding to the target video, then, semantic analysis is performed on the target video by an analyzing module to obtain semantic features corresponding to the target video, then, the video features and the semantic features are fused by a fusing module to obtain fused features corresponding to the target video, and finally, the fused features are classified and predicted by a predicting module to obtain video classifications to which the target video belongs. In the process, the video characteristic and the semantic characteristic are fused, so that the acquisition range of the video characteristic information can be expanded, the situation that the information is not completely contained in the video classification process is avoided, and the accuracy of video classification is improved.

It should be noted that: the video classification apparatus provided in the foregoing embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the video classification apparatus and the video classification method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

Fig. 11 shows a schematic structural diagram of a server according to an exemplary embodiment of the present application. The server 1000 includes a Central Processing Unit (CPU) 1001, a system Memory 1004 including a Random Access Memory (RAM) 1002 and a Read Only Memory (ROM) 1003, and a system bus 1005 connecting the system Memory 1004 and the Central Processing Unit 1001. The server 1000 also includes a mass storage device 1006 for storing an operating system 1013, application programs 1014, and other program modules 1015.

The mass storage device 1006 is connected to the central processing unit 1001 through a mass storage controller (not shown) connected to the system bus 1005. The mass storage device 1006 and its associated computer-readable media provide non-volatile storage for the server 1000. That is, the mass storage device 1006 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 1004 and mass storage device 1006 described above may be collectively referred to as memory.

According to various embodiments of the present application, the server 1000 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 1000 may be connected to the network 1012 through a network interface unit 1011 connected to the system bus 1005, or may be connected to another type of network or remote computer system (not shown) using the network interface unit 911.

The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU.

Embodiments of the present application further provide a computer device, which includes a processor and a memory, where at least one instruction, at least one program, a code set, or a set of instructions is stored in the memory, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor to implement the video classification method provided by the foregoing method embodiments.

Embodiments of the present application further provide a computer-readable storage medium, on which at least one instruction, at least one program, a code set, or a set of instructions is stored, and the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by a processor to implement the video classification method provided by the above-mentioned method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video classification method described in any of the above embodiments.

Optionally, the computer-readable storage medium may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a Solid State Drive (SSD), or an optical disc. The Random Access Memory may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM). The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for video classification, the method comprising:

2. The method according to claim 1, wherein the semantic analysis of the target video based on the video features to obtain semantic features corresponding to the target video comprises:

inputting the video features into a semantic generation network, wherein the semantic generation network is a semantic feature extraction network obtained by training;

and performing semantic analysis on the video features through the semantic generation network, and outputting to obtain semantic features corresponding to the target video.

3. The method according to claim 2, wherein the semantic analyzing the video features through the semantic generation network and outputting semantic features corresponding to the target video includes:

performing prototype acquisition on the video features to obtain prototype features corresponding to the video features, wherein the prototype features are feature mapping results of the video features in a prototype space;

and performing semantic analysis on the prototype features through the semantic generation network, and outputting to obtain semantic features corresponding to the target video.

4. The method according to any one of claims 1 to 3, wherein the performing feature extraction on the target video based on the video frames in the target video to obtain the video features corresponding to the target video comprises:

extracting at least two frames of video frames from the target video, wherein a time sequence relation exists between the at least two frames of video frames;

and extracting the video characteristics of the target video frame based on the time sequence relation and the picture between the at least two frames of video frames.

5. The method according to claim 4, wherein the extracting the video feature of the target video frame based on the timing relationship between the at least two video frames and the picture comprises:

extracting behavior characteristics between the at least two frames of video frames based on the time sequence relation between the at least two frames of video frames;

extracting visual features of the target video frame based on pictures of the at least two video frames;

and fusing the behavior characteristics and the visual characteristics to obtain the video characteristics of the target video frame.

6. The method of claim 5, wherein said fusing the behavior feature and the visual feature to obtain the video feature of the target video frame comprises:

fusing the behavior characteristics and the visual characteristics to obtain mixed characteristics;

7. The method of claim 4, wherein the extracting at least two frames of video frames from the target video comprises:

performing segmentation processing on the target video to obtain at least two video segments;

and sampling video frames from the at least two video clips to obtain the at least two video frames.

8. The method according to any one of claims 1 to 3, wherein the performing classification prediction on the fusion features to obtain a video classification to which the target video belongs comprises:

carrying out distance analysis on the fusion features and sample nodes in the graph neural network to obtain feature distances between the fusion features and the sample nodes;

determining an analysis node with the minimum feature distance to the fusion feature in the graph neural network;

9. The method of claim 2, wherein before entering the video features into a semantic generation network, further comprising:

obtaining a sample video, wherein a category label is correspondingly marked on the sample video, and the category label is used for indicating a reference video category to which the sample video belongs;

performing feature extraction on the sample video to obtain sample video features corresponding to the sample video;

determining a sample visual feature corresponding to the category label based on the sample video feature;

performing semantic extraction on the sample video features through the semantic generation network to obtain sample semantic features corresponding to the sample video;

determining a semantic extraction loss based on the sample semantic features and the class labels;

and training the semantic generation network according to the semantic extraction loss.

10. The method of claim 9, wherein determining semantic extraction loss based on the sample semantic features and the class labels comprises:

carrying out vector space mapping on the category label to obtain space mapping characteristics;

determining the semantic extraction loss based on a feature distance between the sample semantic features and the spatially mapped features.

11. The method of claim 8, wherein prior to performing the distance analysis on the fused feature from the sample nodes in the graph neural network, further comprising:

extracting sample fusion characteristics of the sample video, wherein the sample fusion characteristics comprise sample semantic characteristics and sample video characteristics of the sample video;

and constructing the graph neural network based on intra-class consistency requirements and inter-class difference requirements, wherein nodes in the graph neural network are composed of the sample nodes and target nodes, the sample nodes are the sample fusion features, and the target nodes are the fusion features.

12. An apparatus for video classification, the apparatus comprising:

13. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement a video classification method according to any one of claims 1 to 11.

14. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement a method of video classification according to any one of claims 1 to 11.