CN115049953A

CN115049953A - Video processing method, device, equipment and computer readable storage medium

Info

Publication number: CN115049953A
Application number: CN202210501926.3A
Authority: CN
Inventors: 孙祥训; 程宝平; 谢小燕
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Hangzhou Information Technology Co Ltd
Priority date: 2022-05-09
Filing date: 2022-05-09
Publication date: 2022-09-13

Abstract

The invention discloses a video processing method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring audio information and a key frame image corresponding to a video to be identified; acquiring an audio text semantic feature vector corresponding to the audio information, and acquiring a visual semantic feature vector corresponding to the key frame image; determining the violation probability corresponding to the video to be recognized through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector; and if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video. The method and the device can accurately determine whether the video to be identified is the illegal video according to the audio information of the video to be identified and the key frame image, and improve the accuracy of illegal video identification by being compatible with audio information such as voice explanation during video review.

Description

Video processing method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video processing method, apparatus, device, and computer readable storage medium.

Background

With the development of multimedia technology and the rise of video platforms, video has become a mainstream way for users to distribute information and entertainment. Inevitably, there are always some lawbreakers who release offending videos for interests and the like, for example, pornographic, political, and minors' rights and interests videos, and so on. Therefore, video auditing becomes especially important in order to provide a positive, nice, green, and healthy network environment for users.

Currently, a video is generally examined based on a deep learning image classification model, and the specific process is as follows: the violation score of each frame of picture of the video is obtained through the deep learning image classification model, the maximum violation score is taken to represent the score of the video, and then the video with the score larger than a certain threshold value is pushed to manual review, so that the workload of the manual review is reduced to a great extent, and the efficiency of video violation detection is improved. However, since only each frame of picture of the video is audited, it is difficult to accurately identify prohibited videos such as speech explanation and political affairs, which results in low accuracy of identifying illegal videos such as pornographic videos, political affairs-involved videos, and minors' equity videos.

Disclosure of Invention

The invention mainly aims to provide a video processing method, a video processing device, video processing equipment and a computer readable storage medium, and aims to solve the technical problem that the identification accuracy of the conventional violation video is low.

In order to achieve the above object, the present invention provides a video processing method, including the steps of:

acquiring audio information and a key frame image corresponding to a video to be identified;

acquiring an audio text semantic feature vector corresponding to the audio information, and acquiring a visual semantic feature vector corresponding to the key frame image;

determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;

and if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video.

Further, the step of determining the violation probability corresponding to the video to be identified by fusing a feature network model based on the visual semantic feature vector and the audio text semantic feature vector includes:

determining a multi-mode fusion feature vector corresponding to the video to be recognized based on the visual semantic feature vector and the audio text semantic feature vector;

and inputting the multi-mode fusion feature vector into the fusion feature network model, and taking the output of the fusion feature network model as the violation probability.

Further, the step of obtaining the visual semantic feature vector corresponding to the key frame image includes:

acquiring image text characteristic vectors and key frame image characteristic vectors corresponding to the key frame images;

acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library;

and inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, and taking the output of the cyclic neural network model as the visual semantic feature vector.

Further, the step of obtaining the image text feature vector and the key frame image feature vector corresponding to the key frame image includes:

performing text recognition on the key frame image to obtain image text information corresponding to the key frame image;

inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text feature vector;

and inputting the key frame image into a deep convolutional neural network model, and taking the output of the deep convolutional neural network model as the feature vector of the key frame image.

Further, the step of obtaining the sensitive character feature vector corresponding to the key frame image based on the sensitive character library includes:

acquiring the face features corresponding to the key frame images;

determining the similarity between each face feature and each sensitive character feature in a sensitive character library;

and acquiring the target similarity which is greater than the preset similarity in the similarities, and taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector.

Further, the step of obtaining the audio text semantic feature vector corresponding to the audio information includes:

performing voice recognition on the audio information to obtain audio text information corresponding to the audio information;

and inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the semantic feature vector of the audio text.

Further, before the step of obtaining the audio information and the key frame image corresponding to the video to be identified, the video processing method further includes:

acquiring a video fingerprint corresponding to a video to be detected;

if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database;

if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is the video to be identified;

and if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is an illegal video.

Further, to achieve the above object, the present invention also provides a video processing apparatus comprising:

the first acquisition module is used for acquiring audio information and a key frame image corresponding to a video to be identified;

the second acquisition module is used for acquiring audio text semantic feature vectors corresponding to the audio information and acquiring visual semantic feature vectors corresponding to the key frame images;

the first determining module is used for determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;

and the second determining module is used for determining that the video to be identified is the violation video if the violation probability is greater than a preset threshold.

Further, to achieve the above object, the present invention also provides a video processing apparatus comprising: a memory, a processor and a video processing program stored on the memory and executable on the processor, the video processing program when executed by the processor implementing the steps of the video processing method as described above.

In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a video processing program stored thereon, which when executed by a processor implements the steps of the aforementioned video processing method.

The method comprises the steps of acquiring audio information and a key frame image corresponding to a video to be identified; then, audio text semantic feature vectors corresponding to the audio information are obtained, and visual semantic feature vectors corresponding to the key frame images are obtained; then determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector; and then if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video, accurately determining whether the video to be identified is the violation video according to the audio information of the video to be identified and the key frame image, and improving the accuracy of violation video identification by being compatible with audio information such as voice explanation during video review.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a video processing device in a hardware operating environment according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a video processing method according to a first embodiment of the present invention;

fig. 3 is a functional block diagram of a video processing apparatus according to an embodiment of the invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a schematic structural diagram of a video processing device in a hardware operating environment according to an embodiment of the present invention.

The video processing equipment of the embodiment of the invention can be a PC, and can also be terminal equipment such as a smart phone.

As shown in fig. 1, the video processing apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory such as a disk memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.

Optionally, the video processing device may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Of course, the video processing device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, and so on, which are not described herein again.

Those skilled in the art will appreciate that the terminal architecture shown in fig. 1 does not constitute a limitation of video processing devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a video processing program.

In the video processing apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and communicating data with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be used to invoke a video processing program stored in the memory 1005.

In this embodiment, a video processing apparatus includes: the system comprises a memory 1005, a processor 1001 and a video processing program stored on the memory 1005 and capable of running on the processor 1001, wherein when the processor 1001 calls the video processing program stored in the memory 1005, the steps of the video processing method in the following embodiments are executed.

Referring to fig. 2, fig. 2 is a flowchart illustrating a video processing method according to a first embodiment of the present invention.

In this embodiment, the video processing method includes:

step S101, acquiring audio information and a key frame image corresponding to a video to be identified;

in this embodiment, a video to be recognized is obtained first, for example, each video to be detected that needs to be subjected to illegal review may be taken as the video to be recognized, or the video to be detected is matched according to the illegal video database and the normal video database that have been currently reviewed first, and if the video to be detected is not matched with the illegal video database and the normal video database, the video to be detected is determined to be the video to be recognized.

When a video to be identified is acquired, acquiring audio information and a key frame image corresponding to the video to be identified, specifically, performing audio-video separation on the video to be identified to obtain the audio information and the video information of the video to be identified, and then performing key frame extraction on the video information by using an existing key frame extraction method to obtain the key frame image of the video to be identified, wherein the key frame extraction method can be a shot-based key frame extraction method, a motion analysis-based key frame extraction method or a video clustering-based key frame extraction method.

Step S102, obtaining audio text semantic feature vectors corresponding to the audio information, and obtaining visual semantic feature vectors corresponding to the key frame images;

in this embodiment, when the audio information and the key frame image are obtained, an audio text semantic feature vector corresponding to the audio information is obtained, and a visual semantic feature vector corresponding to the key frame image is obtained, specifically, a trained semantic feature vector model may be preset, and the audio information is input into the semantic feature vector model to obtain an audio text semantic feature vector, or the audio text information of the audio information is obtained first, and the audio text semantic feature vector is obtained through the audio text information; for the key frame image, the key frame image can be directly input into a corresponding model for model training to obtain a visual semantic feature vector, or the key frame image is input into the corresponding model for model training to obtain a key frame image feature vector, the key frame image is subjected to text recognition to obtain image text information, the image text information is input into the corresponding model for model training to obtain an image text feature vector, and the visual semantic feature vector is determined according to the image text feature vector and the key frame image feature vector.

Step S103, determining violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;

in this embodiment, when the visual semantic feature vector and the audio text semantic feature vector are obtained, the violation probability corresponding to the video to be identified is determined through the fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector, specifically, the visual semantic feature vector and the audio text semantic feature vector may be input into the fusion feature network model to perform model training, and a result of the model training is used as the violation probability, or the visual semantic feature vector and the audio text semantic feature vector are fused to obtain a multi-modal fusion feature vector, and the multi-modal fusion feature vector is input into the fusion feature network model to perform model training, and a result of the model training is used as the violation probability.

It should be noted that the fused feature network model includes two fully-connected layers and one SoftMax classifier, and the output of the fused feature network model is the violation probability. Before video auditing is carried out, an initial fusion feature network model is set, a first training sample and a corresponding label are obtained, the first training sample can be a visual semantic feature vector and an audio text semantic feature vector (or a multi-modal fusion feature vector) of an illegal video, the corresponding label is the illegal video, or the visual semantic feature vector and the audio text semantic feature vector (or the multi-modal fusion feature vector) of a normal video, the corresponding label is the normal video, model training is carried out on the initial fusion feature network model to obtain the prediction violation probability of each video in the first training sample, the prediction result of each video is determined according to the prediction violation probability, for example, if the prediction violation probability is greater than a preset threshold value, the video corresponding to the prediction violation probability is the illegal video, and then according to the label and the prediction result, and determining a loss function of the trained initial fusion characteristic network model, and taking the trained initial fusion characteristic network model as a fusion characteristic network model when the loss function is smaller than a preset value.

And step S104, if the violation probability is greater than a preset threshold, determining that the video to be identified is a violation video.

In this embodiment, when the violation probability is obtained, it is determined whether the violation probability is greater than a preset threshold, and if the violation probability is greater than the preset threshold, it is determined that the video to be identified is a violation video, so that it is accurately determined whether the video to be identified is a violation video according to the audio information of the video to be identified and the key frame image.

In the video processing method provided by the embodiment, the audio information and the key frame image corresponding to the video to be identified are acquired; then, audio text semantic feature vectors corresponding to the audio information are obtained, and visual semantic feature vectors corresponding to the key frame images are obtained; then determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector; and then if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video, accurately determining whether the video to be identified is the violation video according to the audio information of the video to be identified and the key frame image, and improving the accuracy of violation video identification by being compatible with audio information such as voice explanation during video review.

Based on the first embodiment, a second embodiment of the video processing method of the present invention is proposed, in this embodiment, step S103 includes:

step S201, determining a multi-modal fusion feature vector corresponding to the video to be recognized based on the visual semantic feature vector and the audio text semantic feature vector;

step S202, inputting the multi-modal fusion feature vector into the fusion feature network model, and taking the output of the fusion feature network model as the violation probability.

In this embodiment, when the visual semantic feature vector and the audio text semantic feature vector are obtained, a multi-modal fusion feature vector corresponding to a video to be identified is determined based on the visual semantic feature vector and the audio text semantic feature vector, specifically, the visual semantic feature vector and the audio text semantic feature vector are fused to obtain the multi-modal fusion feature vector, for example, the visual semantic feature vector and the audio text semantic feature vector may be subjected to character string connection (concat) to obtain the multi-modal fusion feature vector, and of course, in other embodiments, if the dimensions of the visual semantic feature vector and the audio text semantic feature vector are the same, the multi-modal fusion feature vector may also be obtained based on the mean values of elements with the same dimensions of the visual semantic feature vector and the audio text semantic feature vector.

Then, the multi-modal fusion feature vector is input into a fusion feature network model for model training, and the output of the fusion feature network model, which is the model training result, is used as the violation probability.

In the video processing method provided by this embodiment, a multi-modal fusion feature vector corresponding to the video to be recognized is determined based on the visual semantic feature vector and the audio text semantic feature vector; and then inputting the multi-mode fusion feature vector into the fusion feature network model, taking the output of the fusion feature network model as the violation probability, and fusing the visual semantic feature vector and the audio text semantic feature vector to determine whether the video to be identified is the violation video according to the fusion result of the visual semantic feature vector and the audio text semantic feature vector, so that the video identification efficiency is improved, and the violation video identification accuracy is further improved.

Based on the first embodiment, a third embodiment of the video processing method of the present invention is proposed, in this embodiment, step S102 includes:

step S301, acquiring image text characteristic vectors and key frame image characteristic vectors corresponding to the key frame images;

step S302, acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library;

step S303, inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, and taking the output of the cyclic neural network model as the visual semantic feature vector.

In this embodiment, when a key frame image is acquired, an image text feature vector and a key frame image feature vector corresponding to the key frame image are acquired, specifically, image text information corresponding to the previous key frame image is firstly obtained, then, model training is performed according to the image text information to obtain an image text feature vector, and meanwhile, the key frame image is input to a corresponding model to perform model training to obtain a key frame image feature vector.

And then, acquiring a sensitive character feature vector corresponding to the key frame image based on the sensitive character library, specifically, respectively matching the key frame image with sensitive character features in the sensitive character library to obtain matched target sensitive character features, and taking the target sensitive character features as the sensitive character feature vector. If the video audit is an administrative-related video audit, various types of face images including a national leader, a split national organization head, a evil and education organization head, an violence and terrorism organization head, other administrative-related sensitive personnel and the like are collected in advance, the face images are input into a convolutional neural network model to perform model training, sensitive character features are obtained, the sensitive character features can be 128-dimensional face features, and a sensitive character library is established based on the obtained sensitive character features.

Then, inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a recurrent neural network model, and taking the output of the recurrent neural network model as a visual semantic feature vector, wherein the recurrent neural network model can be a bidirectional time recurrent neural network of an attention system, the recurrent neural network model is a pre-trained bidirectional time recurrent neural network, and the visual semantic feature vector can be a 128-dimensional feature vector.

In the video processing method provided by this embodiment, the image text feature vector and the key frame image feature vector corresponding to the key frame image are obtained; then, acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library; and then inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, taking the output of the cyclic neural network model as the visual semantic feature vector, obtaining the visual semantic feature vector based on the image text feature vector, the key frame image feature vector and the sensitive character feature vector, and comprehensively considering various forbidden contents such as sensitive events, administrative characters, reverse propaganda and the like in the video to be recognized during video recognition so as to accurately recognize administrative related articles, advertising slogans, administrative characters and the like in the video to be recognized, thereby further improving the accuracy of illegal video recognition.

Based on the third embodiment, a fourth embodiment of the video processing method of the present invention is proposed, in this embodiment, step S301 includes:

step S401, performing text recognition on the key frame image to obtain image text information corresponding to the key frame image;

step S402, inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text feature vector;

step S403, inputting the key frame image into a deep convolutional neural network model, and using an output of the deep convolutional neural network model as the key frame image feature vector.

In this embodiment, when a key frame image is obtained, text recognition is performed on the key frame image to obtain image text information corresponding to the key frame image, and specifically, text recognition may be performed on the key frame image by using an existing graphic text recognition algorithm, such as an ORC, to obtain the image text information.

And then inputting the image text information into the convolutional neural network model, taking the output of the convolutional neural network model as an image text characteristic vector, inputting the key frame image into the deep convolutional neural network model, and taking the output of the deep convolutional neural network model as a key frame image characteristic vector.

In this embodiment, before video recognition, an initial convolutional neural network model and an initial deep convolutional neural network model are built, a second training sample (text sample) is used for model training of the initial convolutional neural network model, when the trained initial convolutional neural network model meets requirements, the trained initial convolutional neural network model is used as the convolutional neural network model, a third training sample (image sample) is used for model training of the initial deep convolutional neural network model, and when the trained initial deep convolutional neural network model meets requirements, the trained initial deep convolutional neural network model is used as the deep convolutional neural network model.

In the video processing method provided by this embodiment, text recognition is performed on the key frame image to obtain image text information corresponding to the key frame image; inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text characteristic vector; and then inputting the key frame image into a deep convolutional neural network model, taking the output of the deep convolutional neural network model as the feature vector of the key frame image, and accurately obtaining the feature vector of the image text and the feature vector of the key frame image through the image text information and the key frame image so as to comprehensively consider political-related articles, publicity slogans, political-related figures and the like in the video to be identified during video identification, thereby further improving the accuracy of illegal video identification.

A fifth embodiment of the video processing method of the present invention is proposed based on the third embodiment, and in this embodiment, step S303 includes:

step S501, acquiring human face features corresponding to the key frame images;

step S502, determining the similarity between each human face feature and each sensitive character feature in a sensitive character library;

step S503, obtaining the target similarity greater than the preset similarity in each similarity, and taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector.

In this embodiment, after the key frame image is obtained, the face features corresponding to the key frame image are obtained, where the face features include features of all faces appearing in the key frame image, for example, the face features corresponding to the key frame image may be extracted by using a deep convolutional neural network, or the face features corresponding to the key frame image may be extracted by using an existing face recognition algorithm, where the face features may be 128-dimensional vectors.

And then, calculating the similarity between each face feature and each sensitive character feature in the sensitive character library, wherein the sensitive character feature can be a 128-dimensional face feature vector, and for each face feature, calculating a cosine value between the face feature and each sensitive character feature based on a cosine formula, wherein the cosine value is the corresponding similarity.

Then, comparing each similarity with a preset similarity, obtaining a target similarity which is greater than the preset similarity in each similarity, determining a sensitive character feature corresponding to the target similarity in a sensitive character library, namely a target sensitive character feature, and taking the target sensitive character feature as a sensitive character feature vector.

In the video processing method provided by this embodiment, the face features corresponding to the key frame images are obtained; then determining the similarity between each face feature and each sensitive character feature in a sensitive character library; and then, acquiring the target similarity which is greater than the preset similarity in the similarities, taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector, and accurately acquiring the sensitive character feature vector according to the similarity, thereby further improving the accuracy of the illegal video identification.

Based on the first embodiment, a sixth embodiment of the video processing method of the present invention is proposed, in this embodiment, step S102 includes:

step S601, carrying out voice recognition on the audio information to obtain audio text information corresponding to the audio information;

step S602, inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the semantic feature vector of the audio text.

In this embodiment, when audio information is acquired, voice recognition is performed on the audio information to acquire audio text information corresponding to the audio information, for example, voice recognition is performed on the audio information by using an existing voice recognition algorithm to acquire the audio text information.

And then, inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the audio text semantic feature vector, wherein before video recognition, an initial character-level convolutional neural network model is built, model training is carried out on the initial character-level convolutional neural network model by adopting a fourth training sample (text sample), and when the trained initial character-level convolutional neural network model meets the requirements, the trained initial character-level convolutional neural network model is taken as the character-level convolutional neural network model, so that the audio text semantic feature vector corresponding to the audio text information is obtained through the character-level convolutional neural network model.

In the video processing method provided by this embodiment, voice recognition is performed on the audio information to obtain audio text information corresponding to the audio information; and then inputting the audio text information into a character-level convolutional neural network model, taking the output of the character-level convolutional neural network model as the audio text semantic feature vector, accurately obtaining the audio text semantic feature vector according to the audio text information, and further improving the accuracy of illegal video identification by being compatible with audio information such as voice explanation during video auditing.

Based on the above embodiments, a seventh embodiment of the video processing method of the present invention is proposed, in this embodiment, before step S101, the video processing method further includes:

step S701, acquiring a video fingerprint corresponding to a video to be detected;

step S702, if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database;

step S703, if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is a video to be identified;

step S704, if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is an illegal video.

In this embodiment, a video to be detected is obtained first, for example, each video that needs to be subjected to violation review may be taken as a video to be detected, and then a video fingerprint of the video to be detected is obtained. And determining whether the video fingerprint matches a first preset video fingerprint in a normal video fingerprint library.

If the video fingerprint is not matched with the first preset video fingerprint in the normal video fingerprint library, the video to be detected is an abnormal video, namely the video to be detected may be an illegal video or an undetected video, and at the moment, whether the video fingerprint is matched with the second preset video fingerprint in the illegal video fingerprint library is determined. And if the video fingerprint is matched with a first preset video fingerprint in a normal video fingerprint library, for example, the video fingerprint belongs to the first preset video fingerprint, determining that the video to be detected is a normal video.

And if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is the illegal video, and otherwise, determining that the video to be detected is the video to be identified.

In the video processing method provided by the embodiment, the video fingerprint corresponding to the video to be detected is obtained; then if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database; then if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is the video to be identified; and then if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is the illegal video, and initially screening the video to be detected through the video fingerprints of the normal video and the illegal video, so that the video to be detected, which belongs to the normal video or the illegal video, can be prevented from being checked again, and the efficiency of identifying the illegal video is further improved.

The present invention also provides a video processing apparatus, referring to fig. 3, the video processing apparatus comprising:

the first obtaining module 10 is configured to obtain audio information and a key frame image corresponding to a video to be identified;

a second obtaining module 20, configured to obtain an audio text semantic feature vector corresponding to the audio information, and obtain a visual semantic feature vector corresponding to the key frame image;

the first determining module 30 is configured to determine, based on the visual semantic feature vector and the audio text semantic feature vector, a violation probability corresponding to the video to be identified through a fusion feature network model;

and a second determining module 40, configured to determine that the video to be identified is an illegal video if the violation probability is greater than a preset threshold.

Further, the first determining module 30 is further configured to:

Further, the second obtaining module 20 is further configured to:

acquiring the face features corresponding to the key frame images;

Further, the second obtaining module 20 is further configured to:

Further, the video processing apparatus is further configured to:

acquiring a video fingerprint corresponding to a video to be detected;

The methods executed by the program units can refer to various embodiments of the video processing method of the present invention, and are not described herein again.

The invention also provides a computer readable storage medium.

The computer readable storage medium of the present invention has stored thereon a video processing program which, when executed by a processor, implements the steps of the video processing method as described above.

The method implemented when the video processing program running on the processor is executed may refer to each embodiment of the video processing method of the present invention, and details thereof are not repeated herein.

Furthermore, an embodiment of the present invention further provides a computer program product, which includes a video processing program, and when the video processing program is executed by a processor, the steps of the video processing method described above are implemented.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or system in which the element is included.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A video processing method, characterized in that the video processing method comprises the steps of:

2. The video processing method of claim 1, wherein the step of determining the violation probability corresponding to the video to be recognized by fusing a feature network model based on the visual semantic feature vector and the audio text semantic feature vector comprises:

3. The video processing method according to claim 1, wherein the step of obtaining the visual semantic feature vector corresponding to the key frame image comprises:

4. The video processing method as claimed in claim 3, wherein said step of obtaining image text feature vectors and key frame image feature vectors corresponding to said key frame images comprises:

5. The video processing method of claim 3, wherein the step of obtaining the feature vector of the sensitive person corresponding to the key frame image based on the sensitive person library comprises:

acquiring the face features corresponding to the key frame images;

6. The video processing method according to claim 1, wherein the step of obtaining the audio text semantic feature vector corresponding to the audio information comprises:

7. The video processing method according to any one of claims 1 to 6, wherein before the step of acquiring the audio information corresponding to the video to be identified and the key frame image, the video processing method further comprises:

acquiring a video fingerprint corresponding to a video to be detected;

8. A video processing apparatus, characterized in that the video processing apparatus comprises:

the first determining module is used for determining the violation probability corresponding to the video to be recognized through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;

9. A video processing apparatus, characterized in that the video processing apparatus comprises: memory, processor and video processing program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the video processing method according to any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a video processing program is stored thereon, which when executed by a processor implements the steps of the video processing method according to any one of claims 1 to 7.