[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115049953A - Video processing method, device, equipment and computer readable storage medium - Google Patents

Video processing method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN115049953A
CN115049953A CN202210501926.3A CN202210501926A CN115049953A CN 115049953 A CN115049953 A CN 115049953A CN 202210501926 A CN202210501926 A CN 202210501926A CN 115049953 A CN115049953 A CN 115049953A
Authority
CN
China
Prior art keywords
video
feature vector
key frame
network model
frame image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210501926.3A
Other languages
Chinese (zh)
Inventor
孙祥训
程宝平
谢小燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Hangzhou Information Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202210501926.3A priority Critical patent/CN115049953A/en
Publication of CN115049953A publication Critical patent/CN115049953A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The invention discloses a video processing method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring audio information and a key frame image corresponding to a video to be identified; acquiring an audio text semantic feature vector corresponding to the audio information, and acquiring a visual semantic feature vector corresponding to the key frame image; determining the violation probability corresponding to the video to be recognized through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector; and if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video. The method and the device can accurately determine whether the video to be identified is the illegal video according to the audio information of the video to be identified and the key frame image, and improve the accuracy of illegal video identification by being compatible with audio information such as voice explanation during video review.

Description

Video processing method, device, equipment and computer readable storage medium
Technical Field
The present invention relates to the field of video processing technologies, and in particular, to a video processing method, apparatus, device, and computer readable storage medium.
Background
With the development of multimedia technology and the rise of video platforms, video has become a mainstream way for users to distribute information and entertainment. Inevitably, there are always some lawbreakers who release offending videos for interests and the like, for example, pornographic, political, and minors' rights and interests videos, and so on. Therefore, video auditing becomes especially important in order to provide a positive, nice, green, and healthy network environment for users.
Currently, a video is generally examined based on a deep learning image classification model, and the specific process is as follows: the violation score of each frame of picture of the video is obtained through the deep learning image classification model, the maximum violation score is taken to represent the score of the video, and then the video with the score larger than a certain threshold value is pushed to manual review, so that the workload of the manual review is reduced to a great extent, and the efficiency of video violation detection is improved. However, since only each frame of picture of the video is audited, it is difficult to accurately identify prohibited videos such as speech explanation and political affairs, which results in low accuracy of identifying illegal videos such as pornographic videos, political affairs-involved videos, and minors' equity videos.
Disclosure of Invention
The invention mainly aims to provide a video processing method, a video processing device, video processing equipment and a computer readable storage medium, and aims to solve the technical problem that the identification accuracy of the conventional violation video is low.
In order to achieve the above object, the present invention provides a video processing method, including the steps of:
acquiring audio information and a key frame image corresponding to a video to be identified;
acquiring an audio text semantic feature vector corresponding to the audio information, and acquiring a visual semantic feature vector corresponding to the key frame image;
determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;
and if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video.
Further, the step of determining the violation probability corresponding to the video to be identified by fusing a feature network model based on the visual semantic feature vector and the audio text semantic feature vector includes:
determining a multi-mode fusion feature vector corresponding to the video to be recognized based on the visual semantic feature vector and the audio text semantic feature vector;
and inputting the multi-mode fusion feature vector into the fusion feature network model, and taking the output of the fusion feature network model as the violation probability.
Further, the step of obtaining the visual semantic feature vector corresponding to the key frame image includes:
acquiring image text characteristic vectors and key frame image characteristic vectors corresponding to the key frame images;
acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library;
and inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, and taking the output of the cyclic neural network model as the visual semantic feature vector.
Further, the step of obtaining the image text feature vector and the key frame image feature vector corresponding to the key frame image includes:
performing text recognition on the key frame image to obtain image text information corresponding to the key frame image;
inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text feature vector;
and inputting the key frame image into a deep convolutional neural network model, and taking the output of the deep convolutional neural network model as the feature vector of the key frame image.
Further, the step of obtaining the sensitive character feature vector corresponding to the key frame image based on the sensitive character library includes:
acquiring the face features corresponding to the key frame images;
determining the similarity between each face feature and each sensitive character feature in a sensitive character library;
and acquiring the target similarity which is greater than the preset similarity in the similarities, and taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector.
Further, the step of obtaining the audio text semantic feature vector corresponding to the audio information includes:
performing voice recognition on the audio information to obtain audio text information corresponding to the audio information;
and inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the semantic feature vector of the audio text.
Further, before the step of obtaining the audio information and the key frame image corresponding to the video to be identified, the video processing method further includes:
acquiring a video fingerprint corresponding to a video to be detected;
if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database;
if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is the video to be identified;
and if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is an illegal video.
Further, to achieve the above object, the present invention also provides a video processing apparatus comprising:
the first acquisition module is used for acquiring audio information and a key frame image corresponding to a video to be identified;
the second acquisition module is used for acquiring audio text semantic feature vectors corresponding to the audio information and acquiring visual semantic feature vectors corresponding to the key frame images;
the first determining module is used for determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;
and the second determining module is used for determining that the video to be identified is the violation video if the violation probability is greater than a preset threshold.
Further, to achieve the above object, the present invention also provides a video processing apparatus comprising: a memory, a processor and a video processing program stored on the memory and executable on the processor, the video processing program when executed by the processor implementing the steps of the video processing method as described above.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a video processing program stored thereon, which when executed by a processor implements the steps of the aforementioned video processing method.
The method comprises the steps of acquiring audio information and a key frame image corresponding to a video to be identified; then, audio text semantic feature vectors corresponding to the audio information are obtained, and visual semantic feature vectors corresponding to the key frame images are obtained; then determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector; and then if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video, accurately determining whether the video to be identified is the violation video according to the audio information of the video to be identified and the key frame image, and improving the accuracy of violation video identification by being compatible with audio information such as voice explanation during video review.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a video processing device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a video processing method according to a first embodiment of the present invention;
fig. 3 is a functional block diagram of a video processing apparatus according to an embodiment of the invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Fig. 1 is a schematic structural diagram of a video processing device in a hardware operating environment according to an embodiment of the present invention.
The video processing equipment of the embodiment of the invention can be a PC, and can also be terminal equipment such as a smart phone.
As shown in fig. 1, the video processing apparatus may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory such as a disk memory. The memory 1005 may alternatively be a storage device separate from the processor 1001.
Optionally, the video processing device may further include a camera, a Radio Frequency (RF) circuit, a sensor, an audio circuit, a WiFi module, and the like. Of course, the video processing device may also be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, and so on, which are not described herein again.
Those skilled in the art will appreciate that the terminal architecture shown in fig. 1 does not constitute a limitation of video processing devices and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a video processing program.
In the video processing apparatus shown in fig. 1, the network interface 1004 is mainly used for connecting to a backend server and communicating data with the backend server; the user interface 1003 is mainly used for connecting a client (user side) and performing data communication with the client; and the processor 1001 may be used to invoke a video processing program stored in the memory 1005.
In this embodiment, a video processing apparatus includes: the system comprises a memory 1005, a processor 1001 and a video processing program stored on the memory 1005 and capable of running on the processor 1001, wherein when the processor 1001 calls the video processing program stored in the memory 1005, the steps of the video processing method in the following embodiments are executed.
Referring to fig. 2, fig. 2 is a flowchart illustrating a video processing method according to a first embodiment of the present invention.
In this embodiment, the video processing method includes:
step S101, acquiring audio information and a key frame image corresponding to a video to be identified;
in this embodiment, a video to be recognized is obtained first, for example, each video to be detected that needs to be subjected to illegal review may be taken as the video to be recognized, or the video to be detected is matched according to the illegal video database and the normal video database that have been currently reviewed first, and if the video to be detected is not matched with the illegal video database and the normal video database, the video to be detected is determined to be the video to be recognized.
When a video to be identified is acquired, acquiring audio information and a key frame image corresponding to the video to be identified, specifically, performing audio-video separation on the video to be identified to obtain the audio information and the video information of the video to be identified, and then performing key frame extraction on the video information by using an existing key frame extraction method to obtain the key frame image of the video to be identified, wherein the key frame extraction method can be a shot-based key frame extraction method, a motion analysis-based key frame extraction method or a video clustering-based key frame extraction method.
Step S102, obtaining audio text semantic feature vectors corresponding to the audio information, and obtaining visual semantic feature vectors corresponding to the key frame images;
in this embodiment, when the audio information and the key frame image are obtained, an audio text semantic feature vector corresponding to the audio information is obtained, and a visual semantic feature vector corresponding to the key frame image is obtained, specifically, a trained semantic feature vector model may be preset, and the audio information is input into the semantic feature vector model to obtain an audio text semantic feature vector, or the audio text information of the audio information is obtained first, and the audio text semantic feature vector is obtained through the audio text information; for the key frame image, the key frame image can be directly input into a corresponding model for model training to obtain a visual semantic feature vector, or the key frame image is input into the corresponding model for model training to obtain a key frame image feature vector, the key frame image is subjected to text recognition to obtain image text information, the image text information is input into the corresponding model for model training to obtain an image text feature vector, and the visual semantic feature vector is determined according to the image text feature vector and the key frame image feature vector.
Step S103, determining violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;
in this embodiment, when the visual semantic feature vector and the audio text semantic feature vector are obtained, the violation probability corresponding to the video to be identified is determined through the fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector, specifically, the visual semantic feature vector and the audio text semantic feature vector may be input into the fusion feature network model to perform model training, and a result of the model training is used as the violation probability, or the visual semantic feature vector and the audio text semantic feature vector are fused to obtain a multi-modal fusion feature vector, and the multi-modal fusion feature vector is input into the fusion feature network model to perform model training, and a result of the model training is used as the violation probability.
It should be noted that the fused feature network model includes two fully-connected layers and one SoftMax classifier, and the output of the fused feature network model is the violation probability. Before video auditing is carried out, an initial fusion feature network model is set, a first training sample and a corresponding label are obtained, the first training sample can be a visual semantic feature vector and an audio text semantic feature vector (or a multi-modal fusion feature vector) of an illegal video, the corresponding label is the illegal video, or the visual semantic feature vector and the audio text semantic feature vector (or the multi-modal fusion feature vector) of a normal video, the corresponding label is the normal video, model training is carried out on the initial fusion feature network model to obtain the prediction violation probability of each video in the first training sample, the prediction result of each video is determined according to the prediction violation probability, for example, if the prediction violation probability is greater than a preset threshold value, the video corresponding to the prediction violation probability is the illegal video, and then according to the label and the prediction result, and determining a loss function of the trained initial fusion characteristic network model, and taking the trained initial fusion characteristic network model as a fusion characteristic network model when the loss function is smaller than a preset value.
And step S104, if the violation probability is greater than a preset threshold, determining that the video to be identified is a violation video.
In this embodiment, when the violation probability is obtained, it is determined whether the violation probability is greater than a preset threshold, and if the violation probability is greater than the preset threshold, it is determined that the video to be identified is a violation video, so that it is accurately determined whether the video to be identified is a violation video according to the audio information of the video to be identified and the key frame image.
In the video processing method provided by the embodiment, the audio information and the key frame image corresponding to the video to be identified are acquired; then, audio text semantic feature vectors corresponding to the audio information are obtained, and visual semantic feature vectors corresponding to the key frame images are obtained; then determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector; and then if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video, accurately determining whether the video to be identified is the violation video according to the audio information of the video to be identified and the key frame image, and improving the accuracy of violation video identification by being compatible with audio information such as voice explanation during video review.
Based on the first embodiment, a second embodiment of the video processing method of the present invention is proposed, in this embodiment, step S103 includes:
step S201, determining a multi-modal fusion feature vector corresponding to the video to be recognized based on the visual semantic feature vector and the audio text semantic feature vector;
step S202, inputting the multi-modal fusion feature vector into the fusion feature network model, and taking the output of the fusion feature network model as the violation probability.
In this embodiment, when the visual semantic feature vector and the audio text semantic feature vector are obtained, a multi-modal fusion feature vector corresponding to a video to be identified is determined based on the visual semantic feature vector and the audio text semantic feature vector, specifically, the visual semantic feature vector and the audio text semantic feature vector are fused to obtain the multi-modal fusion feature vector, for example, the visual semantic feature vector and the audio text semantic feature vector may be subjected to character string connection (concat) to obtain the multi-modal fusion feature vector, and of course, in other embodiments, if the dimensions of the visual semantic feature vector and the audio text semantic feature vector are the same, the multi-modal fusion feature vector may also be obtained based on the mean values of elements with the same dimensions of the visual semantic feature vector and the audio text semantic feature vector.
Then, the multi-modal fusion feature vector is input into a fusion feature network model for model training, and the output of the fusion feature network model, which is the model training result, is used as the violation probability.
In the video processing method provided by this embodiment, a multi-modal fusion feature vector corresponding to the video to be recognized is determined based on the visual semantic feature vector and the audio text semantic feature vector; and then inputting the multi-mode fusion feature vector into the fusion feature network model, taking the output of the fusion feature network model as the violation probability, and fusing the visual semantic feature vector and the audio text semantic feature vector to determine whether the video to be identified is the violation video according to the fusion result of the visual semantic feature vector and the audio text semantic feature vector, so that the video identification efficiency is improved, and the violation video identification accuracy is further improved.
Based on the first embodiment, a third embodiment of the video processing method of the present invention is proposed, in this embodiment, step S102 includes:
step S301, acquiring image text characteristic vectors and key frame image characteristic vectors corresponding to the key frame images;
step S302, acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library;
step S303, inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, and taking the output of the cyclic neural network model as the visual semantic feature vector.
In this embodiment, when a key frame image is acquired, an image text feature vector and a key frame image feature vector corresponding to the key frame image are acquired, specifically, image text information corresponding to the previous key frame image is firstly obtained, then, model training is performed according to the image text information to obtain an image text feature vector, and meanwhile, the key frame image is input to a corresponding model to perform model training to obtain a key frame image feature vector.
And then, acquiring a sensitive character feature vector corresponding to the key frame image based on the sensitive character library, specifically, respectively matching the key frame image with sensitive character features in the sensitive character library to obtain matched target sensitive character features, and taking the target sensitive character features as the sensitive character feature vector. If the video audit is an administrative-related video audit, various types of face images including a national leader, a split national organization head, a evil and education organization head, an violence and terrorism organization head, other administrative-related sensitive personnel and the like are collected in advance, the face images are input into a convolutional neural network model to perform model training, sensitive character features are obtained, the sensitive character features can be 128-dimensional face features, and a sensitive character library is established based on the obtained sensitive character features.
Then, inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a recurrent neural network model, and taking the output of the recurrent neural network model as a visual semantic feature vector, wherein the recurrent neural network model can be a bidirectional time recurrent neural network of an attention system, the recurrent neural network model is a pre-trained bidirectional time recurrent neural network, and the visual semantic feature vector can be a 128-dimensional feature vector.
In the video processing method provided by this embodiment, the image text feature vector and the key frame image feature vector corresponding to the key frame image are obtained; then, acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library; and then inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, taking the output of the cyclic neural network model as the visual semantic feature vector, obtaining the visual semantic feature vector based on the image text feature vector, the key frame image feature vector and the sensitive character feature vector, and comprehensively considering various forbidden contents such as sensitive events, administrative characters, reverse propaganda and the like in the video to be recognized during video recognition so as to accurately recognize administrative related articles, advertising slogans, administrative characters and the like in the video to be recognized, thereby further improving the accuracy of illegal video recognition.
Based on the third embodiment, a fourth embodiment of the video processing method of the present invention is proposed, in this embodiment, step S301 includes:
step S401, performing text recognition on the key frame image to obtain image text information corresponding to the key frame image;
step S402, inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text feature vector;
step S403, inputting the key frame image into a deep convolutional neural network model, and using an output of the deep convolutional neural network model as the key frame image feature vector.
In this embodiment, when a key frame image is obtained, text recognition is performed on the key frame image to obtain image text information corresponding to the key frame image, and specifically, text recognition may be performed on the key frame image by using an existing graphic text recognition algorithm, such as an ORC, to obtain the image text information.
And then inputting the image text information into the convolutional neural network model, taking the output of the convolutional neural network model as an image text characteristic vector, inputting the key frame image into the deep convolutional neural network model, and taking the output of the deep convolutional neural network model as a key frame image characteristic vector.
In this embodiment, before video recognition, an initial convolutional neural network model and an initial deep convolutional neural network model are built, a second training sample (text sample) is used for model training of the initial convolutional neural network model, when the trained initial convolutional neural network model meets requirements, the trained initial convolutional neural network model is used as the convolutional neural network model, a third training sample (image sample) is used for model training of the initial deep convolutional neural network model, and when the trained initial deep convolutional neural network model meets requirements, the trained initial deep convolutional neural network model is used as the deep convolutional neural network model.
In the video processing method provided by this embodiment, text recognition is performed on the key frame image to obtain image text information corresponding to the key frame image; inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text characteristic vector; and then inputting the key frame image into a deep convolutional neural network model, taking the output of the deep convolutional neural network model as the feature vector of the key frame image, and accurately obtaining the feature vector of the image text and the feature vector of the key frame image through the image text information and the key frame image so as to comprehensively consider political-related articles, publicity slogans, political-related figures and the like in the video to be identified during video identification, thereby further improving the accuracy of illegal video identification.
A fifth embodiment of the video processing method of the present invention is proposed based on the third embodiment, and in this embodiment, step S303 includes:
step S501, acquiring human face features corresponding to the key frame images;
step S502, determining the similarity between each human face feature and each sensitive character feature in a sensitive character library;
step S503, obtaining the target similarity greater than the preset similarity in each similarity, and taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector.
In this embodiment, after the key frame image is obtained, the face features corresponding to the key frame image are obtained, where the face features include features of all faces appearing in the key frame image, for example, the face features corresponding to the key frame image may be extracted by using a deep convolutional neural network, or the face features corresponding to the key frame image may be extracted by using an existing face recognition algorithm, where the face features may be 128-dimensional vectors.
And then, calculating the similarity between each face feature and each sensitive character feature in the sensitive character library, wherein the sensitive character feature can be a 128-dimensional face feature vector, and for each face feature, calculating a cosine value between the face feature and each sensitive character feature based on a cosine formula, wherein the cosine value is the corresponding similarity.
Then, comparing each similarity with a preset similarity, obtaining a target similarity which is greater than the preset similarity in each similarity, determining a sensitive character feature corresponding to the target similarity in a sensitive character library, namely a target sensitive character feature, and taking the target sensitive character feature as a sensitive character feature vector.
In the video processing method provided by this embodiment, the face features corresponding to the key frame images are obtained; then determining the similarity between each face feature and each sensitive character feature in a sensitive character library; and then, acquiring the target similarity which is greater than the preset similarity in the similarities, taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector, and accurately acquiring the sensitive character feature vector according to the similarity, thereby further improving the accuracy of the illegal video identification.
Based on the first embodiment, a sixth embodiment of the video processing method of the present invention is proposed, in this embodiment, step S102 includes:
step S601, carrying out voice recognition on the audio information to obtain audio text information corresponding to the audio information;
step S602, inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the semantic feature vector of the audio text.
In this embodiment, when audio information is acquired, voice recognition is performed on the audio information to acquire audio text information corresponding to the audio information, for example, voice recognition is performed on the audio information by using an existing voice recognition algorithm to acquire the audio text information.
And then, inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the audio text semantic feature vector, wherein before video recognition, an initial character-level convolutional neural network model is built, model training is carried out on the initial character-level convolutional neural network model by adopting a fourth training sample (text sample), and when the trained initial character-level convolutional neural network model meets the requirements, the trained initial character-level convolutional neural network model is taken as the character-level convolutional neural network model, so that the audio text semantic feature vector corresponding to the audio text information is obtained through the character-level convolutional neural network model.
In the video processing method provided by this embodiment, voice recognition is performed on the audio information to obtain audio text information corresponding to the audio information; and then inputting the audio text information into a character-level convolutional neural network model, taking the output of the character-level convolutional neural network model as the audio text semantic feature vector, accurately obtaining the audio text semantic feature vector according to the audio text information, and further improving the accuracy of illegal video identification by being compatible with audio information such as voice explanation during video auditing.
Based on the above embodiments, a seventh embodiment of the video processing method of the present invention is proposed, in this embodiment, before step S101, the video processing method further includes:
step S701, acquiring a video fingerprint corresponding to a video to be detected;
step S702, if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database;
step S703, if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is a video to be identified;
step S704, if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is an illegal video.
In this embodiment, a video to be detected is obtained first, for example, each video that needs to be subjected to violation review may be taken as a video to be detected, and then a video fingerprint of the video to be detected is obtained. And determining whether the video fingerprint matches a first preset video fingerprint in a normal video fingerprint library.
If the video fingerprint is not matched with the first preset video fingerprint in the normal video fingerprint library, the video to be detected is an abnormal video, namely the video to be detected may be an illegal video or an undetected video, and at the moment, whether the video fingerprint is matched with the second preset video fingerprint in the illegal video fingerprint library is determined. And if the video fingerprint is matched with a first preset video fingerprint in a normal video fingerprint library, for example, the video fingerprint belongs to the first preset video fingerprint, determining that the video to be detected is a normal video.
And if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is the illegal video, and otherwise, determining that the video to be detected is the video to be identified.
In the video processing method provided by the embodiment, the video fingerprint corresponding to the video to be detected is obtained; then if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database; then if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is the video to be identified; and then if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is the illegal video, and initially screening the video to be detected through the video fingerprints of the normal video and the illegal video, so that the video to be detected, which belongs to the normal video or the illegal video, can be prevented from being checked again, and the efficiency of identifying the illegal video is further improved.
The present invention also provides a video processing apparatus, referring to fig. 3, the video processing apparatus comprising:
the first obtaining module 10 is configured to obtain audio information and a key frame image corresponding to a video to be identified;
a second obtaining module 20, configured to obtain an audio text semantic feature vector corresponding to the audio information, and obtain a visual semantic feature vector corresponding to the key frame image;
the first determining module 30 is configured to determine, based on the visual semantic feature vector and the audio text semantic feature vector, a violation probability corresponding to the video to be identified through a fusion feature network model;
and a second determining module 40, configured to determine that the video to be identified is an illegal video if the violation probability is greater than a preset threshold.
Further, the first determining module 30 is further configured to:
determining a multi-mode fusion feature vector corresponding to the video to be recognized based on the visual semantic feature vector and the audio text semantic feature vector;
and inputting the multi-mode fusion feature vector into the fusion feature network model, and taking the output of the fusion feature network model as the violation probability.
Further, the second obtaining module 20 is further configured to:
acquiring image text characteristic vectors and key frame image characteristic vectors corresponding to the key frame images;
acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library;
and inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, and taking the output of the cyclic neural network model as the visual semantic feature vector.
Further, the second obtaining module 20 is further configured to:
performing text recognition on the key frame image to obtain image text information corresponding to the key frame image;
inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text feature vector;
and inputting the key frame image into a deep convolutional neural network model, and taking the output of the deep convolutional neural network model as the feature vector of the key frame image.
Further, the second obtaining module 20 is further configured to:
acquiring the face features corresponding to the key frame images;
determining the similarity between each face feature and each sensitive character feature in a sensitive character library;
and acquiring the target similarity which is greater than the preset similarity in the similarities, and taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector.
Further, the second obtaining module 20 is further configured to:
performing voice recognition on the audio information to obtain audio text information corresponding to the audio information;
and inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the semantic feature vector of the audio text.
Further, the video processing apparatus is further configured to:
acquiring a video fingerprint corresponding to a video to be detected;
if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database;
if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is the video to be identified;
and if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is an illegal video.
The methods executed by the program units can refer to various embodiments of the video processing method of the present invention, and are not described herein again.
The invention also provides a computer readable storage medium.
The computer readable storage medium of the present invention has stored thereon a video processing program which, when executed by a processor, implements the steps of the video processing method as described above.
The method implemented when the video processing program running on the processor is executed may refer to each embodiment of the video processing method of the present invention, and details thereof are not repeated herein.
Furthermore, an embodiment of the present invention further provides a computer program product, which includes a video processing program, and when the video processing program is executed by a processor, the steps of the video processing method described above are implemented.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or system in which the element is included.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A video processing method, characterized in that the video processing method comprises the steps of:
acquiring audio information and a key frame image corresponding to a video to be identified;
acquiring an audio text semantic feature vector corresponding to the audio information, and acquiring a visual semantic feature vector corresponding to the key frame image;
determining the violation probability corresponding to the video to be identified through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;
and if the violation probability is greater than a preset threshold value, determining that the video to be identified is the violation video.
2. The video processing method of claim 1, wherein the step of determining the violation probability corresponding to the video to be recognized by fusing a feature network model based on the visual semantic feature vector and the audio text semantic feature vector comprises:
determining a multi-mode fusion feature vector corresponding to the video to be recognized based on the visual semantic feature vector and the audio text semantic feature vector;
and inputting the multi-mode fusion feature vector into the fusion feature network model, and taking the output of the fusion feature network model as the violation probability.
3. The video processing method according to claim 1, wherein the step of obtaining the visual semantic feature vector corresponding to the key frame image comprises:
acquiring image text characteristic vectors and key frame image characteristic vectors corresponding to the key frame images;
acquiring a sensitive character feature vector corresponding to the key frame image based on a sensitive character library;
and inputting the image text feature vector, the key frame image feature vector and the sensitive character feature vector into a cyclic neural network model, and taking the output of the cyclic neural network model as the visual semantic feature vector.
4. The video processing method as claimed in claim 3, wherein said step of obtaining image text feature vectors and key frame image feature vectors corresponding to said key frame images comprises:
performing text recognition on the key frame image to obtain image text information corresponding to the key frame image;
inputting the image text information into a convolutional neural network model, and taking the output of the convolutional neural network model as the image text feature vector;
and inputting the key frame image into a deep convolutional neural network model, and taking the output of the deep convolutional neural network model as the feature vector of the key frame image.
5. The video processing method of claim 3, wherein the step of obtaining the feature vector of the sensitive person corresponding to the key frame image based on the sensitive person library comprises:
acquiring the face features corresponding to the key frame images;
determining the similarity between each face feature and each sensitive character feature in a sensitive character library;
and acquiring the target similarity which is greater than the preset similarity in the similarities, and taking the sensitive character features corresponding to the target similarity as the sensitive character feature vector.
6. The video processing method according to claim 1, wherein the step of obtaining the audio text semantic feature vector corresponding to the audio information comprises:
performing voice recognition on the audio information to obtain audio text information corresponding to the audio information;
and inputting the audio text information into a character-level convolutional neural network model, and taking the output of the character-level convolutional neural network model as the semantic feature vector of the audio text.
7. The video processing method according to any one of claims 1 to 6, wherein before the step of acquiring the audio information corresponding to the video to be identified and the key frame image, the video processing method further comprises:
acquiring a video fingerprint corresponding to a video to be detected;
if the video fingerprint is not matched with a first preset video fingerprint in a normal video fingerprint database, determining whether the video fingerprint is matched with a second preset video fingerprint in an illegal video fingerprint database;
if the video fingerprint is not matched with the second preset video fingerprint, determining that the video to be detected is the video to be identified;
and if the video fingerprint is matched with the second preset video fingerprint, determining that the video to be detected is an illegal video.
8. A video processing apparatus, characterized in that the video processing apparatus comprises:
the first acquisition module is used for acquiring audio information and a key frame image corresponding to a video to be identified;
the second acquisition module is used for acquiring audio text semantic feature vectors corresponding to the audio information and acquiring visual semantic feature vectors corresponding to the key frame images;
the first determining module is used for determining the violation probability corresponding to the video to be recognized through a fusion feature network model based on the visual semantic feature vector and the audio text semantic feature vector;
and the second determining module is used for determining that the video to be identified is the violation video if the violation probability is greater than a preset threshold.
9. A video processing apparatus, characterized in that the video processing apparatus comprises: memory, processor and video processing program stored on the memory and executable on the processor, which when executed by the processor implements the steps of the video processing method according to any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a video processing program is stored thereon, which when executed by a processor implements the steps of the video processing method according to any one of claims 1 to 7.
CN202210501926.3A 2022-05-09 2022-05-09 Video processing method, device, equipment and computer readable storage medium Pending CN115049953A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210501926.3A CN115049953A (en) 2022-05-09 2022-05-09 Video processing method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210501926.3A CN115049953A (en) 2022-05-09 2022-05-09 Video processing method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN115049953A true CN115049953A (en) 2022-09-13

Family

ID=83157492

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210501926.3A Pending CN115049953A (en) 2022-05-09 2022-05-09 Video processing method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN115049953A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294504A (en) * 2022-09-28 2022-11-04 武汉当夏时光文化创意有限公司 Marketing video auditing system based on AI
CN115834935A (en) * 2022-12-21 2023-03-21 阿里云计算有限公司 Multimedia information auditing method, advertisement auditing method, equipment and storage medium
CN116701708A (en) * 2023-07-27 2023-09-05 上海蜜度信息技术有限公司 Multi-mode enhanced video classification method, system, storage medium and electronic equipment
CN117829973A (en) * 2023-12-29 2024-04-05 广电运通集团股份有限公司 Risk control method and device for banking outlets

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294504A (en) * 2022-09-28 2022-11-04 武汉当夏时光文化创意有限公司 Marketing video auditing system based on AI
CN115294504B (en) * 2022-09-28 2023-01-03 武汉当夏时光文化创意有限公司 Marketing video auditing system based on AI
CN115834935A (en) * 2022-12-21 2023-03-21 阿里云计算有限公司 Multimedia information auditing method, advertisement auditing method, equipment and storage medium
CN116701708A (en) * 2023-07-27 2023-09-05 上海蜜度信息技术有限公司 Multi-mode enhanced video classification method, system, storage medium and electronic equipment
CN116701708B (en) * 2023-07-27 2023-11-17 上海蜜度信息技术有限公司 Multi-mode enhanced video classification method, system, storage medium and electronic equipment
CN117829973A (en) * 2023-12-29 2024-04-05 广电运通集团股份有限公司 Risk control method and device for banking outlets
CN117829973B (en) * 2023-12-29 2024-08-16 广电运通集团股份有限公司 Risk control method and device for banking outlets

Similar Documents

Publication Publication Date Title
CN115049953A (en) Video processing method, device, equipment and computer readable storage medium
CN109784186B (en) Pedestrian re-identification method and device, electronic equipment and computer-readable storage medium
CN109117777B (en) Method and device for generating information
CN113779308B (en) Short video detection and multi-classification method, device and storage medium
CN113449725B (en) Object classification method, device, equipment and storage medium
CN111914812A (en) Image processing model training method, device, equipment and storage medium
CN111932544A (en) Tampered image detection method and device and computer readable storage medium
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN111931859B (en) Multi-label image recognition method and device
CN113221918B (en) Target detection method, training method and device of target detection model
CN114611672B (en) Model training method, face recognition method and device
CN110717407B (en) Face recognition method, device and storage medium based on lip language password
CN113255557A (en) Video crowd emotion analysis method and system based on deep learning
CN116561570A (en) Training method, device and equipment for multi-mode model and readable storage medium
CN112766284B (en) Image recognition method and device, storage medium and electronic equipment
Sathiyaprasad Ontology-based video retrieval using modified classification technique by learning in smart surveillance applications
CN117668758A (en) Dialog intention recognition method and device, electronic equipment and storage medium
Khan et al. A framework for plagiarism detection in Arabic documents
CN107329948B (en) Method, apparatus and storage medium for estimating occurrence time of event described in sentence
CN116187341A (en) Semantic recognition method and device
CN115423050A (en) False news detection method and device, electronic equipment and storage medium
CN115329171A (en) Content identification method and equipment
CN114299295A (en) Data processing method and related device
CN114186039A (en) Visual question answering method and device and electronic equipment
CN111626437A (en) Confrontation sample detection method, device and equipment and computer scale storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination