[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2021082941A1 - Video figure recognition method and apparatus, and storage medium and electronic device - Google Patents

Video figure recognition method and apparatus, and storage medium and electronic device Download PDF

Info

Publication number
WO2021082941A1
WO2021082941A1 PCT/CN2020/121259 CN2020121259W WO2021082941A1 WO 2021082941 A1 WO2021082941 A1 WO 2021082941A1 CN 2020121259 W CN2020121259 W CN 2020121259W WO 2021082941 A1 WO2021082941 A1 WO 2021082941A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
target video
key frame
audio
sub
Prior art date
Application number
PCT/CN2020/121259
Other languages
French (fr)
Chinese (zh)
Inventor
彭冬炜
Original Assignee
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oppo广东移动通信有限公司 filed Critical Oppo广东移动通信有限公司
Publication of WO2021082941A1 publication Critical patent/WO2021082941A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Definitions

  • the present disclosure relates to the field of artificial intelligence technology, and in particular to a video character recognition method, a video character recognition device, a computer-readable storage medium, and electronic equipment.
  • Video person recognition refers to the identification of the person's identity in the video to classify the video or add person tags, etc. It has important applications in scenarios such as security, video classification, video content review, and smart photo albums.
  • video person recognition is mainly implemented based on face recognition in video images.
  • An image containing a face is detected from the video, and then the face in the image is further accurately recognized to determine the identity of the person.
  • This method has higher requirements on the sharpness of the face image. When the face image is not clear enough or is blocked, the accuracy of the recognition result is low.
  • the present disclosure provides a video character recognition method, a video character recognition device, a computer-readable storage medium, and electronic equipment, thereby improving the accuracy of video character recognition at least to a certain extent.
  • a method for identifying a person in a video including: acquiring a key frame image from a target video; extracting appearance characteristics of a person from the key frame image; according to the key frame image in the target video The sub-audio corresponding to the key frame image is intercepted from the audio of the target video, and the voiceprint feature is extracted from the sub-audio; the pre-trained fusion model is used to compare the appearance feature of the character and the voice The pattern feature is processed to obtain the person recognition result of the target video.
  • a video character recognition device including a processor; wherein the processor is configured to execute the following program modules stored in a memory: an image acquisition module for acquiring key frames from a target video Image; the first extraction module is used to extract the appearance characteristics of the character from the key frame image; the second extraction module is used to extract the audio from the target video according to the time of the key frame image in the target video The sub audio corresponding to the key frame image is intercepted in the sub audio, and the voiceprint feature is extracted from the sub audio; the feature processing module is used to process the appearance feature of the character and the voiceprint feature using a pre-trained fusion model, The person recognition result of the target video is obtained.
  • a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned video character recognition method and possible implementation manners thereof are realized.
  • an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions Perform the above-mentioned video character recognition method and its possible implementation.
  • the key frame image is obtained from the target video, and the appearance characteristics of the character are extracted, and then the sub audio is intercepted according to the time of the key frame image in the target video , Extract the voiceprint features from the sub-audio, and finally use the pre-trained fusion model to process the appearance features and voiceprint features of the characters to obtain the result of the character recognition of the target video.
  • the appearance characteristics of the characters reflect the characteristics of the image
  • the voiceprint characteristics reflect the characteristics of the sound, thus taking advantage of the characteristics of the video’s two aspects of "sound” and "picture”, fusing multi-modal characteristics and based on the characteristics Process person recognition, which can achieve higher recognition accuracy.
  • multi-modal features can make up for the lack of any aspect of the feature to a certain extent, so that the technical solution of the present disclosure can be applied to situations where the face image in the video is not high in definition or is blocked, and has a high degree of improvement. Robustness.
  • the key frame image and the sub audio are matched over time, and the two have a correspondence, thereby reducing the situation that the image feature and the sound feature are not synchronized, and improving the accuracy of the video character recognition.
  • Fig. 1 shows a flowchart of a method for video character recognition in this exemplary embodiment
  • Fig. 2 shows a flowchart of a method for extracting facial features in this exemplary embodiment
  • Fig. 3 shows a flowchart of a method for extracting voiceprint features in this exemplary embodiment
  • Fig. 4 shows a flowchart of a method for obtaining a spectrum in this exemplary embodiment
  • Fig. 5 shows a flowchart of a method for processing appearance characteristics and voiceprint characteristics of a character in this exemplary embodiment
  • Fig. 6 shows a flowchart of another method for processing appearance characteristics and voiceprint characteristics of a person in this exemplary embodiment
  • Fig. 7 shows a structural block diagram of a video character recognition device in this exemplary embodiment
  • Fig. 8 shows a structural block diagram of another video character recognition device in this exemplary embodiment
  • Fig. 9 shows an electronic device for implementing the above method in this exemplary embodiment.
  • face recognition based on video images has higher requirements for the clarity of the face image.
  • the accuracy of the recognition result is low.
  • the related technology only uses face recognition in the image to identify other people's objects, which fails to make full use of the multi-modal information.
  • Modal information which is one of the main reasons for its low accuracy in identifying people.
  • the exemplary embodiments of the present disclosure first provide a video character recognition method, which can be applied to the server of the video service platform, for example, perform character recognition on the video on the platform from the server to add People tags are convenient for users to search, and can also be applied to personal computers and smart phone terminal devices. For example, people can recognize people in videos taken or downloaded by users to automatically classify people.
  • Fig. 1 shows a process of this exemplary embodiment, which may include the following steps S110 to S140:
  • Step S110 Obtain a key frame image from the target video
  • Step S120 extracting the appearance characteristics of the person from the key frame image
  • Step S130 according to the time of the key frame image in the target video, intercept the sub audio corresponding to the key frame image from the audio of the target video, and extract the voiceprint feature from the sub audio;
  • Step S140 Use the pre-trained fusion model to process the above-mentioned character appearance features and voiceprint features to obtain a character recognition result of the target video.
  • step S110 a key frame image is obtained from the target video.
  • the key frame image refers to an image containing the appearance of a human face in the target video.
  • One key frame or multiple key frames may be extracted, and the number of them is not limited in the present disclosure.
  • each interval is fixed for a fixed length of time or a fixed number of frames, and one frame is selected as a key frame.
  • a key frame image can be extracted for every 3 frames.
  • an intra-coded frame (Intra-Coded frame, I frame for short) is a frame that is independently coded based on a single frame image, which is a complete preservation of the current frame image, and only the data of the current frame is required for decoding.
  • I frame there are a forward predictive frame (Predictive frame, referred to as P frame) and a bi-predictive frame (Bi-Predictive frame, referred to as B frame).
  • the P frame records the difference from the previous frame.
  • the B frame records the difference between it and the previous and subsequent bidirectional frames, and it needs to refer to the previous and subsequent frame data at the same time to be completely decoded.
  • the I frame when acquiring a key frame image, the I frame needs to be decoded first, and then the target P frame and B frame are decoded according to the difference between the previous and next frames, which is less efficient. Therefore, the I frame can be directly used as the key frame, and the I frame can be extracted from the target video and decoded to obtain the aforementioned key frame image. In this way, when decoding, only the key frame image needs to be decoded independently, without decoding other frames, the number of frames to be decoded is the least, and the key frame image is extracted at the fastest speed.
  • multiple threads can also be called during decoding, so that each thread decodes an I frame separately.
  • video tools such as video playback software, editing software, etc.
  • the decoder can be implanted in the video character recognition program, and the code of the thread part can be modified.
  • N I frames are acquired as key frames, and N threads are started accordingly.
  • the decoding task of each I frame is allocated to the corresponding thread, and each thread executes the decoding task independently, thereby quickly completing the extraction of the key frame image in a concurrent manner.
  • step S110 may obtain a fixed number of key frame images, for example, obtain 64 or 128 key frame images, etc., when the above key frames are collected, the relevant parameters can be determined according to the number. For example: calculate the length of the interval or the number of frames in the above method (1); determine the number of key frames extracted from each sub-video in the above method (2); determine how to extract the I frame in the above method (3) If the number of I-frames in the video to be classified is insufficient, the insufficient part can be extracted from P-frames or B-frames.
  • this exemplary embodiment can also use the above three methods in combination, for example: the above methods (2) and (3) are used in combination, an I frame is selected as a key frame in each sub video, and so on.
  • step S120 the appearance characteristics of the person are extracted from the key frame image.
  • the machine learning model may be used to extract the appearance characteristics of the person in the key frame image. This is not to classify or recognize the key frame image. Therefore, there is no limitation on what type of data the machine learning model ultimately outputs.
  • the advantage of this is that there is no restriction on the type of label when training the convolutional neural network. Which label is readily available or easy to obtain, and which label is used for training.
  • an open source portrait data set can be used, which contains For a large number of face images and their classification labels, a convolutional neural network for image classification is trained accordingly to extract face features in step S120.
  • the key frame image can be input into the convolutional neural network, and after a series of convolution and pooling processes, the features are extracted from the fully connected layer. You can select the first fully connected layer, which has denser features, or you can choose the subsequent fully connected layer.
  • the data volume of the connection layer is usually smaller, which is not limited in the present disclosure.
  • Character appearance features can include face features, body shape features, posture features, etc.
  • Face features include information such as the position, proportion, shape, and expression of each part of the face
  • body shape features include the position, proportion, shape, etc. of the body and limbs, etc.
  • Information, posture characteristics include character actions, postures and other information. Among them, facial features are relatively more important for the significance of person recognition.
  • step S120 may be specifically implemented by the following steps S210 and S220:
  • Step S210 detecting the face area in the key frame image, so as to intercept the face sub-image from the key frame image;
  • step S220 a pre-trained convolutional neural network is used to extract facial features from the aforementioned facial sub-images.
  • the face area can be recognized by algorithms such as contour detection.
  • the key frame image can be input into the face detection network RetinaFace for face area detection, and the area where the face is located in the image and the coordinates of the key points of the face can be detected.
  • the face area is cut out from the key frame image to obtain a face sub-image, which filters out scenes, objects and other image content not related to person recognition.
  • input to a pre-trained convolutional neural network and obtain facial features from the fully connected layer of the network.
  • the dimensions of face features can be set according to actual needs. For example, if the first fully connected layer of the convolutional neural network is set to 512 dimensions, then after inputting the face sub-image, the fully connected layer can be extracted 512-dimensional facial features.
  • step S130 according to the time of the key frame image in the target video, the sub audio corresponding to the key frame image is intercepted from the audio of the target video, and the voiceprint feature is extracted from the sub audio.
  • the intercepted sub audio is the audio part corresponding to the key frame image.
  • the time of the key frame image in the target video is 09.670 seconds, you can take this time as the center and intercept a certain window from the audio of the target video Sub audio.
  • the sub audio and key frame images should achieve "audio-visual synchronization".
  • the duration of the sub audio is not specifically limited in the present disclosure, and a fixed duration can be used. For example, according to the average time of a person speaking in a general video, using 3 seconds or 5 seconds, etc., the audio on both sides of the key frame time point can also be detected.
  • the sudden change point in the middle such as the point in time when the voice content or frequency suddenly changes, intercept the part of the audio in the middle of the two sudden changes.
  • the voiceprint feature may include Mel-Frequency Cepstral Coefficients (MFCC).
  • MFCC Mel-Frequency Cepstral Coefficients
  • Step S310 Obtain the frequency spectrum corresponding to the sub audio.
  • the frequency spectrum can be regarded as the collection of the frequency, phase and amplitude of each sinusoidal signal, and the corresponding frequency spectrum can be obtained by sampling the signal in the sub-audio.
  • Step S320 Calculate the corresponding amplitude spectrum based on the above frequency spectrum.
  • the amplitude spectrum refers to the spectral line formed by the amplitude of each different frequency sinusoidal signal in the frequency spectrum.
  • the corresponding amplitude spectrum can be obtained by disassembling and calculating the frequency spectrum of the sub-audio.
  • step S330 Mel filter processing is performed on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
  • the amplitude spectrum can be divided into multiple mel filter banks according to the sensitivity of the human ear, and the center frequencies of each filter are linearly distributed at equal intervals.
  • the Mel filter bank to perform Mel filtering on the amplitude spectrum of the sub audio, the Mel frequency cepstrum coefficient can be calculated.
  • step S310 may be implemented by the following steps S410 and S420:
  • Step S410 preprocessing the sub audio.
  • the preprocessing may include any one or more of the following: extracting the speech signal, pre-emphasis, framing and windowing.
  • Extracting the voice signal refers to filtering out non-human voice signals such as background sound and noise from the sub-audio, and only retaining the human voice signal;
  • pre-emphasis is a signal processing method that compensates for the high-frequency component of the sub-audio red;
  • framing is Split the sub-audio according to each frame to facilitate the subsequent feature extraction;
  • the windowing process is to limit the signal through the preset window size, and each frame can be substituted into the window function, and the value outside the window is set to 0,
  • this exemplary embodiment may use a square window, a Hamming window, etc. as the window function.
  • Step S420 Perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
  • Performing Fourier transform on the sub-audio can extract the frequency domain features, and draw the sub-audio signal as a frequency-density curve, that is, obtain the frequency spectrum corresponding to the sub-audio.
  • the Mel frequency cepstrum coefficients can also be converted into voiceprint feature vectors, and post-processing is performed.
  • Mel frequency cepstral coefficients are high-dimensional dense features, which are expressed in the form of vectors, that is, voiceprint feature vectors, which can be used in the processing of machine learning models.
  • the post-processing of the voiceprint feature vector can optimize the subsequent process, and the post-processing can adopt any one or more of the following methods: de-averaging, normalization, and dimensionality reduction processing.
  • PCA Principal Components Analysis
  • the voiceprint features extracted in step S130 may also include any one or more of the following: generalized Mel cepstrum coefficients, spectral envelope and energy features, fundamental frequency, voiced/light tone classification features, frequency band non-periodic Weight.
  • generalized Mel cepstrum coefficient and Mel frequency cepstrum coefficient are basically the same in principle, which are high-dimensional features (for example, 180 dimensions), and the specific coefficient content has some differences, which can be used as a substitute for the Mel frequency cepstrum coefficient.
  • spectral envelope and energy features are features related to speech content
  • fundamental frequency, voiced/light tone classification features and frequency band non-periodic components are features related to basic pronunciation information, usually relatively sparse features, which can be used as plum
  • Cepstrum coefficient of the Er frequency The richer the dimensions of the voiceprint feature, the more accurate the characterization of the video character, and the more conducive to accurate video character recognition.
  • step S140 a pre-trained fusion model is used to process the above-mentioned character appearance characteristics and voiceprint characteristics to obtain the character recognition result of the target video.
  • the character appearance feature is the feature extracted from the image
  • the voiceprint feature is the feature extracted from the sound.
  • the fusion of the two represents the multi-modal feature of the target video.
  • image features and voiceprint features in steps S120 and S130 to obtain feature data in vector or matrix form, it is easy to achieve multi-modal feature fusion, and then process through the fusion model to obtain The person recognition result of the target video.
  • step S140 may include the following steps S510 to S530:
  • step S520 the appearance feature of the character and the voiceprint feature are respectively input into the two input channels of the fusion model.
  • Step S520 Use the fusion model to process the appearance feature and the voiceprint feature of the character separately to obtain the intermediate feature corresponding to the appearance feature of the character and the intermediate feature corresponding to the voiceprint feature; these two intermediate features respectively represent the appearance and voice of the video character Abstract information.
  • Step S530 Perform a fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video.
  • the appearance and sound information can be combined and correlated to achieve the fusion of the two modal information, and after further recognition processing, a comprehensive character recognition result is finally output.
  • step S140 may include the following steps S610 and S620:
  • Step S610 Combine the appearance feature of the character and the voiceprint feature to obtain a comprehensive feature
  • step S620 the integrated features are input to the fusion model, and the person recognition result of the target video is output.
  • the combination of character appearance features and voiceprint features can be in the form of splicing.
  • a 512-dimensional face feature is obtained through a convolutional neural network
  • a 512-dimensional Mel frequency cepstrum system is extracted Number (ie voiceprint feature)
  • stitch the two into a 1024-dimensional comprehensive feature and input it into the fusion model for processing.
  • the fusion model can use a common neural network model, or it can be structurally optimized according to actual needs. For example: using MobileNet (open source neural network for mobile terminals), in which data enhancement mechanisms are set up, including the Dropout layer (discarding layer), random noise, etc., to standardize the input comprehensive characteristics of the data, and the fully connected layer
  • the number of channels is set to 1024
  • the PReLu Parametric Rectified Linear Unit, parameter correction linear unit
  • BCE Loss Binary Cross Entropy Loss, cross entropy loss
  • time feature can also be determined according to the time of the key frame image in the target video and the time interval of the sub audio in the target video.
  • the time feature can include 2 or 3 dimensions, one dimension records the time of the key frame, and the other dimensions record the time interval of the sub audio, for example, the start time and the end time, It can also be the center time and the sub-audio duration; in the case of multiple key frame images, the dimension of the time feature is set according to the number of frames.
  • the time feature is equivalent to the supplement to the multi-modal feature.
  • the time information is added, which helps to improve the completeness and richness of the comprehensive feature, thereby improving the accuracy of person recognition.
  • the key frame image is obtained from the target video, and the appearance characteristics of the character are extracted, and then the sub audio is intercepted according to the time of the key frame image in the target video, and the voiceprint feature is extracted from the sub audio.
  • the pre-trained fusion model is used to process the appearance characteristics and voiceprint characteristics of the characters, and the result of the character recognition of the target video is obtained.
  • the appearance characteristics of the characters reflect the characteristics of the image
  • the voiceprint characteristics reflect the characteristics of the sound, thus taking advantage of the characteristics of the video’s two aspects of "sound” and "picture”, fusing multi-modal characteristics and based on the characteristics Process person recognition, which can achieve higher recognition accuracy.
  • the multi-modal feature can make up for the lack of any aspect of the feature to a certain extent, so that this exemplary embodiment can be applied to the situation where the face image in the video is not high in definition or occluded, and has a high degree of improvement. Robustness.
  • the key frame image and the sub audio are matched over time, and the two have a correspondence, thereby reducing the situation that the image feature and the sound feature are not synchronized, and improving the accuracy of the video character recognition.
  • the video character recognition device 700 may include a processor 710 and a memory 720.
  • the memory 720 includes the following program modules:
  • the image acquisition module 721 is used to acquire key frame images from the target video
  • the first extraction module 722 is used to extract the appearance features of the person from the key frame image
  • the second extraction module 723 is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprint features from the sub audio;
  • the feature processing module 724 is configured to use the pre-trained fusion model to process the above-mentioned character appearance features and voiceprint features to obtain the character recognition result of the target video;
  • the processor 710 is configured to execute the foregoing program modules.
  • the image acquisition module 721 may be used to:
  • the image acquisition module 721 may be used to:
  • the above-mentioned human appearance feature may include a human face feature.
  • the first extraction module 722 may be used to:
  • a pre-trained convolutional neural network is used to extract facial features from facial sub-images.
  • the above-mentioned human appearance characteristics may further include body shape characteristics and posture characteristics.
  • the voiceprint feature described above may include Mel frequency cepstrum coefficients.
  • the second extraction module 723 may include:
  • the frequency spectrum acquisition unit is used to acquire the frequency spectrum corresponding to the sub audio
  • the amplitude spectrum conversion unit is used to calculate the corresponding amplitude spectrum according to the frequency spectrum
  • the filter processing unit is used to perform Mel filter processing on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
  • the spectrum acquisition unit may include:
  • the preprocessing unit is used to preprocess the sub audio
  • the Fourier transform unit is used to perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
  • the foregoing preprocessing may include any one or more of the following: extracting speech signals, pre-emphasis, framing, and windowing;
  • the second extraction module 723 may further include:
  • the post-processing unit is used to convert the Mel frequency cepstrum coefficients into voiceprint feature vectors and perform post-processing.
  • the aforementioned post-processing may include any one or more of the following: de-averaging, normalization, and dimensionality reduction processing.
  • the voiceprint feature may further include any one or more of the following: generalized Mel cepstrum coefficient, spectral envelope and energy feature, fundamental frequency, voiced/light tone classification feature, frequency band Aperiodic components.
  • the feature processing module 724 can be used to:
  • the feature processing module 724 can be used to:
  • the comprehensive features are input into the fusion model, and the person recognition result of the target video is output.
  • the feature processing module 724 can be used to:
  • the feature processing module 724 is further used for:
  • the above-mentioned temporal characteristics are determined.
  • the video person recognition device 800 may include:
  • the image acquisition module 810 is used to acquire key frame images from the target video
  • the first extraction module 820 is used to extract the appearance features of the person from the key frame image
  • the second extraction module 830 is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprint features from the sub audio;
  • the feature processing module 840 is configured to process the above-mentioned character appearance features and voiceprint features by using a pre-trained fusion model to obtain a character recognition result of the target video.
  • the image acquisition module 810 may be used to:
  • the image acquisition module 810 may be used to:
  • the above-mentioned human appearance feature may include a human face feature.
  • the first extraction module 820 may be used for:
  • a pre-trained convolutional neural network is used to extract facial features from facial sub-images.
  • the above-mentioned human appearance characteristics may further include body shape characteristics and posture characteristics.
  • the voiceprint feature described above may include Mel frequency cepstrum coefficients.
  • the second extraction module 830 may include:
  • the frequency spectrum acquisition unit is used to acquire the frequency spectrum corresponding to the sub audio
  • the amplitude spectrum conversion unit is used to calculate the corresponding amplitude spectrum according to the frequency spectrum
  • the filter processing unit is used to perform Mel filter processing on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
  • the spectrum acquisition unit may include:
  • the preprocessing unit is used to preprocess the sub audio
  • the Fourier transform unit is used to perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
  • the foregoing preprocessing may include any one or more of the following: extracting speech signals, pre-emphasis, framing, and windowing;
  • the second extraction module 830 may further include:
  • the post-processing unit is used to convert the Mel frequency cepstrum coefficients into voiceprint feature vectors and perform post-processing.
  • the aforementioned post-processing may include any one or more of the following: de-averaging, normalization, and dimensionality reduction processing.
  • the voiceprint feature may further include any one or more of the following: generalized Mel cepstrum coefficient, spectral envelope and energy feature, fundamental frequency, voiced/light tone classification feature, frequency band Aperiodic components.
  • the feature processing module 840 may be used to:
  • the feature processing module 840 may be used to:
  • the comprehensive features are input into the fusion model, and the person recognition result of the target video is output.
  • the feature processing module 840 may be used to:
  • the feature processing module 840 is further used for:
  • the above-mentioned temporal characteristics are determined.
  • Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method of this specification.
  • various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code.
  • the program product runs on an electronic device, the program code is used to make the electronic device execute the above-mentioned instructions in this specification.
  • the steps according to various exemplary embodiments of the present disclosure are described in the "Exemplary Methods" section.
  • the program product can be implemented as a portable compact disk read-only memory (CD-ROM) and include program code, and can be run on an electronic device, such as a personal computer.
  • the program product of the present disclosure is not limited thereto.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product can adopt any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
  • the program code for performing the operations of the present disclosure can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming. Language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers
  • Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method.
  • the electronic device 900 according to this exemplary embodiment of the present disclosure will be described below with reference to FIG. 9.
  • the electronic device 900 shown in FIG. 9 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 900 may be in the form of a general-purpose computing device.
  • the components of the electronic device 900 may include but are not limited to: at least one processing unit 910, at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.
  • the storage unit 920 stores program codes, and the program codes can be executed by the processing unit 910 so that the processing unit 910 executes the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification.
  • the processing unit 910 may execute any one or more method steps in FIGS. 1 to 6.
  • the storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 921 and/or a cache storage unit 922, and may further include a read-only storage unit (ROM) 923.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 920 may also include a program/utility tool 924 having a set of (at least one) program module 925.
  • program module 925 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 930 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 900 may also communicate with one or more external devices 1000 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 900, and/or communicate with Any device (eg, router, modem, etc.) that enables the electronic device 900 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 950.
  • the electronic device 900 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 960.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • the network adapter 960 communicates with other modules of the electronic device 900 through the bus 930. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the exemplary implementation manner described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the exemplary embodiment of the present disclosure.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

A video figure recognition method, a video figure recognition apparatus, a storage medium, and an electronic device. The method comprises: obtaining a key frame image from a target video (S110); extracting a figure appearance feature from the key frame image (S120); intercepting a sub-audio corresponding to the key frame image from an audio of the target video according to the time of the key frame image in the target video, and extracting a voiceprint feature from the sub-audio (S130); and processing the figure appearance feature and the voiceprint feature by using a pre-trained fusion model to obtain a figure recognition result of the target video (S140). The multi-modal features in the video can be fused, implementing relatively high figure recognition accuracy, and the present invention is appliable to situations that the definition of a face image in the video is not high or the face image is shielded and the like, and has relatively high robustness.

Description

视频人物识别方法、装置、存储介质与电子设备Video character recognition method, device, storage medium and electronic equipment
本申请要求于2019年10月28日提交的,申请号为201911029707.4,名称为“视频人物识别方法、装置、存储介质与电子设备”的中国专利申请的优先权,该中国专利申请的全部内容通过引用结合在本文中。This application claims the priority of a Chinese patent application filed on October 28, 2019, with the application number 201911029707.4 and titled "Video Character Recognition Method, Device, Storage Medium and Electronic Equipment". The entire content of the Chinese patent application is approved The reference is incorporated in this article.
技术领域Technical field
本公开涉及人工智能技术领域,尤其涉及一种视频人物识别方法、视频人物识别装置、计算机可读存储介质与电子设备。The present disclosure relates to the field of artificial intelligence technology, and in particular to a video character recognition method, a video character recognition device, a computer-readable storage medium, and electronic equipment.
背景技术Background technique
视频人物识别是指识别出视频中的人物身份,以对视频进行分类或者添加人物标签等,在安防、视频分类、视频内容审核、智能相册等场景中有着重要的应用。Video person recognition refers to the identification of the person's identity in the video to classify the video or add person tags, etc. It has important applications in scenarios such as security, video classification, video content review, and smart photo albums.
相关技术中,视频人物识别主要是基于视频图像中的人脸识别而实现的,从视频中检测出包含人脸的图像,再对图像中的人脸进一步精确识别,以确定人物身份。该方法对人脸图像的清晰度有较高要求,当人脸图像不够清晰,或者被遮挡时,识别结果的准确度较低。In related technologies, video person recognition is mainly implemented based on face recognition in video images. An image containing a face is detected from the video, and then the face in the image is further accurately recognized to determine the identity of the person. This method has higher requirements on the sharpness of the face image. When the face image is not clear enough or is blocked, the accuracy of the recognition result is low.
发明内容Summary of the invention
本公开提供一种视频人物识别方法、视频人物识别装置、计算机可读存储介质与电子设备,进而至少在一定程度上提高视频人物识别的准确度。The present disclosure provides a video character recognition method, a video character recognition device, a computer-readable storage medium, and electronic equipment, thereby improving the accuracy of video character recognition at least to a certain extent.
根据本公开的第一方面,提供一种视频人物识别方法,包括:从目标视频中获取关键帧图像;从所述关键帧图像中提取人物外观特征;根据所述关键帧图像在所述目标视频中的时间,从所述目标视频的音频中截取所述关键帧图像对应的子音频,从所述子音频中提取声纹特征;利用预先训练的融合模型对所述人物外观特征和所述声纹特征进行处理,得到所述目标视频的人物识别结果。According to a first aspect of the present disclosure, there is provided a method for identifying a person in a video, including: acquiring a key frame image from a target video; extracting appearance characteristics of a person from the key frame image; according to the key frame image in the target video The sub-audio corresponding to the key frame image is intercepted from the audio of the target video, and the voiceprint feature is extracted from the sub-audio; the pre-trained fusion model is used to compare the appearance feature of the character and the voice The pattern feature is processed to obtain the person recognition result of the target video.
根据本公开的第二方面,提供一种视频人物识别装置,包括处理器;其中,所述处理器用于执行存储在存储器中的以下程序模块:图像获取模块,用于从目标视频中获取关键帧图像;第一提取模块,用于从所述关键帧图像中提取人物外观特征;第二提取模块,用于根据所述关键帧图像在所述目标视频中的时间,从所述目标视频的音频中截取所述关键帧图像对应的子音频,从所述子音频中提取声纹特征;特征处理模块,用于利用预先训练的融合模型对所述人物外观特征和所述声纹特征进行处理,得到所述目标视频的人物识别结果。According to a second aspect of the present disclosure, there is provided a video character recognition device, including a processor; wherein the processor is configured to execute the following program modules stored in a memory: an image acquisition module for acquiring key frames from a target video Image; the first extraction module is used to extract the appearance characteristics of the character from the key frame image; the second extraction module is used to extract the audio from the target video according to the time of the key frame image in the target video The sub audio corresponding to the key frame image is intercepted in the sub audio, and the voiceprint feature is extracted from the sub audio; the feature processing module is used to process the appearance feature of the character and the voiceprint feature using a pre-trained fusion model, The person recognition result of the target video is obtained.
根据本公开的第三方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述视频人物识别方法及其可能的实现方式。According to a third aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned video character recognition method and possible implementation manners thereof are realized.
根据本公开的第四方面,提供一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行上述视频人物识别方法及其可能的实现方式。According to a fourth aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions Perform the above-mentioned video character recognition method and its possible implementation.
本公开的技术方案具有以下有益效果:The technical solution of the present disclosure has the following beneficial effects:
根据上述视频人物识别方法、视频人物识别装置、计算机可读存储介质与电子设备,从目标视频中获取关键帧图像,并提取人物外观特征,再根据关键帧图像在目标视频中的时间截取子音频,从子音频提取声纹特征,最后利用预先训练的融合模型对人物外观特征和声纹特征进行处理,得到目标视频的人物识别结果。一方面,人物外观特征体现了图像方面的特征,声纹特征体现了声音方面的特征,从而利用了视频具有“音”和“画”两方面信息的特点,融合多模态特征,并基于特征处理进行人物识别,可以实现较高的识别准确度。另一方面,多模态特征可以弥补其中任一方面特征在一定程度上的缺失,使得本公开的技术方案能够适用于视频中人脸图像清晰度不高或被遮挡等情况,具有较高的鲁棒性。再一方面,关键帧图像和子音频之间是经过时间匹配的,两者具有对应性,从而减少图像特征与声音特征不同步的情况,提高视频人物识别的准确度。According to the above-mentioned video character recognition method, video character recognition device, computer readable storage medium and electronic equipment, the key frame image is obtained from the target video, and the appearance characteristics of the character are extracted, and then the sub audio is intercepted according to the time of the key frame image in the target video , Extract the voiceprint features from the sub-audio, and finally use the pre-trained fusion model to process the appearance features and voiceprint features of the characters to obtain the result of the character recognition of the target video. On the one hand, the appearance characteristics of the characters reflect the characteristics of the image, and the voiceprint characteristics reflect the characteristics of the sound, thus taking advantage of the characteristics of the video’s two aspects of "sound" and "picture", fusing multi-modal characteristics and based on the characteristics Process person recognition, which can achieve higher recognition accuracy. On the other hand, multi-modal features can make up for the lack of any aspect of the feature to a certain extent, so that the technical solution of the present disclosure can be applied to situations where the face image in the video is not high in definition or is blocked, and has a high degree of improvement. Robustness. On the other hand, the key frame image and the sub audio are matched over time, and the two have a correspondence, thereby reducing the situation that the image feature and the sound feature are not synchronized, and improving the accuracy of the video character recognition.
附图说明Description of the drawings
图1示出本示例性实施方式中一种视频人物识别方法的流程图;Fig. 1 shows a flowchart of a method for video character recognition in this exemplary embodiment;
图2示出本示例性实施方式中一种提取人脸特征方法的流程图;Fig. 2 shows a flowchart of a method for extracting facial features in this exemplary embodiment;
图3示出本示例性实施方式中一种提取声纹特征方法的流程图;Fig. 3 shows a flowchart of a method for extracting voiceprint features in this exemplary embodiment;
图4示出本示例性实施方式中一种获取频谱方法的流程图;Fig. 4 shows a flowchart of a method for obtaining a spectrum in this exemplary embodiment;
图5示出本示例性实施方式中一种处理人物外观特征和声纹特征的方法的流程图;Fig. 5 shows a flowchart of a method for processing appearance characteristics and voiceprint characteristics of a character in this exemplary embodiment;
图6示出本示例性实施方式中另一种处理人物外观特征和声纹特征的方法的流程图;Fig. 6 shows a flowchart of another method for processing appearance characteristics and voiceprint characteristics of a person in this exemplary embodiment;
图7示出本示例性实施方式中一种视频人物识别装置的结构框图;Fig. 7 shows a structural block diagram of a video character recognition device in this exemplary embodiment;
图8示出本示例性实施方式中另一种视频人物识别装置的结构框图;Fig. 8 shows a structural block diagram of another video character recognition device in this exemplary embodiment;
图9示出本示例性实施方式中一种用于实现上述方法的电子设备。Fig. 9 shows an electronic device for implementing the above method in this exemplary embodiment.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。在下面的描述中,提供许多具体细节从而给出对本公开的实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本公开的技术方案而省略所述特定细节中的一个或更 多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知技术方案以避免喧宾夺主而使得本公开的各方面变得模糊。Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, these embodiments are provided so that the present disclosure will be more comprehensive and complete, and the concept of the example embodiments will be fully conveyed To those skilled in the art. The described features, structures or characteristics can be combined in one or more embodiments in any suitable way. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present disclosure. However, those skilled in the art will realize that the technical solutions of the present disclosure can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, the well-known technical solutions are not shown or described in detail to avoid overwhelming the crowd and obscure all aspects of the present disclosure.
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。In addition, the drawings are only schematic illustrations of the present disclosure, and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.
相关技术中,基于视频图像中的人脸识别对人脸图像的清晰度有较高要求,当人脸图像不够清晰,或者被遮挡时,识别结果的准确度较低。In related technologies, face recognition based on video images has higher requirements for the clarity of the face image. When the face image is not clear enough or is blocked, the accuracy of the recognition result is low.
此外,本发明人发现,视频中实际上包含了多模态的信息,包括图像、语音等多个方面,而相关技术中仅通过图像中的人脸识别去辨别人物,未能充分利用到多模态信息,这是其识别人物准确度较低的主要原因之一。In addition, the inventor found that the video actually contains multi-modal information, including images, voice, and other aspects. However, the related technology only uses face recognition in the image to identify other people's objects, which fails to make full use of the multi-modal information. Modal information, which is one of the main reasons for its low accuracy in identifying people.
鉴于上述一个或多个问题,本公开的示例性实施方式首先提供一种视频人物识别方法,该方法可以应用于视频服务平台的服务器,例如从服务端对平台上的视频进行人物识别,以添加人物标签,方便用户搜索,也可以应用于个人电脑、智能手机终端设备,例如对用户拍摄或下载的视频进行人物识别,以自动进行人物分类等。In view of one or more of the above-mentioned problems, the exemplary embodiments of the present disclosure first provide a video character recognition method, which can be applied to the server of the video service platform, for example, perform character recognition on the video on the platform from the server to add People tags are convenient for users to search, and can also be applied to personal computers and smart phone terminal devices. For example, people can recognize people in videos taken or downloaded by users to automatically classify people.
图1示出了本示例性实施方式的一种流程,可以包括以下步骤S110至S140:Fig. 1 shows a process of this exemplary embodiment, which may include the following steps S110 to S140:
步骤S110,从目标视频中获取关键帧图像;Step S110: Obtain a key frame image from the target video;
步骤S120,从关键帧图像中提取人物外观特征;Step S120, extracting the appearance characteristics of the person from the key frame image;
步骤S130,根据关键帧图像在目标视频中的时间,从目标视频的音频中截取关键帧图像对应的子音频,从该子音频中提取声纹特征;Step S130, according to the time of the key frame image in the target video, intercept the sub audio corresponding to the key frame image from the audio of the target video, and extract the voiceprint feature from the sub audio;
步骤S140,利用预先训练的融合模型对上述人物外观特征和声纹特征进行处理,得到目标视频的人物识别结果。Step S140: Use the pre-trained fusion model to process the above-mentioned character appearance features and voiceprint features to obtain a character recognition result of the target video.
下面分别对上述每个步骤做具体说明。The following is a detailed description of each of the above steps.
步骤S110中,从目标视频中获取关键帧图像。In step S110, a key frame image is obtained from the target video.
其中,关键帧图像是指目标视频中包含人脸外观的图像,可以提取一个关键帧,也可以提取多个关键帧,本公开对其数量不做限定。下面就如何确定关键帧提供几个具体实施方式:Among them, the key frame image refers to an image containing the appearance of a human face in the target video. One key frame or multiple key frames may be extracted, and the number of them is not limited in the present disclosure. Several specific implementation methods are provided below on how to determine the key frame:
(1)采用固定间隔的方式,在目标视频中,每间隔固定的时长或固定的帧数,选取一帧作为关键帧,例如可以每个3帧提取一个关键帧图像。(1) Using a fixed interval method, in the target video, each interval is fixed for a fixed length of time or a fixed number of frames, and one frame is selected as a key frame. For example, a key frame image can be extracted for every 3 frames.
(2)检测目标视频中包含人物或不包含人物的帧,将不包含人物的帧记为背景帧,以背景帧作为分割点,将目标视频分割为多段子视频,每段子视频都是包含人物的连续帧。可以认为每段子视频中的人物为同一个人,因此从每段子视频提取至少一帧,作为关键帧。(2) Detect frames that contain or do not contain people in the target video, mark the frames that do not contain people as the background frame, and use the background frame as the dividing point to divide the target video into multiple sub-videos, and each sub-video contains people. Consecutive frames. It can be considered that the person in each sub-video is the same person, so at least one frame is extracted from each sub-video as a key frame.
(3)考虑到从视频中提取完整图像时通常需要对视频帧解码,也可以从目标视频 中提取帧内编码帧并进行解码,以得到关键帧图像。(3) Considering that it is usually necessary to decode the video frame when extracting the complete image from the video, it is also possible to extract the intra-frame coded frame from the target video and decode it to obtain the key frame image.
其中,帧内编码帧(Intra-Coded frame,简称I帧)是基于单帧图像独立编码的帧,是对本帧图像的完整保留,解码时只需要本帧的数据就能完成。与I帧相对应的有前向预测帧(Predictive frame,简称P帧)和双向预测帧(Bi-Predictive frame,简称B帧),P帧记录了其与之前帧的差别,在解码P帧时需要参考之前的帧数据,而B帧记录了其与之前、之后双向帧的差别,需要同时参考之前和之后的帧数据才能完整解码。Among them, an intra-coded frame (Intra-Coded frame, I frame for short) is a frame that is independently coded based on a single frame image, which is a complete preservation of the current frame image, and only the data of the current frame is required for decoding. Corresponding to the I frame, there are a forward predictive frame (Predictive frame, referred to as P frame) and a bi-predictive frame (Bi-Predictive frame, referred to as B frame). The P frame records the difference from the previous frame. When decoding the P frame It needs to refer to the previous frame data, and the B frame records the difference between it and the previous and subsequent bidirectional frames, and it needs to refer to the previous and subsequent frame data at the same time to be completely decoded.
由上可知,如果确定P帧或B帧为关键帧,在获取关键帧图像时,需要先解码I帧,再根据前后帧之间的差别解码目标的P帧和B帧,这样效率较低。因此,可以将I帧直接作为关键帧,从目标视频中提取I帧并进行解码,得到上述关键帧图像。这样在解码时,仅需独立解码出关键帧图像即可,而无需解码其他帧,所需解码的帧数最少,提取关键帧图像的速度最快。It can be seen from the above that if it is determined that a P frame or a B frame is a key frame, when acquiring a key frame image, the I frame needs to be decoded first, and then the target P frame and B frame are decoded according to the difference between the previous and next frames, which is less efficient. Therefore, the I frame can be directly used as the key frame, and the I frame can be extracted from the target video and decoded to obtain the aforementioned key frame image. In this way, when decoding, only the key frame image needs to be decoded independently, without decoding other frames, the number of frames to be decoded is the least, and the key frame image is extracted at the fastest speed.
为了进一步提高效率,如果选取了多个I帧作为关键帧,在进行解码时,还可以调用多个线程,使每个线程分别解码一个I帧。通常视频工具(如视频播放软件、剪辑软件等)中包含解码器,用于视频帧的解码。本示例性实施方式可以将解码器植入视频人物识别的程序中,并修改线程部分的代码,当视频人物识别流程开始后,获取N个I帧为关键帧,则相应的启动N个线程,将每个I帧的解码任务分配到对应的线程上,各线程独立执行解码任务,从而通过并发的方式快速完成关键帧图像的提取。In order to further improve efficiency, if multiple I frames are selected as key frames, multiple threads can also be called during decoding, so that each thread decodes an I frame separately. Generally, video tools (such as video playback software, editing software, etc.) include a decoder for decoding video frames. In this exemplary embodiment, the decoder can be implanted in the video character recognition program, and the code of the thread part can be modified. When the video character recognition process starts, N I frames are acquired as key frames, and N threads are started accordingly. The decoding task of each I frame is allocated to the corresponding thread, and each thread executes the decoding task independently, thereby quickly completing the extraction of the key frame image in a concurrent manner.
需要说明的是,为了便于后续的处理,步骤S110可以获取固定数量的关键帧图像,例如获取64个或128个关键帧图像等,则在上述采集关键帧时,可以根据该数量确定相关的参数,例如:在上述方式(1)中计算间隔的时长或帧数等;在上述方式(2)中确定从每段子视频所提取的关键帧数量;在上述方式(3)中确定提取I帧的数量,若待分类视频中的I帧数量不足,则不足的部分可以提取P帧或B帧。It should be noted that, in order to facilitate subsequent processing, step S110 may obtain a fixed number of key frame images, for example, obtain 64 or 128 key frame images, etc., when the above key frames are collected, the relevant parameters can be determined according to the number. For example: calculate the length of the interval or the number of frames in the above method (1); determine the number of key frames extracted from each sub-video in the above method (2); determine how to extract the I frame in the above method (3) If the number of I-frames in the video to be classified is insufficient, the insufficient part can be extracted from P-frames or B-frames.
此外,本示例性实施方式还可以将上述三种方式组合使用,例如:组合采用上述方式(2)和(3),在每段子视频中选取I帧作为关键帧等等。In addition, this exemplary embodiment can also use the above three methods in combination, for example: the above methods (2) and (3) are used in combination, an I frame is selected as a key frame in each sub video, and so on.
继续参考图1,步骤S120中,从关键帧图像中提取人物外观特征。Continuing to refer to FIG. 1, in step S120, the appearance characteristics of the person are extracted from the key frame image.
本示例性实施方式中,可以利用机器学习模型提取关键帧图像中的人物外观特征,这并非对关键帧图像进行分类或识别,因此,对于机器学习模型最终输出何种类型的数据不做限定。这样的好处是,在训练卷积神经网络时对于标签的类型不做限定,哪种标签是现成的或易于获得,则以哪种标签进行训练,例如可以采用开源的人像数据集,其包含了海量的人脸图像及其分类标签,则相应的训练用于图像分类的卷积神经网络,以用于在步骤S120中提取人脸特征。可以将关键帧图像输入卷积神经网络,进行一系列的卷积和池化处理后,从全连接层提取特征,可以选择第一个全连接层,其特征较为稠密,也可以选择后续的全连接层,其数据量通常更小,本公开对此不做限定。In this exemplary embodiment, the machine learning model may be used to extract the appearance characteristics of the person in the key frame image. This is not to classify or recognize the key frame image. Therefore, there is no limitation on what type of data the machine learning model ultimately outputs. The advantage of this is that there is no restriction on the type of label when training the convolutional neural network. Which label is readily available or easy to obtain, and which label is used for training. For example, an open source portrait data set can be used, which contains For a large number of face images and their classification labels, a convolutional neural network for image classification is trained accordingly to extract face features in step S120. The key frame image can be input into the convolutional neural network, and after a series of convolution and pooling processes, the features are extracted from the fully connected layer. You can select the first fully connected layer, which has denser features, or you can choose the subsequent fully connected layer. The data volume of the connection layer is usually smaller, which is not limited in the present disclosure.
人物外观特征可以包括人脸特征、身形特征、体态特征等,人脸特征包括脸部各 部位的位置、比例、形状以及表情等信息,身形特征包括身体和四肢的位置、比例、形状等信息,体态特征包括人物动作、姿势等信息。其中,人脸特征对于人物识别的意义相对更加重要。在一种可选的实施方式中,当人物外观特征包括人脸特征时,参考图2所示,步骤S120可以具体通过以下步骤S210和S220实现:Character appearance features can include face features, body shape features, posture features, etc. Face features include information such as the position, proportion, shape, and expression of each part of the face, and body shape features include the position, proportion, shape, etc. of the body and limbs, etc. Information, posture characteristics include character actions, postures and other information. Among them, facial features are relatively more important for the significance of person recognition. In an optional implementation manner, when the appearance feature of a person includes a face feature, referring to FIG. 2, step S120 may be specifically implemented by the following steps S210 and S220:
步骤S210,检测关键帧图像中的人脸区域,以从关键帧图像中截取人脸子图像;Step S210, detecting the face area in the key frame image, so as to intercept the face sub-image from the key frame image;
步骤S220,利用预先训练的卷积神经网络从上述人脸子图像中提取人脸特征。In step S220, a pre-trained convolutional neural network is used to extract facial features from the aforementioned facial sub-images.
其中,人脸区域可以通过轮廓检测等算法识别,例如可以将关键帧图像输入人脸检测网络RetinaFace中进行人脸区域的检测,检测出图像中的人脸所在区域以及人脸关键点的坐标。将人脸区域从关键帧图像中截取出来,得到人脸子图像,其滤除了场景、物体等其他和人物识别不相关的图像内容。然后输入到预先训练的卷积神经网络,从网络的全连接层得到人脸特征。本示例性实施方式可以根据实际需求设定人脸特征的维度,例如设置卷积神经网络的第一个全连接层为512维,则将人脸子图像输入后,从该全连接层可以提取得到512维的人脸特征。Among them, the face area can be recognized by algorithms such as contour detection. For example, the key frame image can be input into the face detection network RetinaFace for face area detection, and the area where the face is located in the image and the coordinates of the key points of the face can be detected. The face area is cut out from the key frame image to obtain a face sub-image, which filters out scenes, objects and other image content not related to person recognition. Then input to a pre-trained convolutional neural network, and obtain facial features from the fully connected layer of the network. In this exemplary embodiment, the dimensions of face features can be set according to actual needs. For example, if the first fully connected layer of the convolutional neural network is set to 512 dimensions, then after inputting the face sub-image, the fully connected layer can be extracted 512-dimensional facial features.
继续参考图1,步骤S130中,根据关键帧图像在目标视频中的时间,从目标视频的音频中截取关键帧图像对应的子音频,从该子音频中提取声纹特征。Continuing to refer to FIG. 1, in step S130, according to the time of the key frame image in the target video, the sub audio corresponding to the key frame image is intercepted from the audio of the target video, and the voiceprint feature is extracted from the sub audio.
其中,所截取的子音频是和关键帧图像相对应的音频部分,例如关键帧图像在目标视频中的时间为09.670秒,则可以以该时间为中心,从目标视频的音频中截取一定窗口的子音频。换而言之,子音频和关键帧图像应当达到“音画同步”。对于子音频的时长,本公开不做具体限定,可以采用固定时长,例如根据一般视频中人说一句话的平均时间,采用3秒或5秒等,也可以检测关键帧时间点两侧的音频中的突变点,例如语音内容或频率突然变化的时间点,截取两个突变点中间的部分音频。Among them, the intercepted sub audio is the audio part corresponding to the key frame image. For example, the time of the key frame image in the target video is 09.670 seconds, you can take this time as the center and intercept a certain window from the audio of the target video Sub audio. In other words, the sub audio and key frame images should achieve "audio-visual synchronization". The duration of the sub audio is not specifically limited in the present disclosure, and a fixed duration can be used. For example, according to the average time of a person speaking in a general video, using 3 seconds or 5 seconds, etc., the audio on both sides of the key frame time point can also be detected. The sudden change point in the middle, such as the point in time when the voice content or frequency suddenly changes, intercept the part of the audio in the middle of the two sudden changes.
由于人说话的声音具有唯一性,声纹特征可以极大地体现出每个人语音的个性化特征。在一种可选的实施方式中,声纹特征可以包括梅尔频率倒谱系数(Mel-Frequency Cepstral Coefficients,MFCC),参考图3所示,从子音频中提取声纹特征可以具体包括以下步骤S310至S330:Because the voice of a person's speaking is unique, the voiceprint characteristics can greatly reflect the individual characteristics of each person's voice. In an optional implementation manner, the voiceprint feature may include Mel-Frequency Cepstral Coefficients (MFCC). As shown in FIG. 3, the voiceprint feature extraction from the sub-audio may specifically include the following steps S310 to S330:
步骤S310,获取子音频对应的频谱。Step S310: Obtain the frequency spectrum corresponding to the sub audio.
在音频信号中,频谱可视为每个正弦信号的频率、相位和幅值的集合,通过对子音频中的信号进行采样,可以得到相应的频谱。In the audio signal, the frequency spectrum can be regarded as the collection of the frequency, phase and amplitude of each sinusoidal signal, and the corresponding frequency spectrum can be obtained by sampling the signal in the sub-audio.
步骤S320,根据上述频谱计算出对应的幅度谱。Step S320: Calculate the corresponding amplitude spectrum based on the above frequency spectrum.
幅度谱是指频谱中每个不同频率正弦信号的幅值所形成的谱线,对子音频的频谱进行拆解计算,可以得到对应的幅度谱。The amplitude spectrum refers to the spectral line formed by the amplitude of each different frequency sinusoidal signal in the frequency spectrum. The corresponding amplitude spectrum can be obtained by disassembling and calculating the frequency spectrum of the sub-audio.
步骤S330,对幅度谱进行梅尔滤波处理,以计算出子音频的梅尔频率倒谱系数。In step S330, Mel filter processing is performed on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
由于人耳对不同频率的敏感程度不同,且成非线性关系,因此可以将幅度谱按人耳敏感程度分为多个梅尔滤波器组,各个滤波器的中心频率是相等间隔的线性分布。利用梅尔滤波器组对子音频的幅度谱进行梅尔滤波处理,可以计算出梅尔频率倒谱系 数。Since human ears have different sensitivity to different frequencies and are in a non-linear relationship, the amplitude spectrum can be divided into multiple mel filter banks according to the sensitivity of the human ear, and the center frequencies of each filter are linearly distributed at equal intervals. Using the Mel filter bank to perform Mel filtering on the amplitude spectrum of the sub audio, the Mel frequency cepstrum coefficient can be calculated.
在一种可选的实施方式中,参考图4所示,步骤S310可以通过以下步骤S410和S420实现:In an optional implementation manner, referring to FIG. 4, step S310 may be implemented by the following steps S410 and S420:
步骤S410,对子音频进行预处理。Step S410, preprocessing the sub audio.
其中,预处理可以包括以下任意一种或多种:提取语音信号、预加重、分帧和加窗处理。提取语音信号是指从子音频中滤除背景音、噪音等非人声的信号,只保留人类的语音信号;预加重是对子音频红的高频分量进行补偿的信号处理方式;分帧是将子音频按照每一帧进行拆分,后续便于特征提取;加窗处理是通过预设的窗口大小对信号进行有限化处理,可以将每一帧代入窗函数,窗外的值设定为0,以消除各个帧两端可能会造成的信号不连续性,本示例性实施方式可以采用方窗、汉明窗等作为窗函数。Among them, the preprocessing may include any one or more of the following: extracting the speech signal, pre-emphasis, framing and windowing. Extracting the voice signal refers to filtering out non-human voice signals such as background sound and noise from the sub-audio, and only retaining the human voice signal; pre-emphasis is a signal processing method that compensates for the high-frequency component of the sub-audio red; framing is Split the sub-audio according to each frame to facilitate the subsequent feature extraction; the windowing process is to limit the signal through the preset window size, and each frame can be substituted into the window function, and the value outside the window is set to 0, In order to eliminate signal discontinuity that may be caused at both ends of each frame, this exemplary embodiment may use a square window, a Hamming window, etc. as the window function.
步骤S420,对预处理后的子音频进行傅里叶变换,得到子音频对应的频谱。Step S420: Perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
由于音频信号在时域上的变换通常很难看出信号的特性,所以将其转换为频域上的能量分布来观察,不同的能量分布代表不同语音的特性。对子音频进行傅里叶变换可以提取其中的频域特征,将子音频信号绘制为频率-密度曲线,即得到子音频对应的频谱。Since it is usually difficult to see the characteristics of the audio signal in the time domain, it is converted to the energy distribution in the frequency domain for observation. Different energy distributions represent the characteristics of different voices. Performing Fourier transform on the sub-audio can extract the frequency domain features, and draw the sub-audio signal as a frequency-density curve, that is, obtain the frequency spectrum corresponding to the sub-audio.
在一种可选的实施方式中,在步骤S330后,还可以将梅尔频率倒谱系数转换为声纹特征向量,并进行后处理。梅尔频率倒谱系数是高维度的稠密特征,将其表示为向量的形式,即声纹特征向量,可以用于机器学习模型的处理。本示例性实施方式中,对声纹特征向量进行后处理可以优化后续流程,后处理可以采用以下任意一种或多种方式:去均值、归一化、降维处理。去均值是将各维度的数据都减去对应维度的均值,使得数据的中心值为0,以防止欠拟合等不良影响;归一化是将数据统一到标准的数值尺度内,有利于后续的特征融合与计算;降维处理是通过PCA(Principal Components Analysis,主成分分析)等算法实现,抛弃携带信息量较少的维度,保留主要的特征信息来对数据进行降维处理,一般是使用少数几个有代表性、互不相关的特征来代替原先的大量的、存在一定相关性的特征,从而加速后续的处理进程。In an optional implementation manner, after step S330, the Mel frequency cepstrum coefficients can also be converted into voiceprint feature vectors, and post-processing is performed. Mel frequency cepstral coefficients are high-dimensional dense features, which are expressed in the form of vectors, that is, voiceprint feature vectors, which can be used in the processing of machine learning models. In this exemplary embodiment, the post-processing of the voiceprint feature vector can optimize the subsequent process, and the post-processing can adopt any one or more of the following methods: de-averaging, normalization, and dimensionality reduction processing. Means removal is to subtract the mean value of the corresponding dimension from the data of each dimension, so that the center value of the data is 0 to prevent unfavorable effects such as underfitting; normalization is to unify the data to a standard numerical scale, which is beneficial to follow-up Feature fusion and calculation; dimensionality reduction is achieved through algorithms such as PCA (Principal Components Analysis), which discards the dimensions that carry less information, and retains the main feature information to perform dimensionality reduction processing on the data, generally using A few representative, unrelated features replace the original large number of features with a certain degree of relevance, thus speeding up the subsequent processing process.
进一步的,步骤S130中所提取的声纹特征还可以包括以下任意一种或多种:广义梅尔倒谱系数、谱包络与能量特征、基频、浊音/轻音分类特征、频带非周期分量。其中,广义梅尔倒谱系数和梅尔频率倒谱系数的原理基本相同,是高维度的特征(例如可以是180维),具体系数内容有一些差异,可以作为梅尔频率倒谱系数的替代或补充;谱包络与能量特征是和语音内容相关的特征;基频、浊音/轻音分类特征与频带非周期分量是和基础发音信息相关的特征,通常是较为稀疏的特征,可以作为梅尔频率倒谱系数的补充。声纹特征的维度越丰富,其表征视频人物就越准确,越有利于进行准确的视频人物识别。Further, the voiceprint features extracted in step S130 may also include any one or more of the following: generalized Mel cepstrum coefficients, spectral envelope and energy features, fundamental frequency, voiced/light tone classification features, frequency band non-periodic Weight. Among them, the generalized Mel cepstrum coefficient and Mel frequency cepstrum coefficient are basically the same in principle, which are high-dimensional features (for example, 180 dimensions), and the specific coefficient content has some differences, which can be used as a substitute for the Mel frequency cepstrum coefficient. Or supplement; spectral envelope and energy features are features related to speech content; fundamental frequency, voiced/light tone classification features and frequency band non-periodic components are features related to basic pronunciation information, usually relatively sparse features, which can be used as plum The supplement of the Cepstrum coefficient of the Er frequency. The richer the dimensions of the voiceprint feature, the more accurate the characterization of the video character, and the more conducive to accurate video character recognition.
继续参考图1,步骤S140中,利用预先训练的融合模型对上述人物外观特征和声 纹特征进行处理,得到目标视频的人物识别结果。Continuing to refer to Fig. 1, in step S140, a pre-trained fusion model is used to process the above-mentioned character appearance characteristics and voiceprint characteristics to obtain the character recognition result of the target video.
其中,人物外观特征是从图像方面所提取的特征,声纹特征是从声音方面所提取的特征,两者的融合代表了目标视频的多模态特征。实际上,在步骤S120和S130中通过对图像特征和声纹特征的处理,以得到了向量或矩阵形式的特征数据,可以很容易实现多模态特征的融合,进而通过融合模型进行处理,得到目标视频的人物识别结果。Among them, the character appearance feature is the feature extracted from the image, and the voiceprint feature is the feature extracted from the sound. The fusion of the two represents the multi-modal feature of the target video. In fact, by processing image features and voiceprint features in steps S120 and S130 to obtain feature data in vector or matrix form, it is easy to achieve multi-modal feature fusion, and then process through the fusion model to obtain The person recognition result of the target video.
在一种实施方式中,融合模型可以设置两个输入通道,一个通道用于输入人物外观特征,一个通道用于输入声纹特征。参考图5所示,步骤S140可以包括以下步骤S510至S530:In an embodiment, the fusion model can be configured with two input channels, one channel is used to input the appearance characteristics of the character, and the other channel is used to input the voiceprint characteristics. Referring to FIG. 5, step S140 may include the following steps S510 to S530:
步骤S520,将人物外观特征和声纹特征分别输入融合模型的两个输入通道。In step S520, the appearance feature of the character and the voiceprint feature are respectively input into the two input channels of the fusion model.
步骤S520,利用融合模型分别处理人物外观特征和声纹特征,得到人物外观特征对应的中间特征和声纹特征对应的中间特征;这两个中间特征分别表示视频人物在外观和声音两个方面的抽象信息。Step S520: Use the fusion model to process the appearance feature and the voiceprint feature of the character separately to obtain the intermediate feature corresponding to the appearance feature of the character and the intermediate feature corresponding to the voiceprint feature; these two intermediate features respectively represent the appearance and voice of the video character Abstract information.
步骤S530,对人物外观特征对应的中间特征和声纹特征对应的中间特征进行融合计算,输出目标视频的人物识别结果。由此可以将外观和声音信息结合并关联起来,以实现两种模态信息的融合,并经过进一步的识别处理,最后输出综合性的人物识别结果。Step S530: Perform a fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video. In this way, the appearance and sound information can be combined and correlated to achieve the fusion of the two modal information, and after further recognition processing, a comprehensive character recognition result is finally output.
在另一种实施方式中,参考图6所示,步骤S140可以包括以下步骤S610和S620:In another implementation manner, referring to FIG. 6, step S140 may include the following steps S610 and S620:
步骤S610,将人物外观特征和声纹特征合并,得到综合特征;Step S610: Combine the appearance feature of the character and the voiceprint feature to obtain a comprehensive feature;
步骤S620,将综合特征输入融合模型,输出目标视频的人物识别结果。In step S620, the integrated features are input to the fusion model, and the person recognition result of the target video is output.
其中,人物外观特征和声纹特征合并可以采用拼接的形式,例如在步骤S120中,通过卷积神经网络得到512维的人脸特征,在步骤S130中,提取了512维的梅尔频率倒谱系数(即声纹特征),将两者拼接为1024维的综合特征,输入到融合模型中,以进行处理。Among them, the combination of character appearance features and voiceprint features can be in the form of splicing. For example, in step S120, a 512-dimensional face feature is obtained through a convolutional neural network, and in step S130, a 512-dimensional Mel frequency cepstrum system is extracted Number (ie voiceprint feature), stitch the two into a 1024-dimensional comprehensive feature, and input it into the fusion model for processing.
融合模型可以采用普通的神经网络模型,也可以根据实际需求进行结构上的优化。例如:采用MobileNet(用于移动端的开源神经网络),在其中设置数据增强的机制,包括Dropout层(丢弃层)、随机噪音等,对输入的综合特征进行数据的标准化处理,将全连接层的通道数设置为1024,使用PReLu(Parametric Rectified Linear Unit,参数校正线性单元)层进行激活,使用BCE Loss(Binary Cross Entropy Loss,交叉熵损失)作为损失函数等。The fusion model can use a common neural network model, or it can be structurally optimized according to actual needs. For example: using MobileNet (open source neural network for mobile terminals), in which data enhancement mechanisms are set up, including the Dropout layer (discarding layer), random noise, etc., to standardize the input comprehensive characteristics of the data, and the fully connected layer The number of channels is set to 1024, the PReLu (Parametric Rectified Linear Unit, parameter correction linear unit) layer is used for activation, and the BCE Loss (Binary Cross Entropy Loss, cross entropy loss) is used as the loss function.
进一步的,还可以根据关键帧图像在目标视频中的时间,以及子音频在目标视频中的时间区间,确定时间特征。在进行特征合并时,将人物外观特征、声纹特征和时间特征进行合并,得到综合特征。Further, the time feature can also be determined according to the time of the key frame image in the target video and the time interval of the sub audio in the target video. When performing feature merging, the appearance feature, voiceprint feature, and time feature of the character are combined to obtain a comprehensive feature.
其中,对于关键帧图像是单帧图像的情况,时间特征可以包含2或3个维度,一个维度记录关键帧的时间,其余维度记录子音频的时间区间,例如可以是起始时间和 结束时间,也可以是中心时间和子音频时长;对于多个关键帧图像的情况,则时间特征的维度按照帧数设置。时间特征相当于对多模态特征的补充,在人脸特征和声纹特征的基础上,添加了时间信息,有利于提高综合特征的完整性和丰富性,从而提高人物识别的准确度。Among them, for the case that the key frame image is a single frame image, the time feature can include 2 or 3 dimensions, one dimension records the time of the key frame, and the other dimensions record the time interval of the sub audio, for example, the start time and the end time, It can also be the center time and the sub-audio duration; in the case of multiple key frame images, the dimension of the time feature is set according to the number of frames. The time feature is equivalent to the supplement to the multi-modal feature. On the basis of the face feature and the voiceprint feature, the time information is added, which helps to improve the completeness and richness of the comprehensive feature, thereby improving the accuracy of person recognition.
综上所述,本示例性实施方式中,从目标视频中获取关键帧图像,并提取人物外观特征,再根据关键帧图像在目标视频中的时间截取子音频,从子音频提取声纹特征,最后利用预先训练的融合模型对人物外观特征和声纹特征进行处理,得到目标视频的人物识别结果。一方面,人物外观特征体现了图像方面的特征,声纹特征体现了声音方面的特征,从而利用了视频具有“音”和“画”两方面信息的特点,融合多模态特征,并基于特征处理进行人物识别,可以实现较高的识别准确度。另一方面,多模态特征可以弥补其中任一方面特征在一定程度上的缺失,使得本示例性实施方式能够适用于视频中人脸图像清晰度不高或被遮挡等情况,具有较高的鲁棒性。再一方面,关键帧图像和子音频之间是经过时间匹配的,两者具有对应性,从而减少图像特征与声音特征不同步的情况,提高视频人物识别的准确度。To sum up, in this exemplary embodiment, the key frame image is obtained from the target video, and the appearance characteristics of the character are extracted, and then the sub audio is intercepted according to the time of the key frame image in the target video, and the voiceprint feature is extracted from the sub audio. Finally, the pre-trained fusion model is used to process the appearance characteristics and voiceprint characteristics of the characters, and the result of the character recognition of the target video is obtained. On the one hand, the appearance characteristics of the characters reflect the characteristics of the image, and the voiceprint characteristics reflect the characteristics of the sound, thus taking advantage of the characteristics of the video’s two aspects of "sound" and "picture", fusing multi-modal characteristics and based on the characteristics Process person recognition, which can achieve higher recognition accuracy. On the other hand, the multi-modal feature can make up for the lack of any aspect of the feature to a certain extent, so that this exemplary embodiment can be applied to the situation where the face image in the video is not high in definition or occluded, and has a high degree of improvement. Robustness. On the other hand, the key frame image and the sub audio are matched over time, and the two have a correspondence, thereby reducing the situation that the image feature and the sound feature are not synchronized, and improving the accuracy of the video character recognition.
本公开的示例性实施方式还提供了一种视频人物识别装置,如图7所示,该视频人物识别装置700可以包括处理器710和存储器720。其中,存储器720包括以下程序模块:Exemplary embodiments of the present disclosure also provide a video character recognition device. As shown in FIG. 7, the video character recognition device 700 may include a processor 710 and a memory 720. Wherein, the memory 720 includes the following program modules:
图像获取模块721,用于从目标视频中获取关键帧图像;The image acquisition module 721 is used to acquire key frame images from the target video;
第一提取模块722,用于从关键帧图像中提取人物外观特征;The first extraction module 722 is used to extract the appearance features of the person from the key frame image;
第二提取模块723,用于根据关键帧图像在目标视频中的时间,从目标视频的音频中截取关键帧图像对应的子音频,从该子音频中提取声纹特征;The second extraction module 723 is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprint features from the sub audio;
特征处理模块724,用于利用预先训练的融合模型对上述人物外观特征和声纹特征进行处理,得到目标视频的人物识别结果;The feature processing module 724 is configured to use the pre-trained fusion model to process the above-mentioned character appearance features and voiceprint features to obtain the character recognition result of the target video;
处理器710用于执行上述各程序模块。The processor 710 is configured to execute the foregoing program modules.
在一种可选的实施方式中,图像获取模块721,可以用于:In an optional implementation manner, the image acquisition module 721 may be used to:
从目标视频中提取帧内编码帧并进行解码,得到关键帧图像。Extract the intra-frame coded frame from the target video and decode it to obtain the key frame image.
在一种可选的实施方式中,图像获取模块721,可以用于:In an optional implementation manner, the image acquisition module 721 may be used to:
调用多个线程,通过每个线程分别解码一个帧内编码帧。Call multiple threads, and decode an intra-encoded frame through each thread.
在一种可选的实施方式中,上述人物外观特征可以包括人脸特征。In an optional implementation manner, the above-mentioned human appearance feature may include a human face feature.
在一种可选的实施方式中,第一提取模块722,可以用于:In an optional implementation manner, the first extraction module 722 may be used to:
检测关键帧图像中的人脸区域,以从关键帧图像中截取人脸子图像;Detect the face area in the key frame image to intercept the face sub-image from the key frame image;
利用预先训练的卷积神经网络从人脸子图像中提取人脸特征。A pre-trained convolutional neural network is used to extract facial features from facial sub-images.
在一种可选的实施方式中,上述人物外观特征还可以包括身形特征、体态特征。In an optional implementation manner, the above-mentioned human appearance characteristics may further include body shape characteristics and posture characteristics.
在一种可选的实施方式中,上述声纹特征可以包括梅尔频率倒谱系数。In an optional implementation manner, the voiceprint feature described above may include Mel frequency cepstrum coefficients.
在一种可选的实施方式中,第二提取模块723可以包括:In an optional implementation manner, the second extraction module 723 may include:
频谱获取单元,用于获取子音频对应的频谱;The frequency spectrum acquisition unit is used to acquire the frequency spectrum corresponding to the sub audio;
幅度谱转换单元,用于根据频谱计算出对应的幅度谱;The amplitude spectrum conversion unit is used to calculate the corresponding amplitude spectrum according to the frequency spectrum;
滤波处理单元,用于对幅度谱进行梅尔滤波处理,以计算出子音频的梅尔频率倒谱系数。The filter processing unit is used to perform Mel filter processing on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
在一种可选的实施方式中,频谱获取单元可以包括:In an optional implementation manner, the spectrum acquisition unit may include:
预处理单元,用于对子音频进行预处理;The preprocessing unit is used to preprocess the sub audio;
傅里叶变换单元,用于对预处理后的子音频进行傅里叶变换,得到子音频对应的频谱。The Fourier transform unit is used to perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
在一种可选的实施方式中,上述预处理可以包括以下任意一种或多种:提取语音信号、预加重、分帧和加窗处理;In an optional implementation manner, the foregoing preprocessing may include any one or more of the following: extracting speech signals, pre-emphasis, framing, and windowing;
在一种可选的实施方式中,第二提取模块723还可以包括:In an optional implementation manner, the second extraction module 723 may further include:
后处理单元,用于将梅尔频率倒谱系数转换为声纹特征向量,并进行后处理。The post-processing unit is used to convert the Mel frequency cepstrum coefficients into voiceprint feature vectors and perform post-processing.
在一种可选的实施方式中,上述后处理可以包括以下任意一种或多种:去均值、归一化、降维处理。In an optional implementation manner, the aforementioned post-processing may include any one or more of the following: de-averaging, normalization, and dimensionality reduction processing.
在一种可选的实施方式中,上述声纹特征还可以包括以下任意一种或多种:广义梅尔倒谱系数、谱包络与能量特征、基频、浊音/轻音分类特征、频带非周期分量。In an optional embodiment, the voiceprint feature may further include any one or more of the following: generalized Mel cepstrum coefficient, spectral envelope and energy feature, fundamental frequency, voiced/light tone classification feature, frequency band Aperiodic components.
在一种可选的实施方式中,特征处理模块724,可以用于:In an optional implementation manner, the feature processing module 724 can be used to:
将人物外观特征和声纹特征分别输入融合模型的两个输入通道;Input the character appearance feature and voiceprint feature into the two input channels of the fusion model;
利用融合模型分别处理人物外观特征和声纹特征,得到人物外观特征对应的中间特征和声纹特征对应的中间特征;Use the fusion model to separately process the appearance features and voiceprint features of the characters, and obtain the intermediate features corresponding to the appearance features of the characters and the intermediate features corresponding to the voiceprint features;
对人物外观特征对应的中间特征和声纹特征对应的中间特征进行融合计算,输出目标视频的人物识别结果。Perform fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video.
在一种可选的实施方式中,特征处理模块724,可以用于:In an optional implementation manner, the feature processing module 724 can be used to:
将人物外观特征和声纹特征合并,得到综合特征;Combine character appearance features and voiceprint features to obtain comprehensive features;
将综合特征输入融合模型,输出目标视频的人物识别结果。The comprehensive features are input into the fusion model, and the person recognition result of the target video is output.
在一种可选的实施方式中,特征处理模块724,可以用于:In an optional implementation manner, the feature processing module 724 can be used to:
将人物外观特征、声纹特征和时间特征进行合并,得到综合特征。Combine the appearance feature, voiceprint feature, and time feature of the character to obtain a comprehensive feature.
在一种可选的实施方式中,特征处理模块724,还用于:In an optional implementation manner, the feature processing module 724 is further used for:
根据关键帧图像在目标视频中的时间,以及子音频在目标视频中的时间区间,确定上述时间特征。According to the time of the key frame image in the target video and the time interval of the sub-audio in the target video, the above-mentioned temporal characteristics are determined.
本公开的示例性实施方式还提供了另一种视频人物识别装置,如图8所示,该视频人物识别装置800可以包括:Exemplary embodiments of the present disclosure also provide another video person recognition device. As shown in FIG. 8, the video person recognition device 800 may include:
图像获取模块810,用于从目标视频中获取关键帧图像;The image acquisition module 810 is used to acquire key frame images from the target video;
第一提取模块820,用于从关键帧图像中提取人物外观特征;The first extraction module 820 is used to extract the appearance features of the person from the key frame image;
第二提取模块830,用于根据关键帧图像在目标视频中的时间,从目标视频的音频 中截取关键帧图像对应的子音频,从该子音频中提取声纹特征;The second extraction module 830 is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprint features from the sub audio;
特征处理模块840,用于利用预先训练的融合模型对上述人物外观特征和声纹特征进行处理,得到目标视频的人物识别结果。The feature processing module 840 is configured to process the above-mentioned character appearance features and voiceprint features by using a pre-trained fusion model to obtain a character recognition result of the target video.
在一种可选的实施方式中,图像获取模块810,可以用于:In an optional implementation manner, the image acquisition module 810 may be used to:
从目标视频中提取帧内编码帧并进行解码,得到关键帧图像。Extract the intra-frame coded frame from the target video and decode it to obtain the key frame image.
在一种可选的实施方式中,图像获取模块810,可以用于:In an optional implementation manner, the image acquisition module 810 may be used to:
调用多个线程,通过每个线程分别解码一个帧内编码帧。Call multiple threads, and decode an intra-encoded frame through each thread.
在一种可选的实施方式中,上述人物外观特征可以包括人脸特征。In an optional implementation manner, the above-mentioned human appearance feature may include a human face feature.
在一种可选的实施方式中,第一提取模块820,可以用于:In an optional implementation manner, the first extraction module 820 may be used for:
检测关键帧图像中的人脸区域,以从关键帧图像中截取人脸子图像;Detect the face area in the key frame image to intercept the face sub-image from the key frame image;
利用预先训练的卷积神经网络从人脸子图像中提取人脸特征。A pre-trained convolutional neural network is used to extract facial features from facial sub-images.
在一种可选的实施方式中,上述人物外观特征还可以包括身形特征、体态特征。In an optional implementation manner, the above-mentioned human appearance characteristics may further include body shape characteristics and posture characteristics.
在一种可选的实施方式中,上述声纹特征可以包括梅尔频率倒谱系数。In an optional implementation manner, the voiceprint feature described above may include Mel frequency cepstrum coefficients.
在一种可选的实施方式中,第二提取模块830可以包括:In an optional implementation manner, the second extraction module 830 may include:
频谱获取单元,用于获取子音频对应的频谱;The frequency spectrum acquisition unit is used to acquire the frequency spectrum corresponding to the sub audio;
幅度谱转换单元,用于根据频谱计算出对应的幅度谱;The amplitude spectrum conversion unit is used to calculate the corresponding amplitude spectrum according to the frequency spectrum;
滤波处理单元,用于对幅度谱进行梅尔滤波处理,以计算出子音频的梅尔频率倒谱系数。The filter processing unit is used to perform Mel filter processing on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
在一种可选的实施方式中,频谱获取单元可以包括:In an optional implementation manner, the spectrum acquisition unit may include:
预处理单元,用于对子音频进行预处理;The preprocessing unit is used to preprocess the sub audio;
傅里叶变换单元,用于对预处理后的子音频进行傅里叶变换,得到子音频对应的频谱。The Fourier transform unit is used to perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
在一种可选的实施方式中,上述预处理可以包括以下任意一种或多种:提取语音信号、预加重、分帧和加窗处理;In an optional implementation manner, the foregoing preprocessing may include any one or more of the following: extracting speech signals, pre-emphasis, framing, and windowing;
在一种可选的实施方式中,第二提取模块830还可以包括:In an optional implementation manner, the second extraction module 830 may further include:
后处理单元,用于将梅尔频率倒谱系数转换为声纹特征向量,并进行后处理。The post-processing unit is used to convert the Mel frequency cepstrum coefficients into voiceprint feature vectors and perform post-processing.
在一种可选的实施方式中,上述后处理可以包括以下任意一种或多种:去均值、归一化、降维处理。In an optional implementation manner, the aforementioned post-processing may include any one or more of the following: de-averaging, normalization, and dimensionality reduction processing.
在一种可选的实施方式中,上述声纹特征还可以包括以下任意一种或多种:广义梅尔倒谱系数、谱包络与能量特征、基频、浊音/轻音分类特征、频带非周期分量。In an optional embodiment, the voiceprint feature may further include any one or more of the following: generalized Mel cepstrum coefficient, spectral envelope and energy feature, fundamental frequency, voiced/light tone classification feature, frequency band Aperiodic components.
在一种可选的实施方式中,特征处理模块840,可以用于:In an optional implementation manner, the feature processing module 840 may be used to:
将人物外观特征和声纹特征分别输入融合模型的两个输入通道;Input the character appearance feature and voiceprint feature into the two input channels of the fusion model;
利用融合模型分别处理人物外观特征和声纹特征,得到人物外观特征对应的中间特征和声纹特征对应的中间特征;Use the fusion model to separately process the appearance features and voiceprint features of the characters, and obtain the intermediate features corresponding to the appearance features of the characters and the intermediate features corresponding to the voiceprint features;
对人物外观特征对应的中间特征和声纹特征对应的中间特征进行融合计算,输出 目标视频的人物识别结果。Perform fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video.
在一种可选的实施方式中,特征处理模块840,可以用于:In an optional implementation manner, the feature processing module 840 may be used to:
将人物外观特征和声纹特征合并,得到综合特征;Combine character appearance features and voiceprint features to obtain comprehensive features;
将综合特征输入融合模型,输出目标视频的人物识别结果。The comprehensive features are input into the fusion model, and the person recognition result of the target video is output.
在一种可选的实施方式中,特征处理模块840,可以用于:In an optional implementation manner, the feature processing module 840 may be used to:
将人物外观特征、声纹特征和时间特征进行合并,得到综合特征。Combine the appearance feature, voiceprint feature, and time feature of the character to obtain a comprehensive feature.
在一种可选的实施方式中,特征处理模块840,还用于:In an optional implementation manner, the feature processing module 840 is further used for:
根据关键帧图像在目标视频中的时间,以及子音频在目标视频中的时间区间,确定上述时间特征。According to the time of the key frame image in the target video and the time interval of the sub-audio in the target video, the above-mentioned temporal characteristics are determined.
上述装置700与装置800中各模块/单元的具体细节在方法部分实施方式中已经详细说明,未披露的细节内容可以参见方法部分的实施方式内容,因而不再赘述。The specific details of the modules/units in the above-mentioned device 700 and device 800 have been described in detail in the method part of the implementation. For the undisclosed details, please refer to the method part of the implementation content, and thus will not be repeated.
本公开的示例性实施方式还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在电子设备上运行时,程序代码用于使电子设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。在一种可选的实施方式中,该程序产品可以实现为便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在电子设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method of this specification. In some possible implementation manners, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code. When the program product runs on an electronic device, the program code is used to make the electronic device execute the above-mentioned instructions in this specification. The steps according to various exemplary embodiments of the present disclosure are described in the "Exemplary Methods" section. In an alternative embodiment, the program product can be implemented as a portable compact disk read-only memory (CD-ROM) and include program code, and can be run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product can adopt any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代 码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。The program code for performing the operations of the present disclosure can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming. Language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
本公开的示例性实施方式还提供了一种能够实现上述方法的电子设备。下面参照图9来描述根据本公开的这种示例性实施方式的电子设备900。图9显示的电子设备900仅仅是一个示例,不应对本公开实施方式的功能和使用范围带来任何限制。Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method. The electronic device 900 according to this exemplary embodiment of the present disclosure will be described below with reference to FIG. 9. The electronic device 900 shown in FIG. 9 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
如图9所示,电子设备900可以以通用计算设备的形式表现。电子设备900的组件可以包括但不限于:至少一个处理单元910、至少一个存储单元920、连接不同系统组件(包括存储单元920和处理单元910)的总线930和显示单元940。As shown in FIG. 9, the electronic device 900 may be in the form of a general-purpose computing device. The components of the electronic device 900 may include but are not limited to: at least one processing unit 910, at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.
存储单元920存储有程序代码,程序代码可以被处理单元910执行,使得处理单元910执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。例如,处理单元910可以执行图1至图6中任意一个或多个方法步骤。The storage unit 920 stores program codes, and the program codes can be executed by the processing unit 910 so that the processing unit 910 executes the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification. For example, the processing unit 910 may execute any one or more method steps in FIGS. 1 to 6.
存储单元920可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)921和/或高速缓存存储单元922,还可以进一步包括只读存储单元(ROM)923。The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 921 and/or a cache storage unit 922, and may further include a read-only storage unit (ROM) 923.
存储单元920还可以包括具有一组(至少一个)程序模块925的程序/实用工具924,这样的程序模块925包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 920 may also include a program/utility tool 924 having a set of (at least one) program module 925. Such program module 925 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
总线930可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。The bus 930 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
电子设备900也可以与一个或多个外部设备1000(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备900交互的设备通信,和/或与使得该电子设备900能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口950进行。并且,电子设备900还可以通过网络适配器960与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器960通过总线930与电子设备900的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备900使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。The electronic device 900 may also communicate with one or more external devices 1000 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 900, and/or communicate with Any device (eg, router, modem, etc.) that enables the electronic device 900 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 950. In addition, the electronic device 900 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 960. As shown in the figure, the network adapter 960 communicates with other modules of the electronic device 900 through the bus 930. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施 方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开示例性实施方式的方法。Through the description of the foregoing implementation manners, those skilled in the art can easily understand that the exemplary implementation manner described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the exemplary embodiment of the present disclosure.
此外,上述附图仅是根据本公开示例性实施方式的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiment of the present disclosure, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的示例性实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to exemplary embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
所属技术领域的技术人员能够理解,本公开的各个方面可以实现为系统、方法或程序产品。因此,本公开的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其他实施方式。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施方式仅被视为示例性的,本公开的真正范围和精神由权利要求指出。Those skilled in the art can understand that various aspects of the present disclosure can be implemented as a system, a method, or a program product. Therefore, various aspects of the present disclosure can be specifically implemented in the following forms, namely: complete hardware implementation, complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System". After considering the specification and practicing the invention disclosed herein, those skilled in the art will easily think of other embodiments of the present disclosure. This application is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field that are not disclosed in the present disclosure. . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利要求来限定。It should be understood that the present disclosure is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from its scope. The scope of the present disclosure is limited only by the appended claims.

Claims (20)

  1. 一种视频人物识别方法,其特征在于,包括:A method for video character recognition, which is characterized in that it includes:
    从目标视频中获取关键帧图像;Obtain key frame images from the target video;
    从所述关键帧图像中提取人物外观特征;Extracting character appearance features from the key frame image;
    根据所述关键帧图像在所述目标视频中的时间,从所述目标视频的音频中截取所述关键帧图像对应的子音频,从所述子音频中提取声纹特征;Intercepting the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extracting voiceprint features from the sub audio;
    利用预先训练的融合模型对所述人物外观特征和所述声纹特征进行处理,得到所述目标视频的人物识别结果。A pre-trained fusion model is used to process the appearance feature of the person and the voiceprint feature to obtain the person recognition result of the target video.
  2. 根据权利要求1所述的方法,其特征在于,所述从目标视频中获取关键帧图像,包括:The method according to claim 1, wherein said obtaining the key frame image from the target video comprises:
    从所述目标视频中提取帧内编码帧并进行解码,得到所述关键帧图像。The intra-frame coded frame is extracted from the target video and decoded to obtain the key frame image.
  3. 根据权利要求2所述的方法,其特征在于,所述从所述目标视频中提取帧内编码帧并进行解码,包括:The method according to claim 2, wherein the extracting and decoding intra-frame coded frames from the target video comprises:
    调用多个线程,通过每个线程分别解码一个所述帧内编码帧。Multiple threads are called, and one intra-frame coded frame is decoded by each thread.
  4. 根据权利要求1所述的方法,其特征在于,所述人物外观特征包括人脸特征。The method according to claim 1, wherein the appearance feature of the person comprises a face feature.
  5. 根据权利要求4所述的方法,其特征在于,所述从所述关键帧图像中提取人物外观特征,包括:The method according to claim 4, wherein said extracting the appearance characteristics of a person from the key frame image comprises:
    检测所述关键帧图像中的人脸区域,以从所述关键帧图像中截取人脸子图像;Detecting a face area in the key frame image to intercept a face sub-image from the key frame image;
    利用预先训练的卷积神经网络从所述人脸子图像中提取所述人脸特征。Using a pre-trained convolutional neural network to extract the facial features from the facial sub-images.
  6. 根据权利要求4所述的方法,其特征在于,所述人物外观特征还包括身形特征、体态特征。The method according to claim 4, wherein the appearance characteristics of the character further include body shape characteristics and posture characteristics.
  7. 根据权利要求1所述的方法,其特征在于,所述声纹特征包括梅尔频率倒谱系数。The method according to claim 1, wherein the voiceprint features include Mel frequency cepstrum coefficients.
  8. 根据权利要求7所述的方法,其特征在于,所述从所述子音频中提取声纹特征,包括:8. The method according to claim 7, wherein said extracting voiceprint features from said sub-audio comprises:
    获取所述子音频对应的频谱;Acquiring the frequency spectrum corresponding to the sub audio;
    根据所述频谱计算出对应的幅度谱;Calculating a corresponding amplitude spectrum according to the frequency spectrum;
    对所述幅度谱进行梅尔滤波处理,以计算出所述子音频的梅尔频率倒谱系数。Mel filter processing is performed on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
  9. 根据权利要求8所述的方法,其特征在于,所述获取所述子音频对应的频谱,包括:The method according to claim 8, wherein the obtaining the frequency spectrum corresponding to the sub-audio comprises:
    对所述子音频进行预处理;Preprocessing the sub audio;
    对预处理后的所述子音频进行傅里叶变换,得到所述子音频对应的频谱。Fourier transform is performed on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
  10. 根据权利要求9所述的方法,其特征在于,所述预处理包括以下任意一种或多种:提取语音信号、预加重、分帧、加窗处理。The method according to claim 9, wherein the pre-processing includes any one or more of the following: extracting a speech signal, pre-emphasis, framing, and windowing.
  11. 根据权利要求8所述的方法,其特征在于,所述从所述子音频中提取声纹特征, 还包括:8. The method according to claim 8, wherein said extracting voiceprint features from said sub-audio further comprises:
    将所述梅尔频率倒谱系数转换为声纹特征向量,并进行后处理。The Mel frequency cepstrum coefficients are converted into voiceprint feature vectors, and post-processing is performed.
  12. 根据权利要求11所述的方法,其特征在于,所述后处理包括以下任意一种或多种:去均值、归一化、降维处理。The method according to claim 11, wherein the post-processing includes any one or more of the following: de-averaging, normalization, and dimensionality reduction processing.
  13. 根据权利要求7所述的方法,其特征在于,所述声纹特征还包括以下任意一种或多种:广义梅尔倒谱系数、谱包络与能量特征、基频、浊音/轻音分类特征、频带非周期分量。The method according to claim 7, wherein the voiceprint feature further comprises any one or more of the following: generalized Mel cepstrum coefficient, spectral envelope and energy feature, fundamental frequency, voiced/light tone classification Features, frequency band non-periodic components.
  14. 根据权利要求1所述的方法,其特征在于,所述利用预先训练的融合模型对所述人物外观特征和所述声纹特征进行处理,得到所述目标视频的人物识别结果,包括:The method according to claim 1, wherein said using a pre-trained fusion model to process said person appearance feature and said voiceprint feature to obtain a person recognition result of said target video comprises:
    将所述人物外观特征和所述声纹特征分别输入所述融合模型的两个输入通道;Input the character appearance feature and the voiceprint feature into the two input channels of the fusion model respectively;
    利用所述融合模型分别处理所述人物外观特征和所述声纹特征,得到所述人物外观特征对应的中间特征和所述声纹特征对应的中间特征;Separately processing the character appearance feature and the voiceprint feature by using the fusion model to obtain an intermediate feature corresponding to the character appearance feature and an intermediate feature corresponding to the voiceprint feature;
    对所述人物外观特征对应的中间特征和所述声纹特征对应的中间特征进行融合计算,输出所述目标视频的人物识别结果。Perform a fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video.
  15. 根据权利要求1所述的方法,其特征在于,所述利用预先训练的融合模型对所述人物外观特征和所述声纹特征进行处理,得到所述目标视频的人物识别结果,包括:The method according to claim 1, wherein said using a pre-trained fusion model to process said person appearance feature and said voiceprint feature to obtain a person recognition result of said target video comprises:
    将所述人物外观特征和所述声纹特征合并,得到综合特征;Combining the appearance feature of the character and the voiceprint feature to obtain a comprehensive feature;
    将所述综合特征输入所述融合模型,输出所述目标视频的人物识别结果。The comprehensive feature is input to the fusion model, and the person recognition result of the target video is output.
  16. 根据权利要求15所述的方法,其特征在于,所述将所述人物外观特征和所述声纹特征合并,得到综合特征,包括:The method according to claim 15, wherein the combining the appearance feature of the character and the voiceprint feature to obtain a comprehensive feature comprises:
    将所述人物外观特征、所述声纹特征和时间特征进行合并,得到所述综合特征。Combining the appearance feature of the character, the voiceprint feature, and the time feature to obtain the comprehensive feature.
  17. 根据权利要求16所述的方法,其特征在于,在将所述人物外观特征、所述声纹特征和时间特征进行合并前,所述方法还包括:The method according to claim 16, characterized in that, before merging the appearance feature of the character, the voiceprint feature, and the time feature, the method further comprises:
    根据所述关键帧图像在所述目标视频中的时间,以及所述子音频在所述目标视频中的时间区间,确定所述时间特征。The time feature is determined according to the time of the key frame image in the target video and the time interval of the sub audio in the target video.
  18. 一种视频人物识别装置,其特征在于,包括处理器;A video character recognition device, which is characterized by comprising a processor;
    其中,所述处理器用于执行存储在存储器中的以下程序模块:Wherein, the processor is used to execute the following program modules stored in the memory:
    图像获取模块,用于从目标视频中获取关键帧图像;The image acquisition module is used to acquire key frame images from the target video;
    第一提取模块,用于从所述关键帧图像中提取人物外观特征;The first extraction module is used to extract the appearance characteristics of the person from the key frame image;
    第二提取模块,用于根据所述关键帧图像在所述目标视频中的时间,从所述目标视频的音频中截取所述关键帧图像对应的子音频,从所述子音频中提取声纹特征;The second extraction module is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprints from the sub audio feature;
    特征处理模块,用于利用预先训练的融合模型对所述人物外观特征和所述声纹特征进行处理,得到所述目标视频的人物识别结果。The feature processing module is used to process the appearance feature of the character and the voiceprint feature by using a pre-trained fusion model to obtain the character recognition result of the target video.
  19. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1至17任一项所述的方法。A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the method according to any one of claims 1 to 17 when the computer program is executed by a processor.
  20. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    处理器;以及Processor; and
    存储器,用于存储所述处理器的可执行指令;A memory for storing executable instructions of the processor;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1至17任一项所述的方法。Wherein, the processor is configured to execute the method according to any one of claims 1 to 17 by executing the executable instructions.
PCT/CN2020/121259 2019-10-28 2020-10-15 Video figure recognition method and apparatus, and storage medium and electronic device WO2021082941A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911029707.4 2019-10-28
CN201911029707.4A CN110909613B (en) 2019-10-28 2019-10-28 Video character recognition method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
WO2021082941A1 true WO2021082941A1 (en) 2021-05-06

Family

ID=69816174

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121259 WO2021082941A1 (en) 2019-10-28 2020-10-15 Video figure recognition method and apparatus, and storage medium and electronic device

Country Status (2)

Country Link
CN (1) CN110909613B (en)
WO (1) WO2021082941A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223125A (en) * 2021-05-17 2021-08-06 百度在线网络技术(北京)有限公司 Face driving method, device, equipment and medium for virtual image
CN113254706A (en) * 2021-05-12 2021-08-13 北京百度网讯科技有限公司 Video matching method, video processing device, electronic equipment and medium
CN114283060A (en) * 2021-12-20 2022-04-05 北京字节跳动网络技术有限公司 Video generation method, device, equipment and storage medium
CN114544630A (en) * 2022-02-25 2022-05-27 河南科技大学 Group egg image segmentation fertilization information detection device and method based on deep learning
CN115022710A (en) * 2022-05-30 2022-09-06 咪咕文化科技有限公司 Video processing method and device and readable storage medium
CN115935008A (en) * 2023-02-16 2023-04-07 杭州网之易创新科技有限公司 Video label generation method, device, medium and computing equipment

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909613B (en) * 2019-10-28 2024-05-31 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment
CN112434234B (en) * 2020-05-15 2023-09-01 上海哔哩哔哩科技有限公司 Frame extraction method and system based on browser
CN111767805A (en) * 2020-06-10 2020-10-13 云知声智能科技股份有限公司 Multi-mode data automatic cleaning and labeling method and system
CN111881726B (en) * 2020-06-15 2022-11-25 马上消费金融股份有限公司 Living body detection method and device and storage medium
CN111753762B (en) * 2020-06-28 2024-03-15 北京百度网讯科技有限公司 Method, device, equipment and storage medium for identifying key identification in video
CN111914742A (en) * 2020-07-31 2020-11-10 辽宁工业大学 Attendance checking method, system, terminal equipment and medium based on multi-mode biological characteristics
CN112215136B (en) * 2020-10-10 2023-09-05 北京奇艺世纪科技有限公司 Target person identification method and device, electronic equipment and storage medium
CN112364779B (en) * 2020-11-12 2022-10-21 中国电子科技集团公司第五十四研究所 Underwater sound target identification method based on signal processing and deep-shallow network multi-model fusion
CN112364829B (en) * 2020-11-30 2023-03-24 北京有竹居网络技术有限公司 Face recognition method, device, equipment and storage medium
CN113077470B (en) * 2021-03-26 2022-01-18 天翼爱音乐文化科技有限公司 Method, system, device and medium for cutting horizontal and vertical screen conversion picture
CN113378697B (en) * 2021-06-08 2022-12-09 安徽大学 Method and device for generating speaking face video based on convolutional neural network
CN113507627B (en) * 2021-07-08 2022-03-25 北京的卢深视科技有限公司 Video generation method and device, electronic equipment and storage medium
CN115691538A (en) * 2021-07-29 2023-02-03 华为技术有限公司 Video processing method and electronic equipment
CN113992972B (en) * 2021-10-28 2024-11-08 维沃移动通信有限公司 Subtitle display method and device, electronic equipment and readable storage medium
CN114640826B (en) * 2022-03-23 2023-11-03 北京有竹居网络技术有限公司 Data processing method, device, readable medium and electronic equipment
CN114915856B (en) * 2022-05-17 2023-05-05 中国科学院半导体研究所 Video key frame identification method, device, equipment and medium
CN118101988B (en) * 2024-04-26 2024-09-24 荣耀终端有限公司 Video processing method, system and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222227A (en) * 2011-04-25 2011-10-19 中国华录集团有限公司 Video identification based system for extracting film images
CN104834849A (en) * 2015-04-14 2015-08-12 时代亿宝(北京)科技有限公司 Dual-factor identity authentication method and system based on voiceprint recognition and face recognition
CN107194229A (en) * 2017-05-22 2017-09-22 商洛学院 A kind of computer user's personal identification method
CN108399395A (en) * 2018-03-13 2018-08-14 成都数智凌云科技有限公司 The compound identity identifying method of voice and face based on end-to-end deep neural network
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN109446990A (en) * 2018-10-30 2019-03-08 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN110222719A (en) * 2019-05-10 2019-09-10 中国科学院计算技术研究所 A kind of character recognition method and system based on multiframe audio-video converged network
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590439B (en) * 2017-08-18 2020-12-15 湖南文理学院 Target person identification and tracking method and device based on monitoring video
CN109409296B (en) * 2018-10-30 2020-12-01 河北工业大学 Video emotion recognition method integrating facial expression recognition and voice emotion recognition
CN109740020A (en) * 2018-12-26 2019-05-10 秒针信息技术有限公司 Data processing method, device, storage medium and processor
CN110096966A (en) * 2019-04-10 2019-08-06 天津大学 A kind of audio recognition method merging the multi-modal corpus of depth information Chinese

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102222227A (en) * 2011-04-25 2011-10-19 中国华录集团有限公司 Video identification based system for extracting film images
CN104834849A (en) * 2015-04-14 2015-08-12 时代亿宝(北京)科技有限公司 Dual-factor identity authentication method and system based on voiceprint recognition and face recognition
CN107194229A (en) * 2017-05-22 2017-09-22 商洛学院 A kind of computer user's personal identification method
CN108399395A (en) * 2018-03-13 2018-08-14 成都数智凌云科技有限公司 The compound identity identifying method of voice and face based on end-to-end deep neural network
CN109376603A (en) * 2018-09-25 2019-02-22 北京周同科技有限公司 A kind of video frequency identifying method, device, computer equipment and storage medium
CN109446990A (en) * 2018-10-30 2019-03-08 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN110222719A (en) * 2019-05-10 2019-09-10 中国科学院计算技术研究所 A kind of character recognition method and system based on multiframe audio-video converged network
CN110909613A (en) * 2019-10-28 2020-03-24 Oppo广东移动通信有限公司 Video character recognition method and device, storage medium and electronic equipment

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113254706A (en) * 2021-05-12 2021-08-13 北京百度网讯科技有限公司 Video matching method, video processing device, electronic equipment and medium
CN113223125A (en) * 2021-05-17 2021-08-06 百度在线网络技术(北京)有限公司 Face driving method, device, equipment and medium for virtual image
CN113223125B (en) * 2021-05-17 2023-09-26 百度在线网络技术(北京)有限公司 Face driving method, device, equipment and medium for virtual image
CN114283060A (en) * 2021-12-20 2022-04-05 北京字节跳动网络技术有限公司 Video generation method, device, equipment and storage medium
CN114544630A (en) * 2022-02-25 2022-05-27 河南科技大学 Group egg image segmentation fertilization information detection device and method based on deep learning
CN115022710A (en) * 2022-05-30 2022-09-06 咪咕文化科技有限公司 Video processing method and device and readable storage medium
CN115022710B (en) * 2022-05-30 2023-09-19 咪咕文化科技有限公司 Video processing method, device and readable storage medium
CN115935008A (en) * 2023-02-16 2023-04-07 杭州网之易创新科技有限公司 Video label generation method, device, medium and computing equipment
CN115935008B (en) * 2023-02-16 2023-05-30 杭州网之易创新科技有限公司 Video label generation method, device, medium and computing equipment

Also Published As

Publication number Publication date
CN110909613A (en) 2020-03-24
CN110909613B (en) 2024-05-31

Similar Documents

Publication Publication Date Title
WO2021082941A1 (en) Video figure recognition method and apparatus, and storage medium and electronic device
CN107799126B (en) Voice endpoint detection method and device based on supervised machine learning
US10109277B2 (en) Methods and apparatus for speech recognition using visual information
WO2020253051A1 (en) Lip language recognition method and apparatus
CN111009237A (en) Voice recognition method and device, electronic equipment and storage medium
CN110516083B (en) Album management method, storage medium and electronic device
TW201543467A (en) Voice input method, device and system
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
CN110070859B (en) Voice recognition method and device
WO2015103836A1 (en) Voice control method and device
WO2024140430A1 (en) Text classification method based on multimodal deep learning, device, and storage medium
WO2024140434A1 (en) Text classification method based on multi-modal knowledge graph, and device and storage medium
Chen et al. Multi-Modality Matters: A Performance Leap on VoxCeleb.
CN112017633B (en) Speech recognition method, device, storage medium and electronic equipment
WO2022062800A1 (en) Speech separation method, electronic device, chip and computer-readable storage medium
CN113257238B (en) Training method of pre-training model, coding feature acquisition method and related device
CN115798459B (en) Audio processing method and device, storage medium and electronic equipment
CN111462732B (en) Speech recognition method and device
TWI769520B (en) Multi-language speech recognition and translation method and system
CN114360491A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
KR102220964B1 (en) Method and device for audio recognition
CN117037772A (en) Voice audio segmentation method, device, computer equipment and storage medium
CN112542157B (en) Speech processing method, device, electronic equipment and computer readable storage medium
Robi et al. Active Speaker Detection using Audio, Visual and Depth Modalities: A Survey
CN114283493A (en) Artificial intelligence-based identification system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20882402

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20882402

Country of ref document: EP

Kind code of ref document: A1