WO2021082941A1

WO2021082941A1 - Video figure recognition method and apparatus, and storage medium and electronic device

Info

Publication number: WO2021082941A1
Application number: PCT/CN2020/121259
Authority: WO
Inventors: 彭冬炜
Original assignee: Oppo广东移动通信有限公司
Priority date: 2019-10-28
Filing date: 2020-10-15
Publication date: 2021-05-06
Also published as: CN110909613A; CN110909613B

Abstract

A video figure recognition method, a video figure recognition apparatus, a storage medium, and an electronic device. The method comprises: obtaining a key frame image from a target video (S110); extracting a figure appearance feature from the key frame image (S120); intercepting a sub-audio corresponding to the key frame image from an audio of the target video according to the time of the key frame image in the target video, and extracting a voiceprint feature from the sub-audio (S130); and processing the figure appearance feature and the voiceprint feature by using a pre-trained fusion model to obtain a figure recognition result of the target video (S140). The multi-modal features in the video can be fused, implementing relatively high figure recognition accuracy, and the present invention is appliable to situations that the definition of a face image in the video is not high or the face image is shielded and the like, and has relatively high robustness.

Description

Video character recognition method, device, storage medium and electronic equipment

This application claims the priority of a Chinese patent application filed on October 28, 2019, with the application number 201911029707.4 and titled "Video Character Recognition Method, Device, Storage Medium and Electronic Equipment". The entire content of the Chinese patent application is approved The reference is incorporated in this article.

Technical field

The present disclosure relates to the field of artificial intelligence technology, and in particular to a video character recognition method, a video character recognition device, a computer-readable storage medium, and electronic equipment.

Background technique

Video person recognition refers to the identification of the person's identity in the video to classify the video or add person tags, etc. It has important applications in scenarios such as security, video classification, video content review, and smart photo albums.

In related technologies, video person recognition is mainly implemented based on face recognition in video images. An image containing a face is detected from the video, and then the face in the image is further accurately recognized to determine the identity of the person. This method has higher requirements on the sharpness of the face image. When the face image is not clear enough or is blocked, the accuracy of the recognition result is low.

Summary of the invention

The present disclosure provides a video character recognition method, a video character recognition device, a computer-readable storage medium, and electronic equipment, thereby improving the accuracy of video character recognition at least to a certain extent.

According to a first aspect of the present disclosure, there is provided a method for identifying a person in a video, including: acquiring a key frame image from a target video; extracting appearance characteristics of a person from the key frame image; according to the key frame image in the target video The sub-audio corresponding to the key frame image is intercepted from the audio of the target video, and the voiceprint feature is extracted from the sub-audio; the pre-trained fusion model is used to compare the appearance feature of the character and the voice The pattern feature is processed to obtain the person recognition result of the target video.

According to a second aspect of the present disclosure, there is provided a video character recognition device, including a processor; wherein the processor is configured to execute the following program modules stored in a memory: an image acquisition module for acquiring key frames from a target video Image; the first extraction module is used to extract the appearance characteristics of the character from the key frame image; the second extraction module is used to extract the audio from the target video according to the time of the key frame image in the target video The sub audio corresponding to the key frame image is intercepted in the sub audio, and the voiceprint feature is extracted from the sub audio; the feature processing module is used to process the appearance feature of the character and the voiceprint feature using a pre-trained fusion model, The person recognition result of the target video is obtained.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned video character recognition method and possible implementation manners thereof are realized.

According to a fourth aspect of the present disclosure, there is provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions Perform the above-mentioned video character recognition method and its possible implementation.

The technical solution of the present disclosure has the following beneficial effects:

According to the above-mentioned video character recognition method, video character recognition device, computer readable storage medium and electronic equipment, the key frame image is obtained from the target video, and the appearance characteristics of the character are extracted, and then the sub audio is intercepted according to the time of the key frame image in the target video , Extract the voiceprint features from the sub-audio, and finally use the pre-trained fusion model to process the appearance features and voiceprint features of the characters to obtain the result of the character recognition of the target video. On the one hand, the appearance characteristics of the characters reflect the characteristics of the image, and the voiceprint characteristics reflect the characteristics of the sound, thus taking advantage of the characteristics of the video’s two aspects of "sound" and "picture", fusing multi-modal characteristics and based on the characteristics Process person recognition, which can achieve higher recognition accuracy. On the other hand, multi-modal features can make up for the lack of any aspect of the feature to a certain extent, so that the technical solution of the present disclosure can be applied to situations where the face image in the video is not high in definition or is blocked, and has a high degree of improvement. Robustness. On the other hand, the key frame image and the sub audio are matched over time, and the two have a correspondence, thereby reducing the situation that the image feature and the sound feature are not synchronized, and improving the accuracy of the video character recognition.

Description of the drawings

Fig. 1 shows a flowchart of a method for video character recognition in this exemplary embodiment;

Fig. 2 shows a flowchart of a method for extracting facial features in this exemplary embodiment;

Fig. 3 shows a flowchart of a method for extracting voiceprint features in this exemplary embodiment;

Fig. 4 shows a flowchart of a method for obtaining a spectrum in this exemplary embodiment;

Fig. 5 shows a flowchart of a method for processing appearance characteristics and voiceprint characteristics of a character in this exemplary embodiment;

Fig. 6 shows a flowchart of another method for processing appearance characteristics and voiceprint characteristics of a person in this exemplary embodiment;

Fig. 7 shows a structural block diagram of a video character recognition device in this exemplary embodiment;

Fig. 8 shows a structural block diagram of another video character recognition device in this exemplary embodiment;

Fig. 9 shows an electronic device for implementing the above method in this exemplary embodiment.

Detailed ways

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, these embodiments are provided so that the present disclosure will be more comprehensive and complete, and the concept of the example embodiments will be fully conveyed To those skilled in the art. The described features, structures or characteristics can be combined in one or more embodiments in any suitable way. In the following description, many specific details are provided to give a sufficient understanding of the embodiments of the present disclosure. However, those skilled in the art will realize that the technical solutions of the present disclosure can be practiced without one or more of the specific details, or other methods, components, devices, steps, etc. can be used. In other cases, the well-known technical solutions are not shown or described in detail to avoid overwhelming the crowd and obscure all aspects of the present disclosure.

In addition, the drawings are only schematic illustrations of the present disclosure, and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted. Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.

In related technologies, face recognition based on video images has higher requirements for the clarity of the face image. When the face image is not clear enough or is blocked, the accuracy of the recognition result is low.

In addition, the inventor found that the video actually contains multi-modal information, including images, voice, and other aspects. However, the related technology only uses face recognition in the image to identify other people's objects, which fails to make full use of the multi-modal information. Modal information, which is one of the main reasons for its low accuracy in identifying people.

In view of one or more of the above-mentioned problems, the exemplary embodiments of the present disclosure first provide a video character recognition method, which can be applied to the server of the video service platform, for example, perform character recognition on the video on the platform from the server to add People tags are convenient for users to search, and can also be applied to personal computers and smart phone terminal devices. For example, people can recognize people in videos taken or downloaded by users to automatically classify people.

Fig. 1 shows a process of this exemplary embodiment, which may include the following steps S110 to S140:

Step S110: Obtain a key frame image from the target video;

Step S120, extracting the appearance characteristics of the person from the key frame image;

Step S130, according to the time of the key frame image in the target video, intercept the sub audio corresponding to the key frame image from the audio of the target video, and extract the voiceprint feature from the sub audio;

Step S140: Use the pre-trained fusion model to process the above-mentioned character appearance features and voiceprint features to obtain a character recognition result of the target video.

The following is a detailed description of each of the above steps.

In step S110, a key frame image is obtained from the target video.

Among them, the key frame image refers to an image containing the appearance of a human face in the target video. One key frame or multiple key frames may be extracted, and the number of them is not limited in the present disclosure. Several specific implementation methods are provided below on how to determine the key frame:

(1) Using a fixed interval method, in the target video, each interval is fixed for a fixed length of time or a fixed number of frames, and one frame is selected as a key frame. For example, a key frame image can be extracted for every 3 frames.

(2) Detect frames that contain or do not contain people in the target video, mark the frames that do not contain people as the background frame, and use the background frame as the dividing point to divide the target video into multiple sub-videos, and each sub-video contains people. Consecutive frames. It can be considered that the person in each sub-video is the same person, so at least one frame is extracted from each sub-video as a key frame.

(3) Considering that it is usually necessary to decode the video frame when extracting the complete image from the video, it is also possible to extract the intra-frame coded frame from the target video and decode it to obtain the key frame image.

Among them, an intra-coded frame (Intra-Coded frame, I frame for short) is a frame that is independently coded based on a single frame image, which is a complete preservation of the current frame image, and only the data of the current frame is required for decoding. Corresponding to the I frame, there are a forward predictive frame (Predictive frame, referred to as P frame) and a bi-predictive frame (Bi-Predictive frame, referred to as B frame). The P frame records the difference from the previous frame. When decoding the P frame It needs to refer to the previous frame data, and the B frame records the difference between it and the previous and subsequent bidirectional frames, and it needs to refer to the previous and subsequent frame data at the same time to be completely decoded.

It can be seen from the above that if it is determined that a P frame or a B frame is a key frame, when acquiring a key frame image, the I frame needs to be decoded first, and then the target P frame and B frame are decoded according to the difference between the previous and next frames, which is less efficient. Therefore, the I frame can be directly used as the key frame, and the I frame can be extracted from the target video and decoded to obtain the aforementioned key frame image. In this way, when decoding, only the key frame image needs to be decoded independently, without decoding other frames, the number of frames to be decoded is the least, and the key frame image is extracted at the fastest speed.

In order to further improve efficiency, if multiple I frames are selected as key frames, multiple threads can also be called during decoding, so that each thread decodes an I frame separately. Generally, video tools (such as video playback software, editing software, etc.) include a decoder for decoding video frames. In this exemplary embodiment, the decoder can be implanted in the video character recognition program, and the code of the thread part can be modified. When the video character recognition process starts, N I frames are acquired as key frames, and N threads are started accordingly. The decoding task of each I frame is allocated to the corresponding thread, and each thread executes the decoding task independently, thereby quickly completing the extraction of the key frame image in a concurrent manner.

It should be noted that, in order to facilitate subsequent processing, step S110 may obtain a fixed number of key frame images, for example, obtain 64 or 128 key frame images, etc., when the above key frames are collected, the relevant parameters can be determined according to the number. For example: calculate the length of the interval or the number of frames in the above method (1); determine the number of key frames extracted from each sub-video in the above method (2); determine how to extract the I frame in the above method (3) If the number of I-frames in the video to be classified is insufficient, the insufficient part can be extracted from P-frames or B-frames.

In addition, this exemplary embodiment can also use the above three methods in combination, for example: the above methods (2) and (3) are used in combination, an I frame is selected as a key frame in each sub video, and so on.

Continuing to refer to FIG. 1, in step S120, the appearance characteristics of the person are extracted from the key frame image.

In this exemplary embodiment, the machine learning model may be used to extract the appearance characteristics of the person in the key frame image. This is not to classify or recognize the key frame image. Therefore, there is no limitation on what type of data the machine learning model ultimately outputs. The advantage of this is that there is no restriction on the type of label when training the convolutional neural network. Which label is readily available or easy to obtain, and which label is used for training. For example, an open source portrait data set can be used, which contains For a large number of face images and their classification labels, a convolutional neural network for image classification is trained accordingly to extract face features in step S120. The key frame image can be input into the convolutional neural network, and after a series of convolution and pooling processes, the features are extracted from the fully connected layer. You can select the first fully connected layer, which has denser features, or you can choose the subsequent fully connected layer. The data volume of the connection layer is usually smaller, which is not limited in the present disclosure.

Character appearance features can include face features, body shape features, posture features, etc. Face features include information such as the position, proportion, shape, and expression of each part of the face, and body shape features include the position, proportion, shape, etc. of the body and limbs, etc. Information, posture characteristics include character actions, postures and other information. Among them, facial features are relatively more important for the significance of person recognition. In an optional implementation manner, when the appearance feature of a person includes a face feature, referring to FIG. 2, step S120 may be specifically implemented by the following steps S210 and S220:

Step S210, detecting the face area in the key frame image, so as to intercept the face sub-image from the key frame image;

In step S220, a pre-trained convolutional neural network is used to extract facial features from the aforementioned facial sub-images.

Among them, the face area can be recognized by algorithms such as contour detection. For example, the key frame image can be input into the face detection network RetinaFace for face area detection, and the area where the face is located in the image and the coordinates of the key points of the face can be detected. The face area is cut out from the key frame image to obtain a face sub-image, which filters out scenes, objects and other image content not related to person recognition. Then input to a pre-trained convolutional neural network, and obtain facial features from the fully connected layer of the network. In this exemplary embodiment, the dimensions of face features can be set according to actual needs. For example, if the first fully connected layer of the convolutional neural network is set to 512 dimensions, then after inputting the face sub-image, the fully connected layer can be extracted 512-dimensional facial features.

Continuing to refer to FIG. 1, in step S130, according to the time of the key frame image in the target video, the sub audio corresponding to the key frame image is intercepted from the audio of the target video, and the voiceprint feature is extracted from the sub audio.

Among them, the intercepted sub audio is the audio part corresponding to the key frame image. For example, the time of the key frame image in the target video is 09.670 seconds, you can take this time as the center and intercept a certain window from the audio of the target video Sub audio. In other words, the sub audio and key frame images should achieve "audio-visual synchronization". The duration of the sub audio is not specifically limited in the present disclosure, and a fixed duration can be used. For example, according to the average time of a person speaking in a general video, using 3 seconds or 5 seconds, etc., the audio on both sides of the key frame time point can also be detected. The sudden change point in the middle, such as the point in time when the voice content or frequency suddenly changes, intercept the part of the audio in the middle of the two sudden changes.

Because the voice of a person's speaking is unique, the voiceprint characteristics can greatly reflect the individual characteristics of each person's voice. In an optional implementation manner, the voiceprint feature may include Mel-Frequency Cepstral Coefficients (MFCC). As shown in FIG. 3, the voiceprint feature extraction from the sub-audio may specifically include the following steps S310 to S330:

Step S310: Obtain the frequency spectrum corresponding to the sub audio.

In the audio signal, the frequency spectrum can be regarded as the collection of the frequency, phase and amplitude of each sinusoidal signal, and the corresponding frequency spectrum can be obtained by sampling the signal in the sub-audio.

Step S320: Calculate the corresponding amplitude spectrum based on the above frequency spectrum.

The amplitude spectrum refers to the spectral line formed by the amplitude of each different frequency sinusoidal signal in the frequency spectrum. The corresponding amplitude spectrum can be obtained by disassembling and calculating the frequency spectrum of the sub-audio.

In step S330, Mel filter processing is performed on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.

Since human ears have different sensitivity to different frequencies and are in a non-linear relationship, the amplitude spectrum can be divided into multiple mel filter banks according to the sensitivity of the human ear, and the center frequencies of each filter are linearly distributed at equal intervals. Using the Mel filter bank to perform Mel filtering on the amplitude spectrum of the sub audio, the Mel frequency cepstrum coefficient can be calculated.

In an optional implementation manner, referring to FIG. 4, step S310 may be implemented by the following steps S410 and S420:

Step S410, preprocessing the sub audio.

Among them, the preprocessing may include any one or more of the following: extracting the speech signal, pre-emphasis, framing and windowing. Extracting the voice signal refers to filtering out non-human voice signals such as background sound and noise from the sub-audio, and only retaining the human voice signal; pre-emphasis is a signal processing method that compensates for the high-frequency component of the sub-audio red; framing is Split the sub-audio according to each frame to facilitate the subsequent feature extraction; the windowing process is to limit the signal through the preset window size, and each frame can be substituted into the window function, and the value outside the window is set to 0, In order to eliminate signal discontinuity that may be caused at both ends of each frame, this exemplary embodiment may use a square window, a Hamming window, etc. as the window function.

Step S420: Perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.

Since it is usually difficult to see the characteristics of the audio signal in the time domain, it is converted to the energy distribution in the frequency domain for observation. Different energy distributions represent the characteristics of different voices. Performing Fourier transform on the sub-audio can extract the frequency domain features, and draw the sub-audio signal as a frequency-density curve, that is, obtain the frequency spectrum corresponding to the sub-audio.

In an optional implementation manner, after step S330, the Mel frequency cepstrum coefficients can also be converted into voiceprint feature vectors, and post-processing is performed. Mel frequency cepstral coefficients are high-dimensional dense features, which are expressed in the form of vectors, that is, voiceprint feature vectors, which can be used in the processing of machine learning models. In this exemplary embodiment, the post-processing of the voiceprint feature vector can optimize the subsequent process, and the post-processing can adopt any one or more of the following methods: de-averaging, normalization, and dimensionality reduction processing. Means removal is to subtract the mean value of the corresponding dimension from the data of each dimension, so that the center value of the data is 0 to prevent unfavorable effects such as underfitting; normalization is to unify the data to a standard numerical scale, which is beneficial to follow-up Feature fusion and calculation; dimensionality reduction is achieved through algorithms such as PCA (Principal Components Analysis), which discards the dimensions that carry less information, and retains the main feature information to perform dimensionality reduction processing on the data, generally using A few representative, unrelated features replace the original large number of features with a certain degree of relevance, thus speeding up the subsequent processing process.

Further, the voiceprint features extracted in step S130 may also include any one or more of the following: generalized Mel cepstrum coefficients, spectral envelope and energy features, fundamental frequency, voiced/light tone classification features, frequency band non-periodic Weight. Among them, the generalized Mel cepstrum coefficient and Mel frequency cepstrum coefficient are basically the same in principle, which are high-dimensional features (for example, 180 dimensions), and the specific coefficient content has some differences, which can be used as a substitute for the Mel frequency cepstrum coefficient. Or supplement; spectral envelope and energy features are features related to speech content; fundamental frequency, voiced/light tone classification features and frequency band non-periodic components are features related to basic pronunciation information, usually relatively sparse features, which can be used as plum The supplement of the Cepstrum coefficient of the Er frequency. The richer the dimensions of the voiceprint feature, the more accurate the characterization of the video character, and the more conducive to accurate video character recognition.

Continuing to refer to Fig. 1, in step S140, a pre-trained fusion model is used to process the above-mentioned character appearance characteristics and voiceprint characteristics to obtain the character recognition result of the target video.

Among them, the character appearance feature is the feature extracted from the image, and the voiceprint feature is the feature extracted from the sound. The fusion of the two represents the multi-modal feature of the target video. In fact, by processing image features and voiceprint features in steps S120 and S130 to obtain feature data in vector or matrix form, it is easy to achieve multi-modal feature fusion, and then process through the fusion model to obtain The person recognition result of the target video.

In an embodiment, the fusion model can be configured with two input channels, one channel is used to input the appearance characteristics of the character, and the other channel is used to input the voiceprint characteristics. Referring to FIG. 5, step S140 may include the following steps S510 to S530:

In step S520, the appearance feature of the character and the voiceprint feature are respectively input into the two input channels of the fusion model.

Step S520: Use the fusion model to process the appearance feature and the voiceprint feature of the character separately to obtain the intermediate feature corresponding to the appearance feature of the character and the intermediate feature corresponding to the voiceprint feature; these two intermediate features respectively represent the appearance and voice of the video character Abstract information.

Step S530: Perform a fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video. In this way, the appearance and sound information can be combined and correlated to achieve the fusion of the two modal information, and after further recognition processing, a comprehensive character recognition result is finally output.

In another implementation manner, referring to FIG. 6, step S140 may include the following steps S610 and S620:

Step S610: Combine the appearance feature of the character and the voiceprint feature to obtain a comprehensive feature;

In step S620, the integrated features are input to the fusion model, and the person recognition result of the target video is output.

Among them, the combination of character appearance features and voiceprint features can be in the form of splicing. For example, in step S120, a 512-dimensional face feature is obtained through a convolutional neural network, and in step S130, a 512-dimensional Mel frequency cepstrum system is extracted Number (ie voiceprint feature), stitch the two into a 1024-dimensional comprehensive feature, and input it into the fusion model for processing.

The fusion model can use a common neural network model, or it can be structurally optimized according to actual needs. For example: using MobileNet (open source neural network for mobile terminals), in which data enhancement mechanisms are set up, including the Dropout layer (discarding layer), random noise, etc., to standardize the input comprehensive characteristics of the data, and the fully connected layer The number of channels is set to 1024, the PReLu (Parametric Rectified Linear Unit, parameter correction linear unit) layer is used for activation, and the BCE Loss (Binary Cross Entropy Loss, cross entropy loss) is used as the loss function.

Further, the time feature can also be determined according to the time of the key frame image in the target video and the time interval of the sub audio in the target video. When performing feature merging, the appearance feature, voiceprint feature, and time feature of the character are combined to obtain a comprehensive feature.

Among them, for the case that the key frame image is a single frame image, the time feature can include 2 or 3 dimensions, one dimension records the time of the key frame, and the other dimensions record the time interval of the sub audio, for example, the start time and the end time, It can also be the center time and the sub-audio duration; in the case of multiple key frame images, the dimension of the time feature is set according to the number of frames. The time feature is equivalent to the supplement to the multi-modal feature. On the basis of the face feature and the voiceprint feature, the time information is added, which helps to improve the completeness and richness of the comprehensive feature, thereby improving the accuracy of person recognition.

To sum up, in this exemplary embodiment, the key frame image is obtained from the target video, and the appearance characteristics of the character are extracted, and then the sub audio is intercepted according to the time of the key frame image in the target video, and the voiceprint feature is extracted from the sub audio. Finally, the pre-trained fusion model is used to process the appearance characteristics and voiceprint characteristics of the characters, and the result of the character recognition of the target video is obtained. On the one hand, the appearance characteristics of the characters reflect the characteristics of the image, and the voiceprint characteristics reflect the characteristics of the sound, thus taking advantage of the characteristics of the video’s two aspects of "sound" and "picture", fusing multi-modal characteristics and based on the characteristics Process person recognition, which can achieve higher recognition accuracy. On the other hand, the multi-modal feature can make up for the lack of any aspect of the feature to a certain extent, so that this exemplary embodiment can be applied to the situation where the face image in the video is not high in definition or occluded, and has a high degree of improvement. Robustness. On the other hand, the key frame image and the sub audio are matched over time, and the two have a correspondence, thereby reducing the situation that the image feature and the sound feature are not synchronized, and improving the accuracy of the video character recognition.

Exemplary embodiments of the present disclosure also provide a video character recognition device. As shown in FIG. 7, the video character recognition device 700 may include a processor 710 and a memory 720. Wherein, the memory 720 includes the following program modules:

The image acquisition module 721 is used to acquire key frame images from the target video;

The first extraction module 722 is used to extract the appearance features of the person from the key frame image;

The second extraction module 723 is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprint features from the sub audio;

The feature processing module 724 is configured to use the pre-trained fusion model to process the above-mentioned character appearance features and voiceprint features to obtain the character recognition result of the target video;

The processor 710 is configured to execute the foregoing program modules.

In an optional implementation manner, the image acquisition module 721 may be used to:

Extract the intra-frame coded frame from the target video and decode it to obtain the key frame image.

Call multiple threads, and decode an intra-encoded frame through each thread.

In an optional implementation manner, the above-mentioned human appearance feature may include a human face feature.

In an optional implementation manner, the first extraction module 722 may be used to:

Detect the face area in the key frame image to intercept the face sub-image from the key frame image;

A pre-trained convolutional neural network is used to extract facial features from facial sub-images.

In an optional implementation manner, the above-mentioned human appearance characteristics may further include body shape characteristics and posture characteristics.

In an optional implementation manner, the voiceprint feature described above may include Mel frequency cepstrum coefficients.

In an optional implementation manner, the second extraction module 723 may include:

The frequency spectrum acquisition unit is used to acquire the frequency spectrum corresponding to the sub audio;

The amplitude spectrum conversion unit is used to calculate the corresponding amplitude spectrum according to the frequency spectrum;

The filter processing unit is used to perform Mel filter processing on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.

In an optional implementation manner, the spectrum acquisition unit may include:

The preprocessing unit is used to preprocess the sub audio;

The Fourier transform unit is used to perform Fourier transform on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.

In an optional implementation manner, the foregoing preprocessing may include any one or more of the following: extracting speech signals, pre-emphasis, framing, and windowing;

In an optional implementation manner, the second extraction module 723 may further include:

The post-processing unit is used to convert the Mel frequency cepstrum coefficients into voiceprint feature vectors and perform post-processing.

In an optional implementation manner, the aforementioned post-processing may include any one or more of the following: de-averaging, normalization, and dimensionality reduction processing.

In an optional embodiment, the voiceprint feature may further include any one or more of the following: generalized Mel cepstrum coefficient, spectral envelope and energy feature, fundamental frequency, voiced/light tone classification feature, frequency band Aperiodic components.

In an optional implementation manner, the feature processing module 724 can be used to:

Input the character appearance feature and voiceprint feature into the two input channels of the fusion model;

Use the fusion model to separately process the appearance features and voiceprint features of the characters, and obtain the intermediate features corresponding to the appearance features of the characters and the intermediate features corresponding to the voiceprint features;

Perform fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video.

Combine character appearance features and voiceprint features to obtain comprehensive features;

The comprehensive features are input into the fusion model, and the person recognition result of the target video is output.

Combine the appearance feature, voiceprint feature, and time feature of the character to obtain a comprehensive feature.

In an optional implementation manner, the feature processing module 724 is further used for:

According to the time of the key frame image in the target video and the time interval of the sub-audio in the target video, the above-mentioned temporal characteristics are determined.

Exemplary embodiments of the present disclosure also provide another video person recognition device. As shown in FIG. 8, the video person recognition device 800 may include:

The image acquisition module 810 is used to acquire key frame images from the target video;

The first extraction module 820 is used to extract the appearance features of the person from the key frame image;

The second extraction module 830 is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprint features from the sub audio;

The feature processing module 840 is configured to process the above-mentioned character appearance features and voiceprint features by using a pre-trained fusion model to obtain a character recognition result of the target video.

In an optional implementation manner, the image acquisition module 810 may be used to:

Call multiple threads, and decode an intra-encoded frame through each thread.

In an optional implementation manner, the first extraction module 820 may be used for:

In an optional implementation manner, the second extraction module 830 may include:

The preprocessing unit is used to preprocess the sub audio;

In an optional implementation manner, the second extraction module 830 may further include:

In an optional implementation manner, the feature processing module 840 may be used to:

In an optional implementation manner, the feature processing module 840 is further used for:

The specific details of the modules/units in the above-mentioned device 700 and device 800 have been described in detail in the method part of the implementation. For the undisclosed details, please refer to the method part of the implementation content, and thus will not be repeated.

Exemplary embodiments of the present disclosure also provide a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method of this specification. In some possible implementation manners, various aspects of the present disclosure can also be implemented in the form of a program product, which includes program code. When the program product runs on an electronic device, the program code is used to make the electronic device execute the above-mentioned instructions in this specification. The steps according to various exemplary embodiments of the present disclosure are described in the "Exemplary Methods" section. In an alternative embodiment, the program product can be implemented as a portable compact disk read-only memory (CD-ROM) and include program code, and can be run on an electronic device, such as a personal computer. However, the program product of the present disclosure is not limited thereto. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.

The program product can adopt any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.

The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.

The program code for performing the operations of the present disclosure can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming. Language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).

Exemplary embodiments of the present disclosure also provide an electronic device capable of implementing the above method. The electronic device 900 according to this exemplary embodiment of the present disclosure will be described below with reference to FIG. 9. The electronic device 900 shown in FIG. 9 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 9, the electronic device 900 may be in the form of a general-purpose computing device. The components of the electronic device 900 may include but are not limited to: at least one processing unit 910, at least one storage unit 920, a bus 930 connecting different system components (including the storage unit 920 and the processing unit 910), and a display unit 940.

The storage unit 920 stores program codes, and the program codes can be executed by the processing unit 910 so that the processing unit 910 executes the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification. For example, the processing unit 910 may execute any one or more method steps in FIGS. 1 to 6.

The storage unit 920 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 921 and/or a cache storage unit 922, and may further include a read-only storage unit (ROM) 923.

The storage unit 920 may also include a program/utility tool 924 having a set of (at least one) program module 925. Such program module 925 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.

The bus 930 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.

The electronic device 900 may also communicate with one or more external devices 1000 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 900, and/or communicate with Any device (eg, router, modem, etc.) that enables the electronic device 900 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 950. In addition, the electronic device 900 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 960. As shown in the figure, the network adapter 960 communicates with other modules of the electronic device 900 through the bus 930. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 900, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.

Through the description of the foregoing implementation manners, those skilled in the art can easily understand that the exemplary implementation manner described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the exemplary embodiment of the present disclosure.

In addition, the above-mentioned drawings are merely schematic illustrations of the processing included in the method according to the exemplary embodiment of the present disclosure, and are not intended for limitation. It is easy to understand that the processing shown in the above drawings does not indicate or limit the time sequence of these processings. In addition, it is easy to understand that these processes can be executed synchronously or asynchronously in multiple modules, for example.

It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to exemplary embodiments of the present disclosure, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.

Those skilled in the art can understand that various aspects of the present disclosure can be implemented as a system, a method, or a program product. Therefore, various aspects of the present disclosure can be specifically implemented in the following forms, namely: complete hardware implementation, complete software implementation (including firmware, microcode, etc.), or a combination of hardware and software implementations, which may be collectively referred to herein as "Circuit", "Module" or "System". After considering the specification and practicing the invention disclosed herein, those skilled in the art will easily think of other embodiments of the present disclosure. This application is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field that are not disclosed in the present disclosure. . The description and the embodiments are only regarded as exemplary, and the true scope and spirit of the present disclosure are pointed out by the claims.

It should be understood that the present disclosure is not limited to the precise structure that has been described above and shown in the drawings, and various modifications and changes can be made without departing from its scope. The scope of the present disclosure is limited only by the appended claims.

Claims

A method for video character recognition, which is characterized in that it includes:

Obtain key frame images from the target video;

Extracting character appearance features from the key frame image;

Intercepting the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extracting voiceprint features from the sub audio;

A pre-trained fusion model is used to process the appearance feature of the person and the voiceprint feature to obtain the person recognition result of the target video.
The method according to claim 1, wherein said obtaining the key frame image from the target video comprises:

The intra-frame coded frame is extracted from the target video and decoded to obtain the key frame image.
The method according to claim 2, wherein the extracting and decoding intra-frame coded frames from the target video comprises:

Multiple threads are called, and one intra-frame coded frame is decoded by each thread.
The method according to claim 1, wherein the appearance feature of the person comprises a face feature.
The method according to claim 4, wherein said extracting the appearance characteristics of a person from the key frame image comprises:

Detecting a face area in the key frame image to intercept a face sub-image from the key frame image;

Using a pre-trained convolutional neural network to extract the facial features from the facial sub-images.
The method according to claim 4, wherein the appearance characteristics of the character further include body shape characteristics and posture characteristics.
The method according to claim 1, wherein the voiceprint features include Mel frequency cepstrum coefficients.
8. The method according to claim 7, wherein said extracting voiceprint features from said sub-audio comprises:

Acquiring the frequency spectrum corresponding to the sub audio;

Calculating a corresponding amplitude spectrum according to the frequency spectrum;

Mel filter processing is performed on the amplitude spectrum to calculate the Mel frequency cepstrum coefficient of the sub audio.
The method according to claim 8, wherein the obtaining the frequency spectrum corresponding to the sub-audio comprises:

Preprocessing the sub audio;

Fourier transform is performed on the preprocessed sub-audio to obtain the frequency spectrum corresponding to the sub-audio.
The method according to claim 9, wherein the pre-processing includes any one or more of the following: extracting a speech signal, pre-emphasis, framing, and windowing.
8. The method according to claim 8, wherein said extracting voiceprint features from said sub-audio further comprises:

The Mel frequency cepstrum coefficients are converted into voiceprint feature vectors, and post-processing is performed.
The method according to claim 11, wherein the post-processing includes any one or more of the following: de-averaging, normalization, and dimensionality reduction processing.
The method according to claim 7, wherein the voiceprint feature further comprises any one or more of the following: generalized Mel cepstrum coefficient, spectral envelope and energy feature, fundamental frequency, voiced/light tone classification Features, frequency band non-periodic components.
The method according to claim 1, wherein said using a pre-trained fusion model to process said person appearance feature and said voiceprint feature to obtain a person recognition result of said target video comprises:

Input the character appearance feature and the voiceprint feature into the two input channels of the fusion model respectively;

Separately processing the character appearance feature and the voiceprint feature by using the fusion model to obtain an intermediate feature corresponding to the character appearance feature and an intermediate feature corresponding to the voiceprint feature;

Perform a fusion calculation on the intermediate feature corresponding to the appearance feature of the person and the intermediate feature corresponding to the voiceprint feature, and output the person recognition result of the target video.
The method according to claim 1, wherein said using a pre-trained fusion model to process said person appearance feature and said voiceprint feature to obtain a person recognition result of said target video comprises:

Combining the appearance feature of the character and the voiceprint feature to obtain a comprehensive feature;

The comprehensive feature is input to the fusion model, and the person recognition result of the target video is output.
The method according to claim 15, wherein the combining the appearance feature of the character and the voiceprint feature to obtain a comprehensive feature comprises:

Combining the appearance feature of the character, the voiceprint feature, and the time feature to obtain the comprehensive feature.
The method according to claim 16, characterized in that, before merging the appearance feature of the character, the voiceprint feature, and the time feature, the method further comprises:

The time feature is determined according to the time of the key frame image in the target video and the time interval of the sub audio in the target video.
A video character recognition device, which is characterized by comprising a processor;

Wherein, the processor is used to execute the following program modules stored in the memory:

The image acquisition module is used to acquire key frame images from the target video;

The first extraction module is used to extract the appearance characteristics of the person from the key frame image;

The second extraction module is configured to intercept the sub audio corresponding to the key frame image from the audio of the target video according to the time of the key frame image in the target video, and extract voiceprints from the sub audio feature;

The feature processing module is used to process the appearance feature of the character and the voiceprint feature by using a pre-trained fusion model to obtain the character recognition result of the target video.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program implements the method according to any one of claims 1 to 17 when the computer program is executed by a processor.
An electronic device, characterized in that it comprises:

Processor; and

A memory for storing executable instructions of the processor;

Wherein, the processor is configured to execute the method according to any one of claims 1 to 17 by executing the executable instructions.