CN110363148A

CN110363148A - A kind of method of face vocal print feature fusion verifying

Info

Publication number: CN110363148A
Application number: CN201910641594.7A
Authority: CN
Inventors: 胡增; 江大白
Original assignee: China Applied Technology Co Ltd
Current assignee: China Applied Technology Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2019-10-22

Abstract

The present invention provides a kind of method of face vocal print feature fusion verifying, comprising the following steps: the audio files of input is parsed into the time-domain signal of sound；The time-domain signal is changed into frequency-region signal by Short Time Fourier Transform and adding window framing；The linear relationship that can perceive frequency conversion at human ear is converted by log spectrum；By cepstrum, using dct transform by the frequency-region signal after conversion DC signal component and sinusoidal signal component separate；Sound spectrum feature vector is extracted, the vector is converted into image；Described image is merged with two-dimension human face image.The method of face vocal print feature proposed by the present invention fusion verifying, can achieve the effect for only doing one-time authentication, and does not have a kind of mode erroneous detection in application layer joint verification and will cause and entirely verify unacceptable problem, improve usage experience.

Description

Method for fusing and verifying face voiceprint features

Technical Field

The invention relates to the field of biological feature recognition, in particular to a method for fusion verification of face voiceprint features.

Background

In recent years, with deep learning and continuous development and maturity of computer vision technology, several identity authentication technologies based on computer vision technology in biometric identification are rapidly developed, especially face identification technology, which has the characteristics of non-contact and rapid identification and is widely applied to various services needing identity verification. Voiceprint recognition, which is one type of biometric recognition, is a service that performs identification based on the acoustic wave characteristics of a speaker. The identity recognition is independent of the accent and the language, and can be used for speaker identification and speaker confirmation.

Although various biological recognition technologies are gradually developed and applied to life and production, the use mode has obvious singleness. The conditions of missed detection and false detection always exist in a single verification mode, and the current research and application hotspots jointly use a plurality of biological identification technologies to achieve higher safety and accuracy, such as face identification and voiceprint identification. However, these methods only combine two recognition technologies at the application layer, for example, the face recognition plus the voiceprint recognition means that the face recognition passes, and then the voiceprint recognition passes is verified. And the human face features and the voice print features are not fused on the bottom layer feature level, so that the effect of only performing one-time verification is achieved, the whole verification cannot be passed due to one-way false detection, and the use experience is reduced.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method for fusing and verifying the face voiceprint features.

The technical scheme adopted by the invention for realizing the purpose is as follows: a method for fusing and verifying face voiceprint features comprises the following steps:

analyzing an input sound file into a time domain signal of sound;

converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing;

converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation;

separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation);

extracting a sound frequency spectrum characteristic vector, and converting the vector into an image;

and fusing the image and the two-dimensional face image.

Converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing, specifically:

selecting a time-frequency localized window function h (t), and calculating power spectrums at different moments through short-time Fourier transform, wherein the formula of the short-time Fourier transform is as follows:

where f (τ) represents the time-domain signal of the input audio, τ represents the integration variable, and t represents the different time instants.

The window function is a hamming window.

The conversion of the frequency into a linear relationship which can be perceived by human ears through logarithmic spectrum conversion is specifically as follows:

converting the frequency scale into a logarithmic frequency scale by the following formula to enable the perception of the human ear to the frequency to be in a linear perception relation:

mel(f)＝2595*log₁₀(1+f/700)

where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform.

Through the frequency inversion analysis, the direct current signal component and the sinusoidal signal component in the frequency domain signal after the conversion are separated by adopting DCT, which specifically comprises the following steps:

wherein,

where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency bins, and u represents the frequency bins of the cepstrum.

The extracting of the sound frequency spectrum feature vector converts the vector into an image, specifically:

range of output vector:

mfcc∈[min，max]

linear transformation to the range of the image:

pixel∈[0,255]

thus, a cepstrum of a sound is obtained, the horizontal axis of which is time and the vertical axis of which is frequency; where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.

The image and the two-dimensional face image are fused, and the method specifically comprises the following steps:

and clockwise rotating the cepstrum by 90 degrees, if the length of the transverse axis of the spliced image is not consistent with that of the inverted cepstrum rotated by 90 degrees, scaling the two-dimensional face image to enable the lengths of the transverse axes of the two images to be consistent, and splicing the two images.

The invention has the following advantages and beneficial effects: the method for fusing and verifying the voiceprint features of the human face provided by the invention fuses the voiceprint features and the voice print features at the bottom feature level, can achieve the effect of verifying only once, does not have the problem that the whole verification cannot be passed due to false detection in one mode in application layer combined verification, and improves the use experience. (if any effect is supplemented as much as possible)

Drawings

FIG. 1 is a time domain signal diagram of sound;

FIG. 2 is a frequency domain signal of sound;

FIG. 3 is a chart of the hamming window function of the present invention;

FIG. 4 is a graph of the cepstrum of a sound according to the present invention;

FIG. 5 is a network architecture diagram of the VGG16 of the present invention;

FIG. 6 is a flow chart of a method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The flow of the method for fusing and verifying the human face voiceprint features is shown in fig. 6, and an input sound file is analyzed into a sound time domain signal; converting the time domain signal into a frequency domain signal through short-time Fourier transform and windowing framing; converting the frequency in the frequency domain signal into a linear relation which can be perceived by human ears through logarithmic spectrum transformation; separating direct-current signal components and sinusoidal signal components in the converted frequency domain signals by means of frequency inversion analysis and DCT (discrete cosine transformation); extracting a sound frequency spectrum characteristic vector, and converting the vector into an image; and fusing the image and the two-dimensional face image.

Fig. 1 is a time domain signal of sound and fig. 2 is a frequency domain signal of sound. The sound signal is a one-dimensional time domain signal, and the frequency change rule is difficult to be seen intuitively. If it is changed to the frequency domain by fourier transform, although the frequency distribution of the signal can be seen, the time domain information is lost and the change of the frequency distribution with time cannot be seen. To solve this problem, we use a short-time fourier transform (STFT) to determine the frequency and phase of the local area sinusoid of the time-varying signal. The specific method comprises the following steps: selecting a time-frequency localized window function, supposing that the analysis window function h (t) is stable in a short time interval, and enabling f (t) h (t) to be stable signals in different limited time widths, thereby calculating power spectrums at different moments, wherein the window function selects a hamming window which is a cosine window and can well reflect the attenuation relation of energy at a certain moment along with time. The short-time fourier transform is formulated as:

Is a hamming window function, where N represents the discrete points of the window function and N represents the total number of discrete points, as shown in fig. 3.

The unit of frequency is Hertz (Hz), the audible frequency range of human ear is 20-20000Hz, but the human ear is not in linear perception relation to Hz, the ordinary frequency scale is converted into logarithmic frequency scale, and the mapping relation is shown as the following formula:

mel(f)＝2595*log₁₀(1+f/700)

where mel (f) represents a logarithmic frequency, and f represents a frequency obtained by short-time fourier transform. Through the formula, the perception of the human ear to the frequency is in a linear relation. That is, on this scale, if the frequencies of two segments of speech differ by a factor of two, the pitch that the human ear can perceive will probably also differ by a factor of two.

Based on the log spectrum, the direct current signal component and the sinusoidal signal component in the converted frequency domain signal are separated by adopting DCT (discrete cosine transformation), and the final obtained result is called a cepstrum:

wherein,

where mfcc (u) represents a cepstrum, mel (i) represents a logarithmic frequency, N represents the number of frequency points, u represents a frequency point of the cepstrum, and 0 represents a direct current component.

Since the cepstrum outputs vectors, which cannot be displayed by the picture, it needs to be converted into an image matrix. The range of the output vector needs to be:

mfcc∈[min，max]

linear transformation to the range of the image:

pixel∈[0,255]

this results in a cepstrum of the sound as shown in figure 4. Wherein the horizontal axis is time and the vertical axis is frequency. The brighter the place the greater the value (the greater the energy). Where mfcc denotes a cepstrum, min denotes a minimum value of mfcc, max denotes a maximum value of mfcc, and pixel denotes a pixel after conversion into an image.

In general, the frequency points extracted by us are fixed, that is, the length of the vertical axis is fixed, and the image 4 is rotated clockwise by 90 degrees and spliced behind the face image. And if the length of the horizontal axis of the face picture is not consistent with the length of the horizontal axis of the cepstrum after the face picture is rotated by 90 degrees, scaling the face picture to enable the lengths of the horizontal axes of the face picture and the cepstrum to be consistent. And the picture obtained after splicing is the fused feature. And finally, training a recognition model through a convolutional neural network. A typical CNN classification model can be abbreviated as two steps:

z＝CNN(x)

p＝softmax(zW)

where x is the input picture and p is the probability output for each class. When training as a classification problem, we output x and the corresponding one hot label p, but when using, we do not use the whole model, and we only use the part cnn (x), which is responsible for converting the picture into a vector with fixed length. With the conversion model (encoder), under any scene, the new face and sound fusion features can be encoded and then converted into comparison between the encoding vectors, so that the original classification model is not relied on. This completes the algorithm for biometric identification using the fused features.

The following is one embodiment of the present invention:

we use convolutional neural networks to identify the fused features, specifically, we use VGG16 networks, the network structure is that shown in fig. 5, the network structure is that of VGG16, there are 16 layers (excluding pooling and softmax layers), all convolution kernels use 3 × 3 in size, pooling use maximum pooling with 2 × 2 in size and 2 in step size, and the depth of the convolutional layers is 64- >128- >256- >512- > 512. The VGG16 is one type of neural network, and the training data may be different, including from which library to extract, and the number.

The training data used 1000 pictures of human faces (50 people, 20 pictures per person) extracted from ImageNet and 1000 fragments of human voice (also 50 people, 20 fragments per person) extracted from AudioSet. And randomly pairing 50 persons of the face and 50 persons of the voice, and randomly recombining the paired face pictures and voice fragments to finally obtain 50 face and voice pairs, wherein each face and voice pair comprises 20 samples, and the face and voice pairs are input into a VGG16 network after feature fusion to construct a 50-class model and train to obtain a feature encoder based on CNN.

Claims

1. A method for fusing and verifying face voiceprint features is characterized by comprising the following steps:

analyzing an input sound file into a time domain signal of sound;

and fusing the image and the two-dimensional face image.

2. The method for fusion verification of human face voiceprint features according to claim 1, wherein the time domain signal is converted into a frequency domain signal by short-time fourier transform and windowing framing, specifically:

3. The method of claim 2, wherein the window function is a hamming window.

4. The method for fusion verification of human face voiceprint features according to claim 1, wherein the frequency is converted into a linear relationship which can be perceived by human ears by log-spectrum transformation, specifically:

mel(f)＝2595*log₁₀(1+f/700)

5. The method for fusion verification of human face voiceprint features according to claim 1, wherein the dc signal component and the sinusoidal signal component in the converted frequency domain signal are separated by means of a DCT transform through a cepstrum analysis, specifically:

wherein,

6. The method for fusion verification of human face voiceprint features according to claim 1, wherein the extracting of the voice spectrum feature vector and the converting of the vector into an image are specifically as follows:

range of output vector:

mfcc∈[min，max]

linear transformation to the range of the image:

pixel∈[0，255]

7. The method for fusion verification of face voiceprint features according to claim 1, wherein the fusion of the image with a two-dimensional face image is specifically as follows: