CN110782877A

CN110782877A - Speech identification method and system based on Fisher mixed feature and neural network

Info

Publication number: CN110782877A
Application number: CN201911130906.4A
Authority: CN
Inventors: 苏兆品; 季仁杰; 葛昭旭; 陈清; 郑宁军; 李顺宇; 张国富; 岳峰
Original assignee: Hefei Polytechnic University
Current assignee: Hefei University of Technology; Hefei Polytechnic University
Priority date: 2019-11-19
Filing date: 2019-11-19
Publication date: 2020-02-11

Abstract

The invention provides a method and a system for identifying voices of a Fisher mixed feature and a neural network, and relates to the technical field of voice recognition. Firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then acquiring MFCC-CQCC mixed characteristics of the voice samples based on Fisher criterion, MFCC characteristics and CQCC characteristics; acquiring a voice identification model based on the mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model. In the invention, single characteristics are not selected in the selection of the voice characteristics, but MFCC-CQCC mixed characteristics based on Fisher criterion are selected, the characteristics organically combine the MFCC characteristics and the CQCC characteristics, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristics are used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.

Description

Speech identification method and system based on Fisher mixed feature and neural network

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice identification method and system based on Fisher mixed features and a neural network.

Background

With the continuous development of voice signal processing technology, a system for performing identity authentication by using a speaker voice signal is widely applied in various industries. The method has great potential safety hazard in identity authentication by using the speaker voice signal, wherein the potential safety hazard comprises the step of impersonating the speaker voice by using synthesized voice. Therefore, how to identify the synthesized speech and the natural human voice is the key to eliminate the potential safety hazard.

In the prior art, a common speech recognition system uses speech features to recognize whether the speech to be detected is synthetic speech or natural human voice. The speech features mainly include MFCC features and CQCC features.

However, the inventors of the present application have found that the voice recognition system in the related art does not consider the sound quality of the synthesized voice and the types of the synthesized voice, resulting in low accuracy of voice recognition.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a speech identification method and system based on Fisher mixed characteristics and a neural network, and solves the technical problem of low accuracy of the existing speech identification system.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme:

the invention provides a speech identification method based on Fisher mixed characteristics and a neural network, which is executed by a computer and comprises the following steps:

s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;

s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;

s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;

s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;

and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.

Preferably, the formula of the Fisher criterion is as follows:

wherein: r is _FIs the Fisher ratio, σ, of the characteristic components _bRepresents the inter-class variance of the feature components, and σ w represents the intra-class variance of the feature components.

Preferably, in S3, the method for obtaining MFCC-CQCC mixture features of voice samples in a voice sample set includes:

s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set _bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples _b(ii) a The formula is as follows:

in the formula: sigma _bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,

mean, m, of the k-th component of a class s of features representing the ith speech sample _kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;

s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set _wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples _w(ii) a The formula is as follows:

in the formula: sigma _wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,

mean, n, of the k-th component of a class s of features representing the ith speech sample _iThe number of frames representing a certain voice;

a kth-dimension c-th frame parameter representing an ith voice;

s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.

Preferably, before obtaining the speech discrimination model, the method further comprises: and dividing the MFCC-CQCC mixed features acquired in S303 into training data and test data.

Preferably, in S4, the preset neural network includes: one layer of LSTM and one layer of GRU.

Preferably, in S4, the method for obtaining the speech discrimination model includes:

inputting training data into a preset neural network, adjusting parameters of a neural network model, and training the neural network;

inputting test data into the trained neural network station to test the accuracy of the neural network;

and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.

The invention also provides a speech identification system based on the Fisher mixed characteristics and the neural network, which comprises a computer, wherein the computer comprises:

at least one memory cell;

at least one processing unit;

wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:

Preferably, the formula of the Fisher criterion is as follows:

wherein: r is _FIs the Fisher ratio, σ, of the characteristic components _bRepresenting characteristic componentsσ w represents the intra-class variance of the feature component.

mean, n, of the k-th component of a class s of features representing the ith speech sample _iThe number of frames representing a certain voice; a kth-dimension c-th frame parameter representing an ith voice;

(III) advantageous effects

The invention provides a speech identification method and system based on Fisher mixed characteristics and a neural network. Compared with the prior art, the method has the following beneficial effects:

firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then obtaining the MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, the MFCC characteristics and the CQCC characteristics; acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model. In the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC and CQCC characteristics, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a block diagram of a speech recognition method based on Fisher mixed features and a neural network according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the application solves the problem of low accuracy of the existing voice identification system by providing the voice identification method and system based on the Fisher mixed characteristics and the neural network, and realizes the improvement of the accuracy of voice identification.

In order to solve the technical problems, the general idea of the embodiment of the application is as follows:

firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then obtaining the MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, the MFCC characteristics and the CQCC characteristics; acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

The embodiment of the invention provides a speech identification method based on Fisher mixed characteristics and a neural network, which is executed by a computer and comprises the following steps of S1-S5:

s2, obtaining MFCC characteristics and CQCC characteristics of voice samples in the voice sample set;

In the embodiment of the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC characteristic and the CQCC characteristic, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.

The individual steps are described in detail below:

in step S1, a speech sample set and a speech to be measured are acquired.

Specifically, the voice sample set includes intelligent synthesized voice data and natural human voice database data. In the embodiment of the invention, the intelligent synthetic voice data is synthesized based on web APIs of Baidu, Fei and Aliyun, and the synthetic voice is preprocessed and cut into WAV files of 3s, wherein 48000 pieces are provided in total. The natural voice database data is data in an open source Chinese voice database of ai _ shell, and is cut into WAV files of 3s and 48000 pieces.

In step S2, MFCC features and CQCC features of the voice samples in the voice sample set are obtained.

Specifically, a 24-dimensional MFCC feature and a 24-dimensional CQCC feature are extracted for each speech sample in the set of speech samples.

In step S3, MFCC-CQCC mixture features of the speech samples in the set of speech samples are obtained based on Fisher criterion, MFCC (Mel-scale frequency cepstral coefficients Mel-frequency cepstral coefficient) features, and CQCC (constant Qtransform coefficients) features.

Specifically, S301, obtaining MFCC bits for all speech samples in the set of speech samplesInter-class variance σ of the characterized feature components _bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples _b(ii) a The formula is as follows:

in the formula: sigma _bThe inter-class variance of the characteristic component (i.e. the variance of the mean of different voice characteristic components, which reflects the degree of difference between different voice samples) is represented; m represents the total number of all speech samples, mean, m, of the k-th component of a class s of features representing the ith speech sample _kRepresents the mean of the k-th dimension components of all speech samples in a certain class of features s.

in the formula: sigma _wRepresents the intra-class variance of the feature components (i.e., the mean of the variances of the same speech feature component); m represents the total number of all speech samples,

mean, n, of the k-th component of a class s of features representing the ith speech sample _iRepresenting the frame number of a certain voice (in the embodiment of the present invention, the frame numbers of all voice samples are 299);

and c frame parameters of the k dimension representing the ith voice. It should be noted that, for each speech sample, each class of features is a 24 × 299 matrix, and the matrix is operated by using the NumPy library of Python.

In the specific implementation process, respectively extracting MFCC features from each voice sample, and recording the MFCC features as M; extracting CQCC characteristics and recording as C. M and C are two 24 x 299 matrices. Formulas for finding the intra-class variance and the inter-class variance have been given in S301 and S302, where k is the k-dimension feature, which means that the intra-class variance and the inter-class variance of a k-dimension feature are found according to the formula for all speech samples at present. And obtaining the Fisher ratio of the k dimension through the between-class variance and the within-class variance. Let k go from 1 to 24, and Fisher ratios are calculated for the k-th dimension for M and C, respectively. Sorting Fisher ratios of each dimension of M and C in a descending order, selecting the first 12 dimensions of M, selecting the first 12 dimensions of C, and forming the MFCC-CQCC mixed feature. The obtained MFCC-CQCC mixed feature is saved as an npy file, the naming format of the feature file is 'serial number +0 (or 1)', wherein 0 represents that the npy file is the MFCC-CQCC mixed feature of the intelligent synthetic speech, and 1 represents that the npy file is the MFCC-CQCC mixed feature of natural human voice.

The Fisher criterion is formulated as follows:

In step S4, a speech recognition model is obtained based on the MFCC-CQCC mixture features and a preset neural network.

Specifically, at S401, 80% of MFCC-CQCC mixed features in the intelligent synthesized speech data in the speech sample set and 80% of MFCC-CQCC mixed features in the natural human voice database data are randomly extracted from the npy file as training data. In the embodiment of the invention, 76800 pieces of training data are used, and the rest 19200 pieces of MFCC-CQCC mixed feature data are used as test data.

S402, inputting the training data into a preset neural network, adjusting parameters of the neural network model, and training the neural network. In the embodiment of the invention, the neural network is an LSTM-GRU neural network, namely a layer of LSTM and a layer of GRU neural network. In the embodiment of the invention, the number of nodes of the LSTM layer is set to be 20, the drop rate dropout is 0.2, the number of nodes of the GRU layer is set to be 20, 100 rounds of training are carried out, the size of each round is 500, and the optimization method is adam.

S403, inputting test data into the trained neural network station, and testing the accuracy of the neural network; and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.

In step S5, a type to which the speech to be detected belongs is obtained based on the speech discrimination model, where the type includes the intelligent synthesized speech and the natural human voice.

Specifically, the MFCC-CQCC mixed feature of the voice to be detected is obtained, the MFCC-CQCC mixed feature of the voice to be detected is output to the voice identification model, if the model output is 1, the voice to be detected is natural human voice, and if the model output is 0, the voice to be detected is intelligent synthetic voice.

It should be noted that, in the embodiment of the present invention, after the voice identification model is obtained, the voice identification model may be used to determine the types of multiple voices to be detected, but the model needs to be obtained again instead of determining the voice to be detected once, and when the type of the voice to be detected is subsequently determined, only the voice to be detected needs to be obtained, and then step S5 is performed.

In order to verify that the method of the embodiment of the invention can improve the accuracy of voice identification, three groups of comparison embodiments are provided, which are specifically as follows:

the first embodiment is as follows: training an LSTM-GRU neural network by using MFCC and CQCC independently, wherein the obtained accuracy rates are 97.64% and 97.48% respectively;

example two: training using MFCC-CQCC mixed features into a neural network like the example gave an accuracy of 98.27%.

Example three: testing against mp3 compression: the samples used in both examples one and two were WAV format files, and 1000 samples were selected and compressed down to mp3 format. The MFCC features, the CQCC features and the MFCC-CQCC mixed features are respectively extracted and respectively sent to the same neural network as the embodiment of the embodiment. The final obtained accuracy rates are MFCC: 90.14%, CQCC: 60.64%, MFCC-CQCC mixture characteristics: 92.52 percent.

The embodiment of the invention also provides a voice identification system based on Fisher mixed characteristics and a neural network, which comprises a computer, wherein the computer comprises:

at least one memory cell;

at least one processing unit;

wherein, at least one instruction is stored in the at least one storage unit, and the at least one instruction is loaded and executed by the at least one processing unit to realize the following steps:

s3, acquiring MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;

It is to be understood that the speech recognition system based on the Fisher mixed feature and the neural network provided in the embodiment of the present invention corresponds to the speech recognition method based on the Fisher mixed feature and the neural network, and for the explanation, examples, and beneficial effects of the relevant contents, reference may be made to the corresponding contents in the speech recognition method based on the Fisher mixed feature and the neural network, and details are not described here again.

In summary, compared with the prior art, the method has the following beneficial effects:

1. in the embodiment of the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC characteristic and the CQCC characteristic, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.

2. The neural network in the embodiment of the invention adopts a layer of LSTM and a layer of GRU, combines the advantages of easier convergence of GRU parameters and better LSTM expression performance, can reduce the time of training a neural network model and ensure good performance, and further ensures the accuracy of a speech recognition model.

It should be noted that, through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for speech discrimination based on Fisher mixed features and neural networks, the method being implemented by a computer and comprising the steps of:

2. The Fisher hybrid signature and neural network based speech discrimination method of claim 1, wherein the Fisher criterion is formulated as follows:

wherein: r is _FIs the Fisher ratio, σ, of the characteristic components _bRepresenting the inter-class variance, σ w, of the feature componentsRepresenting the intra-class variance of the feature components.

3. The Fisher mixture features and neural network based speech discrimination method of claim 2, wherein in S3, the method of obtaining MFCC-CQCC mixture features for speech samples in a set of speech samples comprises:

representing a certain class of features of the ith speech sampleMean of the k-th component of s, n _iThe number of frames representing a certain voice;

a kth-dimension c-th frame parameter representing an ith voice;

4. The Fisher hybrid signature and neural network based speech recognition method of claim 3, wherein prior to obtaining the derived speech recognition model, the method further comprises: and dividing the MFCC-CQCC mixed features acquired in S303 into training data and test data.

5. The Fisher hybrid signature and neural network based speech discrimination method of claim 4, wherein in S4, the preset neural network comprises: one layer of LSTM and one layer of GRU.

6. The Fisher hybrid signature and neural network based speech recognition method of claim 5, wherein in S4, the method of obtaining the speech recognition model comprises:

7. A speech discrimination system based on Fisher mixed features and neural networks, the system comprising a computer, the computer comprising:

at least one memory cell;

at least one processing unit;

8. The Fisher hybrid signature and neural network based speech discrimination method of claim 7, wherein the Fisher criterion is formulated as follows:

9. The Fisher mixture features and neural network based speech discrimination method of claim 8, wherein in S3, the method of obtaining MFCC-CQCC mixture features for speech samples in a set of speech samples comprises:

a kth-dimension c-th frame parameter representing an ith voice;