CN110782877A - Speech identification method and system based on Fisher mixed feature and neural network - Google Patents
Speech identification method and system based on Fisher mixed feature and neural network Download PDFInfo
- Publication number
- CN110782877A CN110782877A CN201911130906.4A CN201911130906A CN110782877A CN 110782877 A CN110782877 A CN 110782877A CN 201911130906 A CN201911130906 A CN 201911130906A CN 110782877 A CN110782877 A CN 110782877A
- Authority
- CN
- China
- Prior art keywords
- voice
- cqcc
- mfcc
- speech
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 18
- 238000012360 testing method Methods 0.000 claims description 10
- 239000000203 mixture Substances 0.000 claims description 9
- 238000003062 neural network model Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 7
- 238000012850 discrimination method Methods 0.000 claims 5
- 230000008569 process Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 239000008186 active pharmaceutical agent Substances 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a method and a system for identifying voices of a Fisher mixed feature and a neural network, and relates to the technical field of voice recognition. Firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then acquiring MFCC-CQCC mixed characteristics of the voice samples based on Fisher criterion, MFCC characteristics and CQCC characteristics; acquiring a voice identification model based on the mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model. In the invention, single characteristics are not selected in the selection of the voice characteristics, but MFCC-CQCC mixed characteristics based on Fisher criterion are selected, the characteristics organically combine the MFCC characteristics and the CQCC characteristics, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristics are used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.
Description
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice identification method and system based on Fisher mixed features and a neural network.
Background
With the continuous development of voice signal processing technology, a system for performing identity authentication by using a speaker voice signal is widely applied in various industries. The method has great potential safety hazard in identity authentication by using the speaker voice signal, wherein the potential safety hazard comprises the step of impersonating the speaker voice by using synthesized voice. Therefore, how to identify the synthesized speech and the natural human voice is the key to eliminate the potential safety hazard.
In the prior art, a common speech recognition system uses speech features to recognize whether the speech to be detected is synthetic speech or natural human voice. The speech features mainly include MFCC features and CQCC features.
However, the inventors of the present application have found that the voice recognition system in the related art does not consider the sound quality of the synthesized voice and the types of the synthesized voice, resulting in low accuracy of voice recognition.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a speech identification method and system based on Fisher mixed characteristics and a neural network, and solves the technical problem of low accuracy of the existing speech identification system.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention provides a speech identification method based on Fisher mixed characteristics and a neural network, which is executed by a computer and comprises the following steps:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
Preferably, the formula of the Fisher criterion is as follows:
wherein: r is
FIs the Fisher ratio, σ, of the characteristic components
bRepresents the inter-class variance of the feature components, and σ w represents the intra-class variance of the feature components.
Preferably, in S3, the method for obtaining MFCC-CQCC mixture features of voice samples in a voice sample set includes:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples
b(ii) a The formula is as follows:
in the formula: sigma
bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
mean, m, of the k-th component of a class s of features representing the ith speech sample
kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples
w(ii) a The formula is as follows:
in the formula: sigma
wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
mean, n, of the k-th component of a class s of features representing the ith speech sample
iThe number of frames representing a certain voice;
a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
Preferably, before obtaining the speech discrimination model, the method further comprises: and dividing the MFCC-CQCC mixed features acquired in S303 into training data and test data.
Preferably, in S4, the preset neural network includes: one layer of LSTM and one layer of GRU.
Preferably, in S4, the method for obtaining the speech discrimination model includes:
inputting training data into a preset neural network, adjusting parameters of a neural network model, and training the neural network;
inputting test data into the trained neural network station to test the accuracy of the neural network;
and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.
The invention also provides a speech identification system based on the Fisher mixed characteristics and the neural network, which comprises a computer, wherein the computer comprises:
at least one memory cell;
at least one processing unit;
wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
Preferably, the formula of the Fisher criterion is as follows:
wherein: r is
FIs the Fisher ratio, σ, of the characteristic components
bRepresenting characteristic componentsσ w represents the intra-class variance of the feature component.
Preferably, in S3, the method for obtaining MFCC-CQCC mixture features of voice samples in a voice sample set includes:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples
b(ii) a The formula is as follows:
in the formula: sigma
bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
mean, m, of the k-th component of a class s of features representing the ith speech sample
kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples
w(ii) a The formula is as follows:
in the formula: sigma
wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
mean, n, of the k-th component of a class s of features representing the ith speech sample
iThe number of frames representing a certain voice;
a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
(III) advantageous effects
The invention provides a speech identification method and system based on Fisher mixed characteristics and a neural network. Compared with the prior art, the method has the following beneficial effects:
firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then obtaining the MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, the MFCC characteristics and the CQCC characteristics; acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model. In the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC and CQCC characteristics, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a block diagram of a speech recognition method based on Fisher mixed features and a neural network according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application solves the problem of low accuracy of the existing voice identification system by providing the voice identification method and system based on the Fisher mixed characteristics and the neural network, and realizes the improvement of the accuracy of voice identification.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then obtaining the MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, the MFCC characteristics and the CQCC characteristics; acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The embodiment of the invention provides a speech identification method based on Fisher mixed characteristics and a neural network, which is executed by a computer and comprises the following steps of S1-S5:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
In the embodiment of the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC characteristic and the CQCC characteristic, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.
The individual steps are described in detail below:
in step S1, a speech sample set and a speech to be measured are acquired.
Specifically, the voice sample set includes intelligent synthesized voice data and natural human voice database data. In the embodiment of the invention, the intelligent synthetic voice data is synthesized based on web APIs of Baidu, Fei and Aliyun, and the synthetic voice is preprocessed and cut into WAV files of 3s, wherein 48000 pieces are provided in total. The natural voice database data is data in an open source Chinese voice database of ai _ shell, and is cut into WAV files of 3s and 48000 pieces.
In step S2, MFCC features and CQCC features of the voice samples in the voice sample set are obtained.
Specifically, a 24-dimensional MFCC feature and a 24-dimensional CQCC feature are extracted for each speech sample in the set of speech samples.
In step S3, MFCC-CQCC mixture features of the speech samples in the set of speech samples are obtained based on Fisher criterion, MFCC (Mel-scale frequency cepstral coefficients Mel-frequency cepstral coefficient) features, and CQCC (constant Qtransform coefficients) features.
Specifically, S301, obtaining MFCC bits for all speech samples in the set of speech samplesInter-class variance σ of the characterized feature components
bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples
b(ii) a The formula is as follows:
in the formula: sigma
bThe inter-class variance of the characteristic component (i.e. the variance of the mean of different voice characteristic components, which reflects the degree of difference between different voice samples) is represented; m represents the total number of all speech samples,
mean, m, of the k-th component of a class s of features representing the ith speech sample
kRepresents the mean of the k-th dimension components of all speech samples in a certain class of features s.
S302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples
w(ii) a The formula is as follows:
in the formula: sigma
wRepresents the intra-class variance of the feature components (i.e., the mean of the variances of the same speech feature component); m represents the total number of all speech samples,
mean, n, of the k-th component of a class s of features representing the ith speech sample
iRepresenting the frame number of a certain voice (in the embodiment of the present invention, the frame numbers of all voice samples are 299);
and c frame parameters of the k dimension representing the ith voice. It should be noted that, for each speech sample, each class of features is a 24 × 299 matrix, and the matrix is operated by using the NumPy library of Python.
S303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
In the specific implementation process, respectively extracting MFCC features from each voice sample, and recording the MFCC features as M; extracting CQCC characteristics and recording as C. M and C are two 24 x 299 matrices. Formulas for finding the intra-class variance and the inter-class variance have been given in S301 and S302, where k is the k-dimension feature, which means that the intra-class variance and the inter-class variance of a k-dimension feature are found according to the formula for all speech samples at present. And obtaining the Fisher ratio of the k dimension through the between-class variance and the within-class variance. Let k go from 1 to 24, and Fisher ratios are calculated for the k-th dimension for M and C, respectively. Sorting Fisher ratios of each dimension of M and C in a descending order, selecting the first 12 dimensions of M, selecting the first 12 dimensions of C, and forming the MFCC-CQCC mixed feature. The obtained MFCC-CQCC mixed feature is saved as an npy file, the naming format of the feature file is 'serial number +0 (or 1)', wherein 0 represents that the npy file is the MFCC-CQCC mixed feature of the intelligent synthetic speech, and 1 represents that the npy file is the MFCC-CQCC mixed feature of natural human voice.
The Fisher criterion is formulated as follows:
wherein: r is
FIs the Fisher ratio, σ, of the characteristic components
bRepresents the inter-class variance of the feature components, and σ w represents the intra-class variance of the feature components.
In step S4, a speech recognition model is obtained based on the MFCC-CQCC mixture features and a preset neural network.
Specifically, at S401, 80% of MFCC-CQCC mixed features in the intelligent synthesized speech data in the speech sample set and 80% of MFCC-CQCC mixed features in the natural human voice database data are randomly extracted from the npy file as training data. In the embodiment of the invention, 76800 pieces of training data are used, and the rest 19200 pieces of MFCC-CQCC mixed feature data are used as test data.
S402, inputting the training data into a preset neural network, adjusting parameters of the neural network model, and training the neural network. In the embodiment of the invention, the neural network is an LSTM-GRU neural network, namely a layer of LSTM and a layer of GRU neural network. In the embodiment of the invention, the number of nodes of the LSTM layer is set to be 20, the drop rate dropout is 0.2, the number of nodes of the GRU layer is set to be 20, 100 rounds of training are carried out, the size of each round is 500, and the optimization method is adam.
S403, inputting test data into the trained neural network station, and testing the accuracy of the neural network; and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.
In step S5, a type to which the speech to be detected belongs is obtained based on the speech discrimination model, where the type includes the intelligent synthesized speech and the natural human voice.
Specifically, the MFCC-CQCC mixed feature of the voice to be detected is obtained, the MFCC-CQCC mixed feature of the voice to be detected is output to the voice identification model, if the model output is 1, the voice to be detected is natural human voice, and if the model output is 0, the voice to be detected is intelligent synthetic voice.
It should be noted that, in the embodiment of the present invention, after the voice identification model is obtained, the voice identification model may be used to determine the types of multiple voices to be detected, but the model needs to be obtained again instead of determining the voice to be detected once, and when the type of the voice to be detected is subsequently determined, only the voice to be detected needs to be obtained, and then step S5 is performed.
In order to verify that the method of the embodiment of the invention can improve the accuracy of voice identification, three groups of comparison embodiments are provided, which are specifically as follows:
the first embodiment is as follows: training an LSTM-GRU neural network by using MFCC and CQCC independently, wherein the obtained accuracy rates are 97.64% and 97.48% respectively;
example two: training using MFCC-CQCC mixed features into a neural network like the example gave an accuracy of 98.27%.
Example three: testing against mp3 compression: the samples used in both examples one and two were WAV format files, and 1000 samples were selected and compressed down to mp3 format. The MFCC features, the CQCC features and the MFCC-CQCC mixed features are respectively extracted and respectively sent to the same neural network as the embodiment of the embodiment. The final obtained accuracy rates are MFCC: 90.14%, CQCC: 60.64%, MFCC-CQCC mixture characteristics: 92.52 percent.
The embodiment of the invention also provides a voice identification system based on Fisher mixed characteristics and a neural network, which comprises a computer, wherein the computer comprises:
at least one memory cell;
at least one processing unit;
wherein, at least one instruction is stored in the at least one storage unit, and the at least one instruction is loaded and executed by the at least one processing unit to realize the following steps:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of voice samples in the voice sample set;
s3, acquiring MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
It is to be understood that the speech recognition system based on the Fisher mixed feature and the neural network provided in the embodiment of the present invention corresponds to the speech recognition method based on the Fisher mixed feature and the neural network, and for the explanation, examples, and beneficial effects of the relevant contents, reference may be made to the corresponding contents in the speech recognition method based on the Fisher mixed feature and the neural network, and details are not described here again.
In summary, compared with the prior art, the method has the following beneficial effects:
1. in the embodiment of the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC characteristic and the CQCC characteristic, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.
2. The neural network in the embodiment of the invention adopts a layer of LSTM and a layer of GRU, combines the advantages of easier convergence of GRU parameters and better LSTM expression performance, can reduce the time of training a neural network model and ensure good performance, and further ensures the accuracy of a speech recognition model.
It should be noted that, through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (9)
1. A method for speech discrimination based on Fisher mixed features and neural networks, the method being implemented by a computer and comprising the steps of:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
2. The Fisher hybrid signature and neural network based speech discrimination method of claim 1, wherein the Fisher criterion is formulated as follows:
wherein: r is
FIs the Fisher ratio, σ, of the characteristic components
bRepresenting the inter-class variance, σ w, of the feature componentsRepresenting the intra-class variance of the feature components.
3. The Fisher mixture features and neural network based speech discrimination method of claim 2, wherein in S3, the method of obtaining MFCC-CQCC mixture features for speech samples in a set of speech samples comprises:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples
b(ii) a The formula is as follows:
in the formula: sigma
bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
mean, m, of the k-th component of a class s of features representing the ith speech sample
kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples
w(ii) a The formula is as follows:
in the formula: sigma
wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
representing a certain class of features of the ith speech sampleMean of the k-th component of s, n
iThe number of frames representing a certain voice;
a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
4. The Fisher hybrid signature and neural network based speech recognition method of claim 3, wherein prior to obtaining the derived speech recognition model, the method further comprises: and dividing the MFCC-CQCC mixed features acquired in S303 into training data and test data.
5. The Fisher hybrid signature and neural network based speech discrimination method of claim 4, wherein in S4, the preset neural network comprises: one layer of LSTM and one layer of GRU.
6. The Fisher hybrid signature and neural network based speech recognition method of claim 5, wherein in S4, the method of obtaining the speech recognition model comprises:
inputting training data into a preset neural network, adjusting parameters of a neural network model, and training the neural network;
inputting test data into the trained neural network station to test the accuracy of the neural network;
and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.
7. A speech discrimination system based on Fisher mixed features and neural networks, the system comprising a computer, the computer comprising:
at least one memory cell;
at least one processing unit;
wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
8. The Fisher hybrid signature and neural network based speech discrimination method of claim 7, wherein the Fisher criterion is formulated as follows:
wherein: r is
FIs the Fisher ratio, σ, of the characteristic components
bRepresents the inter-class variance of the feature components, and σ w represents the intra-class variance of the feature components.
9. The Fisher mixture features and neural network based speech discrimination method of claim 8, wherein in S3, the method of obtaining MFCC-CQCC mixture features for speech samples in a set of speech samples comprises:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples
b(ii) a The formula is as follows:
in the formula: sigma
bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
mean, m, of the k-th component of a class s of features representing the ith speech sample
kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set
wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples
w(ii) a The formula is as follows:
in the formula: sigma
wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
mean, n, of the k-th component of a class s of features representing the ith speech sample
iThe number of frames representing a certain voice;
a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911130906.4A CN110782877A (en) | 2019-11-19 | 2019-11-19 | Speech identification method and system based on Fisher mixed feature and neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911130906.4A CN110782877A (en) | 2019-11-19 | 2019-11-19 | Speech identification method and system based on Fisher mixed feature and neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110782877A true CN110782877A (en) | 2020-02-11 |
Family
ID=69391714
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911130906.4A Pending CN110782877A (en) | 2019-11-19 | 2019-11-19 | Speech identification method and system based on Fisher mixed feature and neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110782877A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113314148A (en) * | 2021-07-29 | 2021-08-27 | 中国科学院自动化研究所 | Light-weight neural network generated voice identification method and system based on original waveform |
CN113516969A (en) * | 2021-09-14 | 2021-10-19 | 北京远鉴信息技术有限公司 | Spliced voice identification method and device, electronic equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937678A (en) * | 2010-07-19 | 2011-01-05 | 东南大学 | Judgment-deniable automatic speech emotion recognition method for fidget |
CN103325372A (en) * | 2013-05-20 | 2013-09-25 | 北京航空航天大学 | Chinese phonetic symbol tone identification method based on improved tone core model |
CN104970773A (en) * | 2015-07-21 | 2015-10-14 | 西安交通大学 | Automatic sleep stage classification method based on dual character filtering |
CN106491143A (en) * | 2016-10-18 | 2017-03-15 | 哈尔滨工业大学深圳研究生院 | Judgment method of authenticity and device based on EEG signals |
CN107871498A (en) * | 2017-10-10 | 2018-04-03 | 昆明理工大学 | It is a kind of based on Fisher criterions to improve the composite character combinational algorithm of phonetic recognization rate |
CN108039176A (en) * | 2018-01-11 | 2018-05-15 | 广州势必可赢网络科技有限公司 | Voiceprint authentication method and device for preventing recording attack and access control system |
US20180254046A1 (en) * | 2017-03-03 | 2018-09-06 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
CN108986824A (en) * | 2018-07-09 | 2018-12-11 | 宁波大学 | A kind of voice playback detection method |
CN109754812A (en) * | 2019-01-30 | 2019-05-14 | 华南理工大学 | A kind of voiceprint authentication method of the anti-recording attack detecting based on convolutional neural networks |
-
2019
- 2019-11-19 CN CN201911130906.4A patent/CN110782877A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101937678A (en) * | 2010-07-19 | 2011-01-05 | 东南大学 | Judgment-deniable automatic speech emotion recognition method for fidget |
CN103325372A (en) * | 2013-05-20 | 2013-09-25 | 北京航空航天大学 | Chinese phonetic symbol tone identification method based on improved tone core model |
CN104970773A (en) * | 2015-07-21 | 2015-10-14 | 西安交通大学 | Automatic sleep stage classification method based on dual character filtering |
CN106491143A (en) * | 2016-10-18 | 2017-03-15 | 哈尔滨工业大学深圳研究生院 | Judgment method of authenticity and device based on EEG signals |
US20180254046A1 (en) * | 2017-03-03 | 2018-09-06 | Pindrop Security, Inc. | Method and apparatus for detecting spoofing conditions |
CN107871498A (en) * | 2017-10-10 | 2018-04-03 | 昆明理工大学 | It is a kind of based on Fisher criterions to improve the composite character combinational algorithm of phonetic recognization rate |
CN108039176A (en) * | 2018-01-11 | 2018-05-15 | 广州势必可赢网络科技有限公司 | Voiceprint authentication method and device for preventing recording attack and access control system |
CN108986824A (en) * | 2018-07-09 | 2018-12-11 | 宁波大学 | A kind of voice playback detection method |
CN109754812A (en) * | 2019-01-30 | 2019-05-14 | 华南理工大学 | A kind of voiceprint authentication method of the anti-recording attack detecting based on convolutional neural networks |
Non-Patent Citations (1)
Title |
---|
于泓: "《博士学位论文》", 30 September 2019, 北京邮电大学 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113314148A (en) * | 2021-07-29 | 2021-08-27 | 中国科学院自动化研究所 | Light-weight neural network generated voice identification method and system based on original waveform |
CN113314148B (en) * | 2021-07-29 | 2021-11-09 | 中国科学院自动化研究所 | Light-weight neural network generated voice identification method and system based on original waveform |
CN113516969A (en) * | 2021-09-14 | 2021-10-19 | 北京远鉴信息技术有限公司 | Spliced voice identification method and device, electronic equipment and storage medium |
CN113516969B (en) * | 2021-09-14 | 2021-12-14 | 北京远鉴信息技术有限公司 | Spliced voice identification method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110164452B (en) | Voiceprint recognition method, model training method and server | |
EP3719798B1 (en) | Voiceprint recognition method and device based on memorability bottleneck feature | |
CN111916111B (en) | Intelligent voice outbound method and device with emotion, server and storage medium | |
EP3599606A1 (en) | Machine learning for authenticating voice | |
US6253179B1 (en) | Method and apparatus for multi-environment speaker verification | |
CN105096955B (en) | A kind of speaker's method for quickly identifying and system based on model growth cluster | |
CN111524527A (en) | Speaker separation method, device, electronic equipment and storage medium | |
CN110136696B (en) | Audio data monitoring processing method and system | |
CN111916108B (en) | Voice evaluation method and device | |
US20160019897A1 (en) | Speaker recognition from telephone calls | |
Yücesoy et al. | A new approach with score-level fusion for the classification of a speaker age and gender | |
CN113223536B (en) | Voiceprint recognition method and device and terminal equipment | |
WO2019137392A1 (en) | File classification processing method and apparatus, terminal, server, and storage medium | |
CN111508524B (en) | Method and system for identifying voice source equipment | |
CN111508505A (en) | Speaker identification method, device, equipment and storage medium | |
CN110164417B (en) | Language vector obtaining and language identification method and related device | |
CN116564315A (en) | Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium | |
CN110782877A (en) | Speech identification method and system based on Fisher mixed feature and neural network | |
Shareef et al. | Gender voice classification with huge accuracy rate | |
CN114610840A (en) | Sensitive word-based accounting monitoring method, device, equipment and storage medium | |
JPWO2020003413A1 (en) | Information processing equipment, control methods, and programs | |
WO2021051533A1 (en) | Address information-based blacklist identification method, apparatus, device, and storage medium | |
CN111833842A (en) | Synthetic sound template discovery method, device and equipment | |
Herrera-Camacho et al. | Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE | |
CN111061909A (en) | Method and device for classifying accompaniment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200211 |
|
RJ01 | Rejection of invention patent application after publication |