[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110782877A - Speech identification method and system based on Fisher mixed feature and neural network - Google Patents

Speech identification method and system based on Fisher mixed feature and neural network Download PDF

Info

Publication number
CN110782877A
CN110782877A CN201911130906.4A CN201911130906A CN110782877A CN 110782877 A CN110782877 A CN 110782877A CN 201911130906 A CN201911130906 A CN 201911130906A CN 110782877 A CN110782877 A CN 110782877A
Authority
CN
China
Prior art keywords
voice
cqcc
mfcc
speech
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911130906.4A
Other languages
Chinese (zh)
Inventor
苏兆品
季仁杰
葛昭旭
陈清
郑宁军
李顺宇
张国富
岳峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Hefei Polytechnic University
Original Assignee
Hefei Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei Polytechnic University filed Critical Hefei Polytechnic University
Priority to CN201911130906.4A priority Critical patent/CN110782877A/en
Publication of CN110782877A publication Critical patent/CN110782877A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a system for identifying voices of a Fisher mixed feature and a neural network, and relates to the technical field of voice recognition. Firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then acquiring MFCC-CQCC mixed characteristics of the voice samples based on Fisher criterion, MFCC characteristics and CQCC characteristics; acquiring a voice identification model based on the mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model. In the invention, single characteristics are not selected in the selection of the voice characteristics, but MFCC-CQCC mixed characteristics based on Fisher criterion are selected, the characteristics organically combine the MFCC characteristics and the CQCC characteristics, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristics are used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.

Description

Speech identification method and system based on Fisher mixed feature and neural network
Technical Field
The invention relates to the technical field of voice recognition, in particular to a voice identification method and system based on Fisher mixed features and a neural network.
Background
With the continuous development of voice signal processing technology, a system for performing identity authentication by using a speaker voice signal is widely applied in various industries. The method has great potential safety hazard in identity authentication by using the speaker voice signal, wherein the potential safety hazard comprises the step of impersonating the speaker voice by using synthesized voice. Therefore, how to identify the synthesized speech and the natural human voice is the key to eliminate the potential safety hazard.
In the prior art, a common speech recognition system uses speech features to recognize whether the speech to be detected is synthetic speech or natural human voice. The speech features mainly include MFCC features and CQCC features.
However, the inventors of the present application have found that the voice recognition system in the related art does not consider the sound quality of the synthesized voice and the types of the synthesized voice, resulting in low accuracy of voice recognition.
Disclosure of Invention
Technical problem to be solved
Aiming at the defects of the prior art, the invention provides a speech identification method and system based on Fisher mixed characteristics and a neural network, and solves the technical problem of low accuracy of the existing speech identification system.
(II) technical scheme
In order to achieve the purpose, the invention is realized by the following technical scheme:
the invention provides a speech identification method based on Fisher mixed characteristics and a neural network, which is executed by a computer and comprises the following steps:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
Preferably, the formula of the Fisher criterion is as follows:
Figure BDA0002278246420000021
wherein: r is FIs the Fisher ratio, σ, of the characteristic components bRepresents the inter-class variance of the feature components, and σ w represents the intra-class variance of the feature components.
Preferably, in S3, the method for obtaining MFCC-CQCC mixture features of voice samples in a voice sample set includes:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples b(ii) a The formula is as follows:
Figure BDA0002278246420000031
in the formula: sigma bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
Figure BDA0002278246420000032
mean, m, of the k-th component of a class s of features representing the ith speech sample kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples w(ii) a The formula is as follows:
Figure BDA0002278246420000033
in the formula: sigma wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
Figure BDA0002278246420000034
mean, n, of the k-th component of a class s of features representing the ith speech sample iThe number of frames representing a certain voice;
Figure BDA0002278246420000035
a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
Preferably, before obtaining the speech discrimination model, the method further comprises: and dividing the MFCC-CQCC mixed features acquired in S303 into training data and test data.
Preferably, in S4, the preset neural network includes: one layer of LSTM and one layer of GRU.
Preferably, in S4, the method for obtaining the speech discrimination model includes:
inputting training data into a preset neural network, adjusting parameters of a neural network model, and training the neural network;
inputting test data into the trained neural network station to test the accuracy of the neural network;
and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.
The invention also provides a speech identification system based on the Fisher mixed characteristics and the neural network, which comprises a computer, wherein the computer comprises:
at least one memory cell;
at least one processing unit;
wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
Preferably, the formula of the Fisher criterion is as follows:
Figure BDA0002278246420000051
wherein: r is FIs the Fisher ratio, σ, of the characteristic components bRepresenting characteristic componentsσ w represents the intra-class variance of the feature component.
Preferably, in S3, the method for obtaining MFCC-CQCC mixture features of voice samples in a voice sample set includes:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples b(ii) a The formula is as follows:
Figure BDA0002278246420000052
in the formula: sigma bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
Figure BDA0002278246420000053
mean, m, of the k-th component of a class s of features representing the ith speech sample kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples w(ii) a The formula is as follows:
Figure BDA0002278246420000054
in the formula: sigma wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
Figure BDA0002278246420000061
mean, n, of the k-th component of a class s of features representing the ith speech sample iThe number of frames representing a certain voice; a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
(III) advantageous effects
The invention provides a speech identification method and system based on Fisher mixed characteristics and a neural network. Compared with the prior art, the method has the following beneficial effects:
firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then obtaining the MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, the MFCC characteristics and the CQCC characteristics; acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model. In the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC and CQCC characteristics, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a block diagram of a speech recognition method based on Fisher mixed features and a neural network according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the application solves the problem of low accuracy of the existing voice identification system by providing the voice identification method and system based on the Fisher mixed characteristics and the neural network, and realizes the improvement of the accuracy of voice identification.
In order to solve the technical problems, the general idea of the embodiment of the application is as follows:
firstly, acquiring a voice to be detected and a voice sample set comprising intelligent synthetic voice data and natural human voice database data, and then acquiring MFCC (Mel frequency cepstrum coefficient) characteristics and CQC (CQCC) characteristics of voice samples in the voice sample set; then obtaining the MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, the MFCC characteristics and the CQCC characteristics; acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network; and finally, judging whether the voice to be detected is intelligent synthesized voice or natural voice based on the voice identification model.
In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.
The embodiment of the invention provides a speech identification method based on Fisher mixed characteristics and a neural network, which is executed by a computer and comprises the following steps of S1-S5:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
In the embodiment of the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC characteristic and the CQCC characteristic, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.
The individual steps are described in detail below:
in step S1, a speech sample set and a speech to be measured are acquired.
Specifically, the voice sample set includes intelligent synthesized voice data and natural human voice database data. In the embodiment of the invention, the intelligent synthetic voice data is synthesized based on web APIs of Baidu, Fei and Aliyun, and the synthetic voice is preprocessed and cut into WAV files of 3s, wherein 48000 pieces are provided in total. The natural voice database data is data in an open source Chinese voice database of ai _ shell, and is cut into WAV files of 3s and 48000 pieces.
In step S2, MFCC features and CQCC features of the voice samples in the voice sample set are obtained.
Specifically, a 24-dimensional MFCC feature and a 24-dimensional CQCC feature are extracted for each speech sample in the set of speech samples.
In step S3, MFCC-CQCC mixture features of the speech samples in the set of speech samples are obtained based on Fisher criterion, MFCC (Mel-scale frequency cepstral coefficients Mel-frequency cepstral coefficient) features, and CQCC (constant Qtransform coefficients) features.
Specifically, S301, obtaining MFCC bits for all speech samples in the set of speech samplesInter-class variance σ of the characterized feature components bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples b(ii) a The formula is as follows:
in the formula: sigma bThe inter-class variance of the characteristic component (i.e. the variance of the mean of different voice characteristic components, which reflects the degree of difference between different voice samples) is represented; m represents the total number of all speech samples, mean, m, of the k-th component of a class s of features representing the ith speech sample kRepresents the mean of the k-th dimension components of all speech samples in a certain class of features s.
S302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples w(ii) a The formula is as follows:
Figure BDA0002278246420000101
in the formula: sigma wRepresents the intra-class variance of the feature components (i.e., the mean of the variances of the same speech feature component); m represents the total number of all speech samples,
Figure BDA0002278246420000102
mean, n, of the k-th component of a class s of features representing the ith speech sample iRepresenting the frame number of a certain voice (in the embodiment of the present invention, the frame numbers of all voice samples are 299);
Figure BDA0002278246420000103
and c frame parameters of the k dimension representing the ith voice. It should be noted that, for each speech sample, each class of features is a 24 × 299 matrix, and the matrix is operated by using the NumPy library of Python.
S303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
In the specific implementation process, respectively extracting MFCC features from each voice sample, and recording the MFCC features as M; extracting CQCC characteristics and recording as C. M and C are two 24 x 299 matrices. Formulas for finding the intra-class variance and the inter-class variance have been given in S301 and S302, where k is the k-dimension feature, which means that the intra-class variance and the inter-class variance of a k-dimension feature are found according to the formula for all speech samples at present. And obtaining the Fisher ratio of the k dimension through the between-class variance and the within-class variance. Let k go from 1 to 24, and Fisher ratios are calculated for the k-th dimension for M and C, respectively. Sorting Fisher ratios of each dimension of M and C in a descending order, selecting the first 12 dimensions of M, selecting the first 12 dimensions of C, and forming the MFCC-CQCC mixed feature. The obtained MFCC-CQCC mixed feature is saved as an npy file, the naming format of the feature file is 'serial number +0 (or 1)', wherein 0 represents that the npy file is the MFCC-CQCC mixed feature of the intelligent synthetic speech, and 1 represents that the npy file is the MFCC-CQCC mixed feature of natural human voice.
The Fisher criterion is formulated as follows:
Figure BDA0002278246420000111
wherein: r is FIs the Fisher ratio, σ, of the characteristic components bRepresents the inter-class variance of the feature components, and σ w represents the intra-class variance of the feature components.
In step S4, a speech recognition model is obtained based on the MFCC-CQCC mixture features and a preset neural network.
Specifically, at S401, 80% of MFCC-CQCC mixed features in the intelligent synthesized speech data in the speech sample set and 80% of MFCC-CQCC mixed features in the natural human voice database data are randomly extracted from the npy file as training data. In the embodiment of the invention, 76800 pieces of training data are used, and the rest 19200 pieces of MFCC-CQCC mixed feature data are used as test data.
S402, inputting the training data into a preset neural network, adjusting parameters of the neural network model, and training the neural network. In the embodiment of the invention, the neural network is an LSTM-GRU neural network, namely a layer of LSTM and a layer of GRU neural network. In the embodiment of the invention, the number of nodes of the LSTM layer is set to be 20, the drop rate dropout is 0.2, the number of nodes of the GRU layer is set to be 20, 100 rounds of training are carried out, the size of each round is 500, and the optimization method is adam.
S403, inputting test data into the trained neural network station, and testing the accuracy of the neural network; and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.
In step S5, a type to which the speech to be detected belongs is obtained based on the speech discrimination model, where the type includes the intelligent synthesized speech and the natural human voice.
Specifically, the MFCC-CQCC mixed feature of the voice to be detected is obtained, the MFCC-CQCC mixed feature of the voice to be detected is output to the voice identification model, if the model output is 1, the voice to be detected is natural human voice, and if the model output is 0, the voice to be detected is intelligent synthetic voice.
It should be noted that, in the embodiment of the present invention, after the voice identification model is obtained, the voice identification model may be used to determine the types of multiple voices to be detected, but the model needs to be obtained again instead of determining the voice to be detected once, and when the type of the voice to be detected is subsequently determined, only the voice to be detected needs to be obtained, and then step S5 is performed.
In order to verify that the method of the embodiment of the invention can improve the accuracy of voice identification, three groups of comparison embodiments are provided, which are specifically as follows:
the first embodiment is as follows: training an LSTM-GRU neural network by using MFCC and CQCC independently, wherein the obtained accuracy rates are 97.64% and 97.48% respectively;
example two: training using MFCC-CQCC mixed features into a neural network like the example gave an accuracy of 98.27%.
Example three: testing against mp3 compression: the samples used in both examples one and two were WAV format files, and 1000 samples were selected and compressed down to mp3 format. The MFCC features, the CQCC features and the MFCC-CQCC mixed features are respectively extracted and respectively sent to the same neural network as the embodiment of the embodiment. The final obtained accuracy rates are MFCC: 90.14%, CQCC: 60.64%, MFCC-CQCC mixture characteristics: 92.52 percent.
The embodiment of the invention also provides a voice identification system based on Fisher mixed characteristics and a neural network, which comprises a computer, wherein the computer comprises:
at least one memory cell;
at least one processing unit;
wherein, at least one instruction is stored in the at least one storage unit, and the at least one instruction is loaded and executed by the at least one processing unit to realize the following steps:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of voice samples in the voice sample set;
s3, acquiring MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
It is to be understood that the speech recognition system based on the Fisher mixed feature and the neural network provided in the embodiment of the present invention corresponds to the speech recognition method based on the Fisher mixed feature and the neural network, and for the explanation, examples, and beneficial effects of the relevant contents, reference may be made to the corresponding contents in the speech recognition method based on the Fisher mixed feature and the neural network, and details are not described here again.
In summary, compared with the prior art, the method has the following beneficial effects:
1. in the embodiment of the invention, the traditional single characteristic is not selected in the selection of the voice characteristic, but the MFCC-CQCC mixed characteristic based on the Fisher criterion is selected, the characteristic organically combines the MFCC characteristic and the CQCC characteristic, the voice synthesized by various algorithms can be effectively identified, and the mixed characteristic is used for training a neural network to obtain a voice identification model, so that the accuracy of the voice identification model can be effectively improved.
2. The neural network in the embodiment of the invention adopts a layer of LSTM and a layer of GRU, combines the advantages of easier convergence of GRU parameters and better LSTM expression performance, can reduce the time of training a neural network model and ensure good performance, and further ensures the accuracy of a speech recognition model.
It should be noted that, through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for speech discrimination based on Fisher mixed features and neural networks, the method being implemented by a computer and comprising the steps of:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
2. The Fisher hybrid signature and neural network based speech discrimination method of claim 1, wherein the Fisher criterion is formulated as follows:
Figure FDA0002278246410000011
wherein: r is FIs the Fisher ratio, σ, of the characteristic components bRepresenting the inter-class variance, σ w, of the feature componentsRepresenting the intra-class variance of the feature components.
3. The Fisher mixture features and neural network based speech discrimination method of claim 2, wherein in S3, the method of obtaining MFCC-CQCC mixture features for speech samples in a set of speech samples comprises:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples b(ii) a The formula is as follows:
Figure FDA0002278246410000021
in the formula: sigma bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
Figure FDA0002278246410000022
mean, m, of the k-th component of a class s of features representing the ith speech sample kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples w(ii) a The formula is as follows:
Figure FDA0002278246410000023
in the formula: sigma wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
Figure FDA0002278246410000024
representing a certain class of features of the ith speech sampleMean of the k-th component of s, n iThe number of frames representing a certain voice;
Figure FDA0002278246410000025
a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
4. The Fisher hybrid signature and neural network based speech recognition method of claim 3, wherein prior to obtaining the derived speech recognition model, the method further comprises: and dividing the MFCC-CQCC mixed features acquired in S303 into training data and test data.
5. The Fisher hybrid signature and neural network based speech discrimination method of claim 4, wherein in S4, the preset neural network comprises: one layer of LSTM and one layer of GRU.
6. The Fisher hybrid signature and neural network based speech recognition method of claim 5, wherein in S4, the method of obtaining the speech recognition model comprises:
inputting training data into a preset neural network, adjusting parameters of a neural network model, and training the neural network;
inputting test data into the trained neural network station to test the accuracy of the neural network;
and when the accuracy reaches a preset value, saving the parameters of the neural network model to obtain the voice identification model.
7. A speech discrimination system based on Fisher mixed features and neural networks, the system comprising a computer, the computer comprising:
at least one memory cell;
at least one processing unit;
wherein the at least one memory unit has stored therein at least one instruction that is loaded and executed by the at least one processing unit to perform the steps of:
s1, acquiring a voice sample set and a voice to be detected, wherein the voice sample set comprises intelligent synthetic voice data and natural human voice database data;
s2, obtaining MFCC characteristics and CQCC characteristics of the voice samples in the voice sample set;
s3, obtaining MFCC-CQCC mixed characteristics of the voice samples in the voice sample set based on Fisher criterion, MFCC characteristics and CQCC characteristics;
s4, acquiring a voice identification model based on the MFCC-CQCC mixed features and a preset neural network;
and S5, acquiring the type of the voice to be detected based on the voice identification model, wherein the type comprises intelligent synthetic voice and natural human voice.
8. The Fisher hybrid signature and neural network based speech discrimination method of claim 7, wherein the Fisher criterion is formulated as follows:
Figure FDA0002278246410000041
wherein: r is FIs the Fisher ratio, σ, of the characteristic components bRepresents the inter-class variance of the feature components, and σ w represents the intra-class variance of the feature components.
9. The Fisher mixture features and neural network based speech discrimination method of claim 8, wherein in S3, the method of obtaining MFCC-CQCC mixture features for speech samples in a set of speech samples comprises:
s301, obtaining the inter-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set bAnd the inter-class variance σ of the feature components of the CQCC features of all speech samples b(ii) a The formula is as follows:
Figure FDA0002278246410000042
in the formula: sigma bThe inter-class variance of the characteristic components, namely the variance of the mean values of different voice characteristic components reflects the difference degree between different voice samples; m represents the total number of all speech samples,
Figure FDA0002278246410000051
mean, m, of the k-th component of a class s of features representing the ith speech sample kRepresenting the mean value of k-dimension components of all voice samples in a certain class of features s;
s302, obtaining the intra-class variance sigma of the feature components of the MFCC features of all the voice samples in the voice sample set wAnd the within-class variance σ of the feature components of the CQCC features of the speech samples w(ii) a The formula is as follows:
Figure FDA0002278246410000052
in the formula: sigma wRepresenting the intra-class variance of the feature components, i.e. the mean of the variances of the same speech feature component; m represents the total number of all speech samples,
Figure FDA0002278246410000053
mean, n, of the k-th component of a class s of features representing the ith speech sample iThe number of frames representing a certain voice;
Figure FDA0002278246410000054
a kth-dimension c-th frame parameter representing an ith voice;
s303, calculating Fisher ratios of the MFCC characteristics of each voice sample and each dimensional component of the CQCC characteristics, selecting 12 dimensions with the maximum ratios, and fusing the 12 dimensions into 24-dimensional MFCC-CQCC mixed characteristics.
CN201911130906.4A 2019-11-19 2019-11-19 Speech identification method and system based on Fisher mixed feature and neural network Pending CN110782877A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911130906.4A CN110782877A (en) 2019-11-19 2019-11-19 Speech identification method and system based on Fisher mixed feature and neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911130906.4A CN110782877A (en) 2019-11-19 2019-11-19 Speech identification method and system based on Fisher mixed feature and neural network

Publications (1)

Publication Number Publication Date
CN110782877A true CN110782877A (en) 2020-02-11

Family

ID=69391714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911130906.4A Pending CN110782877A (en) 2019-11-19 2019-11-19 Speech identification method and system based on Fisher mixed feature and neural network

Country Status (1)

Country Link
CN (1) CN110782877A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314148A (en) * 2021-07-29 2021-08-27 中国科学院自动化研究所 Light-weight neural network generated voice identification method and system based on original waveform
CN113516969A (en) * 2021-09-14 2021-10-19 北京远鉴信息技术有限公司 Spliced voice identification method and device, electronic equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937678A (en) * 2010-07-19 2011-01-05 东南大学 Judgment-deniable automatic speech emotion recognition method for fidget
CN103325372A (en) * 2013-05-20 2013-09-25 北京航空航天大学 Chinese phonetic symbol tone identification method based on improved tone core model
CN104970773A (en) * 2015-07-21 2015-10-14 西安交通大学 Automatic sleep stage classification method based on dual character filtering
CN106491143A (en) * 2016-10-18 2017-03-15 哈尔滨工业大学深圳研究生院 Judgment method of authenticity and device based on EEG signals
CN107871498A (en) * 2017-10-10 2018-04-03 昆明理工大学 It is a kind of based on Fisher criterions to improve the composite character combinational algorithm of phonetic recognization rate
CN108039176A (en) * 2018-01-11 2018-05-15 广州势必可赢网络科技有限公司 Voiceprint authentication method and device for preventing recording attack and access control system
US20180254046A1 (en) * 2017-03-03 2018-09-06 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN108986824A (en) * 2018-07-09 2018-12-11 宁波大学 A kind of voice playback detection method
CN109754812A (en) * 2019-01-30 2019-05-14 华南理工大学 A kind of voiceprint authentication method of the anti-recording attack detecting based on convolutional neural networks

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101937678A (en) * 2010-07-19 2011-01-05 东南大学 Judgment-deniable automatic speech emotion recognition method for fidget
CN103325372A (en) * 2013-05-20 2013-09-25 北京航空航天大学 Chinese phonetic symbol tone identification method based on improved tone core model
CN104970773A (en) * 2015-07-21 2015-10-14 西安交通大学 Automatic sleep stage classification method based on dual character filtering
CN106491143A (en) * 2016-10-18 2017-03-15 哈尔滨工业大学深圳研究生院 Judgment method of authenticity and device based on EEG signals
US20180254046A1 (en) * 2017-03-03 2018-09-06 Pindrop Security, Inc. Method and apparatus for detecting spoofing conditions
CN107871498A (en) * 2017-10-10 2018-04-03 昆明理工大学 It is a kind of based on Fisher criterions to improve the composite character combinational algorithm of phonetic recognization rate
CN108039176A (en) * 2018-01-11 2018-05-15 广州势必可赢网络科技有限公司 Voiceprint authentication method and device for preventing recording attack and access control system
CN108986824A (en) * 2018-07-09 2018-12-11 宁波大学 A kind of voice playback detection method
CN109754812A (en) * 2019-01-30 2019-05-14 华南理工大学 A kind of voiceprint authentication method of the anti-recording attack detecting based on convolutional neural networks

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
于泓: "《博士学位论文》", 30 September 2019, 北京邮电大学 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113314148A (en) * 2021-07-29 2021-08-27 中国科学院自动化研究所 Light-weight neural network generated voice identification method and system based on original waveform
CN113314148B (en) * 2021-07-29 2021-11-09 中国科学院自动化研究所 Light-weight neural network generated voice identification method and system based on original waveform
CN113516969A (en) * 2021-09-14 2021-10-19 北京远鉴信息技术有限公司 Spliced voice identification method and device, electronic equipment and storage medium
CN113516969B (en) * 2021-09-14 2021-12-14 北京远鉴信息技术有限公司 Spliced voice identification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110164452B (en) Voiceprint recognition method, model training method and server
EP3719798B1 (en) Voiceprint recognition method and device based on memorability bottleneck feature
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
EP3599606A1 (en) Machine learning for authenticating voice
US6253179B1 (en) Method and apparatus for multi-environment speaker verification
CN105096955B (en) A kind of speaker's method for quickly identifying and system based on model growth cluster
CN111524527A (en) Speaker separation method, device, electronic equipment and storage medium
CN110136696B (en) Audio data monitoring processing method and system
CN111916108B (en) Voice evaluation method and device
US20160019897A1 (en) Speaker recognition from telephone calls
Yücesoy et al. A new approach with score-level fusion for the classification of a speaker age and gender
CN113223536B (en) Voiceprint recognition method and device and terminal equipment
WO2019137392A1 (en) File classification processing method and apparatus, terminal, server, and storage medium
CN111508524B (en) Method and system for identifying voice source equipment
CN111508505A (en) Speaker identification method, device, equipment and storage medium
CN110164417B (en) Language vector obtaining and language identification method and related device
CN116564315A (en) Voiceprint recognition method, voiceprint recognition device, voiceprint recognition equipment and storage medium
CN110782877A (en) Speech identification method and system based on Fisher mixed feature and neural network
Shareef et al. Gender voice classification with huge accuracy rate
CN114610840A (en) Sensitive word-based accounting monitoring method, device, equipment and storage medium
JPWO2020003413A1 (en) Information processing equipment, control methods, and programs
WO2021051533A1 (en) Address information-based blacklist identification method, apparatus, device, and storage medium
CN111833842A (en) Synthetic sound template discovery method, device and equipment
Herrera-Camacho et al. Design and testing of a corpus for forensic speaker recognition using MFCC, GMM and MLE
CN111061909A (en) Method and device for classifying accompaniment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200211

RJ01 Rejection of invention patent application after publication