[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN118098247A - Voiceprint recognition method and system based on parallel feature extraction model - Google Patents

Voiceprint recognition method and system based on parallel feature extraction model Download PDF

Info

Publication number
CN118098247A
CN118098247A CN202410009893.XA CN202410009893A CN118098247A CN 118098247 A CN118098247 A CN 118098247A CN 202410009893 A CN202410009893 A CN 202410009893A CN 118098247 A CN118098247 A CN 118098247A
Authority
CN
China
Prior art keywords
voiceprint
features
fbank
classification model
mfcc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410009893.XA
Other languages
Chinese (zh)
Inventor
魏江超
罗永和
刘美坪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Leelen Technology Co Ltd
Original Assignee
Xiamen Leelen Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Leelen Technology Co Ltd filed Critical Xiamen Leelen Technology Co Ltd
Priority to CN202410009893.XA priority Critical patent/CN118098247A/en
Publication of CN118098247A publication Critical patent/CN118098247A/en
Pending legal-status Critical Current

Links

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses a voiceprint recognition method and a voiceprint recognition system based on parallel feature extraction, wherein the method comprises the following steps: resampling the voiceprint to be identified, and extracting features of the resampled voiceprint to be identified to obtain FBANK features and MFCC features; inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing, and outputting to obtain fusion voiceprint features; inputting the original registered voiceprint of the user into the parallel feature classification model, and outputting to obtain comparison voiceprint features; and calculating cosine similarity of the fusion voiceprint features and the comparison voiceprint features, and determining that the voiceprint to be identified is from a registrant when the cosine similarity is larger than a first threshold. The system comprises: the device comprises an extraction module, a processing module and a comparison module. Accurate recognition of voiceprints is achieved.

Description

Voiceprint recognition method and system based on parallel feature extraction model
Technical Field
The invention relates to the technical field of voiceprint recognition, in particular to a voiceprint recognition method and system based on a parallel feature extraction model.
Background
Nowadays, with the rapid development of computer hardware, various technologies are widely popularized. The voiceprint recognition technology based on deep learning is widely applied to security authentication and personalized scenes. For example, the door opening of the intelligent lock assists authentication, the intelligent home authenticates different members, performs personalized customization service, and the like.
Voiceprint recognition algorithms generally require three aspects of audio feature extraction, model construction and scoring decision. The first step feature extraction has FBANK, MFCC, logFBank, LPCC, LPC, LSF etc., and voiceprint recognition is commonly used as FBANK, MFCC. The depth model is generally built into a classification network, the architecture is usually an improved TDNN network, AAM-SOFTMAX is adopted as a loss function, and finally a grading decision is to judge whether the characteristic cosine value obtained by the two models is larger than a set threshold value or not so as to judge whether the two models belong to authentication members or not.
For models with inputs typically MFCC features or FBANK features, etc., such single feature input models have limited audio features under consideration, and in practical applications, the algorithm is less robust. However, the artificial processing and stitching fusion of the features is taken as input, which may result in the loss of some features during processing or limited processing capacity for stitching features during model training.
The invention aims at solving the problems in the prior art and designing a voiceprint recognition method and a voiceprint recognition system based on a parallel feature extraction model.
Disclosure of Invention
Accordingly, the present invention is directed to a voiceprint recognition method and system based on a parallel feature extraction model, which can solve the above-mentioned problems.
The invention provides a voiceprint recognition method based on a parallel feature extraction model, which comprises the following steps:
Resampling the voiceprint to be identified, and extracting features of the resampled voiceprint to be identified to obtain FBANK features and MFCC features;
Inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing, and outputting to obtain fusion voiceprint features; inputting the original registered voiceprint of the user into the parallel feature classification model, and outputting to obtain comparison voiceprint features;
And calculating cosine similarity of the fusion voiceprint features and the comparison voiceprint features, and determining that the voiceprint to be identified is from a registrant when the cosine similarity is larger than a first threshold.
Further, resampling the voiceprint to be identified, and extracting features of the resampled voiceprint to be identified to obtain FBANK features and MFCC features, including:
The high-frequency part of the voiceprint to be identified is emphasized, and the emphasized voiceprint to be identified is obtained;
Dividing the weighted voiceprint to be identified into a plurality of short-time frames, substituting each frame of short-time frame into a Hamming window function, and obtaining continuous short-time frames;
Performing discrete Fourier transform on each short-time frame to obtain a frequency spectrum of each short-time frame, and performing modular squaring on the frequency spectrum of each short-time frame to obtain the power spectrum of the voiceprint to be identified;
Filtering the power spectrum of the voiceprint to be identified through a Mel filter bank to obtain FBANK features;
and obtaining the MFCC characteristics through discrete cosine transformation of the FBANK characteristics.
Further, the pre-training process of the parallel feature classification model specifically includes:
collecting and marking training voiceprints, and constructing a voiceprint training set; performing data enhancement on the training voiceprint to obtain an enhanced voiceprint, and adding the enhanced voiceprint into the voiceprint training set;
Constructing FBANK a feature extraction network and an MFCC feature extraction network as front-end input networks of the parallel feature classification model; constructing a fusion voiceprint network as a rear end output network of the parallel feature classification model;
And training by utilizing the voiceprint training set to obtain the parallel feature classification model.
Further, the data enhancement is performed on the training voiceprint to obtain an enhanced training voiceprint, which includes:
Reverberation is carried out on the training voiceprint by using the open source data set to generate a voiceprint with human voice and noise;
And carrying out random masking on the voiceprint with the human voice and noise in a time domain for 0-5 frames to obtain the masked voiceprint.
Further, the FBANK voiceprint feature extraction network includes two layers 1*1 convolution blocks and three layers of residual channel attention modules;
The MFCC voiceprint feature extraction network includes two layers of 1*1 convolutions blocks and two layers of residual channel attention modules.
Further, the residual channel attention module comprises a two-layer 1*1 convolution block, a channel attention module and an add module.
Further, the inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing, and outputting to obtain a fused voiceprint feature, including:
Extracting voiceprint features of the FBANK features through FBANK feature extraction networks to obtain FBANK voiceprint features;
extracting voiceprint features of the MFCC features through an MFCC voiceprint feature extraction network to obtain MFCC voiceprint features;
And processing the FBANK voiceprint features and the MFCC voiceprint features through a fusion voiceprint network to obtain fusion voiceprint features.
Further, the processing the FBANK voiceprint features and the MFCC voiceprint features through the fusion voiceprint network to obtain fusion voiceprint features includes:
superposing the FBANK voiceprint features and the MFCC voiceprint features to obtain superposed voiceprint features; and giving different weights to the overlapped voiceprint features to obtain the fusion voiceprint features.
Further, the assigning different weights to the superimposed voiceprint features to obtain a fused voiceprint feature includes:
calculating the mean value and standard deviation of each frame feature dimension of the superimposed voiceprint features;
Stacking and connecting the average value and the standard deviation of each frame of feature dimension of the overlapped voiceprint features in series to obtain the global feature of the overlapped voiceprint features;
And carrying out attention weighted calculation on the global features of the superimposed voiceprint features to obtain the average value and the standard deviation of each frame, and stacking the average value and the standard deviation of each frame of the global features of the superimposed voiceprint features to obtain the fusion voiceprint features.
The invention provides a voiceprint recognition system based on a parallel feature classification model, which comprises:
The extraction module is used for resampling the voiceprint to be identified, and extracting the characteristics of the resampled voiceprint to be identified to obtain FBANK characteristics and MFCC characteristics;
The processing module is used for inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing and outputting to obtain fusion voiceprint features; inputting the original registered voiceprint of the user into the parallel feature classification model, and outputting to obtain comparison voiceprint features;
And the comparison module is used for calculating cosine similarity of the fusion voiceprint characteristics and the comparison voiceprint characteristics, and determining that the voiceprint to be identified comes from a registrant when the cosine similarity is larger than a first threshold value.
The invention has the beneficial effects that:
Firstly, FBANK features and MFCC features are selected as voice print extraction features, the FBANK features have higher correlation, the MFCC features have higher discrimination degree, and the effective features in voice prints can be effectively compared by combining the advantages of the two features, so that the overall robustness is improved, and the subsequent registration voice print comparison is more accurate.
Secondly, a voiceprint extraction parallel architecture is introduced, two extracted audio features are processed in parallel without fusion by manpower, audio information is fully utilized, a pyramid multi-scale fusion combined residual error structure, an attention mechanism and a statistics pooling layer are adopted, MFCC and FBANK features are processed in parallel in a model, feature fusion is finally realized in the model, and voiceprint features are obtained; by the method, the possible influence of manual processing of the characteristics and single characteristic input is avoided, the model identification performance is improved, and the model identification accuracy is improved.
Thirdly, the channel is modeled better by expanding the time context of the frame layer through the re-scaling of the channel by SE-block, SE-Res2Blocks of different layers are aggregated and propagated to connect the output feature mapping, and a statistical pool module of channel dependent frame attention is adopted, so that the network can pay attention to different frame subsets and pay attention to global features in the process of channel statistical estimation.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of a parallel feature extraction model.
FIG. 3 is a flow chart of SE-Res2Block calculation.
Fig. 4 is a flow chart of attention statistics layer calculation.
Fig. 5 is a system block diagram of the present invention.
Detailed Description
For the convenience of understanding of those skilled in the art, the structure of the present invention will be described in further detail with reference to the accompanying drawings, and it should be understood that, unless the order of the steps mentioned in the present embodiment is specifically described, the order of the steps may be adjusted according to actual needs, and may even be performed simultaneously or partially simultaneously.
As shown in fig. 1, an embodiment of the present invention provides a voiceprint recognition method based on a parallel feature extraction model, including:
S1, resampling the voiceprint to be identified, and extracting features of the resampled voiceprint to be identified to obtain FBANK features and MFCC features;
In the step, 16K single-channel 16BIT voice audio sampling voice prints are uniformly adopted, so that different voice audios are sampled by using a resampling technology, sampled voice print information is ensured to meet sampling requirements, and the accuracy of subsequent voice print extraction is ensured. Since FBANK features have higher correlation, the discrimination degree of the MFCC features is higher, so that the two features are adopted as the features of the voiceprint to be identified.
FBANK (FilterBank) since the response of the human ear to the sound spectrum is nonlinear, FBANK is a front-end processing algorithm that processes audio in a manner similar to that of the human ear, which can improve the performance of speech recognition. The general steps for obtaining fbank features of a speech signal are: pre-emphasis, framing, windowing, short Time Fourier Transform (STFT), mel filtering, de-averaging, etc. MFCC characteristics are obtained by performing Discrete Cosine Transform (DCT) on Fbank.
The Mel frequency cepstrum coefficient of the MFCC (Mel-frequency cepstral coefficients) is proposed based on the auditory characteristics of the human ear, and has a nonlinear correspondence with the Hz frequency. The MFCC is then the calculated Hz spectral signature using this relationship between them. The method is mainly used for extracting the characteristics of the voice data and reducing the operation dimension.
The method comprises the following specific steps:
S101, emphasizing a high-frequency part of the voiceprint to be identified to obtain the emphasized voiceprint to be identified;
In this step, pre-emphasis operation is performed on the human voice audio to be recognized to reduce the energy of the low frequency components while enhancing the relatively high frequency components. On the one hand, in order to balance the spectrum, since the high frequency has a smaller amplitude than the lower frequency, the high frequency part is raised, the spectrum of the signal is flattened, and the spectrum can be obtained with the Same Noise Ratio (SNR) while maintaining the whole frequency band from the low frequency to the high frequency. On the other hand, the effect of vocal cords and lips in the occurrence process is eliminated to compensate the high-frequency part of the voice signal restrained by the pronunciation system, and the resonance peak of the high frequency is highlighted. The actual pre-emphasis calculation by a high pass filter is:
y[n]=x[n]-ax[(n-1)];
Where x n is the original sample value, y n is the pre-emphasis sample value, and a is the filter coefficient, where a=0.97.
S102, dividing the weighted voiceprint to be identified into a plurality of short-time frames, and substituting each frame of short-time frame into a Hamming window function to obtain continuous short-time frames;
In this step, since the speech signal is a non-stationary process, it cannot be analyzed and processed by a signal processing technique for processing stationary signals. The speech signal is a short-time stationary signal. Therefore, before fourier transformation is performed, a framing process is required to split the pre-emphasized voice audio signal into a plurality of frames, where each frame is 25ms long and the frame is shifted by 10ms. In order to suppress signal spectrum leakage, signal discontinuities that may be caused at both ends of each frame are eliminated, and a hamming window function is used to smooth the boundary. The hamming window function calculation formula is:
W(n)=0.54-0.46cos[(2πn)/(N-1)];
where N is the sampling point position, N is the total number of sampling points, and W (N) is the window value.
S103, performing discrete Fourier transform on each short-time frame to obtain a frequency spectrum of each short-time frame, and performing modular squaring on the frequency spectrum of each short-time frame to obtain a power spectrum of the voiceprint to be identified;
In this step, since the transformation of the signal in the time domain is generally difficult to see the characteristics of the signal, it is generally converted into an energy distribution in the frequency domain for observation, and different energy distributions can represent the characteristics of different voices. Different energy distributions in the frequency domain can represent the characteristics of different voices. Let DFT of the speech signal be:
where X (n) is an input speech signal, n_fft represents the number of points of fourier transform, and K represents a frequency index.
Each frame is fourier transformed, each accounting for half as much as the positive and negative frequency conjugates. It need only be saved as a [1+n_fft/2, frames ] array, denoted as X.
S104, filtering the power spectrum of the voiceprint to be identified through a Mel filter bank to obtain FBANK features;
in the step, as the human ear is sensitive to low frequency and high frequency is not sensitive, the Mel filter bank is adopted to filter the voiceprint to be identified after Fourier transformation to obtain FBANK features. Since the sampling rate is 16000HZ, the maximum frequency is set to 8000HZ according to the nyquist sampling theorem, and 80 mel filters are set for filtering, the filter array can be expressed as Y, shape= [80,1+n_fft/2], and FBANK is characterized as y×x, shape= [80, frames ], which is the FBANK feature of 80 dimensions.
S105 subjects the FBANK features to discrete cosine transform to obtain MFCC features.
In this step, since there is an overlap between the mel filter banks in the previous step, the Discrete Cosine Transform (DCT) is used to decorrelate, the order of the discrete cosine transform is set to 13, and the first-order difference and the second-order difference are combined to obtain the 40-dimensional MFCC characteristic, (13-dimensional MFCC coefficient+13-dimensional first-order difference parameter+13-dimensional second-order difference parameter+1-dimensional frame energy).
S2, inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing, and outputting to obtain fusion voiceprint features;
Further, the pre-training process of the parallel feature classification model specifically includes:
collecting and marking training voiceprints, and constructing a voiceprint training set; performing data enhancement on the training voiceprint to obtain an enhanced voiceprint, and adding the enhanced voiceprint into the voiceprint training set;
specifically, using an open source data set to reverb the training voiceprint to generate a voiceprint with human voice and noise; and carrying out random masking on the voiceprint with the human voice and noise in a time domain for 0-5 frames to obtain the masked voiceprint.
In this step, during training, a training set with 5000 speaker data sets was used as the voiceprint training set, 250 data were used as the validation set, and 250 data were used as the test set. Additional data enhancement is performed on the original data: reverberation is carried out by combining the open source dataset MUSAN to generate audio with human voice and noise; and carrying out random masking, and masking 0-5 frames in the time domain. The enhanced voiceprint training set can improve training accuracy. Constructing FBANK a feature extraction network and an MFCC feature extraction network as front-end input networks of the parallel feature classification model; constructing a fusion voiceprint network as a rear end output network of the parallel feature classification model; in this step, as shown in fig. 2, the FBANK voiceprint feature extraction network includes two-layer 1*1 convolution blocks and three-layer residual channel attention modules; the MFCC voiceprint feature extraction network includes two layers of 1*1 convolutions blocks and two layers of residual channel attention modules. Wherein Conv1D is one-dimensional convolution, reLu is an activation function, and BN is regularization. K is convolution kernel size, d is expansion size, C1 and C2 are channel number, C1=512 and C2=256 are adopted for extracting voiceprint features, FBANK voiceprint features obtained through FBANK voiceprint feature extraction networks are (3×C1) x T, and MFCC voiceprint features obtained through MFCC voiceprint extraction networks are (2×C2) x T, wherein T is frame number.
As shown in fig. 3, the residual channel attention module includes a two-layer 1*1 convolution block, a channel attention module, and an add module. The channel is rescaled to extend the temporal context of the frame layer using two 1*1 convolution BLOCKs, both with parameters k=1, d=1 and with SE-BLOCK added (channel attention module), by time building the link between features, forming a residual structure as a whole.
And training by utilizing the voiceprint training set to obtain the parallel feature classification model.
In the actual training process of the parallel feature classification model, input features are 80-dimensional FBANK and 40-dimensional MFCC from a 25-millisecond window, frame-shifting by 10MS, and setting an initial learning rate to be 1e-3-1e-8 by using an Adam optimizer in combination for periodic training and 1K iterations in one period, so as to prevent overfitting, and carrying out weight attenuation on the model. Finally, training is used for 128 times in batches, and when the final loss function approaches to be stable, a trained parallel feature classification model is obtained, wherein the loss function selects AAM-Softmax, and the AAM-Softmax has the following calculation formula:
Wherein s=64, t=0.2, m is bs (batch size) batches, n is the classification number of s, yi represents the corresponding correct category, t is the angle interval, For yi column vectors of the last full-connection layer matrix W, the included angle of the output vector, θ j is the included angle of j column vectors of the last full-connection layer matrix W and the output vector. Compared with normal Softmax, the loss is added with the angle between the angle interval t punishment depth characteristic and the corresponding weight, so that the intra-class gap is reduced, and the inter-class gap is increased. Because of the loss function characteristic, the intra-class distance is considered to be small, the network architecture in the AAM-Softmax is abandoned, and only the 192-dimensional vector is obtained in the upper layer and is used as the final fusion voiceprint feature.
S201, extracting voiceprint features of the FBANK features through FBANK feature extraction network to obtain FBANK voiceprint features;
S202, extracting voiceprint features of the MFCC features through an MFCC voiceprint feature extraction network to obtain the MFCC voiceprint features;
S203, processing the FBANK voiceprint features and the MFCC voiceprint features through a fusion voiceprint network to obtain fusion voiceprint features.
S2031, superposing the FBANK voiceprint features and the MFCC voiceprint features to obtain superposed voiceprint features;
In this step, as shown in fig. 2, FBANK voiceprint features (3×c1) ×t and MFCC voiceprint features (2×c2) ×t are superimposed to give ((3×c1) + (2×c2))×t.
And S2032, giving different weights to each frame of the overlapped voiceprint features to obtain the fusion voiceprint features. S20321 calculates the mean value and standard deviation of each frame feature dimension of the superimposed voiceprint feature, where the calculation formula is as follows:
S20322, stacking and serially connecting the average value and standard deviation of each frame of feature dimension of the overlapped voiceprint feature to obtain a global feature of the overlapped voiceprint feature;
S20323, performing attention weighted calculation on the global features of the overlapped voiceprint features to obtain the average value and standard deviation of each frame, stacking the average value and standard deviation of each frame of the global features of the overlapped voiceprint features to obtain the fused voiceprint features, wherein the calculation formula is as follows:
In this step, the FBANK voiceprint features and MFCC voiceprint features are first overlaid and fused, and global frames are further considered, and because some of the frame-level features may be more important than others, the attention-introducing statistical layer assigns different weights to each frame, and its output is independent of the input frame number. As shown in fig. 4, according to formula 1, calculating the mean value and standard deviation of each frame feature dimension of the superimposed voiceprint feature to obtain ((3×c1) + (2×c2), 1); stacking and serially connecting the superimposed voiceprint features and the mean value and standard deviation of each frame feature dimension thereof to obtain global features of the superimposed voiceprint features, and recovering the global feature dimensions of the superimposed voiceprint features to ((3×c1) + (2×c2), and marking the global feature dimensions as H; optimizing global features of the superimposed voiceprint features using an activation function tanh, as shown in fig. 4, wherein W and b are network parameters, which is more conducive to weight updating; optimizing and activating global features of the overlapped voiceprint features by using an activation function softmax, wherein V is a network parameter; and (3) carrying out attention weighted calculation on the global features of the overlapped voiceprint features according to a formula II to obtain the average value and standard deviation of each frame, and stacking the average value and standard deviation on the original overlapped voiceprint features to obtain ((3 XC 1) + (2 XC 2) x2, 1).
Inputting the original registered voiceprint of the user into the parallel feature classification model, and outputting to obtain comparison voiceprint features;
In this step, the step of extracting and fusing the original registered voiceprint features of the user and the step of extracting and fusing the voiceprint of the voiceprint to be identified are not described here. Extracting effective audio of the personnel A for 3-4s, and registering; the corresponding audio is subjected to a training stage audio preprocessing stage, and 192-dimensional voiceprint features are extracted and stored through a parallel feature classification model.
S3, calculating cosine similarity of the fusion voiceprint features and the comparison voiceprint features, and determining that the voiceprint to be identified is from a registrant when the cosine similarity is larger than a first threshold.
In the step, the original registered voiceprint of the user extracts 192-dimensional comparison voiceprint features x through a parallel feature classification model, the voiceprint to be identified extracts 192-dimensional fusion voiceprint features y through the parallel feature classification model, and the corresponding cosine similarity of x and y is calculated, wherein the similarity= (|x x y)/(|x|y|), the|is L2 norm, and the similarity is not less than 0 and not more than 1; the first threshold of the present invention may be 0.6; if similarity >0.6, the voiceprint to be identified is considered to be from the registrant, otherwise it is an unregistered person. If the voice print database has a large number of people, the user performs a traversal to obtain a person with a similarity of >0.6, and if there are a plurality of people with a similarity of >0.6, the user with the highest similarity is selected.
The following table shows the actual recognition effect comparison of the method of the invention:
The device is used for carrying out practical and simple test, the running time is about 140MS, and the specific effects are as follows:
influence of different distance factors under the same environment:
Preliminary conclusion: within the test distance of 3m, the recognition accuracy is highest, and along with the increase of the distance, the audio quality is reduced, and the effect is poor.
Influence of different environmental factors at the same distance:
Preliminary conclusion: the interference of the environmental factors with the voice music is relatively large, and the recognition success rate is only 64%.
The audio frequency is poor in effect along with the reduction of the distance quality, but the success rate can reach more than 90% within 3 meters. The voiceprint recognition method is applied to authentication of different members in intelligent home and has higher success rate when playing light music. However, for voice interference, since voice recognition is to extract voice characteristics, the effect is affected to some extent under the condition that multiple persons speak.
As shown in fig. 5, based on the same concept, an embodiment of the present application further provides a voiceprint recognition system based on a parallel feature classification model, including:
The extraction module is used for resampling the voiceprint to be identified, and extracting the characteristics of the resampled voiceprint to be identified to obtain FBANK characteristics and MFCC characteristics;
The processing module is used for inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing and outputting to obtain fusion voiceprint features; inputting the original registered voiceprint of the user into the parallel feature classification model, and outputting to obtain comparison voiceprint features;
And the comparison module is used for calculating cosine similarity of the fusion voiceprint characteristics and the comparison voiceprint characteristics, and determining that the voiceprint to be identified comes from a registrant when the cosine similarity is larger than a first threshold value.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms should not be understood as necessarily being directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Claims (10)

1. A voiceprint recognition method based on a parallel feature classification model is characterized by comprising the following steps:
Resampling the voiceprint to be identified, and extracting features of the resampled voiceprint to be identified to obtain FBANK features and MFCC features;
Inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing, and outputting to obtain fusion voiceprint features; inputting the original registered voiceprint of the user into the parallel feature classification model, and outputting to obtain comparison voiceprint features;
And calculating cosine similarity of the fusion voiceprint features and the comparison voiceprint features, and determining that the voiceprint to be identified is from a registrant when the cosine similarity is larger than a first threshold.
2. The voiceprint recognition method based on the parallel feature classification model according to claim 1, wherein the resampling the voiceprint to be recognized, and performing feature extraction on the resampled voiceprint to be recognized to obtain FBANK features and MFCC features, includes:
The high-frequency part of the voiceprint to be identified is emphasized, and the emphasized voiceprint to be identified is obtained;
Dividing the weighted voiceprint to be identified into a plurality of short-time frames, substituting each frame of short-time frame into a Hamming window function, and obtaining continuous short-time frames;
Performing discrete Fourier transform on each short-time frame to obtain a frequency spectrum of each short-time frame, and performing modular squaring on the frequency spectrum of each short-time frame to obtain the power spectrum of the voiceprint to be identified;
Filtering the power spectrum of the voiceprint to be identified through a Mel filter bank to obtain FBANK features;
and obtaining the MFCC characteristics through discrete cosine transformation of the FBANK characteristics.
3. The voiceprint recognition method based on the parallel feature classification model according to claim 1, wherein the pre-training process of the parallel feature classification model specifically comprises:
collecting and marking training voiceprints, and constructing a voiceprint training set; performing data enhancement on the training voiceprint to obtain an enhanced voiceprint, and adding the enhanced voiceprint into the voiceprint training set;
Constructing FBANK a feature extraction network and an MFCC feature extraction network as front-end input networks of the parallel feature classification model; constructing a fusion voiceprint network as a rear end output network of the parallel feature classification model;
And training by utilizing the voiceprint training set to obtain the parallel feature classification model.
4. The voiceprint recognition method based on the parallel feature classification model according to claim 3, wherein the performing data enhancement on the training voiceprint to obtain an enhanced training voiceprint comprises:
Reverberation is carried out on the training voiceprint by using the open source data set to generate a voiceprint with human voice and noise;
And carrying out random masking on the voiceprint with the human voice and noise in a time domain for 0-5 frames to obtain the masked voiceprint.
5. The voiceprint recognition method based on the parallel feature classification model of claim 3 wherein the FBANK voiceprint feature extraction network comprises two layers 1*1 convolution block and three layers residual channel attention module;
The MFCC voiceprint feature extraction network includes two layers of 1*1 convolutions blocks and two layers of residual channel attention modules.
6. The method for identifying voiceprints based on the parallel feature classification model according to claim 5, wherein the residual channel attention module comprises a two-layer 1*1 convolution block, a channel attention module, and an add module.
7. A voiceprint recognition method based on a parallel feature classification model according to claim 1 or 3, wherein inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing, and outputting to obtain a fused voiceprint feature comprises:
Extracting voiceprint features of the FBANK features through FBANK feature extraction networks to obtain FBANK voiceprint features;
extracting voiceprint features of the MFCC features through an MFCC voiceprint feature extraction network to obtain MFCC voiceprint features;
And processing the FBANK voiceprint features and the MFCC voiceprint features through a fusion voiceprint network to obtain fusion voiceprint features.
8. The method for identifying voiceprint based on parallel feature classification model according to claim 7, wherein the processing the FBANK voiceprint features and MFCC voiceprint features through a fused voiceprint network to obtain fused voiceprint features comprises:
superposing the FBANK voiceprint features and the MFCC voiceprint features to obtain superposed voiceprint features; and giving different weights to the overlapped voiceprint features to obtain the fusion voiceprint features.
9. The method for identifying voiceprint based on parallel feature classification model according to claim 8, wherein the assigning different weights to the superimposed voiceprint features to obtain the fused voiceprint features comprises:
calculating the mean value and standard deviation of each frame feature dimension of the superimposed voiceprint features;
Stacking and connecting the average value and the standard deviation of each frame of feature dimension of the overlapped voiceprint features in series to obtain the global feature of the overlapped voiceprint features;
And carrying out attention weighted calculation on the global features of the superimposed voiceprint features to obtain the average value and the standard deviation of each frame, and stacking the average value and the standard deviation of each frame of the global features of the superimposed voiceprint features to obtain the fusion voiceprint features.
10. A voiceprint recognition system based on a parallel feature classification model, characterized in that the voiceprint recognition method based on a parallel feature classification model according to any one of claims 1-9 comprises:
The extraction module is used for resampling the voiceprint to be identified, and extracting the characteristics of the resampled voiceprint to be identified to obtain FBANK characteristics and MFCC characteristics;
The processing module is used for inputting the FBANK features and the MFCC features into a pre-trained parallel feature classification model for parallel processing and outputting to obtain fusion voiceprint features; inputting the original registered voiceprint of the user into the parallel feature classification model, and outputting to obtain comparison voiceprint features;
And the comparison module is used for calculating cosine similarity of the fusion voiceprint characteristics and the comparison voiceprint characteristics, and determining that the voiceprint to be identified comes from a registrant when the cosine similarity is larger than a first threshold value.
CN202410009893.XA 2024-01-02 2024-01-02 Voiceprint recognition method and system based on parallel feature extraction model Pending CN118098247A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410009893.XA CN118098247A (en) 2024-01-02 2024-01-02 Voiceprint recognition method and system based on parallel feature extraction model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410009893.XA CN118098247A (en) 2024-01-02 2024-01-02 Voiceprint recognition method and system based on parallel feature extraction model

Publications (1)

Publication Number Publication Date
CN118098247A true CN118098247A (en) 2024-05-28

Family

ID=91160821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410009893.XA Pending CN118098247A (en) 2024-01-02 2024-01-02 Voiceprint recognition method and system based on parallel feature extraction model

Country Status (1)

Country Link
CN (1) CN118098247A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118335092A (en) * 2024-06-12 2024-07-12 山东省计算中心(国家超级计算济南中心) Voice compression method and system based on multi-scale residual error attention

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118335092A (en) * 2024-06-12 2024-07-12 山东省计算中心(国家超级计算济南中心) Voice compression method and system based on multi-scale residual error attention
CN118335092B (en) * 2024-06-12 2024-08-30 山东省计算中心(国家超级计算济南中心) Voice compression method and system based on multi-scale residual error attention

Similar Documents

Publication Publication Date Title
Mashao et al. Combining classifier decisions for robust speaker identification
Kumar et al. Design of an automatic speaker recognition system using MFCC, vector quantization and LBG algorithm
CN110942766A (en) Audio event detection method, system, mobile terminal and storage medium
Ali et al. Mel frequency cepstral coefficient: a review
CN115862684A (en) Audio-based depression state auxiliary detection method for dual-mode fusion type neural network
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN111986679A (en) Speaker confirmation method, system and storage medium for responding to complex acoustic environment
WO2023070874A1 (en) Voiceprint recognition method
CN111489763A (en) Adaptive method for speaker recognition in complex environment based on GMM model
CN110268471A (en) The method and apparatus of ASR with embedded noise reduction
CN118098247A (en) Voiceprint recognition method and system based on parallel feature extraction model
Cheng et al. DNN-based speech enhancement with self-attention on feature dimension
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
CN111785262A (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Praveen et al. Text dependent speaker recognition using MFCC features and BPANN
Kamaruddin et al. Speech emotion verification system (SEVS) based on MFCC for real time applications
CN115620731A (en) Voice feature extraction and detection method
Zailan et al. Comparative analysis of LPC and MFCC for male speaker recognition in text-independent context
CN114038469A (en) Speaker identification method based on multi-class spectrogram feature attention fusion network
Shofiyah et al. Voice recognition system for home security keys with mel-frequency cepstral coefficient method and backpropagation artificial neural network
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
Komlen et al. Text independent speaker recognition using LBG vector quantization
Satla et al. Dialect Identification in Telugu Language Speech Utterance Using Modified Features with Deep Neural Network.
CN114512133A (en) Sound object recognition method, sound object recognition device, server and storage medium
Mittal et al. Age approximation from speech using Gaussian mixture models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination