CN109377984A - A kind of audio recognition method and device based on ArcFace - Google Patents
A kind of audio recognition method and device based on ArcFace Download PDFInfo
- Publication number
- CN109377984A CN109377984A CN201811400260.2A CN201811400260A CN109377984A CN 109377984 A CN109377984 A CN 109377984A CN 201811400260 A CN201811400260 A CN 201811400260A CN 109377984 A CN109377984 A CN 109377984A
- Authority
- CN
- China
- Prior art keywords
- preset
- voice
- arcface
- characteristic vector
- recognized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 22
- 230000006870 function Effects 0.000 claims description 41
- 238000012545 processing Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 2
- 239000000284 extract Substances 0.000 abstract description 5
- 238000010586 diagram Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012821 model calculation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012706 support-vector machine Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/10—Speech classification or search using distance or distortion measures between unknown speech and reference templates
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the present invention provides a kind of audio recognition method and device based on ArcFace, which comprises obtains voice to be identified, and extracts the low layer frame level feature of the voice to be identified;According to the low layer frame level feature, identity characteristic vector is extracted;Target identities feature vector similar with the identity characteristic vector is obtained from default sound bank, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information;The corresponding relationship is obtained according to the preset model trained in advance;The preset model is that the default loss function obtained by the algorithm expression formula based on ArcFace is trained;According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and using the target identity information as the recognition result of the voice to be identified.Described device executes the above method.Method and device provided in an embodiment of the present invention can accurately identify various types of voices.
Description
Technical Field
The embodiment of the invention relates to the technical field of voice processing, in particular to a method and a device for recognizing voice based on ArcFace.
Background
With the explosive growth of digital audio data, recognition of speakers through speech recognition technology is also receiving increasing attention.
At present, the i-vector system which is most widely applied to speaker recognition is based on GMM-UBM (mixture Gaussian model-background model) and GSV-SVM (Gaussian mean value super vector-support vector machine) which are established on the theory of statistical models, so that training and testing of voice are required to reach a certain length, otherwise, recognition accuracy is greatly reduced. On the other hand, although ArcFace is widely applied in the field of face recognition, there is no method for applying ArcFace in the field of speech recognition.
Therefore, how to avoid the above-mentioned drawbacks and accurately recognize various types of voices (including long voices and short voices) based on ArcFace is a problem to be solved urgently.
Disclosure of Invention
Aiming at the problems in the prior art, the embodiment of the invention provides a method and a device for recognizing a voice based on ArcFace.
In a first aspect, an embodiment of the present invention provides a speech recognition method based on ArcFace, where the method includes:
acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized;
extracting an identity feature vector according to the low-level frame level features;
acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;
and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
In a second aspect, an embodiment of the present invention provides an ArcFace-based speech recognition apparatus, where the apparatus includes:
the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring a voice to be recognized and extracting low-level frame-level features of the voice to be recognized;
the extraction unit is used for extracting an identity feature vector according to the low-level frame level features;
the second acquisition unit is used for acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, and the preset voice library stores the corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;
and the recognition unit is used for determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation and taking the target identity information as a recognition result of the voice to be recognized.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:
acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized;
extracting an identity feature vector according to the low-level frame level features;
acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;
and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including:
the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform a method comprising:
acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized;
extracting an identity feature vector according to the low-level frame level features;
acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;
and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
According to the ArcFace-based voice recognition method and device provided by the embodiment of the invention, the target identity characteristic vector similar to the identity characteristic vector corresponding to the voice to be recognized is obtained from the preset voice library, the corresponding relation is obtained according to the preset model trained by the preset loss function obtained in advance based on the ArcFace algorithm expression, the target identity information is further obtained, and then the target identity information is used as the recognition result of the voice to be recognized, so that various types of voices can be recognized accurately.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a speech recognition method based on ArcFace according to an embodiment of the present invention;
FIG. 2 is a schematic structural diagram of an ArcFace-based speech recognition apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a speech recognition method based on ArcFace according to an embodiment of the present invention, and as shown in fig. 1, the speech recognition method based on ArcFace according to the embodiment of the present invention includes the following steps:
s101: and acquiring the voice to be recognized, and extracting the low-level frame-level features of the voice to be recognized.
Specifically, the device acquires a speech to be recognized and extracts low-level frame-level features of the speech to be recognized. The device can be a server and the like for executing the method, and can acquire the voices of the same speaker in different channels through equipment such as a moving-coil microphone, a capacitance microphone, a micro-electro-mechanical microphone and the like to simulate an actual voice environment. According to the frame length of 25ms and the frame shift of 10ms, the frame level characteristics of the voice to be recognized are extracted, and the frame level characteristics are subjected to mute operation by VAD (voice activity detection), so that the low-level frame level characteristics are obtained. The low-level frame-level feature may be an Fbank feature, and is not particularly limited.
S102: and extracting an identity feature vector according to the low-level frame level features.
Specifically, the device extracts an identity feature vector according to the low-level frame-level features. The identity feature vector may be understood as a feature vector identifying a speaker, and the low-level frame-level features may be input to an optimized GRU model, and the output result of the optimized GRU model may be used as the identity feature vector. GRU (gated Current Unit) is an LSTM variant, and is used as a model for learning time sequence characteristics, and the model has the advantages of simpler structure and more efficient calculation while keeping the advantage that the LSTM can well process long-distance dependence. A convolution layer can be introduced in front of the GRU layer to optimize the GRU model, so that the dimensionality of the characteristics in a time domain and a frequency domain is reduced and the model calculation is accelerated while the spectral correlation is modeled.
S103: acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression.
Specifically, the device acquires a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expressionAnd (5) refining. The nearest neighbor classifier can be adopted to calculate the euclidean distance between the identity feature vector and the preset identity feature vector in the preset voice library, and the preset identity feature vector with the minimum euclidean distance is determined as the target identity feature vector. The preset identity information can be understood as the speaker corresponding to the preset identity characteristic vector, that is, the preset model identifies which speaker the preset identity characteristic vector corresponds to by identifying the preset identity characteristic vector. The embodiment of the invention does not specifically limit the preset model. ArcFace-based algorithm expression L3Can be obtained according to the following steps:
for an input sample vector xiAnd its corresponding tag yi(i.e., to which speaker), the penalty function L1Is defined as follows:
where N is the subset of samples trained in a batch (i.e., a fraction of the total number of samples trained for the batch input device), and C is the total number of sample classes (i.e., the total number of speakers),To represent a sample vector xiPosterior probability of the class to which it belongs, fjRepresenting a sample vector xiThe posterior probabilities of all the classes to which they belong,can be expressed as follows:
wherein,andis the weight vector and offset of the fully-connected layer,is the angle between the two.
Simplified expression ofSet to 0, normalize by L2Is set to 1, thenOnly by sample vector xiAnd an included angleDetermining:
l2 regularization of the features can remove radial variation of the features in hypersphere space. Will | xi| L is set to a constant s, the loss function L2Expressed as:
since the soft boundary loss function is focused on correct classification, there is no consideration for the classification error case. To solve this problem, a corner edge loss factor m is added, i.e. atM is introduced internally to increase boundary constraint on the classification boundary, so that an algorithm expression L of the ArcFace is obtained3:
Wherein,falls within the range [0, π -m ].
The goal of speech recognition is to determine to which speaker the unknown speech belongs, assuming the posterior probability of the class to which the speech belongsA posterior probability f greater than a predetermined threshold t and in the other classjAre all less than t. Can be expressed as follows:
in the course of the classification process,t or less, or fjIf t is larger than or equal to t, the classification is misclassification, and the loss is defined as the difference between the two. For the former case, let the loss be L+Expressed as:
for the same reason, the latter loss is L-Expressed as:
to represent the misclassification loss function as a whole, L+And L-Merging, introducing a penalty function delta of a maximum boundary termy:
For all samples, the maximum edge constraint penalty factor is:
in general, a preset loss function L, i.e., a maximum edge cosine distance loss function (MMCL), is obtained based on the ArcFace, and is defined as follows:
wherein L is L3And Cmax_marIs represented as follows:
L=L3+λCmax_mar
λ is a weight coefficient, and the value is selected from 0.1 to 10.
It should be noted that: the loss factor C of the maximum edge constraint introduced by the embodiment of the inventionmax_marIncluding a penalty function delta of a maximum boundary termyFor the case of correct prediction (corresponding to δ)yIn the expression ofCase (d) such that δy1 is ═ 1; for the case of a wrong prediction (corresponding to δ)yIn the expression ofCase (d) such that δy-1; namely, the resolution of the preset loss function to the prediction result is stronger, so that the recognition result is more accurate.
S104: and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
Specifically, the device determines target identity information corresponding to the target identity feature vector according to the corresponding relationship, and uses the target identity information as the recognition result of the voice to be recognized. Examples are as follows: the preset identity characteristic vector A and the preset identity information a have a corresponding relation, the identity characteristic vector corresponding to the voice to be recognized is X, and the preset identity characteristic vector A is a target identity characteristic vector similar to the identity characteristic vector X through a vector similarity comparison method, so that the preset identity information a is determined as target identity information, and the target identity information is used as a recognition result of the voice to be recognized. The comparison results of the EER indexes of MMCL of the embodiment of the invention and softmax and ArcFace under four speech length conditions of 2s,3s,5s and 8s are shown in Table 1:
TABLE 1 recognition Performance of short-voice speaker recognition method at different durations
2s | 3s | 5s | 8s | |
softmax | 0.0643 | 0.0437 | 0.0363 | 0.0301 |
ArcFace | 0.0602 | 0.0410 | 0.0307 | 0.0254 |
MMCL | 0.0538 | 0.0385 | 0.0272 | 0.0215 |
Therefore, the MMCL provided by the embodiment of the invention has smaller EER errors, namely, the speech can be more accurately recognized.
According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the target identity characteristic vector similar to the identity characteristic vector corresponding to the voice to be recognized is obtained from the preset voice library, the corresponding relation is obtained according to the preset model obtained by the algorithm expression based on the ArcFace in advance and trained by the preset loss function, the target identity information is further obtained, and then the target identity information is used as the recognition result of the voice to be recognized, so that various types of voices can be recognized accurately.
On the basis of the above embodiment, the preset loss function includes a maximum edge constraint loss factor, and the expression of the maximum edge constraint loss factor is:
wherein, Cmax_marIs the maximum edge constraint loss factor, N is the sample subset of the batch training, y is the sample class, C is the total number of sample classes, t is the preset threshold,For a posterior probability, δ, greater than the preset threshold value representing the class to which the sample vector belongsyA penalty function is given for the maximum boundary term.
Specifically, the expression of the maximum edge constraint penalty factor in the device is:
wherein, Cmax_marIs the maximum edge constraint loss factor, N is the sample subset of the batch training, y is the sample class, C is the total number of sample classes, t is the preset threshold,For a posterior probability, δ, greater than the preset threshold value representing the class to which the sample vector belongsyA penalty function is given for the maximum boundary term. Reference may be made to the above embodiments, which are not described in detail.
According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the preset loss function comprising the maximum edge constraint loss factor is adopted to train the preset function, so that various types of voices can be further accurately recognized.
On the basis of the above-mentioned embodiment, the deltayThe expression of (a) is:
wherein, when j ≠ yiWhen f is presentjRepresents less than theThe preset threshold value represents the posterior probability of other classes to which the sample vector belongs.
In particular, said delta in the deviceyThe expression of (a) is:
wherein, when j ≠ yiWhen f is presentjAnd the posterior probability which is smaller than the preset threshold value and represents other classes to which the sample vector belongs is represented. Reference may be made to the above embodiments, which are not described in detail.
The ArcFace-based voice recognition method provided by the embodiment of the invention can further accurately recognize various types of voices by calculating the penalty function of the maximum boundary term through a specific expression.
On the basis of the above embodiment, the expression of the preset loss function is:
L=L3+λCmax_mar
wherein L is the predetermined loss function, L3The method is based on an ArcFace algorithm expression, lambda is a weight coefficient, and the numerical value is 0.1-10.
Specifically, the expression of the preset loss function in the device is as follows:
L=L3+λCmax_mar
wherein L is the predetermined loss function, L3The method is based on an ArcFace algorithm expression, lambda is a weight coefficient, and the numerical value is 0.1-10. Reference may be made to the above embodiments, which are not described in detail.
The ArcFace-based voice recognition method provided by the embodiment of the invention can further accurately recognize various types of voices by calculating the preset loss function through the specific expression.
On the basis of the above embodiment, the extracting an identity feature vector according to the low-level frame-level feature includes:
inputting the low-level frame-level features into an optimized GRU model, and taking an output result of the optimized GRU model as the identity feature vector.
Specifically, the apparatus inputs the low-level frame-level features into an optimized GRU model, and takes the output of the optimized GRU model as the identity feature vector. Reference may be made to the above embodiments, which are not described in detail.
According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the output result of the optimized GRU model is used as the identity feature vector, so that the method can be ensured to be normally carried out.
On the basis of the above embodiment, the optimized GRU model is a GRU model provided with a convolutional layer.
Specifically, the optimized GRU model in the apparatus is a GRU model provided with a convolutional layer. Reference may be made to the above embodiments, which are not described in detail.
According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the optimized GRU model is selected as the GRU model provided with the convolution layer, so that the operation efficiency of the GRU model can be improved, and various types of voice can be recognized more quickly.
On the basis of the above embodiment, the low-level frame-level feature is an Fbank feature.
Specifically, the low-level frame-level feature in the device is an Fbank feature. Reference may be made to the above embodiments, which are not described in detail.
According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the low-level frame level features are selected as Fbank features, so that the method can be ensured to be normally carried out.
Fig. 2 is a schematic structural diagram of an ArcFace-based speech recognition apparatus according to an embodiment of the present invention, and as shown in fig. 2, an embodiment of the present invention provides an ArcFace-based speech recognition apparatus, which includes a first obtaining unit 201, an extracting unit 202, a second obtaining unit 203, and a recognition unit 204, where:
the first obtaining unit 201 is configured to obtain a speech to be recognized, and extract low-level frame-level features of the speech to be recognized; the extracting unit 202 is configured to extract an identity feature vector according to the low-level frame-level feature; the second obtaining unit 203 is configured to obtain a target identity feature vector similar to the identity feature vector from a preset voice library, where a corresponding relationship between the preset identity feature vector and preset identity information is stored in the preset voice library in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; the recognition unit 204 is configured to determine, according to the corresponding relationship, target identity information corresponding to the target identity feature vector, and use the target identity information as a recognition result of the speech to be recognized.
Specifically, the first obtaining unit 201 is configured to obtain a speech to be recognized, and extract low-level frame-level features of the speech to be recognized; the extracting unit 202 is configured to extract an identity feature vector according to the low-level frame-level feature; the second obtaining unit 203 is configured to obtain a target identity feature vector similar to the identity feature vector from a preset voice library, where a corresponding relationship between the preset identity feature vector and preset identity information is stored in the preset voice library in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; the recognition unit 204 is configured to determine, according to the corresponding relationship, target identity information corresponding to the target identity feature vector, and use the target identity information as a recognition result of the speech to be recognized.
The ArcFace-based voice recognition device provided by the embodiment of the invention obtains the target identity characteristic vector similar to the identity characteristic vector corresponding to the voice to be recognized from the preset voice library, obtains the corresponding relation according to the preset model obtained by the algorithm expression based on the ArcFace in advance and trained by the preset loss function, further obtains the target identity information, and takes the target identity information as the recognition result of the voice to be recognized, so that various types of voices can be recognized accurately.
The ArcFace-based speech recognition device provided by the embodiment of the present invention may be specifically configured to execute the processing flows of the above method embodiments, and the functions thereof are not described herein again, and reference may be made to the detailed description of the above method embodiments.
Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes: a processor (processor)301, a memory (memory)302, and a bus 303;
the processor 301 and the memory 302 complete communication with each other through a bus 303;
the processor 301 is configured to call program instructions in the memory 302 to perform the methods provided by the above-mentioned method embodiments, including: acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized; extracting an identity feature vector according to the low-level frame level features; acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized; extracting an identity feature vector according to the low-level frame level features; acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized; extracting an identity feature vector according to the low-level frame level features; acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. An ArcFace-based speech recognition method is characterized by comprising the following steps:
acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized;
extracting an identity feature vector according to the low-level frame level features;
acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;
and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.
2. The method of claim 1, wherein the predetermined penalty function comprises a maximum edge constraint penalty factor expressed as:
wherein, Cmax_marThe maximum edge constraint loss factor, N is a sample subset of batch training, y is a sample class, C is a total number of sample classes, t is a preset threshold, fyiFor a posterior probability, δ, greater than the preset threshold value representing the class to which the sample vector belongsyA penalty function is given for the maximum boundary term.
3. The method of claim 2, wherein δyThe expression of (a) is:
wherein, when j ≠ yiWhen f is presentjAnd the posterior probability which is smaller than the preset threshold value and represents other classes to which the sample vector belongs is represented.
4. A method according to claim 2 or 3, wherein the predetermined loss function is expressed by:
L=L3+λCmax_mar
wherein L is the predetermined loss function, L3The method is based on an ArcFace algorithm expression, lambda is a weight coefficient, and the numerical value is 0.1-10.
5. The method of claim 1, wherein extracting an identity feature vector from the low-level frame-level features comprises:
inputting the low-level frame-level features into an optimized GRU model, and taking an output result of the optimized GRU model as the identity feature vector.
6. The method of claim 5, wherein the optimized GRU model is a GRU model with convolutional layers.
7. The method of claim 1, wherein the low-level frame-level feature is an Fbank feature.
8. An ArcFace-based speech recognition device, comprising:
the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring a voice to be recognized and extracting low-level frame-level features of the voice to be recognized;
the extraction unit is used for extracting an identity feature vector according to the low-level frame level features;
the second acquisition unit is used for acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, and the preset voice library stores the corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;
and the recognition unit is used for determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation and taking the target identity information as a recognition result of the voice to be recognized.
9. An electronic device, comprising: a processor, a memory, and a bus, wherein,
the processor and the memory are communicated with each other through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.
10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811400260.2A CN109377984B (en) | 2018-11-22 | 2018-11-22 | ArcFace-based voice recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811400260.2A CN109377984B (en) | 2018-11-22 | 2018-11-22 | ArcFace-based voice recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109377984A true CN109377984A (en) | 2019-02-22 |
CN109377984B CN109377984B (en) | 2022-05-03 |
Family
ID=65377103
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811400260.2A Active CN109377984B (en) | 2018-11-22 | 2018-11-22 | ArcFace-based voice recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109377984B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047468A (en) * | 2019-05-20 | 2019-07-23 | 北京达佳互联信息技术有限公司 | Audio recognition method, device and storage medium |
CN111582354A (en) * | 2020-04-30 | 2020-08-25 | 中国平安财产保险股份有限公司 | Picture identification method, device, equipment and storage medium |
CN112669827A (en) * | 2020-12-28 | 2021-04-16 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN105632502A (en) * | 2015-12-10 | 2016-06-01 | 江西师范大学 | Weighted pairwise constraint metric learning algorithm-based speaker recognition method |
CN105931646A (en) * | 2016-04-29 | 2016-09-07 | 江西师范大学 | Speaker identification method base on simple direct tolerance learning algorithm |
CN106022380A (en) * | 2016-05-25 | 2016-10-12 | 中国科学院自动化研究所 | Individual identity identification method based on deep learning |
US20180197547A1 (en) * | 2017-01-10 | 2018-07-12 | Fujitsu Limited | Identity verification method and apparatus based on voiceprint |
US20180261236A1 (en) * | 2017-03-10 | 2018-09-13 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speaker recognition method and apparatus, computer device and computer-readable medium |
-
2018
- 2018-11-22 CN CN201811400260.2A patent/CN109377984B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104732978A (en) * | 2015-03-12 | 2015-06-24 | 上海交通大学 | Text-dependent speaker recognition method based on joint deep learning |
CN105632502A (en) * | 2015-12-10 | 2016-06-01 | 江西师范大学 | Weighted pairwise constraint metric learning algorithm-based speaker recognition method |
CN105575394A (en) * | 2016-01-04 | 2016-05-11 | 北京时代瑞朗科技有限公司 | Voiceprint identification method based on global change space and deep learning hybrid modeling |
CN105931646A (en) * | 2016-04-29 | 2016-09-07 | 江西师范大学 | Speaker identification method base on simple direct tolerance learning algorithm |
CN106022380A (en) * | 2016-05-25 | 2016-10-12 | 中国科学院自动化研究所 | Individual identity identification method based on deep learning |
US20180197547A1 (en) * | 2017-01-10 | 2018-07-12 | Fujitsu Limited | Identity verification method and apparatus based on voiceprint |
US20180261236A1 (en) * | 2017-03-10 | 2018-09-13 | Baidu Online Network Technology (Beijing) Co., Ltd. | Speaker recognition method and apparatus, computer device and computer-readable medium |
Non-Patent Citations (1)
Title |
---|
CHEN, SHENG 等: "MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices", 《13TH CHINESE CONFERENCE ON BIOMETRIC RECOGNITION (CCBR)》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110047468A (en) * | 2019-05-20 | 2019-07-23 | 北京达佳互联信息技术有限公司 | Audio recognition method, device and storage medium |
CN110047468B (en) * | 2019-05-20 | 2022-01-25 | 北京达佳互联信息技术有限公司 | Speech recognition method, apparatus and storage medium |
CN111582354A (en) * | 2020-04-30 | 2020-08-25 | 中国平安财产保险股份有限公司 | Picture identification method, device, equipment and storage medium |
CN112669827A (en) * | 2020-12-28 | 2021-04-16 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
CN112669827B (en) * | 2020-12-28 | 2022-08-02 | 清华大学 | Joint optimization method and system for automatic speech recognizer |
Also Published As
Publication number | Publication date |
---|---|
CN109377984B (en) | 2022-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110853666B (en) | Speaker separation method, device, equipment and storage medium | |
US12112757B2 (en) | Voice identity feature extractor and classifier training | |
US11264044B2 (en) | Acoustic model training method, speech recognition method, acoustic model training apparatus, speech recognition apparatus, acoustic model training program, and speech recognition program | |
CN108346436B (en) | Voice emotion detection method and device, computer equipment and storage medium | |
JP7266674B2 (en) | Image classification model training method, image processing method and apparatus | |
CN110472675B (en) | Image classification method, image classification device, storage medium and electronic equipment | |
CN112435673B (en) | Model training method and electronic terminal | |
CN110033760A (en) | Modeling method, device and the equipment of speech recognition | |
CN106169295B (en) | Identity vector generation method and device | |
CN109377984B (en) | ArcFace-based voice recognition method and device | |
CN114678030B (en) | Voiceprint recognition method and device based on depth residual error network and attention mechanism | |
CN111091809B (en) | Regional accent recognition method and device based on depth feature fusion | |
CN106991312B (en) | Internet anti-fraud authentication method based on voiceprint recognition | |
CN111583906A (en) | Role recognition method, device and terminal for voice conversation | |
CN111508505A (en) | Speaker identification method, device, equipment and storage medium | |
CN111477219A (en) | Keyword distinguishing method and device, electronic equipment and readable storage medium | |
CN116153330B (en) | Intelligent telephone voice robot control method | |
Shivakumar et al. | Simplified and supervised i-vector modeling for speaker age regression | |
CN112632248A (en) | Question answering method, device, computer equipment and storage medium | |
CN111930885B (en) | Text topic extraction method and device and computer equipment | |
Tong et al. | Graph convolutional network based semi-supervised learning on multi-speaker meeting data | |
CN117235137B (en) | Professional information query method and device based on vector database | |
CN110956981B (en) | Speech emotion recognition method, device, equipment and storage medium | |
Imoto et al. | Acoustic scene analysis from acoustic event sequence with intermittent missing event | |
Bohra et al. | Language Identification using Stacked Convolutional Neural Network (SCNN) |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |