CN109377984A

CN109377984A - A kind of audio recognition method and device based on ArcFace

Info

Publication number: CN109377984A
Application number: CN201811400260.2A
Authority: CN
Inventors: 李鹏; 吉瑞芳; 蔡新元
Original assignee: Beijing Wisdom And Technology Co Ltd
Current assignee: Beijing Wisdom And Technology Co Ltd
Priority date: 2018-11-22
Filing date: 2018-11-22
Publication date: 2019-02-22
Anticipated expiration: 2038-11-22
Also published as: CN109377984B

Abstract

The embodiment of the present invention provides a kind of audio recognition method and device based on ArcFace, which comprises obtains voice to be identified, and extracts the low layer frame level feature of the voice to be identified；According to the low layer frame level feature, identity characteristic vector is extracted；Target identities feature vector similar with the identity characteristic vector is obtained from default sound bank, the default sound bank is previously stored with the corresponding relationship between default identity characteristic vector and default identity information；The corresponding relationship is obtained according to the preset model trained in advance；The preset model is that the default loss function obtained by the algorithm expression formula based on ArcFace is trained；According to the corresponding relationship, target identity information corresponding with the target identities feature vector is determined, and using the target identity information as the recognition result of the voice to be identified.Described device executes the above method.Method and device provided in an embodiment of the present invention can accurately identify various types of voices.

Description

ArcFace-based voice recognition method and device

Technical Field

The embodiment of the invention relates to the technical field of voice processing, in particular to a method and a device for recognizing voice based on ArcFace.

Background

With the explosive growth of digital audio data, recognition of speakers through speech recognition technology is also receiving increasing attention.

At present, the i-vector system which is most widely applied to speaker recognition is based on GMM-UBM (mixture Gaussian model-background model) and GSV-SVM (Gaussian mean value super vector-support vector machine) which are established on the theory of statistical models, so that training and testing of voice are required to reach a certain length, otherwise, recognition accuracy is greatly reduced. On the other hand, although ArcFace is widely applied in the field of face recognition, there is no method for applying ArcFace in the field of speech recognition.

Therefore, how to avoid the above-mentioned drawbacks and accurately recognize various types of voices (including long voices and short voices) based on ArcFace is a problem to be solved urgently.

Disclosure of Invention

Aiming at the problems in the prior art, the embodiment of the invention provides a method and a device for recognizing a voice based on ArcFace.

In a first aspect, an embodiment of the present invention provides a speech recognition method based on ArcFace, where the method includes:

acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized;

extracting an identity feature vector according to the low-level frame level features;

acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;

and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.

In a second aspect, an embodiment of the present invention provides an ArcFace-based speech recognition apparatus, where the apparatus includes:

the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring a voice to be recognized and extracting low-level frame-level features of the voice to be recognized;

the extraction unit is used for extracting an identity feature vector according to the low-level frame level features;

the second acquisition unit is used for acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, and the preset voice library stores the corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression;

and the recognition unit is used for determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation and taking the target identity information as a recognition result of the voice to be recognized.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform a method comprising:

In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, including:

the non-transitory computer readable storage medium stores computer instructions that cause the computer to perform a method comprising:

According to the ArcFace-based voice recognition method and device provided by the embodiment of the invention, the target identity characteristic vector similar to the identity characteristic vector corresponding to the voice to be recognized is obtained from the preset voice library, the corresponding relation is obtained according to the preset model trained by the preset loss function obtained in advance based on the ArcFace algorithm expression, the target identity information is further obtained, and then the target identity information is used as the recognition result of the voice to be recognized, so that various types of voices can be recognized accurately.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a speech recognition method based on ArcFace according to an embodiment of the present invention;

FIG. 2 is a schematic structural diagram of an ArcFace-based speech recognition apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a speech recognition method based on ArcFace according to an embodiment of the present invention, and as shown in fig. 1, the speech recognition method based on ArcFace according to the embodiment of the present invention includes the following steps:

s101: and acquiring the voice to be recognized, and extracting the low-level frame-level features of the voice to be recognized.

Specifically, the device acquires a speech to be recognized and extracts low-level frame-level features of the speech to be recognized. The device can be a server and the like for executing the method, and can acquire the voices of the same speaker in different channels through equipment such as a moving-coil microphone, a capacitance microphone, a micro-electro-mechanical microphone and the like to simulate an actual voice environment. According to the frame length of 25ms and the frame shift of 10ms, the frame level characteristics of the voice to be recognized are extracted, and the frame level characteristics are subjected to mute operation by VAD (voice activity detection), so that the low-level frame level characteristics are obtained. The low-level frame-level feature may be an Fbank feature, and is not particularly limited.

S102: and extracting an identity feature vector according to the low-level frame level features.

Specifically, the device extracts an identity feature vector according to the low-level frame-level features. The identity feature vector may be understood as a feature vector identifying a speaker, and the low-level frame-level features may be input to an optimized GRU model, and the output result of the optimized GRU model may be used as the identity feature vector. GRU (gated Current Unit) is an LSTM variant, and is used as a model for learning time sequence characteristics, and the model has the advantages of simpler structure and more efficient calculation while keeping the advantage that the LSTM can well process long-distance dependence. A convolution layer can be introduced in front of the GRU layer to optimize the GRU model, so that the dimensionality of the characteristics in a time domain and a frequency domain is reduced and the model calculation is accelerated while the spectral correlation is modeled.

S103: acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression.

Specifically, the device acquires a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expressionAnd (5) refining. The nearest neighbor classifier can be adopted to calculate the euclidean distance between the identity feature vector and the preset identity feature vector in the preset voice library, and the preset identity feature vector with the minimum euclidean distance is determined as the target identity feature vector. The preset identity information can be understood as the speaker corresponding to the preset identity characteristic vector, that is, the preset model identifies which speaker the preset identity characteristic vector corresponds to by identifying the preset identity characteristic vector. The embodiment of the invention does not specifically limit the preset model. ArcFace-based algorithm expression L₃Can be obtained according to the following steps:

for an input sample vector x_iAnd its corresponding tag y_i(i.e., to which speaker), the penalty function L₁Is defined as follows:

where N is the subset of samples trained in a batch (i.e., a fraction of the total number of samples trained for the batch input device), and C is the total number of sample classes (i.e., the total number of speakers),To represent a sample vector x_iPosterior probability of the class to which it belongs, f_jRepresenting a sample vector x_iThe posterior probabilities of all the classes to which they belong,can be expressed as follows:

wherein,andis the weight vector and offset of the fully-connected layer,is the angle between the two.

Simplified expression ofSet to 0, normalize by L2Is set to 1, thenOnly by sample vector x_iAnd an included angleDetermining:

l2 regularization of the features can remove radial variation of the features in hypersphere space. Will | x_i| L is set to a constant s, the loss function L₂Expressed as:

since the soft boundary loss function is focused on correct classification, there is no consideration for the classification error case. To solve this problem, a corner edge loss factor m is added, i.e. atM is introduced internally to increase boundary constraint on the classification boundary, so that an algorithm expression L of the ArcFace is obtained₃：

Wherein,falls within the range [0, π -m ].

The goal of speech recognition is to determine to which speaker the unknown speech belongs, assuming the posterior probability of the class to which the speech belongsA posterior probability f greater than a predetermined threshold t and in the other class_jAre all less than t. Can be expressed as follows:

in the course of the classification process,t or less, or f_jIf t is larger than or equal to t, the classification is misclassification, and the loss is defined as the difference between the two. For the former case, let the loss be L₊Expressed as:

for the same reason, the latter loss is L_-Expressed as:

to represent the misclassification loss function as a whole, L₊And L_-Merging, introducing a penalty function delta of a maximum boundary term_y:

For all samples, the maximum edge constraint penalty factor is:

in general, a preset loss function L, i.e., a maximum edge cosine distance loss function (MMCL), is obtained based on the ArcFace, and is defined as follows:

wherein L is L₃And C_{max_mar}Is represented as follows:

L＝L₃+λC_{max_mar}

λ is a weight coefficient, and the value is selected from 0.1 to 10.

It should be noted that: the loss factor C of the maximum edge constraint introduced by the embodiment of the invention_{max_mar}Including a penalty function delta of a maximum boundary term_yFor the case of correct prediction (corresponding to δ)_yIn the expression ofCase (d) such that δ_y1 is ═ 1; for the case of a wrong prediction (corresponding to δ)_yIn the expression ofCase (d) such that δ_y-1; namely, the resolution of the preset loss function to the prediction result is stronger, so that the recognition result is more accurate.

S104: and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.

Specifically, the device determines target identity information corresponding to the target identity feature vector according to the corresponding relationship, and uses the target identity information as the recognition result of the voice to be recognized. Examples are as follows: the preset identity characteristic vector A and the preset identity information a have a corresponding relation, the identity characteristic vector corresponding to the voice to be recognized is X, and the preset identity characteristic vector A is a target identity characteristic vector similar to the identity characteristic vector X through a vector similarity comparison method, so that the preset identity information a is determined as target identity information, and the target identity information is used as a recognition result of the voice to be recognized. The comparison results of the EER indexes of MMCL of the embodiment of the invention and softmax and ArcFace under four speech length conditions of 2s,3s,5s and 8s are shown in Table 1:

TABLE 1 recognition Performance of short-voice speaker recognition method at different durations

	2s	3s	5s	8s
					softmax	0.0643	0.0437	0.0363	0.0301
ArcFace	0.0602	0.0410	0.0307	0.0254
					MMCL	0.0538	0.0385	0.0272	0.0215

Therefore, the MMCL provided by the embodiment of the invention has smaller EER errors, namely, the speech can be more accurately recognized.

According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the target identity characteristic vector similar to the identity characteristic vector corresponding to the voice to be recognized is obtained from the preset voice library, the corresponding relation is obtained according to the preset model obtained by the algorithm expression based on the ArcFace in advance and trained by the preset loss function, the target identity information is further obtained, and then the target identity information is used as the recognition result of the voice to be recognized, so that various types of voices can be recognized accurately.

On the basis of the above embodiment, the preset loss function includes a maximum edge constraint loss factor, and the expression of the maximum edge constraint loss factor is:

wherein, C_{max_mar}Is the maximum edge constraint loss factor, N is the sample subset of the batch training, y is the sample class, C is the total number of sample classes, t is the preset threshold,For a posterior probability, δ, greater than the preset threshold value representing the class to which the sample vector belongs_yA penalty function is given for the maximum boundary term.

Specifically, the expression of the maximum edge constraint penalty factor in the device is:

wherein, C_{max_mar}Is the maximum edge constraint loss factor, N is the sample subset of the batch training, y is the sample class, C is the total number of sample classes, t is the preset threshold,For a posterior probability, δ, greater than the preset threshold value representing the class to which the sample vector belongs_yA penalty function is given for the maximum boundary term. Reference may be made to the above embodiments, which are not described in detail.

According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the preset loss function comprising the maximum edge constraint loss factor is adopted to train the preset function, so that various types of voices can be further accurately recognized.

On the basis of the above-mentioned embodiment, the delta_yThe expression of (a) is:

wherein, when j ≠ y_iWhen f is present_jRepresents less than theThe preset threshold value represents the posterior probability of other classes to which the sample vector belongs.

In particular, said delta in the device_yThe expression of (a) is:

wherein, when j ≠ y_iWhen f is present_jAnd the posterior probability which is smaller than the preset threshold value and represents other classes to which the sample vector belongs is represented. Reference may be made to the above embodiments, which are not described in detail.

The ArcFace-based voice recognition method provided by the embodiment of the invention can further accurately recognize various types of voices by calculating the penalty function of the maximum boundary term through a specific expression.

On the basis of the above embodiment, the expression of the preset loss function is:

L＝L₃+λC_{max_mar}

wherein L is the predetermined loss function, L₃The method is based on an ArcFace algorithm expression, lambda is a weight coefficient, and the numerical value is 0.1-10.

Specifically, the expression of the preset loss function in the device is as follows:

L＝L₃+λC_{max_mar}

wherein L is the predetermined loss function, L₃The method is based on an ArcFace algorithm expression, lambda is a weight coefficient, and the numerical value is 0.1-10. Reference may be made to the above embodiments, which are not described in detail.

The ArcFace-based voice recognition method provided by the embodiment of the invention can further accurately recognize various types of voices by calculating the preset loss function through the specific expression.

On the basis of the above embodiment, the extracting an identity feature vector according to the low-level frame-level feature includes:

inputting the low-level frame-level features into an optimized GRU model, and taking an output result of the optimized GRU model as the identity feature vector.

Specifically, the apparatus inputs the low-level frame-level features into an optimized GRU model, and takes the output of the optimized GRU model as the identity feature vector. Reference may be made to the above embodiments, which are not described in detail.

According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the output result of the optimized GRU model is used as the identity feature vector, so that the method can be ensured to be normally carried out.

On the basis of the above embodiment, the optimized GRU model is a GRU model provided with a convolutional layer.

Specifically, the optimized GRU model in the apparatus is a GRU model provided with a convolutional layer. Reference may be made to the above embodiments, which are not described in detail.

According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the optimized GRU model is selected as the GRU model provided with the convolution layer, so that the operation efficiency of the GRU model can be improved, and various types of voice can be recognized more quickly.

On the basis of the above embodiment, the low-level frame-level feature is an Fbank feature.

Specifically, the low-level frame-level feature in the device is an Fbank feature. Reference may be made to the above embodiments, which are not described in detail.

According to the ArcFace-based voice recognition method provided by the embodiment of the invention, the low-level frame level features are selected as Fbank features, so that the method can be ensured to be normally carried out.

Fig. 2 is a schematic structural diagram of an ArcFace-based speech recognition apparatus according to an embodiment of the present invention, and as shown in fig. 2, an embodiment of the present invention provides an ArcFace-based speech recognition apparatus, which includes a first obtaining unit 201, an extracting unit 202, a second obtaining unit 203, and a recognition unit 204, where:

the first obtaining unit 201 is configured to obtain a speech to be recognized, and extract low-level frame-level features of the speech to be recognized; the extracting unit 202 is configured to extract an identity feature vector according to the low-level frame-level feature; the second obtaining unit 203 is configured to obtain a target identity feature vector similar to the identity feature vector from a preset voice library, where a corresponding relationship between the preset identity feature vector and preset identity information is stored in the preset voice library in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; the recognition unit 204 is configured to determine, according to the corresponding relationship, target identity information corresponding to the target identity feature vector, and use the target identity information as a recognition result of the speech to be recognized.

Specifically, the first obtaining unit 201 is configured to obtain a speech to be recognized, and extract low-level frame-level features of the speech to be recognized; the extracting unit 202 is configured to extract an identity feature vector according to the low-level frame-level feature; the second obtaining unit 203 is configured to obtain a target identity feature vector similar to the identity feature vector from a preset voice library, where a corresponding relationship between the preset identity feature vector and preset identity information is stored in the preset voice library in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; the recognition unit 204 is configured to determine, according to the corresponding relationship, target identity information corresponding to the target identity feature vector, and use the target identity information as a recognition result of the speech to be recognized.

The ArcFace-based voice recognition device provided by the embodiment of the invention obtains the target identity characteristic vector similar to the identity characteristic vector corresponding to the voice to be recognized from the preset voice library, obtains the corresponding relation according to the preset model obtained by the algorithm expression based on the ArcFace in advance and trained by the preset loss function, further obtains the target identity information, and takes the target identity information as the recognition result of the voice to be recognized, so that various types of voices can be recognized accurately.

The ArcFace-based speech recognition device provided by the embodiment of the present invention may be specifically configured to execute the processing flows of the above method embodiments, and the functions thereof are not described herein again, and reference may be made to the detailed description of the above method embodiments.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device includes: a processor (processor)301, a memory (memory)302, and a bus 303;

the processor 301 and the memory 302 complete communication with each other through a bus 303;

the processor 301 is configured to call program instructions in the memory 302 to perform the methods provided by the above-mentioned method embodiments, including: acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized; extracting an identity feature vector according to the low-level frame level features; acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.

The present embodiment discloses a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-mentioned method embodiments, for example, comprising: acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized; extracting an identity feature vector according to the low-level frame level features; acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring a voice to be recognized, and extracting low-level frame level features of the voice to be recognized; extracting an identity feature vector according to the low-level frame level features; acquiring a target identity characteristic vector similar to the identity characteristic vector from a preset voice library, wherein the preset voice library stores a corresponding relation between the preset identity characteristic vector and preset identity information in advance; the corresponding relation is obtained according to a pre-trained preset model; the preset model is trained through a preset loss function obtained through an ArcFace-based algorithm expression; and determining target identity information corresponding to the target identity characteristic vector according to the corresponding relation, and taking the target identity information as the recognition result of the voice to be recognized.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The above-described embodiments of the electronic device and the like are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the embodiments of the present invention, and are not limited thereto; although embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An ArcFace-based speech recognition method is characterized by comprising the following steps:

2. The method of claim 1, wherein the predetermined penalty function comprises a maximum edge constraint penalty factor expressed as:

wherein, C_{max_mar}The maximum edge constraint loss factor, N is a sample subset of batch training, y is a sample class, C is a total number of sample classes, t is a preset threshold, f_yiFor a posterior probability, δ, greater than the preset threshold value representing the class to which the sample vector belongs_yA penalty function is given for the maximum boundary term.

3. The method of claim 2, wherein δ_yThe expression of (a) is:

wherein, when j ≠ y_iWhen f is present_jAnd the posterior probability which is smaller than the preset threshold value and represents other classes to which the sample vector belongs is represented.

4. A method according to claim 2 or 3, wherein the predetermined loss function is expressed by:

L＝L₃+λC_{max_mar}

5. The method of claim 1, wherein extracting an identity feature vector from the low-level frame-level features comprises:

6. The method of claim 5, wherein the optimized GRU model is a GRU model with convolutional layers.

7. The method of claim 1, wherein the low-level frame-level feature is an Fbank feature.

8. An ArcFace-based speech recognition device, comprising:

9. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 7.