WO2019196305A1

WO2019196305A1 - Electronic device, identity verification method, and storage medium

Info

Publication number: WO2019196305A1
Application number: PCT/CN2018/102208
Authority: WO
Inventors: 王健宗; 于夕畔; 李瑾瑾; 肖京
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-04-09
Filing date: 2018-08-24
Publication date: 2019-10-17
Also published as: CN108694952A; CN108694952B

Abstract

The present application relates to an electronic device, an identity verification method, and a storage medium. The method comprises: when a user performs transactions in an IVR scenario, broadcasting a random code of a first preset number of bits for the user to read, and upon reading, establishing acoustic models of preset types for the broadcasted random code and the voice read out by the user, respectively; performing a forced overall alignment operation on the acoustic model of the broadcasted random code and the acoustic model of the voice read out by the user, and using a predetermined algorithm to calculate the probability that the two aligned acoustic models are the same; if the probability is greater than a preset first threshold, extracting a voiceprint feature vector of the voice read out by the user, acquiring a standard voiceprint feature vector of the user pre-stored after successful registration, and calculating the distance between the voiceprint feature vector of the voice read out by the user and that standard voiceprint feature vector, so as to perform identity verification on the user. The present application performs dual verification on the identity of a user, and can accurately determine the identity of the user.

Description

Electronic device, authentication method and storage medium

Priority claim

The present application is based on the priority of the Chinese Patent Application entitled "Electronic Device, Method of Authenticating and Storage Media", filed on Apr. 09, 2018, the entire disclosure of which is hereby incorporated by reference. The way is combined in this application.

Technical field

The present application relates to the field of communications technologies, and in particular, to an electronic device, a method for authenticating, and a storage medium.

Background technique

Currently, in the Interactive Voice Response (IVR) scenario, a scheme is provided in which an interactive voice response IVR is combined with voiceprint recognition to authenticate a client, for example, a customer uses a credit card to receive a credit card. When you activate or modify a password, you need to verify the scene of the customer's identity. In the interactive voice response (IVR) scenario, since the remote voiceprint verification is not verified in person, there may be fraudulent behaviors in which the client uses the synthesized tone prepared in advance, and the identity of the client cannot be accurately confirmed. The security of authentication is low.

Summary of the invention

The purpose of the present application is to provide an electronic device, a method for authenticating an authentication, and a storage medium, which are intended to double-verify the identity of the user and accurately confirm the identity of the user.

To achieve the above object, the present application provides an electronic device including a memory and a processor coupled to the memory, the memory storing a processing system operable on the processor, the processing The system implements the following steps when executed by the processor:

The acoustic model establishing step, in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively The acoustic model of the preset type is established this time;

Forcing the overall alignment step, forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;

In the authentication step, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful. The feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.

To achieve the above objective, the present application further provides a method for authenticating, and the method for authenticating includes:

S1, in the interactive voice response IVR scenario, when the user handles the service, the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user are followed this time. The read speech establishes an acoustic model of a preset type;

S2, performing a forced overall alignment operation on the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;

S3, if the same probability of the aligned two acoustic models is greater than a preset first threshold, extracting a voiceprint feature vector of the voice that the user is currently reading, and obtaining a standard voiceprint feature vector pre-stored by the user after successful registration. And calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector to authenticate the user.

The present application also provides a computer readable storage medium having stored thereon a processing system, the processing system being executed by a processor to implement the steps of the method of identity verification described above.

The beneficial effects of the present application are: when the present application performs identification in the interactive voice response IVR scenario, the random code is used for the user to follow up, which can effectively prevent the pre-prepared synthesized sound from being fraudulent, and combine the random code with the voiceprint recognition. The double verification of the user identity is realized, the user identity can be accurately confirmed, the security of the identity verification in the interactive voice response IVR scenario is improved, and the acoustic model of the broadcast random code and the acoustic model of the voice followed by the user are performed. Forcing the overall alignment operation can reduce the amount of calculation and improve the identification efficiency.

DRAWINGS

1 is a schematic diagram of an optional application environment of each embodiment of the present application;

FIG. 2 is a schematic flowchart diagram of an embodiment of a method for identity verification according to the present application.

detailed description

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.

It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.

Referring to Figure 1, there is shown a schematic diagram of an application environment of a preferred embodiment of the method for authentication of the present application. The application environment diagram includes an electronic device 1 and a terminal device. The electronic device 1 can perform data interaction with the terminal device through a suitable technology such as a network or a near field communication technology. In this embodiment, the user logs in to the interactive voice response IVR system of the electronic device 1 through the terminal device to perform voiceprint registration and voiceprint recognition operations.

The terminal device includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, Mobile devices such as Personal Digital Assistant (PDA), game consoles, Internet Protocol Television (IPTV), smart wearable devices, navigation devices, etc., or such as digital TV, desktop computers, notebooks Fixed terminal for servers, servers, etc.

The electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance. The electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing. A super virtual computer consisting of a group of loosely coupled computers.

In the present embodiment, the electronic device 1 may include, but is not limited to, a memory 11 communicably connected to each other through a system bus, a processor 12, and a network interface 13, and the memory 11 stores a processing system operable on the processor 12. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.

The memory 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the electronic device 1; the readable storage medium may be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM). A non-volatile storage medium such as a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, or the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1. A storage device, such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like. In this embodiment, the readable storage medium of the memory 11 is generally used to store an operating system and various types of application software installed in the electronic device 1, such as program code for storing a processing system in an embodiment of the present application. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.

The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with the terminal device, and the like. In this embodiment, the processor 12 is configured to run program code or process data stored in the memory 11, such as running a processing system or the like.

The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices.

The processing system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the methods of various embodiments of the present application; The at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.

In an embodiment, when the processing system is executed by the processor 12, the following steps are implemented:

In the interactive voice response IVR scenario, when the user requests to handle the service, the identity code is sent, for example, the ID number. After receiving the user's request, the user analyzes whether the service handled by the user needs further identity verification, and according to the identity of the user. The identification code analyzes whether the user has registered voiceprints. If further authentication is required and the user has registered voiceprints, a random code of the first preset number of bits is generated and the random code is broadcasted by voice synthesis technology to guide the random code. The user performs a follow-up, and the first preset number of bits is, for example, 8 bits.

After the user follows the reading, a preset type of acoustic model is established for the voice of the random code broadcasted, and a preset type of acoustic model is established for the voice that the user is currently reading. In a preferred embodiment, the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like.

In a specific example, taking the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie The DNN-HMM model is available. The voice of the random code broadcasted this time and the voice that the user is following this time are a series of syllables. If the character to be recognized is a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, the DNN-HMM acoustic model of the voice of the random code broadcasted by the global character is adaptively trained based on the predetermined character speech library, and the voice of the user is currently read. DNN-HMM acoustic model.

Wherein, the acoustic model of the random code broadcasted this time and the acoustic model of the voice that the user is currently reading are subjected to a Force Alignment operation, which can be compared with the conventional method of adopting word-by-word comparison. Greatly reduce the amount of calculation, which is beneficial to improve the efficiency of identification.

The predetermined algorithm is a a priori posterior probability algorithm in another embodiment. In other embodiments, the predetermined algorithm may also be a similarity algorithm. For example, the similarity algorithm is used to calculate the edit distance of the characters in the aligned two acoustic models. The smaller the distance, the greater the probability that the two acoustic models are aligned. The similarity algorithm can also be the longest common subsequence algorithm. If the longest common subsequence obtained is the length of the characters in the aligned two acoustic models. The smaller the phase difference, the greater the probability that the aligned two acoustic models will be the same.

In this embodiment, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, for example, the preset first threshold is 0.985, it is considered that the character that the user has read this time is consistent with the random code of the current broadcast. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.

In an embodiment, the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing. Corresponding spectrum, input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient MFCC The voiceprint feature vector of the user's current speech.

Wherein, the user's current read speech is framed, and then the pre-emphasis processing is performed on the framed speech data, and the pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency data in the speech data The characteristic is more prominent. Specifically, the transfer function of the high-pass filter is: H(Z)=1-αZ ^-1 , where Z is speech data, α is a constant coefficient, preferably, the value of α is 0.97; After the framing, the original voice is deviated to some extent, and therefore, the voice data needs to be windowed.

In this embodiment, the cepstrum analysis on the Mel spectrum is, for example, taking logarithm and inverse transform. The inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as Mei. Frequency cepstral coefficient MFCC. The Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame. The Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the sound of the user's current speech. Pattern feature vector.

In this embodiment, the voice frequency cepstral coefficient MFCC of the speech data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal cepstrum spectrum. The accuracy of the authentication.

In an embodiment, calculating a distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is a cosine distance for calculating the two, including:

Wherein said

Is a standard voiceprint feature vector, said

The voiceprint feature vector of the voice that the user is following this time.

If the cosine distance is less than or equal to the preset distance threshold, the identity verification passes; if the cosine distance is greater than the preset distance threshold, the identity verification fails.

In an embodiment, the standard voiceprint feature vector pre-stored after the user successfully registers, the step of registering the voiceprint includes:

In the interactive voice response IVR scenario, when the user performs voiceprint registration, the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading. Voice establishing an acoustic model of the preset type;

Performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding user-read speech, respectively, using a predetermined algorithm to calculate the same probability of the aligned two acoustic models;

If the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;

If yes, storing the voiceprint feature vector as the standard voiceprint feature vector of the user;

If not, the user is prompted to re-enter, and the step of registering the voiceprint is performed again.

In the interactive voice response IVR scenario, when the user requests registration, an identity code is sent, such as an identity card number, and after receiving the user's request, a second preset number of random codes is generated and voice synthesis technology is used to voice. The random code is broadcast in the form, and the user is guided to perform a preset time (for example, 3 times), and the second preset number of bits is, for example, 8 bits.

After the user follows the reading, a preset type of acoustic model is established for the voice of the random code for each broadcast, and a preset type of acoustic model is established for the voice that the user reads each time. In a preferred embodiment, the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.

In a specific example, taking the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie The DNN-HMM model is available. The voice of the random code and the voice followed by the user are a series of syllables. If the characters to be recognized are a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, based on a predetermined character speech library, the DNN-HMM acoustic model of the speech of the broadcast random code is obtained through global character acoustic adaptive training, and the DNN-HMM of the voice followed by the user. Acoustic model.

Wherein, the acoustic alignment of the random code of each broadcast and the acoustic model of the voice of the user are forced to perform a Force Alignment operation, which can be greatly reduced compared to the conventional method of adopting a verbatim comparison. The amount of calculation is beneficial to improve the efficiency of identification.

The predetermined algorithm may be a a priori posterior probability algorithm in an embodiment, and may be a similarity algorithm in other embodiments. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.

In this embodiment, if the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, for example, the preset second threshold is 0.985, it is considered that the character that the user reads is consistent with the broadcasted random code. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.

In an embodiment, the step of extracting the voiceprint feature vector of the voice that is read by the user is substantially the same as the method of extracting the voiceprint feature vector of the voice in the foregoing embodiment, and details are not described herein again.

In an embodiment, the step of calculating the distance between the two moiré feature vectors is substantially the same as the step of calculating the cosine distance described above, and details are not described herein again.

If the cosine distance is less than or equal to the preset distance threshold, the user who reads each time is the same user, and the voiceprint feature vector is stored as the standard voiceprint feature vector of the user; if the cosine distance is greater than the preset The distance threshold is such that the user who is reading is not the same user, prompting the user to re-register.

Compared with the prior art, when the present application performs identification in the interactive voice response IVR scenario, the random code is used for the user to follow up, which can effectively prevent the pre-prepared synthesized sound from being fraudulent, and combine the random code with the voiceprint recognition. The double verification of the user identity is realized, the user identity can be accurately confirmed, the security of the identity verification in the interactive voice response IVR scenario is improved, and the acoustic model of the broadcast random code and the acoustic model of the voice followed by the user are performed. Forcing the overall alignment operation can reduce the amount of calculation and improve the identification efficiency.

As shown in FIG. 2, FIG. 2 is a schematic flowchart of an embodiment of an authentication method according to an embodiment of the present application. The method for authenticating includes the following steps:

Step S1: When the user handles the service in the interactive voice response IVR scenario, the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user this time Establishing a preset type of acoustic model with the spoken speech;

Step S2, performing a forced overall alignment operation on the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;

In step S3, if the same probability of the aligned two acoustic models is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint feature pre-stored by the user after successful registration. The vector is calculated, and the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is calculated to authenticate the user.

Wherein said

Is a standard voiceprint feature vector, said

The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims

An electronic device, comprising: a memory and a processor coupled to the memory, wherein the memory stores a processing system operable on the processor, the processing system being The processor implements the following steps when it executes:

The acoustic model establishing step, in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively The acoustic model of the preset type is established this time;

Forcing the overall alignment step, forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;

In the authentication step, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful. The feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
The electronic device according to claim 1, wherein when the processing system is executed by the processor, the following steps are further implemented:

In the interactive voice response IVR scenario, when the user performs voiceprint registration, the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading. Voice establishing an acoustic model of the preset type;

Performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding user-read speech, respectively, using a predetermined algorithm to calculate the same probability of the aligned two acoustic models;

If the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;

If so, the voiceprint feature vector is stored as the standard voiceprint feature vector of the user.
The electronic device according to claim 1, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
The electronic device according to claim 2, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
The electronic device according to claim 1, wherein the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes:

Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
The electronic device according to claim 2, wherein the step of extracting the voiceprint feature vector of the voice that the user has followed this time comprises:

Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
The electronic device according to claim 1 or 2, wherein the calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector comprises:

Wherein said
Is a standard voiceprint feature vector, said
The voiceprint feature vector of the voice that the user is following this time.
A method for authentication, characterized in that the method for authenticating includes:

S1, in the interactive voice response IVR scenario, when the user handles the service, the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user are followed this time. The read speech establishes an acoustic model of a preset type;

S2, performing a forced overall alignment operation on the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;

S3, if the same probability of the aligned two acoustic models is greater than a preset first threshold, extracting a voiceprint feature vector of the voice that the user is currently reading, and obtaining a standard voiceprint feature vector pre-stored by the user after successful registration. And calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector to authenticate the user.
The method for authentication according to claim 8, wherein before the step S1, the method further comprises:

S01, when the voice registration is performed by the user in the interactive voice response IVR scenario, the random code of the second preset digit is broadcasted for the user to follow the preset time, and the random code and the user are respectively broadcasted after each reading. The read speech establishes an acoustic model of the preset type;

S02, performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding voice followed by the user, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;

S03, if the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user reads the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time with Whether the read user is the same user;

S04, if yes, storing the voiceprint feature vector as the standard voiceprint feature vector of the user.
The method of claim 8 according to claim 8, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
The method of claim 9 according to claim 9, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
The method of claim 8, wherein the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes:

Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
The method of claim 9 according to claim 9, wherein the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes:

Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
The method for authentication according to claim 8 or 9, wherein the step of calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector comprises:

Wherein said
Is a standard voiceprint feature vector, said
The voiceprint feature vector of the voice that the user is following this time.
A computer readable storage medium, wherein the computer readable storage medium stores a processing system, and when the processing system is executed by the processor, the steps are:

The acoustic model establishing step, in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively The acoustic model of the preset type is established this time;

Forcing the overall alignment step, forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;

In the authentication step, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful. The feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
The computer readable storage medium of claim 15, wherein when the processing system is executed by the processor, the following steps are further implemented:

In the interactive voice response IVR scenario, when the user performs voiceprint registration, the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading. Voice establishing an acoustic model of the preset type;

Performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding user-read speech, respectively, using a predetermined algorithm to calculate the same probability of the aligned two acoustic models;

If the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;

If so, the voiceprint feature vector is stored as the standard voiceprint feature vector of the user.
The computer readable storage medium of claim 15, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
The computer readable storage medium of claim 16, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
The computer readable storage medium according to claim 15, wherein the step of extracting the voiceprint feature vector of the voice that the user has read this time comprises:

Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
The computer readable storage medium according to claim 16, wherein the step of extracting the voiceprint feature vector of the voice that the user has read this time comprises:

Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;

The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.