[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2019196305A1 - Electronic device, identity verification method, and storage medium - Google Patents

Electronic device, identity verification method, and storage medium Download PDF

Info

Publication number
WO2019196305A1
WO2019196305A1 PCT/CN2018/102208 CN2018102208W WO2019196305A1 WO 2019196305 A1 WO2019196305 A1 WO 2019196305A1 CN 2018102208 W CN2018102208 W CN 2018102208W WO 2019196305 A1 WO2019196305 A1 WO 2019196305A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
feature vector
voiceprint feature
voice
acoustic model
Prior art date
Application number
PCT/CN2018/102208
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
于夕畔
李瑾瑾
肖京
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019196305A1 publication Critical patent/WO2019196305A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/16Hidden Markov models [HMM]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the field of communications technologies, and in particular, to an electronic device, a method for authenticating, and a storage medium.
  • IVR Interactive Voice Response
  • a scheme is provided in which an interactive voice response IVR is combined with voiceprint recognition to authenticate a client, for example, a customer uses a credit card to receive a credit card.
  • voiceprint recognition For example, you activate or modify a password, you need to verify the scene of the customer's identity.
  • IVR interactive voice response
  • the purpose of the present application is to provide an electronic device, a method for authenticating an authentication, and a storage medium, which are intended to double-verify the identity of the user and accurately confirm the identity of the user.
  • the present application provides an electronic device including a memory and a processor coupled to the memory, the memory storing a processing system operable on the processor, the processing The system implements the following steps when executed by the processor:
  • the acoustic model establishing step in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively
  • the acoustic model of the preset type is established this time;
  • Forcing the overall alignment step forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
  • the authentication step if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful.
  • the feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
  • the present application further provides a method for authenticating, and the method for authenticating includes:
  • the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user are followed this time.
  • the read speech establishes an acoustic model of a preset type
  • the present application also provides a computer readable storage medium having stored thereon a processing system, the processing system being executed by a processor to implement the steps of the method of identity verification described above.
  • the beneficial effects of the present application are: when the present application performs identification in the interactive voice response IVR scenario, the random code is used for the user to follow up, which can effectively prevent the pre-prepared synthesized sound from being fraudulent, and combine the random code with the voiceprint recognition.
  • the double verification of the user identity is realized, the user identity can be accurately confirmed, the security of the identity verification in the interactive voice response IVR scenario is improved, and the acoustic model of the broadcast random code and the acoustic model of the voice followed by the user are performed. Forcing the overall alignment operation can reduce the amount of calculation and improve the identification efficiency.
  • FIG. 1 is a schematic diagram of an optional application environment of each embodiment of the present application.
  • FIG. 2 is a schematic flowchart diagram of an embodiment of a method for identity verification according to the present application.
  • the application environment diagram includes an electronic device 1 and a terminal device.
  • the electronic device 1 can perform data interaction with the terminal device through a suitable technology such as a network or a near field communication technology.
  • the user logs in to the interactive voice response IVR system of the electronic device 1 through the terminal device to perform voiceprint registration and voiceprint recognition operations.
  • the terminal device includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, Mobile devices such as Personal Digital Assistant (PDA), game consoles, Internet Protocol Television (IPTV), smart wearable devices, navigation devices, etc., or such as digital TV, desktop computers, notebooks Fixed terminal for servers, servers, etc.
  • PDA Personal Digital Assistant
  • IPTV Internet Protocol Television
  • smart wearable devices navigation devices, etc.
  • digital TV desktop computers, notebooks Fixed terminal for servers, servers, etc.
  • the electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance.
  • the electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing.
  • a super virtual computer consisting of a group of loosely coupled computers.
  • the electronic device 1 may include, but is not limited to, a memory 11 communicably connected to each other through a system bus, a processor 12, and a network interface 13, and the memory 11 stores a processing system operable on the processor 12. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
  • the memory 11 includes a memory and at least one type of readable storage medium.
  • the memory provides a cache for the operation of the electronic device 1;
  • the readable storage medium may be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM).
  • a non-volatile storage medium such as a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, or the like.
  • the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1.
  • a storage device such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like.
  • the readable storage medium of the memory 11 is generally used to store an operating system and various types of application software installed in the electronic device 1, such as program code for storing a processing system in an embodiment of the present application. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
  • the processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with the terminal device, and the like.
  • the processor 12 is configured to run program code or process data stored in the memory 11, such as running a processing system or the like.
  • the network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices.
  • the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices.
  • the processing system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the methods of various embodiments of the present application;
  • the at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
  • the acoustic model establishing step in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively
  • the acoustic model of the preset type is established this time;
  • the identity code is sent, for example, the ID number.
  • the user analyzes whether the service handled by the user needs further identity verification, and according to the identity of the user.
  • the identification code analyzes whether the user has registered voiceprints. If further authentication is required and the user has registered voiceprints, a random code of the first preset number of bits is generated and the random code is broadcasted by voice synthesis technology to guide the random code. The user performs a follow-up, and the first preset number of bits is, for example, 8 bits.
  • a preset type of acoustic model is established for the voice of the random code broadcasted, and a preset type of acoustic model is established for the voice that the user is currently reading.
  • the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model.
  • the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like.
  • the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie
  • the DNN-HMM model is available.
  • the voice of the random code broadcasted this time and the voice that the user is following this time are a series of syllables. If the character to be recognized is a series of characters.
  • the DNN-HMM acoustic model of the voice of the random code broadcasted by the global character is adaptively trained based on the predetermined character speech library, and the voice of the user is currently read. DNN-HMM acoustic model.
  • Forcing the overall alignment step forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
  • the acoustic model of the random code broadcasted this time and the acoustic model of the voice that the user is currently reading are subjected to a Force Alignment operation, which can be compared with the conventional method of adopting word-by-word comparison. Greatly reduce the amount of calculation, which is beneficial to improve the efficiency of identification.
  • the predetermined algorithm is a a priori posterior probability algorithm in another embodiment.
  • the predetermined algorithm may also be a similarity algorithm.
  • the similarity algorithm is used to calculate the edit distance of the characters in the aligned two acoustic models. The smaller the distance, the greater the probability that the two acoustic models are aligned.
  • the similarity algorithm can also be the longest common subsequence algorithm. If the longest common subsequence obtained is the length of the characters in the aligned two acoustic models. The smaller the phase difference, the greater the probability that the aligned two acoustic models will be the same.
  • the authentication step if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful.
  • the feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
  • the preset first threshold is 0.985, it is considered that the character that the user has read this time is consistent with the random code of the current broadcast. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
  • the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing.
  • Corresponding spectrum input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient MFCC
  • the voiceprint feature vector of the user's current speech includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing.
  • Corresponding spectrum input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient MFCC
  • the voiceprint feature vector of the user's current speech includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing.
  • the user's current read speech is framed, and then the pre-emphasis processing is performed on the framed speech data, and the pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency data in the speech data
  • the characteristic is more prominent.
  • the cepstrum analysis on the Mel spectrum is, for example, taking logarithm and inverse transform.
  • the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as Mei.
  • Frequency cepstral coefficient MFCC is the voiceprint feature of the speech data of this frame.
  • the Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the sound of the user's current speech. Pattern feature vector.
  • the voice frequency cepstral coefficient MFCC of the speech data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal cepstrum spectrum. The accuracy of the authentication.
  • calculating a distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is a cosine distance for calculating the two, including:
  • said I is a standard voiceprint feature vector, said The voiceprint feature vector of the voice that the user is following this time.
  • the identity verification passes; if the cosine distance is greater than the preset distance threshold, the identity verification fails.
  • the standard voiceprint feature vector pre-stored after the user successfully registers includes:
  • the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading.
  • the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;
  • the user is prompted to re-enter, and the step of registering the voiceprint is performed again.
  • an identity code is sent, such as an identity card number
  • a second preset number of random codes is generated and voice synthesis technology is used to voice.
  • the random code is broadcast in the form, and the user is guided to perform a preset time (for example, 3 times), and the second preset number of bits is, for example, 8 bits.
  • a preset type of acoustic model is established for the voice of the random code for each broadcast, and a preset type of acoustic model is established for the voice that the user reads each time.
  • the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model.
  • the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.
  • the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie
  • the DNN-HMM model is available.
  • the voice of the random code and the voice followed by the user are a series of syllables. If the characters to be recognized are a series of characters.
  • the DNN-HMM acoustic model of the speech of the broadcast random code is obtained through global character acoustic adaptive training, and the DNN-HMM of the voice followed by the user. Acoustic model.
  • the acoustic alignment of the random code of each broadcast and the acoustic model of the voice of the user are forced to perform a Force Alignment operation, which can be greatly reduced compared to the conventional method of adopting a verbatim comparison.
  • the amount of calculation is beneficial to improve the efficiency of identification.
  • the predetermined algorithm may be a a priori posterior probability algorithm in an embodiment, and may be a similarity algorithm in other embodiments.
  • a priori posterior probability algorithm in an embodiment
  • a similarity algorithm in other embodiments.
  • the preset second threshold is 0.985
  • the character that the user reads is consistent with the broadcasted random code. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
  • the step of extracting the voiceprint feature vector of the voice that is read by the user is substantially the same as the method of extracting the voiceprint feature vector of the voice in the foregoing embodiment, and details are not described herein again.
  • the step of calculating the distance between the two moiré feature vectors is substantially the same as the step of calculating the cosine distance described above, and details are not described herein again.
  • the cosine distance is less than or equal to the preset distance threshold, the user who reads each time is the same user, and the voiceprint feature vector is stored as the standard voiceprint feature vector of the user; if the cosine distance is greater than the preset The distance threshold is such that the user who is reading is not the same user, prompting the user to re-register.
  • the present application when the present application performs identification in the interactive voice response IVR scenario, the random code is used for the user to follow up, which can effectively prevent the pre-prepared synthesized sound from being fraudulent, and combine the random code with the voiceprint recognition.
  • the double verification of the user identity is realized, the user identity can be accurately confirmed, the security of the identity verification in the interactive voice response IVR scenario is improved, and the acoustic model of the broadcast random code and the acoustic model of the voice followed by the user are performed. Forcing the overall alignment operation can reduce the amount of calculation and improve the identification efficiency.
  • FIG. 2 is a schematic flowchart of an embodiment of an authentication method according to an embodiment of the present application.
  • the method for authenticating includes the following steps:
  • Step S1 When the user handles the service in the interactive voice response IVR scenario, the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user this time Establishing a preset type of acoustic model with the spoken speech;
  • the identity code is sent, for example, the ID number.
  • the user analyzes whether the service handled by the user needs further identity verification, and according to the identity of the user.
  • the identification code analyzes whether the user has registered voiceprints. If further authentication is required and the user has registered voiceprints, a random code of the first preset number of bits is generated and the random code is broadcasted by voice synthesis technology to guide the random code. The user performs a follow-up, and the first preset number of bits is, for example, 8 bits.
  • a preset type of acoustic model is established for the voice of the random code broadcasted, and a preset type of acoustic model is established for the voice that the user is currently reading.
  • the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model.
  • the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like.
  • the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie
  • the DNN-HMM model is available.
  • the voice of the random code broadcasted this time and the voice that the user is following this time are a series of syllables. If the character to be recognized is a series of characters.
  • the DNN-HMM acoustic model of the voice of the random code broadcasted by the global character is adaptively trained based on the predetermined character speech library, and the voice of the user is currently read. DNN-HMM acoustic model.
  • Step S2 performing a forced overall alignment operation on the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
  • the acoustic model of the random code broadcasted this time and the acoustic model of the voice that the user is currently reading are subjected to a Force Alignment operation, which can be compared with the conventional method of adopting word-by-word comparison. Greatly reduce the amount of calculation, which is beneficial to improve the efficiency of identification.
  • the predetermined algorithm is a a priori posterior probability algorithm in another embodiment.
  • the predetermined algorithm may also be a similarity algorithm.
  • the similarity algorithm is used to calculate the edit distance of the characters in the aligned two acoustic models. The smaller the distance, the greater the probability that the two acoustic models are aligned.
  • the similarity algorithm can also be the longest common subsequence algorithm. If the longest common subsequence obtained is the length of the characters in the aligned two acoustic models. The smaller the phase difference, the greater the probability that the aligned two acoustic models will be the same.
  • step S3 if the same probability of the aligned two acoustic models is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint feature pre-stored by the user after successful registration.
  • the vector is calculated, and the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is calculated to authenticate the user.
  • the preset first threshold is 0.985, it is considered that the character that the user has read this time is consistent with the random code of the current broadcast. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
  • the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing.
  • Corresponding spectrum input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient MFCC
  • the voiceprint feature vector of the user's current speech includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing.
  • Corresponding spectrum input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient MFCC
  • the voiceprint feature vector of the user's current speech includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing.
  • the user's current read speech is framed, and then the pre-emphasis processing is performed on the framed speech data, and the pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency data in the speech data
  • the characteristic is more prominent.
  • the cepstrum analysis on the Mel spectrum is, for example, taking logarithm and inverse transform.
  • the inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as Mei.
  • Frequency cepstral coefficient MFCC is the voiceprint feature of the speech data of this frame.
  • the Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the sound of the user's current speech. Pattern feature vector.
  • the voice frequency cepstral coefficient MFCC of the speech data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal cepstrum spectrum. The accuracy of the authentication.
  • calculating a distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is a cosine distance for calculating the two, including:
  • said I is a standard voiceprint feature vector, said The voiceprint feature vector of the voice that the user is following this time.
  • the identity verification passes; if the cosine distance is greater than the preset distance threshold, the identity verification fails.
  • the standard voiceprint feature vector pre-stored after the user successfully registers includes:
  • the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading.
  • the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;
  • the user is prompted to re-enter, and the step of registering the voiceprint is performed again.
  • an identity code is sent, such as an identity card number
  • a second preset number of random codes is generated and voice synthesis technology is used to voice.
  • the random code is broadcast in the form, and the user is guided to perform a preset time (for example, 3 times), and the second preset number of bits is, for example, 8 bits.
  • a preset type of acoustic model is established for the voice of the random code for each broadcast, and a preset type of acoustic model is established for the voice that the user reads each time.
  • the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model.
  • the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.
  • the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie
  • the DNN-HMM model is available.
  • the voice of the random code and the voice followed by the user are a series of syllables. If the characters to be recognized are a series of characters.
  • the DNN-HMM acoustic model of the speech of the broadcast random code is obtained through global character acoustic adaptive training, and the DNN-HMM of the voice followed by the user. Acoustic model.
  • the acoustic alignment of the random code of each broadcast and the acoustic model of the voice of the user are forced to perform a Force Alignment operation, which can be greatly reduced compared to the conventional method of adopting a verbatim comparison.
  • the amount of calculation is beneficial to improve the efficiency of identification.
  • the predetermined algorithm may be a a priori posterior probability algorithm in an embodiment, and may be a similarity algorithm in other embodiments.
  • a priori posterior probability algorithm in an embodiment
  • a similarity algorithm in other embodiments.
  • the preset second threshold is 0.985
  • the character that the user reads is consistent with the broadcasted random code. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
  • the step of extracting the voiceprint feature vector of the voice that is read by the user is substantially the same as the method of extracting the voiceprint feature vector of the voice in the foregoing embodiment, and details are not described herein again.
  • the step of calculating the distance between the two moiré feature vectors is substantially the same as the step of calculating the cosine distance described above, and details are not described herein again.
  • the cosine distance is less than or equal to the preset distance threshold, the user who reads each time is the same user, and the voiceprint feature vector is stored as the standard voiceprint feature vector of the user; if the cosine distance is greater than the preset The distance threshold is such that the user who is reading is not the same user, prompting the user to re-register.
  • the present application also provides a computer readable storage medium having stored thereon a processing system, the processing system being executed by a processor to implement the steps of the method of identity verification described above.
  • the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better.
  • Implementation Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Collating Specific Patterns (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The present application relates to an electronic device, an identity verification method, and a storage medium. The method comprises: when a user performs transactions in an IVR scenario, broadcasting a random code of a first preset number of bits for the user to read, and upon reading, establishing acoustic models of preset types for the broadcasted random code and the voice read out by the user, respectively; performing a forced overall alignment operation on the acoustic model of the broadcasted random code and the acoustic model of the voice read out by the user, and using a predetermined algorithm to calculate the probability that the two aligned acoustic models are the same; if the probability is greater than a preset first threshold, extracting a voiceprint feature vector of the voice read out by the user, acquiring a standard voiceprint feature vector of the user pre-stored after successful registration, and calculating the distance between the voiceprint feature vector of the voice read out by the user and that standard voiceprint feature vector, so as to perform identity verification on the user. The present application performs dual verification on the identity of a user, and can accurately determine the identity of the user.

Description

电子装置、身份验证的方法及存储介质Electronic device, authentication method and storage medium
优先权申明Priority claim
本申请基于巴黎公约申明享有2018年04月09日递交的申请号为CN2018103117212、名称为“电子装置、身份验证的方法及存储介质”中国专利申请的优先权,该中国专利申请的整体内容以参考的方式结合在本申请中。The present application is based on the priority of the Chinese Patent Application entitled "Electronic Device, Method of Authenticating and Storage Media", filed on Apr. 09, 2018, the entire disclosure of which is hereby incorporated by reference. The way is combined in this application.
技术领域Technical field
本申请涉及通信技术领域,尤其涉及一种电子装置、身份验证的方法及存储介质。The present application relates to the field of communications technologies, and in particular, to an electronic device, a method for authenticating, and a storage medium.
背景技术Background technique
目前,在互动式语音应答IVR(Interactive Voice Response)场景中,提供了将互动式语音应答IVR与声纹识别结合,以对客户进行身份验证的方案,例如,客户收到信用卡后使用电话进行信用卡激活或修改密码时,需要验证客户身份的场景。现有技术在互动式语音应答IVR(Interactive Voice Response)场景中,鉴于远程声纹验证双方不是当面进行验证,因此,可能会存在客户利用预先准备的合成音的欺诈行为,不能准确确认客户身份,身份验证的安全性低。Currently, in the Interactive Voice Response (IVR) scenario, a scheme is provided in which an interactive voice response IVR is combined with voiceprint recognition to authenticate a client, for example, a customer uses a credit card to receive a credit card. When you activate or modify a password, you need to verify the scene of the customer's identity. In the interactive voice response (IVR) scenario, since the remote voiceprint verification is not verified in person, there may be fraudulent behaviors in which the client uses the synthesized tone prepared in advance, and the identity of the client cannot be accurately confirmed. The security of authentication is low.
发明内容Summary of the invention
本申请的目的在于提供一种电子装置、身份验证的方法及存储介质,旨在对用户身份进行双重验证,能够准确确认用户身份。The purpose of the present application is to provide an electronic device, a method for authenticating an authentication, and a storage medium, which are intended to double-verify the identity of the user and accurately confirm the identity of the user.
为实现上述目的,本申请提供一种电子装置,所述电子装置包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的处理系统,所述处理系统被所述处理器执行时实现如下步骤:To achieve the above object, the present application provides an electronic device including a memory and a processor coupled to the memory, the memory storing a processing system operable on the processor, the processing The system implements the following steps when executed by the processor:
声学模型建立步骤,在互动式语音应答IVR场景下用户办理业务时,播报第一预设位数的随机码供该用户跟读,并在跟读后分别为本次播报的随机码及该用户本次跟读的语音建立预设类型的声学模型;The acoustic model establishing step, in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively The acoustic model of the preset type is established this time;
强制整体对齐步骤,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算该对齐后的两声学模型相同的概率;Forcing the overall alignment step, forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
身份验证步骤,若该对齐后的两声学模型相同的概率大于预设第一阈值,则提取该用户本次跟读的语音的声纹特征向量,获取该用户在注册成功后预存的标准声纹特征向量,并计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离,以对该用户进行身份验证。In the authentication step, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful. The feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
为实现上述目的,本申请还提供一种身份验证的方法,所述身份验证的方法包括:To achieve the above objective, the present application further provides a method for authenticating, and the method for authenticating includes:
S1,在互动式语音应答IVR场景下用户办理业务时,播报第一预设位数的随机码供该用户跟读,并在跟读后分别为本次播报的随机码及该用户本次跟读的语音建立预设类型的声学模型;S1, in the interactive voice response IVR scenario, when the user handles the service, the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user are followed this time. The read speech establishes an acoustic model of a preset type;
S2,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算该对齐后的两声学模型相同的概率;S2, performing a forced overall alignment operation on the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
S3,若该对齐后的两声学模型相同的概率大于预设第一阈值,则提取该用户本次跟读的语音的声纹特征向量,获取该用户在注册成功后预存的标准声纹特征向量,并计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离,以对该用户进行身份验证。S3, if the same probability of the aligned two acoustic models is greater than a preset first threshold, extracting a voiceprint feature vector of the voice that the user is currently reading, and obtaining a standard voiceprint feature vector pre-stored by the user after successful registration. And calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector to authenticate the user.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现上述的身份验证的方法的步骤。The present application also provides a computer readable storage medium having stored thereon a processing system, the processing system being executed by a processor to implement the steps of the method of identity verification described above.
本申请的有益效果是:本申请在互动式语音应答IVR场景下进行身份识别时,利用随机码供用户跟读能够有效防止了预先准备的合成音进行欺诈,将随机码与声纹识别结合,实现了对用户身份的双重验证,能够准确确认用户身份,提高互动式语音应答IVR场景下身份验证的安全性,此外,对播报的随机码的声学模型及该用户跟读的语音的声学模型进行强制整体对齐操作,能够降低计算量,提高了身份识别效率。The beneficial effects of the present application are: when the present application performs identification in the interactive voice response IVR scenario, the random code is used for the user to follow up, which can effectively prevent the pre-prepared synthesized sound from being fraudulent, and combine the random code with the voiceprint recognition. The double verification of the user identity is realized, the user identity can be accurately confirmed, the security of the identity verification in the interactive voice response IVR scenario is improved, and the acoustic model of the broadcast random code and the acoustic model of the voice followed by the user are performed. Forcing the overall alignment operation can reduce the amount of calculation and improve the identification efficiency.
附图说明DRAWINGS
图1为本申请各个实施例一可选的应用环境示意图;1 is a schematic diagram of an optional application environment of each embodiment of the present application;
图2为本申请身份验证的方法一实施例的流程示意图。FIG. 2 is a schematic flowchart diagram of an embodiment of a method for identity verification according to the present application.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present application without departing from the inventive scope are the scope of the present application.
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外,各个实施例之间的技术方案可以相互结合,但是必须是以本领域普通技术人员能够实现为基础,当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在,也不在本申请要求的保护范围之内。It should be noted that the descriptions of "first", "second" and the like in the present application are for the purpose of description only, and are not to be construed as indicating or implying their relative importance or implicitly indicating the number of technical features indicated. . Thus, features defining "first" or "second" may include at least one of the features, either explicitly or implicitly. In addition, the technical solutions between the various embodiments may be combined with each other, but must be based on the realization of those skilled in the art, and when the combination of the technical solutions is contradictory or impossible to implement, it should be considered that the combination of the technical solutions does not exist. Nor is it within the scope of protection required by this application.
参阅图1所示,是本申请身份验证的方法的较佳实施例的应用环境示意 图。该应用环境示意图包括电子装置1及终端设备。电子装置1可以通过网络、近场通信技术等适合的技术与终端设备进行数据交互。本实施例中,用户通过终端设备登录电子装置1的互动式语音应答IVR系统,以执行声纹注册及声纹识别的操作。Referring to Figure 1, there is shown a schematic diagram of an application environment of a preferred embodiment of the method for authentication of the present application. The application environment diagram includes an electronic device 1 and a terminal device. The electronic device 1 can perform data interaction with the terminal device through a suitable technology such as a network or a near field communication technology. In this embodiment, the user logs in to the interactive voice response IVR system of the electronic device 1 through the terminal device to perform voiceprint registration and voiceprint recognition operations.
所述终端设备包括,但不限于,任何一种可与用户通过键盘、鼠标、遥控器、触摸板或者声控设备等方式进行人机交互的电子产品,例如,个人计算机、平板电脑、智能手机、个人数字助理(Personal Digital Assistant,PDA)、游戏机、交互式网络电视(Internet Protocol Television,IPTV)、智能式穿戴式设备、导航装置等等的可移动设备,或者诸如数字TV、台式计算机、笔记本、服务器等等的固定终端。The terminal device includes, but is not limited to, any electronic product that can interact with a user through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, Mobile devices such as Personal Digital Assistant (PDA), game consoles, Internet Protocol Television (IPTV), smart wearable devices, navigation devices, etc., or such as digital TV, desktop computers, notebooks Fixed terminal for servers, servers, etc.
所述电子装置1是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。所述电子装置1可以是计算机、也可以是单个网络服务器、多个网络服务器组成的服务器组或者基于云计算的由大量主机或者网络服务器构成的云,其中云计算是分布式计算的一种,由一群松散耦合的计算机集组成的一个超级虚拟计算机。The electronic device 1 is an apparatus capable of automatically performing numerical calculation and/or information processing in accordance with an instruction set or stored in advance. The electronic device 1 may be a computer, a single network server, a server group composed of multiple network servers, or a cloud-based cloud composed of a large number of hosts or network servers, where cloud computing is a type of distributed computing. A super virtual computer consisting of a group of loosely coupled computers.
在本实施例中,电子装置1可包括,但不仅限于,可通过系统总线相互通信连接的存储器11、处理器12、网络接口13,存储器11存储有可在处理器12上运行的处理系统。需要指出的是,图1仅示出了具有组件11-13的电子装置1,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。In the present embodiment, the electronic device 1 may include, but is not limited to, a memory 11 communicably connected to each other through a system bus, a processor 12, and a network interface 13, and the memory 11 stores a processing system operable on the processor 12. It should be noted that FIG. 1 only shows the electronic device 1 having the components 11-13, but it should be understood that not all illustrated components are required to be implemented, and more or fewer components may be implemented instead.
其中,存储器11包括内存及至少一种类型的可读存储介质。内存为电子装置1的运行提供缓存;可读存储介质可为如闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等的 非易失性存储介质。在一些实施例中,可读存储介质可以是电子装置1的内部存储单元,例如该电子装置1的硬盘;在另一些实施例中,该非易失性存储介质也可以是电子装置1的外部存储设备,例如电子装置1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。本实施例中,存储器11的可读存储介质通常用于存储安装于电子装置1的操作系统和各类应用软件,例如存储本申请一实施例中的处理系统的程序代码等。此外,存储器11还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 11 includes a memory and at least one type of readable storage medium. The memory provides a cache for the operation of the electronic device 1; the readable storage medium may be, for example, a flash memory, a hard disk, a multimedia card, a card type memory (eg, SD or DX memory, etc.), a random access memory (RAM), a static random access memory (SRAM). A non-volatile storage medium such as a read only memory (ROM), an electrically erasable programmable read only memory (EEPROM), a programmable read only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, or the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1; in other embodiments, the non-volatile storage medium may also be external to the electronic device 1. A storage device, such as a plug-in hard disk equipped with an electronic device 1, a smart memory card (SMC), a Secure Digital (SD) card, a flash card, or the like. In this embodiment, the readable storage medium of the memory 11 is generally used to store an operating system and various types of application software installed in the electronic device 1, such as program code for storing a processing system in an embodiment of the present application. Further, the memory 11 can also be used to temporarily store various types of data that have been output or are to be output.
所述处理器12在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器12通常用于控制所述电子装置1的总体操作,例如执行与所述终端设备进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器12用于运行所述存储器11中存储的程序代码或者处理数据,例如运行处理系统等。The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used to control the overall operation of the electronic device 1, such as performing control and processing related to data interaction or communication with the terminal device, and the like. In this embodiment, the processor 12 is configured to run program code or process data stored in the memory 11, such as running a processing system or the like.
所述网络接口13可包括无线网络接口或有线网络接口,该网络接口13通常用于在所述电子装置1与其他电子设备之间建立通信连接。本实施例中,网络接口13主要用于将电子装置1与一个或多个终端设备相连,在电子装置1与一个或多个终端设备之间建立数据传输通道和通信连接。The network interface 13 may comprise a wireless network interface or a wired network interface, which is typically used to establish a communication connection between the electronic device 1 and other electronic devices. In this embodiment, the network interface 13 is mainly used to connect the electronic device 1 with one or more terminal devices, and establish a data transmission channel and a communication connection between the electronic device 1 and one or more terminal devices.
所述处理系统存储在存储器11中,包括至少一个存储在存储器11中的计算机可读指令,该至少一个计算机可读指令可被处理器器12执行,以实现本申请各实施例的方法;以及,该至少一个计算机可读指令依据其各部分所实现的功能不同,可被划为不同的逻辑模块。The processing system is stored in the memory 11 and includes at least one computer readable instruction stored in the memory 11, the at least one computer readable instruction being executable by the processor 12 to implement the methods of various embodiments of the present application; The at least one computer readable instruction can be classified into different logic modules depending on the functions implemented by its various parts.
在一实施例中,上述处理系统被所述处理器12执行时实现如下步骤:In an embodiment, when the processing system is executed by the processor 12, the following steps are implemented:
声学模型建立步骤,在互动式语音应答IVR场景下用户办理业务时,播报第一预设位数的随机码供该用户跟读,并在跟读后分别为本次播报的随机 码及该用户本次跟读的语音建立预设类型的声学模型;The acoustic model establishing step, in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively The acoustic model of the preset type is established this time;
在互动式语音应答IVR场景下,用户请求办理业务时发送身份识别码,例如身份证号,在接收到用户的请求后,分析用户所办理的业务是否需要进一步的身份验证,且根据用户的身份识别码分析该用户是否已注册有声纹,若需要进一步的身份验证且该用户已注册有声纹,则生成第一预设位数的随机码并采用语音合成技术以语音形式播报该随机码,引导用户进行跟读,该第一预设位数例如为8位。In the interactive voice response IVR scenario, when the user requests to handle the service, the identity code is sent, for example, the ID number. After receiving the user's request, the user analyzes whether the service handled by the user needs further identity verification, and according to the identity of the user. The identification code analyzes whether the user has registered voiceprints. If further authentication is required and the user has registered voiceprints, a random code of the first preset number of bits is generated and the random code is broadcasted by voice synthesis technology to guide the random code. The user performs a follow-up, and the first preset number of bits is, for example, 8 bits.
在用户跟读后,为本次播报的随机码的语音建立预设类型的声学模型、为该用户本次跟读的语音建立预设类型的声学模型。在一优选实施例中,该预设类型的声学模型为深度神经网络-隐马尔可夫声学模型,即DNN-HMM声学模型。在其他实施例中,该预设类型的声学模型也可以为其他的声学模型,例如为隐马尔可夫声学模型等。After the user follows the reading, a preset type of acoustic model is established for the voice of the random code broadcasted, and a preset type of acoustic model is established for the voice that the user is currently reading. In a preferred embodiment, the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like.
在一具体的实例中,以DNN-HMM声学模型为例,其中,HMM用来描述语音信号的动态变化,利用DNN的每个输出节点来估计连续密度HMM的某个状态的后验概率,即可得到DNN-HMM模型。本次播报的随机码的语音及该用户本次跟读的语音都是一连串的音节,若要辨识成的文字,则是一连串的字符。本实施例在建立DNN-HMM声学模型时,基于预定的字符语音库,通过全局字符声学自适应训练得到本次播报的随机码的语音的DNN-HMM声学模型、该用户本次跟读的语音的DNN-HMM声学模型。In a specific example, taking the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie The DNN-HMM model is available. The voice of the random code broadcasted this time and the voice that the user is following this time are a series of syllables. If the character to be recognized is a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, the DNN-HMM acoustic model of the voice of the random code broadcasted by the global character is adaptively trained based on the predetermined character speech library, and the voice of the user is currently read. DNN-HMM acoustic model.
强制整体对齐步骤,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算该对齐后的两声学模型相同的概率;Forcing the overall alignment step, forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
其中,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐(Force Alignment)操作,相比于传统的采取逐字对比的方法,本实施例能够大大降低计算量,有利提高身份识别的效率。Wherein, the acoustic model of the random code broadcasted this time and the acoustic model of the voice that the user is currently reading are subjected to a Force Alignment operation, which can be compared with the conventional method of adopting word-by-word comparison. Greatly reduce the amount of calculation, which is beneficial to improve the efficiency of identification.
其中,预定算法在一实施例中为前验后验概率算法,在其他实施例中,还可以是相似度算法,例如该相似度算法为计算对齐后的两声学模型中字符的编辑距离,编辑距离越小则对齐后的两声学模型相同的概率越大;该相似度算法还可以是最长公共子序列算法,若得到的最长公共子序列均与对齐后的两声学模型中字符的长度相差越小,则对齐后的两声学模型相同的概率越大。The predetermined algorithm is a a priori posterior probability algorithm in another embodiment. In other embodiments, the predetermined algorithm may also be a similarity algorithm. For example, the similarity algorithm is used to calculate the edit distance of the characters in the aligned two acoustic models. The smaller the distance, the greater the probability that the two acoustic models are aligned. The similarity algorithm can also be the longest common subsequence algorithm. If the longest common subsequence obtained is the length of the characters in the aligned two acoustic models. The smaller the phase difference, the greater the probability that the aligned two acoustic models will be the same.
身份验证步骤,若该对齐后的两声学模型相同的概率大于预设第一阈值,则提取该用户本次跟读的语音的声纹特征向量,获取该用户在注册成功后预存的标准声纹特征向量,并计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离,以对该用户进行身份验证。In the authentication step, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful. The feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
本实施例中,若该对齐后的两声学模型相同的概率大于预设第一阈值,例如预设第一阈值为0.985,则认为用户本次跟读的字符与本次播报的随机码一致。由于播报的是随机码,因此可以有效防止了用户预先准备的合成音进行欺诈,提升身份识别的安全性。In this embodiment, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, for example, the preset first threshold is 0.985, it is considered that the character that the user has read this time is consistent with the random code of the current broadcast. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
在一实施例中,提取该用户本次跟读的语音的声纹特征向量的步骤包括:对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。In an embodiment, the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing. Corresponding spectrum, input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient MFCC The voiceprint feature vector of the user's current speech.
其中,对该用户本次跟读的语音进行分帧,然后对分帧后的语音数据进行预加重处理,预加重处理实际是高通滤波处理,滤除低频数据,使得该语音数据中的高频特性更加突显,具体地,高通滤波的传递函数为:H(Z)=1-αZ -1,其中,Z为语音数据,α为常量系数,优选地,α的取值为0.97;由于语音在分帧之后在一定程度上背离原始语音,因此,需要对该语音数据进行加窗处理。 Wherein, the user's current read speech is framed, and then the pre-emphasis processing is performed on the framed speech data, and the pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency data in the speech data The characteristic is more prominent. Specifically, the transfer function of the high-pass filter is: H(Z)=1-αZ -1 , where Z is speech data, α is a constant coefficient, preferably, the value of α is 0.97; After the framing, the original voice is deviated to some extent, and therefore, the voice data needs to be windowed.
本实施例中,在梅尔频谱上进行倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC。梅尔频率倒谱系数MFCC即为这帧语音数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为该用户本次跟读的语音的声纹特征向量。In this embodiment, the cepstrum analysis on the Mel spectrum is, for example, taking logarithm and inverse transform. The inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as Mei. Frequency cepstral coefficient MFCC. The Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame. The Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the sound of the user's current speech. Pattern feature vector.
本实施例取语音数据的梅尔频率倒谱系数MFCC组成对应的声纹特征向量,由于其比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统,因此能够提高身份验证的准确性。In this embodiment, the voice frequency cepstral coefficient MFCC of the speech data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal cepstrum spectrum. The accuracy of the authentication.
在一实施例中,计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离为计算两者的余弦距离,包括:In an embodiment, calculating a distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is a cosine distance for calculating the two, including:
Figure PCTCN2018102208-appb-000001
其中,所述
Figure PCTCN2018102208-appb-000002
为标准声纹特征向量,所述
Figure PCTCN2018102208-appb-000003
为该用户本次跟读的语音的声纹特征向量。
Figure PCTCN2018102208-appb-000001
Wherein said
Figure PCTCN2018102208-appb-000002
Is a standard voiceprint feature vector, said
Figure PCTCN2018102208-appb-000003
The voiceprint feature vector of the voice that the user is following this time.
若余弦距离小于或者等于预设的距离阈值,则身份验证通过;若余弦距离大于预设的距离阈值,则身份验证不通过。If the cosine distance is less than or equal to the preset distance threshold, the identity verification passes; if the cosine distance is greater than the preset distance threshold, the identity verification fails.
在一实施例中,在用户注册成功后预存的标准声纹特征向量,该注册声纹的步骤包括:In an embodiment, the standard voiceprint feature vector pre-stored after the user successfully registers, the step of registering the voiceprint includes:
在互动式语音应答IVR场景下用户进行声纹注册时,播报第二预设位数的随机码供用户跟读预设次,在每次跟读后分别为播报的随机码及用户跟读的语音建立所述预设类型的声学模型;In the interactive voice response IVR scenario, when the user performs voiceprint registration, the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading. Voice establishing an acoustic model of the preset type;
分别将每次播报的随机码的声学模型及对应的用户跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算对齐后的两声学模型相同的概率;Performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding user-read speech, respectively, using a predetermined algorithm to calculate the same probability of the aligned two acoustic models;
若对齐后的两声学模型相同的概率均大于预设第二阈值,则提取每次用户跟读的语音的声纹特征向量,计算两两声纹特征向量的距离,以分析每次 跟读的用户是否为同一用户;If the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;
若是,则以该声纹特征向量作为该用户的标准声纹特征向量进行存储;If yes, storing the voiceprint feature vector as the standard voiceprint feature vector of the user;
若否,则提示用户重新录入,再次进行注册声纹的步骤。If not, the user is prompted to re-enter, and the step of registering the voiceprint is performed again.
其中,在互动式语音应答IVR场景下,用户请求注册时发送身份识别码,例如身份证号,在接收到用户的请求后,生成第二预设位数的随机码并采用语音合成技术以语音形式播报该随机码,引导用户进行跟读预设次(例如3次),该第二预设位数例如为8位。In the interactive voice response IVR scenario, when the user requests registration, an identity code is sent, such as an identity card number, and after receiving the user's request, a second preset number of random codes is generated and voice synthesis technology is used to voice. The random code is broadcast in the form, and the user is guided to perform a preset time (for example, 3 times), and the second preset number of bits is, for example, 8 bits.
在用户跟读后,为每次播报的随机码的语音建立预设类型的声学模型、为该用户每次跟读的语音建立预设类型的声学模型。在一优选实施例中,该预设类型的声学模型为深度神经网络-隐马尔可夫声学模型,即DNN-HMM声学模型。在其他实施例中,该预设类型的声学模型也可以为其他的声学模型,例如为隐马尔可夫声学模型等。具体的实例可以参考上述的实施例,此处不再赘述。After the user follows the reading, a preset type of acoustic model is established for the voice of the random code for each broadcast, and a preset type of acoustic model is established for the voice that the user reads each time. In a preferred embodiment, the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.
在一具体的实例中,以DNN-HMM声学模型为例,其中,HMM用来描述语音信号的动态变化,利用DNN的每个输出节点来估计连续密度HMM的某个状态的后验概率,即可得到DNN-HMM模型。每次播报的随机码的语音及该用户跟读的语音都是一连串的音节,若要辨识成的文字,则是一连串的字符。本实施例在建立DNN-HMM声学模型时,基于预定的字符语音库,通过全局字符声学自适应训练得到播报的随机码的语音的DNN-HMM声学模型、该用户跟读的语音的DNN-HMM声学模型。In a specific example, taking the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie The DNN-HMM model is available. The voice of the random code and the voice followed by the user are a series of syllables. If the characters to be recognized are a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, based on a predetermined character speech library, the DNN-HMM acoustic model of the speech of the broadcast random code is obtained through global character acoustic adaptive training, and the DNN-HMM of the voice followed by the user. Acoustic model.
其中,将每次播报的随机码的声学模型及该用户跟读的语音的声学模型进行强制整体对齐(Force Alignment)操作,相比于传统的采取逐字对比的方法,本实施例能够大大降低计算量,有利提高身份识别的效率。Wherein, the acoustic alignment of the random code of each broadcast and the acoustic model of the voice of the user are forced to perform a Force Alignment operation, which can be greatly reduced compared to the conventional method of adopting a verbatim comparison. The amount of calculation is beneficial to improve the efficiency of identification.
其中,预定算法在一实施例中为前验后验概率算法,在其他实施例中,还可以是相似度算法,具体的实例可以参考上述的实施例,此处不再赘述。The predetermined algorithm may be a a priori posterior probability algorithm in an embodiment, and may be a similarity algorithm in other embodiments. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.
本实施例中,若对齐后的两声学模型相同的概率均大于预设第二阈值,例如预设第二阈值为0.985,则认为用户每次跟读的字符与所播报的随机码一致。由于播报的是随机码,因此可以有效防止了用户预先准备的合成音进行欺诈,提升身份识别的安全性。In this embodiment, if the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, for example, the preset second threshold is 0.985, it is considered that the character that the user reads is consistent with the broadcasted random code. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
在一实施例中,提取每次用户跟读的语音的声纹特征向量的步骤与上述实施例的提取语音的声纹特征向量的方法基本相同,此处不再赘述。In an embodiment, the step of extracting the voiceprint feature vector of the voice that is read by the user is substantially the same as the method of extracting the voiceprint feature vector of the voice in the foregoing embodiment, and details are not described herein again.
在一实施例中,计算两两声纹特征向量的距离的步骤,与上述计算余弦距离的步骤基本相同,此处不再赘述。In an embodiment, the step of calculating the distance between the two moiré feature vectors is substantially the same as the step of calculating the cosine distance described above, and details are not described herein again.
若余弦距离小于或者等于预设的距离阈值,则每次跟读的用户为同一用户,此时以该声纹特征向量作为该用户的标准声纹特征向量进行存储;若余弦距离大于预设的距离阈值,则每次跟读的用户不为同一用户,提示用户重新注册。If the cosine distance is less than or equal to the preset distance threshold, the user who reads each time is the same user, and the voiceprint feature vector is stored as the standard voiceprint feature vector of the user; if the cosine distance is greater than the preset The distance threshold is such that the user who is reading is not the same user, prompting the user to re-register.
与现有技术相比,本申请在互动式语音应答IVR场景下进行身份识别时,利用随机码供用户跟读能够有效防止了预先准备的合成音进行欺诈,将随机码与声纹识别结合,实现了对用户身份的双重验证,能够准确确认用户身份,提高互动式语音应答IVR场景下身份验证的安全性,此外,对播报的随机码的声学模型及该用户跟读的语音的声学模型进行强制整体对齐操作,能够降低计算量,提高了身份识别效率。Compared with the prior art, when the present application performs identification in the interactive voice response IVR scenario, the random code is used for the user to follow up, which can effectively prevent the pre-prepared synthesized sound from being fraudulent, and combine the random code with the voiceprint recognition. The double verification of the user identity is realized, the user identity can be accurately confirmed, the security of the identity verification in the interactive voice response IVR scenario is improved, and the acoustic model of the broadcast random code and the acoustic model of the voice followed by the user are performed. Forcing the overall alignment operation can reduce the amount of calculation and improve the identification efficiency.
如图2所示,图2为本申请身份验证的方法一实施例的流程示意图,该身份验证的方法包括以下步骤:As shown in FIG. 2, FIG. 2 is a schematic flowchart of an embodiment of an authentication method according to an embodiment of the present application. The method for authenticating includes the following steps:
步骤S1,在互动式语音应答IVR场景下用户办理业务时,播报第一预设位数的随机码供该用户跟读,并在跟读后分别为本次播报的随机码及该用户本次跟读的语音建立预设类型的声学模型;Step S1: When the user handles the service in the interactive voice response IVR scenario, the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user this time Establishing a preset type of acoustic model with the spoken speech;
在互动式语音应答IVR场景下,用户请求办理业务时发送身份识别码,例如身份证号,在接收到用户的请求后,分析用户所办理的业务是否需要进 一步的身份验证,且根据用户的身份识别码分析该用户是否已注册有声纹,若需要进一步的身份验证且该用户已注册有声纹,则生成第一预设位数的随机码并采用语音合成技术以语音形式播报该随机码,引导用户进行跟读,该第一预设位数例如为8位。In the interactive voice response IVR scenario, when the user requests to handle the service, the identity code is sent, for example, the ID number. After receiving the user's request, the user analyzes whether the service handled by the user needs further identity verification, and according to the identity of the user. The identification code analyzes whether the user has registered voiceprints. If further authentication is required and the user has registered voiceprints, a random code of the first preset number of bits is generated and the random code is broadcasted by voice synthesis technology to guide the random code. The user performs a follow-up, and the first preset number of bits is, for example, 8 bits.
在用户跟读后,为本次播报的随机码的语音建立预设类型的声学模型、为该用户本次跟读的语音建立预设类型的声学模型。在一优选实施例中,该预设类型的声学模型为深度神经网络-隐马尔可夫声学模型,即DNN-HMM声学模型。在其他实施例中,该预设类型的声学模型也可以为其他的声学模型,例如为隐马尔可夫声学模型等。After the user follows the reading, a preset type of acoustic model is established for the voice of the random code broadcasted, and a preset type of acoustic model is established for the voice that the user is currently reading. In a preferred embodiment, the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like.
在一具体的实例中,以DNN-HMM声学模型为例,其中,HMM用来描述语音信号的动态变化,利用DNN的每个输出节点来估计连续密度HMM的某个状态的后验概率,即可得到DNN-HMM模型。本次播报的随机码的语音及该用户本次跟读的语音都是一连串的音节,若要辨识成的文字,则是一连串的字符。本实施例在建立DNN-HMM声学模型时,基于预定的字符语音库,通过全局字符声学自适应训练得到本次播报的随机码的语音的DNN-HMM声学模型、该用户本次跟读的语音的DNN-HMM声学模型。In a specific example, taking the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie The DNN-HMM model is available. The voice of the random code broadcasted this time and the voice that the user is following this time are a series of syllables. If the character to be recognized is a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, the DNN-HMM acoustic model of the voice of the random code broadcasted by the global character is adaptively trained based on the predetermined character speech library, and the voice of the user is currently read. DNN-HMM acoustic model.
步骤S2,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算该对齐后的两声学模型相同的概率;Step S2, performing a forced overall alignment operation on the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
其中,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐(Force Alignment)操作,相比于传统的采取逐字对比的方法,本实施例能够大大降低计算量,有利提高身份识别的效率。Wherein, the acoustic model of the random code broadcasted this time and the acoustic model of the voice that the user is currently reading are subjected to a Force Alignment operation, which can be compared with the conventional method of adopting word-by-word comparison. Greatly reduce the amount of calculation, which is beneficial to improve the efficiency of identification.
其中,预定算法在一实施例中为前验后验概率算法,在其他实施例中,还可以是相似度算法,例如该相似度算法为计算对齐后的两声学模型中字符的编辑距离,编辑距离越小则对齐后的两声学模型相同的概率越大;该相似 度算法还可以是最长公共子序列算法,若得到的最长公共子序列均与对齐后的两声学模型中字符的长度相差越小,则对齐后的两声学模型相同的概率越大。The predetermined algorithm is a a priori posterior probability algorithm in another embodiment. In other embodiments, the predetermined algorithm may also be a similarity algorithm. For example, the similarity algorithm is used to calculate the edit distance of the characters in the aligned two acoustic models. The smaller the distance, the greater the probability that the two acoustic models are aligned. The similarity algorithm can also be the longest common subsequence algorithm. If the longest common subsequence obtained is the length of the characters in the aligned two acoustic models. The smaller the phase difference, the greater the probability that the aligned two acoustic models will be the same.
步骤S3,若该对齐后的两声学模型相同的概率大于预设第一阈值,则提取该用户本次跟读的语音的声纹特征向量,获取该用户在注册成功后预存的标准声纹特征向量,并计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离,以对该用户进行身份验证。In step S3, if the same probability of the aligned two acoustic models is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint feature pre-stored by the user after successful registration. The vector is calculated, and the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is calculated to authenticate the user.
本实施例中,若该对齐后的两声学模型相同的概率大于预设第一阈值,例如预设第一阈值为0.985,则认为用户本次跟读的字符与本次播报的随机码一致。由于播报的是随机码,因此可以有效防止了用户预先准备的合成音进行欺诈,提升身份识别的安全性。In this embodiment, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, for example, the preset first threshold is 0.985, it is considered that the character that the user has read this time is consistent with the random code of the current broadcast. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
在一实施例中,提取该用户本次跟读的语音的声纹特征向量的步骤包括:对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。In an embodiment, the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes: pre-emphasizing and windowing the voice that the user is currently reading, and performing Fourier transform on each windowing. Corresponding spectrum, input the spectrum into the Meyer filter to output the Mel spectrum; perform cepstrum analysis on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, based on the Mel frequency cepstral coefficient MFCC The voiceprint feature vector of the user's current speech.
其中,对该用户本次跟读的语音进行分帧,然后对分帧后的语音数据进行预加重处理,预加重处理实际是高通滤波处理,滤除低频数据,使得该语音数据中的高频特性更加突显,具体地,高通滤波的传递函数为:H(Z)=1-αZ -1,其中,Z为语音数据,α为常量系数,优选地,α的取值为0.97;由于语音在分帧之后在一定程度上背离原始语音,因此,需要对该语音数据进行加窗处理。 Wherein, the user's current read speech is framed, and then the pre-emphasis processing is performed on the framed speech data, and the pre-emphasis processing is actually a high-pass filtering process, filtering out the low-frequency data, so that the high-frequency data in the speech data The characteristic is more prominent. Specifically, the transfer function of the high-pass filter is: H(Z)=1-αZ -1 , where Z is speech data, α is a constant coefficient, preferably, the value of α is 0.97; After the framing, the original voice is deviated to some extent, and therefore, the voice data needs to be windowed.
本实施例中,在梅尔频谱上进行倒谱分析例如为取对数、做逆变换,逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为梅尔频率倒谱系数MFCC。梅尔频率倒谱系数MFCC即为这帧语 音数据的声纹特征,将每帧的梅尔频率倒谱系数MFCC组成特征数据矩阵,该特征数据矩阵即为该用户本次跟读的语音的声纹特征向量。In this embodiment, the cepstrum analysis on the Mel spectrum is, for example, taking logarithm and inverse transform. The inverse transform is generally implemented by DCT discrete cosine transform, and the second to thirteenth coefficients after DCT are taken as Mei. Frequency cepstral coefficient MFCC. The Mel frequency cepstrum coefficient MFCC is the voiceprint feature of the speech data of this frame. The Mel frequency cepstral coefficient MFCC of each frame is composed into a feature data matrix, which is the sound of the user's current speech. Pattern feature vector.
本实施例取语音数据的梅尔频率倒谱系数MFCC组成对应的声纹特征向量,由于其比用于正常的对数倒频谱中的线性间隔的频带更能近似人类的听觉系统,因此能够提高身份验证的准确性。In this embodiment, the voice frequency cepstral coefficient MFCC of the speech data is composed of a corresponding voiceprint feature vector, which can be improved because it is more similar to the human auditory system than the linearly spaced frequency band used in the normal cepstrum spectrum. The accuracy of the authentication.
在一实施例中,计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离为计算两者的余弦距离,包括:In an embodiment, calculating a distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector is a cosine distance for calculating the two, including:
Figure PCTCN2018102208-appb-000004
其中,所述
Figure PCTCN2018102208-appb-000005
为标准声纹特征向量,所述
Figure PCTCN2018102208-appb-000006
为该用户本次跟读的语音的声纹特征向量。
Figure PCTCN2018102208-appb-000004
Wherein said
Figure PCTCN2018102208-appb-000005
Is a standard voiceprint feature vector, said
Figure PCTCN2018102208-appb-000006
The voiceprint feature vector of the voice that the user is following this time.
若余弦距离小于或者等于预设的距离阈值,则身份验证通过;若余弦距离大于预设的距离阈值,则身份验证不通过。If the cosine distance is less than or equal to the preset distance threshold, the identity verification passes; if the cosine distance is greater than the preset distance threshold, the identity verification fails.
在一实施例中,在用户注册成功后预存的标准声纹特征向量,该注册声纹的步骤包括:In an embodiment, the standard voiceprint feature vector pre-stored after the user successfully registers, the step of registering the voiceprint includes:
在互动式语音应答IVR场景下用户进行声纹注册时,播报第二预设位数的随机码供用户跟读预设次,在每次跟读后分别为播报的随机码及用户跟读的语音建立所述预设类型的声学模型;In the interactive voice response IVR scenario, when the user performs voiceprint registration, the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading. Voice establishing an acoustic model of the preset type;
分别将每次播报的随机码的声学模型及对应的用户跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算对齐后的两声学模型相同的概率;Performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding user-read speech, respectively, using a predetermined algorithm to calculate the same probability of the aligned two acoustic models;
若对齐后的两声学模型相同的概率均大于预设第二阈值,则提取每次用户跟读的语音的声纹特征向量,计算两两声纹特征向量的距离,以分析每次跟读的用户是否为同一用户;If the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;
若是,则以该声纹特征向量作为该用户的标准声纹特征向量进行存储;If yes, storing the voiceprint feature vector as the standard voiceprint feature vector of the user;
若否,则提示用户重新录入,再次进行注册声纹的步骤。If not, the user is prompted to re-enter, and the step of registering the voiceprint is performed again.
其中,在互动式语音应答IVR场景下,用户请求注册时发送身份识别码,例如身份证号,在接收到用户的请求后,生成第二预设位数的随机码并采用语音合成技术以语音形式播报该随机码,引导用户进行跟读预设次(例如3次),该第二预设位数例如为8位。In the interactive voice response IVR scenario, when the user requests registration, an identity code is sent, such as an identity card number, and after receiving the user's request, a second preset number of random codes is generated and voice synthesis technology is used to voice. The random code is broadcast in the form, and the user is guided to perform a preset time (for example, 3 times), and the second preset number of bits is, for example, 8 bits.
在用户跟读后,为每次播报的随机码的语音建立预设类型的声学模型、为该用户每次跟读的语音建立预设类型的声学模型。在一优选实施例中,该预设类型的声学模型为深度神经网络-隐马尔可夫声学模型,即DNN-HMM声学模型。在其他实施例中,该预设类型的声学模型也可以为其他的声学模型,例如为隐马尔可夫声学模型等。具体的实例可以参考上述的实施例,此处不再赘述。After the user follows the reading, a preset type of acoustic model is established for the voice of the random code for each broadcast, and a preset type of acoustic model is established for the voice that the user reads each time. In a preferred embodiment, the predetermined type of acoustic model is a deep neural network-Hidden Markov acoustic model, ie a DNN-HMM acoustic model. In other embodiments, the preset type of acoustic model may also be other acoustic models, such as a hidden Markov acoustic model or the like. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.
在一具体的实例中,以DNN-HMM声学模型为例,其中,HMM用来描述语音信号的动态变化,利用DNN的每个输出节点来估计连续密度HMM的某个状态的后验概率,即可得到DNN-HMM模型。每次播报的随机码的语音及该用户跟读的语音都是一连串的音节,若要辨识成的文字,则是一连串的字符。本实施例在建立DNN-HMM声学模型时,基于预定的字符语音库,通过全局字符声学自适应训练得到播报的随机码的语音的DNN-HMM声学模型、该用户跟读的语音的DNN-HMM声学模型。In a specific example, taking the DNN-HMM acoustic model as an example, wherein the HMM is used to describe the dynamic change of the speech signal, and each output node of the DNN is used to estimate the posterior probability of a certain state of the continuous density HMM, ie The DNN-HMM model is available. The voice of the random code and the voice followed by the user are a series of syllables. If the characters to be recognized are a series of characters. In the embodiment, when the DNN-HMM acoustic model is established, based on a predetermined character speech library, the DNN-HMM acoustic model of the speech of the broadcast random code is obtained through global character acoustic adaptive training, and the DNN-HMM of the voice followed by the user. Acoustic model.
其中,将每次播报的随机码的声学模型及该用户跟读的语音的声学模型进行强制整体对齐(Force Alignment)操作,相比于传统的采取逐字对比的方法,本实施例能够大大降低计算量,有利提高身份识别的效率。Wherein, the acoustic alignment of the random code of each broadcast and the acoustic model of the voice of the user are forced to perform a Force Alignment operation, which can be greatly reduced compared to the conventional method of adopting a verbatim comparison. The amount of calculation is beneficial to improve the efficiency of identification.
其中,预定算法在一实施例中为前验后验概率算法,在其他实施例中,还可以是相似度算法,具体的实例可以参考上述的实施例,此处不再赘述。The predetermined algorithm may be a a priori posterior probability algorithm in an embodiment, and may be a similarity algorithm in other embodiments. For specific examples, reference may be made to the foregoing embodiments, and details are not described herein again.
本实施例中,若对齐后的两声学模型相同的概率均大于预设第二阈值,例如预设第二阈值为0.985,则认为用户每次跟读的字符与所播报的随机码一致。由于播报的是随机码,因此可以有效防止了用户预先准备的合成音进 行欺诈,提升身份识别的安全性。In this embodiment, if the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, for example, the preset second threshold is 0.985, it is considered that the character that the user reads is consistent with the broadcasted random code. Since the broadcast is a random code, the synthesized sound prepared by the user can be effectively prevented from being fraudulent, and the security of the identification is improved.
在一实施例中,提取每次用户跟读的语音的声纹特征向量的步骤与上述实施例的提取语音的声纹特征向量的方法基本相同,此处不再赘述。In an embodiment, the step of extracting the voiceprint feature vector of the voice that is read by the user is substantially the same as the method of extracting the voiceprint feature vector of the voice in the foregoing embodiment, and details are not described herein again.
在一实施例中,计算两两声纹特征向量的距离的步骤,与上述计算余弦距离的步骤基本相同,此处不再赘述。In an embodiment, the step of calculating the distance between the two moiré feature vectors is substantially the same as the step of calculating the cosine distance described above, and details are not described herein again.
若余弦距离小于或者等于预设的距离阈值,则每次跟读的用户为同一用户,此时以该声纹特征向量作为该用户的标准声纹特征向量进行存储;若余弦距离大于预设的距离阈值,则每次跟读的用户不为同一用户,提示用户重新注册。If the cosine distance is less than or equal to the preset distance threshold, the user who reads each time is the same user, and the voiceprint feature vector is stored as the standard voiceprint feature vector of the user; if the cosine distance is greater than the preset The distance threshold is such that the user who is reading is not the same user, prompting the user to re-register.
本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现上述的身份验证的方法的步骤。The present application also provides a computer readable storage medium having stored thereon a processing system, the processing system being executed by a processor to implement the steps of the method of identity verification described above.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the foregoing embodiment method can be implemented by means of software plus a necessary general hardware platform, and of course, can also be through hardware, but in many cases, the former is better. Implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the methods described in various embodiments of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above is only a preferred embodiment of the present application, and is not intended to limit the scope of the patent application, and the equivalent structure or equivalent process transformations made by the specification and the drawings of the present application, or directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of this application.

Claims (20)

  1. 一种电子装置,其特征在于,所述电子装置包括存储器及与所述存储器连接的处理器,所述存储器中存储有可在所述处理器上运行的处理系统,所述处理系统被所述处理器执行时实现如下步骤:An electronic device, comprising: a memory and a processor coupled to the memory, wherein the memory stores a processing system operable on the processor, the processing system being The processor implements the following steps when it executes:
    声学模型建立步骤,在互动式语音应答IVR场景下用户办理业务时,播报第一预设位数的随机码供该用户跟读,并在跟读后分别为本次播报的随机码及该用户本次跟读的语音建立预设类型的声学模型;The acoustic model establishing step, in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively The acoustic model of the preset type is established this time;
    强制整体对齐步骤,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算该对齐后的两声学模型相同的概率;Forcing the overall alignment step, forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
    身份验证步骤,若该对齐后的两声学模型相同的概率大于预设第一阈值,则提取该用户本次跟读的语音的声纹特征向量,获取该用户在注册成功后预存的标准声纹特征向量,并计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离,以对该用户进行身份验证。In the authentication step, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful. The feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
  2. 根据权利要求1所述的电子装置,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:The electronic device according to claim 1, wherein when the processing system is executed by the processor, the following steps are further implemented:
    在互动式语音应答IVR场景下用户进行声纹注册时,播报第二预设位数的随机码供用户跟读预设次,在每次跟读后分别为播报的随机码及用户跟读的语音建立所述预设类型的声学模型;In the interactive voice response IVR scenario, when the user performs voiceprint registration, the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading. Voice establishing an acoustic model of the preset type;
    分别将每次播报的随机码的声学模型及对应的用户跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算对齐后的两声学模型相同的概率;Performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding user-read speech, respectively, using a predetermined algorithm to calculate the same probability of the aligned two acoustic models;
    若对齐后的两声学模型相同的概率均大于预设第二阈值,则提取每次用户跟读的语音的声纹特征向量,计算两两声纹特征向量的距离,以分析每次跟读的用户是否为同一用户;If the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;
    若是,则以该声纹特征向量作为该用户的标准声纹特征向量进行存储。If so, the voiceprint feature vector is stored as the standard voiceprint feature vector of the user.
  3. 根据权利要求1所述的电子装置,其特征在于,所述预设类型的声学模型为深度神经网络-隐马尔可夫模型。The electronic device according to claim 1, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
  4. 根据权利要求2所述的电子装置,其特征在于,所述预设类型的声学模型为深度神经网络-隐马尔可夫模型。The electronic device according to claim 2, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
  5. 根据权利要求1所述的电子装置,其特征在于,所述提取该用户本次跟读的语音的声纹特征向量的步骤包括:The electronic device according to claim 1, wherein the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes:
    对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
  6. 根据权利要求2所述的电子装置,其特征在于,所述提取该用户本次跟读的语音的声纹特征向量的步骤包括:The electronic device according to claim 2, wherein the step of extracting the voiceprint feature vector of the voice that the user has followed this time comprises:
    对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
  7. 根据权利要求1或2所述的电子装置,其特征在于,所述计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离的步骤包括:The electronic device according to claim 1 or 2, wherein the calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector comprises:
    Figure PCTCN2018102208-appb-100001
    其中,所述
    Figure PCTCN2018102208-appb-100002
    为标准声纹特征向量,所述
    Figure PCTCN2018102208-appb-100003
    为该用户本次跟读的语音的声纹特征向量。
    Figure PCTCN2018102208-appb-100001
    Wherein said
    Figure PCTCN2018102208-appb-100002
    Is a standard voiceprint feature vector, said
    Figure PCTCN2018102208-appb-100003
    The voiceprint feature vector of the voice that the user is following this time.
  8. 一种身份验证的方法,其特征在于,所述身份验证的方法包括:A method for authentication, characterized in that the method for authenticating includes:
    S1,在互动式语音应答IVR场景下用户办理业务时,播报第一预设位数的随机码供该用户跟读,并在跟读后分别为本次播报的随机码及该用户本次跟读的语音建立预设类型的声学模型;S1, in the interactive voice response IVR scenario, when the user handles the service, the random code of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user are followed this time. The read speech establishes an acoustic model of a preset type;
    S2,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算该对齐后的两声学模型相同的概率;S2, performing a forced overall alignment operation on the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
    S3,若该对齐后的两声学模型相同的概率大于预设第一阈值,则提取该用户本次跟读的语音的声纹特征向量,获取该用户在注册成功后预存的标准声纹特征向量,并计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离,以对该用户进行身份验证。S3, if the same probability of the aligned two acoustic models is greater than a preset first threshold, extracting a voiceprint feature vector of the voice that the user is currently reading, and obtaining a standard voiceprint feature vector pre-stored by the user after successful registration. And calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector to authenticate the user.
  9. 根据权利要求8所述的身份验证的方法,其特征在于,所述步骤S1之前,还包括:The method for authentication according to claim 8, wherein before the step S1, the method further comprises:
    S01,在互动式语音应答IVR场景下用户进行声纹注册时,播报第二预设位数的随机码供用户跟读预设次,在每次跟读后分别为播报的随机码及用户跟读的语音建立所述预设类型的声学模型;S01, when the voice registration is performed by the user in the interactive voice response IVR scenario, the random code of the second preset digit is broadcasted for the user to follow the preset time, and the random code and the user are respectively broadcasted after each reading. The read speech establishes an acoustic model of the preset type;
    S02,分别将每次播报的随机码的声学模型及对应的用户跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算对齐后的两声学模型相同的概率;S02, performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding voice followed by the user, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
    S03,若对齐后的两声学模型相同的概率均大于预设第二阈值,则提取每次用户跟读的语音的声纹特征向量,计算两两声纹特征向量的距离,以分析每次跟读的用户是否为同一用户;S03, if the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user reads the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time with Whether the read user is the same user;
    S04,若是,则以该声纹特征向量作为该用户的标准声纹特征向量进行存储。S04, if yes, storing the voiceprint feature vector as the standard voiceprint feature vector of the user.
  10. 根据权利要求8所述的身份验证的方法,其特征在于,所述预设类型的声学模型为深度神经网络-隐马尔可夫模型。The method of claim 8 according to claim 8, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
  11. 根据权利要求9所述的身份验证的方法,其特征在于,所述预设类型的声学模型为深度神经网络-隐马尔可夫模型。The method of claim 9 according to claim 9, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
  12. 根据权利要求8所述的身份验证的方法,其特征在于,所述提取该用户本次跟读的语音的声纹特征向量的步骤包括:The method of claim 8, wherein the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes:
    对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
  13. 根据权利要求9所述的身份验证的方法,其特征在于,所述提取该用户本次跟读的语音的声纹特征向量的步骤包括:The method of claim 9 according to claim 9, wherein the step of extracting the voiceprint feature vector of the voice that the user is currently reading includes:
    对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
  14. 根据权利要求8或9所述的身份验证的方法,其特征在于,所述计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离的步骤包括:The method for authentication according to claim 8 or 9, wherein the step of calculating the distance between the voiceprint feature vector of the user's current voice and the standard voiceprint feature vector comprises:
    Figure PCTCN2018102208-appb-100004
    其中,所述
    Figure PCTCN2018102208-appb-100005
    为标准声纹特征向量,所述
    Figure PCTCN2018102208-appb-100006
    为该用户本次跟读的语音的声纹特征向量。
    Figure PCTCN2018102208-appb-100004
    Wherein said
    Figure PCTCN2018102208-appb-100005
    Is a standard voiceprint feature vector, said
    Figure PCTCN2018102208-appb-100006
    The voiceprint feature vector of the voice that the user is following this time.
  15. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有处理系统,所述处理系统被处理器执行时实现步骤:A computer readable storage medium, wherein the computer readable storage medium stores a processing system, and when the processing system is executed by the processor, the steps are:
    声学模型建立步骤,在互动式语音应答IVR场景下用户办理业务时,播 报第一预设位数的随机码供该用户跟读,并在跟读后分别为本次播报的随机码及该用户本次跟读的语音建立预设类型的声学模型;The acoustic model establishing step, in the interactive voice response IVR scenario, when the user handles the service, the random number of the first preset digit is broadcasted for the user to follow, and after the follow-up, the random code of the broadcast and the user respectively The acoustic model of the preset type is established this time;
    强制整体对齐步骤,将本次播报的随机码的声学模型及该用户本次跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算该对齐后的两声学模型相同的概率;Forcing the overall alignment step, forcibly aligning the acoustic model of the random code and the acoustic model of the voice that the user is currently reading, and calculating the same probability of the aligned two acoustic models by using a predetermined algorithm;
    身份验证步骤,若该对齐后的两声学模型相同的概率大于预设第一阈值,则提取该用户本次跟读的语音的声纹特征向量,获取该用户在注册成功后预存的标准声纹特征向量,并计算该用户本次跟读的语音的声纹特征向量及该标准声纹特征向量的距离,以对该用户进行身份验证。In the authentication step, if the probability that the aligned two acoustic models are the same is greater than the preset first threshold, extracting the voiceprint feature vector of the voice that the user is currently reading, and obtaining the standard voiceprint pre-stored by the user after the registration is successful. The feature vector is calculated, and the distance between the voiceprint feature vector of the user's current read voice and the standard voiceprint feature vector is calculated to authenticate the user.
  16. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述处理系统被所述处理器执行时,还实现如下步骤:The computer readable storage medium of claim 15, wherein when the processing system is executed by the processor, the following steps are further implemented:
    在互动式语音应答IVR场景下用户进行声纹注册时,播报第二预设位数的随机码供用户跟读预设次,在每次跟读后分别为播报的随机码及用户跟读的语音建立所述预设类型的声学模型;In the interactive voice response IVR scenario, when the user performs voiceprint registration, the random code of the second preset digit is broadcasted for the user to follow the preset time, and each time after the reading, the random code of the broadcast and the user follow the reading. Voice establishing an acoustic model of the preset type;
    分别将每次播报的随机码的声学模型及对应的用户跟读的语音的声学模型进行强制整体对齐操作,利用预定算法计算对齐后的两声学模型相同的概率;Performing a forced overall alignment operation on the acoustic model of the random code for each broadcast and the acoustic model of the corresponding user-read speech, respectively, using a predetermined algorithm to calculate the same probability of the aligned two acoustic models;
    若对齐后的两声学模型相同的概率均大于预设第二阈值,则提取每次用户跟读的语音的声纹特征向量,计算两两声纹特征向量的距离,以分析每次跟读的用户是否为同一用户;If the two acoustic models have the same probability that the two acoustic models are greater than the preset second threshold, extract the voiceprint feature vector of each time the user follows the voice, calculate the distance between the two voiceprint feature vectors, and analyze each time of reading. Whether the user is the same user;
    若是,则以该声纹特征向量作为该用户的标准声纹特征向量进行存储。If so, the voiceprint feature vector is stored as the standard voiceprint feature vector of the user.
  17. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述预设类型的声学模型为深度神经网络-隐马尔可夫模型。The computer readable storage medium of claim 15, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
  18. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述预设类型的声学模型为深度神经网络-隐马尔可夫模型。The computer readable storage medium of claim 16, wherein the predetermined type of acoustic model is a deep neural network-hidden Markov model.
  19. 根据权利要求15所述的计算机可读存储介质,其特征在于,所述提取该用户本次跟读的语音的声纹特征向量的步骤包括:The computer readable storage medium according to claim 15, wherein the step of extracting the voiceprint feature vector of the voice that the user has read this time comprises:
    对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
  20. 根据权利要求16所述的计算机可读存储介质,其特征在于,所述提取该用户本次跟读的语音的声纹特征向量的步骤包括:The computer readable storage medium according to claim 16, wherein the step of extracting the voiceprint feature vector of the voice that the user has read this time comprises:
    对该用户本次跟读的语音进行预加重及加窗处理,对每一个加窗进行傅立叶变换得到对应的频谱,将所述频谱输入梅尔滤波器以输出得到梅尔频谱;Performing pre-emphasis and windowing on the speech that the user is currently reading, performing Fourier transform on each window to obtain a corresponding spectrum, and inputting the spectrum into the Meyer filter to output a Meir spectrum;
    在梅尔频谱上进行倒谱分析以获得梅尔频率倒谱系数MFCC,基于所述梅尔频率倒谱系数MFCC组成该用户本次跟读的语音的声纹特征向量。The cepstrum analysis is performed on the Mel spectrum to obtain the Mel frequency cepstral coefficient MFCC, and the Moiré frequency cepstral coefficient MFCC is used to form the voiceprint feature vector of the user's current speech.
PCT/CN2018/102208 2018-04-09 2018-08-24 Electronic device, identity verification method, and storage medium WO2019196305A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810311721.2 2018-04-09
CN201810311721.2A CN108694952B (en) 2018-04-09 2018-04-09 Electronic device, identity authentication method and storage medium

Publications (1)

Publication Number Publication Date
WO2019196305A1 true WO2019196305A1 (en) 2019-10-17

Family

ID=63844884

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102208 WO2019196305A1 (en) 2018-04-09 2018-08-24 Electronic device, identity verification method, and storage medium

Country Status (2)

Country Link
CN (1) CN108694952B (en)
WO (1) WO2019196305A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109448732B (en) * 2018-12-27 2021-06-08 科大讯飞股份有限公司 Digital string voice processing method and device
CN110536029B (en) * 2019-08-15 2021-11-16 咪咕音乐有限公司 Interaction method, network side equipment, terminal equipment, storage medium and system
CN110491393B (en) * 2019-08-30 2022-04-22 科大讯飞股份有限公司 Training method and related device for voiceprint representation model
CN111161746B (en) * 2019-12-31 2022-04-15 思必驰科技股份有限公司 Voiceprint registration method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103680497A (en) * 2012-08-31 2014-03-26 百度在线网络技术(北京)有限公司 Voice recognition system and voice recognition method based on video
CN103986725A (en) * 2014-05-29 2014-08-13 中国农业银行股份有限公司 Client side, server side and identity authentication system and method
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103680497A (en) * 2012-08-31 2014-03-26 百度在线网络技术(北京)有限公司 Voice recognition system and voice recognition method based on video
CN103986725A (en) * 2014-05-29 2014-08-13 中国农业银行股份有限公司 Client side, server side and identity authentication system and method
CN107517207A (en) * 2017-03-13 2017-12-26 平安科技(深圳)有限公司 Server, auth method and computer-readable recording medium

Also Published As

Publication number Publication date
CN108694952A (en) 2018-10-23
CN108694952B (en) 2020-04-28

Similar Documents

Publication Publication Date Title
JP6621536B2 (en) Electronic device, identity authentication method, system, and computer-readable storage medium
WO2018166187A1 (en) Server, identity verification method and system, and a computer-readable storage medium
WO2019100606A1 (en) Electronic device, voiceprint-based identity verification method and system, and storage medium
JP6429945B2 (en) Method and apparatus for processing audio data
WO2019196305A1 (en) Electronic device, identity verification method, and storage medium
WO2019136912A1 (en) Electronic device, identity authentication method and system, and storage medium
WO2021051572A1 (en) Voice recognition method and apparatus, and computer device
CN107610709A (en) A kind of method and system for training Application on Voiceprint Recognition model
WO2019136911A1 (en) Voice recognition method for updating voiceprint data, terminal device, and storage medium
EP3989217B1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
KR20160011709A (en) Method, apparatus and system for payment validation
EP3373177B1 (en) Methods and systems for determining user liveness
WO2019179033A1 (en) Speaker authentication method, server, and computer-readable storage medium
CN108650266B (en) Server, voiceprint verification method and storage medium
KR20210050884A (en) Registration method and apparatus for speaker recognition
CN105224844B (en) Verification method, system and device
CN108447489B (en) A continuous voiceprint authentication method and system with feedback
WO2019218515A1 (en) Server, voiceprint-based identity authentication method, and storage medium
CN113112992B (en) Voice recognition method and device, storage medium and server
US20230153815A1 (en) Methods and systems for training a machine learning model and authenticating a user with the model
CN113436633B (en) Speaker recognition method, speaker recognition device, computer equipment and storage medium
US20230153408A1 (en) Methods and systems for training a machine learning model and authenticating a user with the model
CN117238297A (en) Method, apparatus, device, medium and program product for sound signal processing
CN112382296A (en) Method and device for voiceprint remote control of wireless audio equipment
US20250046317A1 (en) Methods and systems for authenticating users

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18914251

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18914251

Country of ref document: EP

Kind code of ref document: A1