US11948582B2 - Systems and methods for speaker verification - Google Patents
Systems and methods for speaker verification Download PDFInfo
- Publication number
- US11948582B2 US11948582B2 US16/363,658 US201916363658A US11948582B2 US 11948582 B2 US11948582 B2 US 11948582B2 US 201916363658 A US201916363658 A US 201916363658A US 11948582 B2 US11948582 B2 US 11948582B2
- Authority
- US
- United States
- Prior art keywords
- user
- voiceprint
- speech
- speaker
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
- 238000012795 verification Methods 0.000 title claims description 43
- 238000000034 method Methods 0.000 title claims description 42
- 230000003993 interaction Effects 0.000 claims abstract description 82
- 230000002452 interceptive effect Effects 0.000 claims abstract description 18
- 239000013598 vector Substances 0.000 claims description 29
- 238000004891 communication Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000009471 action Effects 0.000 claims description 7
- 238000004458 analytical method Methods 0.000 claims description 7
- 239000000203 mixture Substances 0.000 claims description 5
- 230000000306 recurrent effect Effects 0.000 claims description 5
- 230000006403 short-term memory Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 description 20
- 230000008569 process Effects 0.000 description 10
- 230000004044 response Effects 0.000 description 9
- 230000007246 mechanism Effects 0.000 description 8
- 239000000284 extract Substances 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 230000000977 initiatory effect Effects 0.000 description 4
- 230000002829 reductive effect Effects 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 206010068319 Oropharyngeal pain Diseases 0.000 description 2
- 201000007100 Pharyngitis Diseases 0.000 description 2
- ATJFFYVFTNAWJD-UHFFFAOYSA-N Tin Chemical compound [Sn] ATJFFYVFTNAWJD-UHFFFAOYSA-N 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000036961 partial effect Effects 0.000 description 2
- 238000013442 quality metrics Methods 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 230000032683 aging Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000005352 clarification Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000556 factor analysis Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/30—Authentication, i.e. establishing the identity or authorisation of security principals
- G06F21/31—User authentication
- G06F21/32—User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/06—Decision making techniques; Pattern matching strategies
- G10L17/14—Use of phonemic categorisation or speech recognition prior to speaker recognition or verification
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Definitions
- the present application relates to systems, devices, apparatuses and methods of using voice biometrics. More particularly, the application relates to voice biometrics applied to live natural speech and employing context aware models to verify users.
- Computing systems as well as, devices, apparatuses and services often have a controlled access mechanism for regulating user access to their resources and data. These mechanisms may be implemented in hardware or software or a combination of the two.
- the most commonly used controlled access mechanism is the use of credentials, often in the form of a username and password duet.
- more complex mechanisms may be employed where the user must be in possession of a physical token (e.g. token generation device for e-banking code entry), key, or card (e.g. a bank card for use at Automated Teller Machine (ATMs), etc.
- a physical token e.g. token generation device for e-banking code entry
- key e.g. a bank card for use at Automated Teller Machine (ATMs), etc.
- ATMs Automated Teller Machine
- biometric methods have been tested and employed in Voice Biometric (VB) systems, in fingerprint reading, iris scanning, voice identification, face identification, etc. Their adoption and performance have various degrees of success. Some issues related to their success is the varying level of correct user identification as a result of environmental conditions (e.g. poor lighting, noise, multiple concurrent speakers), robustness to users pretending to be someone else or using duplicates of the real user's characteristics to fool the identification/verification system (e.g. playback of the target user's voice, showing a photo of the target user's face or fingerprint), simplicity of use, and necessity for special User Interface (UI) hardware.
- UI User Interface
- voice based user verification has been gaining acceptance as either a standalone verification mechanism or in combination with other mechanisms.
- Existing systems often perform either speech recognition or voice recognition, but not both processes at the same time. Speech recognition identifies particular words that are spoken, while voice recognition identifies the user or speaker that is speaking.
- voice biometrics i.e. processing of certain voice characteristics
- This mechanism can operate as “user identification” where the system identifies a user based on his voice, without previously knowing who the user is or without the user claiming and identity, and as “user verification” where the claimed or estimated identity of the user is known and the system verifies that the claimed or estimated identity is correct or not.
- voice biometrics for speaker verification and/or identification. They frequently use special text or phrases that the user dictates during an enrolment phase. Alternatively there are examples of conversational scenarios where the users should say a password and (upon correct identification of the said password) they are then asked to provide an answer to a challenge question from a set of questions previously stored by the system and which may relate to the specific user.
- Some of these systems maintain voiceprint databases and when the speaker utters any phrase in any of the above scenarios, a “live” voiceprint, or more than one, are created and compared against the corresponding voiceprints stored in the database. Following correct verification (and in some cases identification) of the user from his voiceprint, an action is taken by the system, usually granting access to the said user to a system or service.
- Variations of the voiceprint matching mechanisms include adaptation of relevant thresholds, statistical processing, and use of multistage matching with an initial match providing a group of potential users from those whose data are stored in the database, and subsequent stage(s) refining the matching to get the most probable match.
- ASR Automatic Speech Recognition
- NLU Natural Language Understanding
- An NLU module is used to understand what the user says by identifying words in the user's speech, which words correspond to passwords etc. or answers to system's question, which the user has to provide and the system has to capture from the user's speech and compare them against the corresponding data stored in the user's records in one or more databases.
- this is a simple example of a user uttering a credential such as a PIN number (“does the spoken password match the stored password”?).
- the proposed speaker biometric verification is based on the notion of Active Words (AW) where the speaker's live speech is analyzed and compared against stored voiceprints.
- AW Active Words
- the present innovative solution analyzes the uttered speech to identify words, classified as “active words”, that is, words that are frequently used under a specific context (e.g. a financial transaction).
- active words that is, words that are frequently used under a specific context (e.g. a financial transaction).
- the present solution searches for and compares them with stored active word voiceprints for the same speaker, previously derived during past interactions of the user with the system, and created from a similar model from the previous interactions of sets of users with the system.
- the use of Active Words solves the problem encountered in voice biometrics where very short speech segments are not enough to provide an accurate user identification and where the user's interaction routine with the system has to be changed.
- a speaker initiates a communication session with the VB system, which transcribes the speaker's voice to text and identifies uttered words.
- the system captures and uses session related data to pre-identify the user and then limit the amount of calculations for the similarity scores between the Active Words live voiceprints and only a subset of the archived voiceprints, which belong to the same user.
- session related data to pre-identify the user and then limit the amount of calculations for the similarity scores between the Active Words live voiceprints and only a subset of the archived voiceprints, which belong to the same user.
- the system identifies the current context, where the content and/or context is derived from transcribed speech analysis and/or identification data relating to the specific system application and user session. Transcribed words are analyzed to select frequently used words (AW) and to produce a speaker live voiceprint for the AW and compared to one or more stored voiceprints for the same AW of the same speaker.
- AW frequently used words
- the stored voiceprint(s) has been constructed during enrolment and may have been updated during previous user sessions.
- the stored voiceprint(s) may be stored in a compact way by exploiting correlations inside the voiceprint and with AW models representing average representations of uttered active words (and other words) of a set of system speakers-users. The system then creates similarity scores with the speaker's archived voiceprints for the identified active words.
- a text-independent speaker identification or verification model is also included, which extracts one voiceprint per utterance.
- the similarity score provided by the text-independent model is combined with the similarity scores provided by the AW-based model using score-level fusion techniques.
- the present innovative solution and all its exemplary aspects allow the speaker to speak in natural language, without the need to change his pace, complexity, intonation, or any other characteristic of his speech. They also eliminate the need (as in prior art) for the user to enroll to the system the speaker's uttering of enrolment words and phrases or repeating predefined words and phrases every time they are to be identified. Also, the present innovative solution eliminates the need to use long uttered phrases thereby avoiding altering user interaction pattern and unnecessary dialogue turns with the system to extend the user's utterances and allow the system to accurately verify them. As a result, simple, faster and more accurate voice biometric verification is achieved.
- the system includes an interactive voice recognition (IVR) module arranged to perform a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation.
- the system includes a datastore arranged to store a group of active words associated with the interaction context where each active word is selected based on one or moreselection criterion derived from conversations of a population of users.
- the datastore also stores the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user where each first user voiceprint corresponds to each active word of the group of active words.
- the system also includes an automated speech recognition (ASR) module arranged to perform speech recognition of the first user audio provided during the speech conversation, where the ASR module converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements;
- ASR automated speech recognition
- the system further includes a voice biometric (VB) module arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based on one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii) if the similarity score is greater than or equal to a threshold value, indicate that the first user identifier is verified or if the similarity score is less than the threshold value, indicate that the first user identifier is
- the similarity score is based on the closest comparison of one of the captured voiceprints with one of the first user voiceprints.
- An audio element may include at least one of a phoneme, syllable, word, subword, or phrase.
- a text element may include at least one of a word, subword, and phrase.
- Speech recognition may be implemented using at least one speech recognition model including vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and/or end to end automatic speech recognition.
- HMMs Hidden Markov models
- LSTM Long short-term memory
- acoustic modeling language modeling
- Gaussian mixture model a Gaussian mixture model
- the one or more of the comparisons of each captured voiceprint with each first user voiceprint may include at least one voice recognition model selected from the group of vector analyses, Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and end to end automatic speech recognition.
- HMMs Hidden Markov models
- LSTM Long short-term memory
- acoustic modeling language modeling
- Gaussian mixture model a Gaussian mixture model
- the one or more selection criterion for active words includes at least one selected from the group of frequency of use, type of word, amount of phonemes in a word, combination of phonemes, and amount of syllables in a word or phrase.
- the interaction context may include a type of interaction that the first user has with the IVR.
- the interaction context may include a banking application, a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR-based application that enables users to interface with a service or product provider.
- a server configured to perform speaker verification includes a communications interface arranged to perform a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation.
- the communications interface may be arranged to receive, from a datastore, a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users.
- the communications interface may receive, from the datastore, the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words.
- the server also includes a processor arranged to perform speech recognition of the first user audio provided during the speech conversation, where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements.
- the processor is also arranged to: i) receive the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements, ii) receive the first user identifier, iii) compare the plurality of corresponding text elements with each active word of the group of active words, iv) identify text elements matching each active word of the group of active words, and v) generate a captured voiceprint for each captured audio element corresponding to each text element matching each active word, vi) compare each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, vii) generate a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and viii)
- a further aspect includes a method for performing speaker verification including performing a speech conversation with a first user and receive a first user identifier, where the speech conversation has an interaction context based on a subject matter of the speech conversation.
- the method includes receiving a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users and receiving the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words.
- the method further includes: performing speech recognition of the first user audio provided during the speech conversation, where a processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements; receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements; comparing the plurality of corresponding text elements with each active word of the group of active words; identifying text elements matching each active word of the group of active words; generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word; comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words; generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and if the similarity score is greater than or equal to a threshold value, indicating that the first user identifier is verified or if the similarity score is less than the threshold value, indicating that the
- the above aspects should not be considered directed to an abstract idea. Instead, the above aspects should be considered directed to an Internet-centric problem or improvement of computer technology related to more efficient voice or speaker recognition that advantageously reduces memory and processing demands on a voice biometric system.
- a voice biometric system is able to more efficiently focus on a subset of user audio or speech.
- the system also advantageously combines speech recognition with voice recognition in a technically innovative way to enable rapid identification of the active words using speech recognition to which voice recognition then applied.
- the above aspects could involve an abstract idea, the aspects are not directed to that idea standing alone.
- a long-standing problem with voice recognition is how to quickly, efficiently, and reliably verify the identity of speakers.
- the above aspects are directed to technically improving the speed, efficiency, and reliability, while reducing the cost in processing and memory of speaker recognition.
- FIG. 1 shows a schematic diagram of a system implementing aspects of the present inventions.
- FIG. 2 shows basic speaker verification using natural speech as performed.
- FIG. 3 shows a schematic diagram of the components and the flow of data in innovative exemplary implementation of a voice biometric system using Active Words.
- FIG. 4 A shows a first example of a database storing speaker voiceprint and related information.
- FIG. 4 B shows an example of a database storing Active Words speaker voiceprint and related information.
- FIG. 4 C illustrates a database associating users with user identifiers.
- FIG. 4 D illustrates a database associating active words with various user population contexts.
- FIG. 4 E illustrates a database that associates voiceprints of active words by various users for a particular interaction context associated with an IVR system.
- FIG. 5 shows the basic hardware architecture of an application server.
- FIG. 6 shows the basic software components running on an application server.
- IVR Interactive Voice Response
- NLU Natural Language Understanding
- ASR Automatic Speech Recognition
- DM Dialogue Manager
- PSTN Public Switched Telephone Network
- Public Land Mobile Network is intended to mean “PLMN”.
- VAD Voice Activity Detector
- VA Voiceprint Adaptation
- UI User Interface
- OS is intended to mean “Operating System”.
- mobile device may be used interchangeably with “client device” and “device with wireless capabilities”.
- a mobile device may include, without limitation, a cellular telephone, a mobile handset, a personal digital assistant, a wireless tablet, a wireless computing device, and the like.
- the term “user” may be used interchangeably with “regular user” and “ordinary user” and “speaker”. It may also be used to mean “caller” in a telephone or VOIP call or conferencing session, “user of an application” or “user of a service”, and “participant” in a text chat, audio chat, video chat, email, audio-conference or video-conference session.
- system may be used interchangeably with “device”, “apparatus”, and “service”, except where it is obvious to a reader of ordinary skill in related art that these terms refer to different things, as this is apparent by the context of the discussion in which they appear. Under any circumstance, and unless otherwise explicitly or implicitly in the description, these four terms should be considered to have the broadest meaning i.e. that of encompassing all four.
- the present invention treats the problem of identifying a speaker using voice biometrics applied to natural language, free speech dialogue of the speaker with a system, device, apparatus, or service. It aims to present a solution to many speaker verification problems, comprising avoiding having the speaker to dictate predefined text extracts, repeat text prompts, alter his intonation, use simplified speech, emphasize predefined keywords, use non-verbal means of identification, or add unnecessary complexity in the identification process. It also targets the use of fully automated systems, apparatuses and devices which support natural language processing and understanding, compatibility with existing Interactive Voice Response (IVR) systems, support of natural, real-time Dialogue Manager (DM) systems, and Automatic Speech Recognition (ASR) systems.
- IVR Interactive Voice Response
- DM real-time Dialogue Manager
- ASR Automatic Speech Recognition
- FIG. 1 shows a schematic diagram of a system implementing aspects of the present inventions.
- a speaker can use the identification service via any type of device, comprising analogue telephone 10 , digital telephone or mobile (cellular) telephone or smartphone 20 , or computer or laptop or tablet 30 with voice capture and audio playback capabilities. Regardless of the device or apparatus of choice, the speaker's speech is converted into either a voice signal (if an analogue telephone 10 is used) or a data stream and sent to 40 a Public Switched Telephone Network (PSTN) or one of Public Land Mobile Network (PLMN), Internet, Private Network, Cloud, respectively. The voice signal is converted to data (not shown in FIG.
- PSTN Public Switched Telephone Network
- PLMN Public Land Mobile Network
- an optional Cache and/or Proxy Server 60 handles them.
- the inclusion of such a server or servers is optional and can serve further enhancing data security in a mode analogous to secure browsing by isolating an Application Server 70 and the data stored in a Database 80 so that a potential unauthorized intruder can access and modify only a copy of these data and not the original sensitive data. It can also serve as a faster means of accessing commonly used data that may be dispersed to a number of servers and databases as opposed to a single server 70 and database 80 .
- firewall 50 may be added, or that some of the elements of FIG. 1 may be omitted (e.g. firewall 50 , Cache and/or Proxy Server 60 ), or that their position and connections to each other may be modified without altering the scope, usability, essence, operation, result and purpose of the present invention.
- a speaker is verified based on his natural speech.
- FIG. 2 shows basic speaker verification using natural speech as performed in prior art.
- the method starts with the speaker initiating a communication session 100 with the exemplary system of FIG. 1 (or a VB server).
- the user may initiate a telephone session using an analogue telephone 10 , a digital (Voice Over Internet Protocol (VOIP)) phone or mobile phone or smartphone 20 or tablet, laptop, or desktop computer 30 .
- the telephone session may also be implemented as a teleconferencing—multi-party session.
- the speaker may use a software application running on his smartphone 20 or tablet, laptop, or desktop computer 30 .
- This software application may provide VOIP telephone or teleconferencing—multi-party session, video call, video conferencing, text or voice or video chat session, e-mail session, or a combination of the preceding.
- the devices 10 - 30 , apparatuses, subsystems and applications perform no processing other than forwarding voice signals or first digitizing voice signals, converting them into data and then forward them via any of the networks 40 shown in FIG. 1 to back-end servers 60 , 70 , and receiving voice signals or data representing voice signals for play back to the speaker.
- the back-end server or servers 60 , 70 receive the voice data corresponding to the user's live natural speech and calculate his live voiceprint 130 .
- a voiceprint is a set of measurable characteristics of a human voice that can be used to uniquely identify the speaker. These characteristics, which are based on the physical configuration of the speaker's mouth and throat, can be expressed as a mathematical formula or as a vector.
- the voiceprint can be calculated with any algorithm and contain any number of features. By means of example, the voiceprint can be calculated at a backend server by creating an i-vector (or an x-vector or other representation), i.e.
- a first similarity score is calculated 150 for each pair of the speaker's live voiceprint and each fetched voiceprint and the pair with the largest similarity score is selected 155 along with the corresponding similarity score. This similarity score is then compared to a predefined threshold T 1 160 . If this score is greater than or equal to T 1 , then the speaker is correctly identified 170 . If not, then the speaker identification has failed and the system may, optionally, prompt the speaker to repeat 180 what he just uttered or may start a dialogue to request clarifications or additional information that can be used to correctly verify the speaker against the partially-identified speaker from the call ID and other available data.
- Failure of voice recognition and/or speaker verification may be due to lack of a stored voiceprint corresponding to this particular speaker, or due to noise in the voice signal, changes in speaker's voice as a result of e.g. a sore throat, aging if the speaker has not used the user identification service for a long time, or other reasons beyond the scope of this invention.
- the system cannot identify the speaker unless the speaker provides additional information beyond this used for voice biometrics.
- the repetition of the uttered speech by the speaker, or his participation in a natural speech dialogue with the system could lead to a correct identification result.
- This method produces acceptable results assuming the speaker's voiceprint is among those stored at the backend servers 60 , 70 and/or the database 80 and the live speaker's uttered speech has a sufficiently long duration (typically several seconds) for a useful live voiceprint to be constructed.
- the database is populated with voiceprints corresponding to speaker of the system. This database population process may be repeated during the user's new interactions with the system or for new speakers-users. Manual intervention of an operator or support staff may optionally be used.
- Real time operation and response times depend on the processing power of the application server 70 undertaking the processing, which in turn depends on the amount of stored voiceprints.
- the number of potential speakers (and consequently the corresponding stored voiceprints) is large, in the order of hundreds of thousands or millions, the delays introduced for processing all these data and verify the speaker can be beyond what users may consider acceptable. For this reason, the step of reducing the set of stored voiceprints that are used in the VB calculations is essential to reduce the stored voiceprint set against which to compare the live speakers voiceprint.
- the reduced stored voiceprint set may still be comprised of significantly different phonemic content as compared to that of the live voiceprint, making an effective comparison and subsequent scoring sub-optimal.
- the live voiceprint data is comprised of at least a minimum duration of speech, which ensures that the phonemic content of the live voiceprint will be enriched enough so as to statistically be not significantly different from the stored voiceprint phonemic content.
- FIG. 3 shows a schematic diagram of the components and the flow of data in innovative exemplary implementation of a voice biometric system using AW.
- the innovative Voice biometric system 1100 comprises an Interactive Voice Response (IVR) module, responsible for performing live speech conversation with a live speaker 1101 , based on uttered questions and replies. The speech communication between speaker 1101 and IVR 105 is done according to what is described in FIG. 1 .
- IVR 1105 is connected to an Automatic Speech Recognition (ASR) module 1120 , which ASR 1120 analyzes uttered speech and transcribes the speech into text.
- ASR Automatic Speech Recognition
- IVR 1105 streams voice received from speaker 1101 to ASR 1120 together with identification data.
- ASR Automatic Speech Recognition
- Identification data may include, without limitation, a unique subscriber identifier such as an International mobile subscriber identity (IMSI), a mobile identification number (MIN), a mobile subscription identification number (MSID or MSIN), temporary mobile subscriber identity (TMSI), Mobile Subscriber ISDN Number (MSISDN), Mobile Station International ISDN Number, Mobile International ISDN Number, Subscriber Number (SN), and/or a Packet temporary mobile subscriber identity (P-TMSI), a unique electronic serial number (ESN), a mobile device identifier, a mobile equipment identifier (MEID), an International Mobile Equipment Identifier (IMEI), a media access control (MAC) address, Android ID, a Unique Device Identifier (UDID), Universally Unique Identifier (UUID), a Core Foundation Universally Unique Identifier (CFUUID), a globally unique identifier (GUID), an OpenUDID, a SecureUDID, an unique Identifier (UIDevice), LTE static IP address (UEIP), Tracking Area Identity (TAI), Temporary
- ASR 1120 receives the streamed speech and identification data and uses an acoustic model and optionally a language model (both stored locally at ASR 1120 , or at a local database, or at the cloud) to identify phonemes, syllables, words and sentences in the speaker's 1101 speech.
- a language model both stored locally at ASR 1120 , or at a local database, or at the cloud
- the acoustic and language models used by ASR 1120 may be produced by the proposed innovative system. In some implementations, either one or both of the two models may be imported from other applications and systems, external to system 1100 , where their creation involves analyzing speech files and their transcribed text content and timing information, and words, phrases, and their relationships in a given language, respectively.
- the models may be automatically created by system 1100 or external systems. In yet another implementation, manual intervention may be employed during the creation of the models to determine a configuration of and/or selection of parameters for a model.
- Speech recognition may include automatic speech recognition (ASR), computer speech recognition, and/or speech to text (STT).
- Speech recognition and/or voice recognition models herein may include, without limitation Hidden Markov models (HMMs), Dynamic time warping based speech recognition, a back-off model, Neural networks, Deep feedforward and recurrent neural networks, Long short-term memory (LSTM), acoustic modeling, language modeling, a Gaussian mixture model, and/or end to end automatic speech recognition.
- HMMs Hidden Markov models
- LSTM Long short-term memory
- acoustic modeling language modeling
- Gaussian mixture model a Gaussian mixture model
- the transcribed speech may optionally be fed to a Natural Language Understanding (NLU) unit.
- NLU Natural Language Understanding
- the NLU uses a Semantic Model of the used language and creates metadata that describe what the words in the transcribed text mean. For instance, the phrase “how much money did I spend at Starbucks last month” is processed by the NLU, which assigns the tag “merchant” to “starbucks” and the tag “date” to “last month”.
- the semantic model is independent and unrelated of the current invention and its aspects and can be imported from other systems or services.
- ASR 1120 continues by streaming speech, transcribed speech (i.e. text) and control data (e.g. word and/or letter and/or sentence boundaries and tags identifying word and/or letter and/or sentence boundaries in the streamed speech) to Voice Biometric module 1130 , so as to allow association of the streamed (i.e. uttered) speech with its content (i.e. transcribed text).
- control data e.g. word and/or letter and/or sentence boundaries and tags identifying word and/or letter and/or sentence boundaries in the streamed speech
- Voice Biometric module 1130 so as to allow association of the streamed (i.e. uttered) speech with its content (i.e. transcribed text).
- the alignment of words may be done by the ASR module 1120 or the VB module 1130 .
- VB 1130 receives the speech, text and control data and uses them to identify speaker 1101 . To reach an identification result, VB 1130 analyzes the transcribed text and control data to deduce the content and/or context of speech (e.g. a request to get the balance of a bank account). In some configurations, VB 1130 communicates with an NLU module (not shown in FIG. 3 , to receive an estimation of the context of the speech as this context is derived from the content of the transcribed speech by analyzing the natural language using a language model and rules to understand the conveyed meaning).
- NLU module not shown in FIG. 3
- VB 1130 uses the content and/or context of the speech to check a lookup table (or other data representation either locally or remotely stored) which associates a context (e.g. “I want my bank account balance please”) with a set of used words (e.g. “account” and “balance”) based on one ore more selection criterion.
- a selection criterion may be frequency of use.
- Other selection criteria may be used such as, without limitation, type of word, amount of phonemes in a word, combination of phonemes, and/or amount of syllables in a word or phrase.
- a selected word is referred to as an “Active Word” (AW).
- An AW is used by VB 1130 to improve the performance of speaker 1101 verification.
- VB 1130 uses AWs to limit the search of speaker voiceprints stored in a database 1140 , accessed by VB 1130 , to only those speaker voiceprints that corresponds to AWs, and thus improve the phonemic content match between the live speaker voiceprint and the stored speaker voiceprints.
- This process of matching the phonemic content of live and stored speaker voiceprints, using active words alleviates the requirement for long speaker samples and long utterances, and allows the VB 1130 to perform an accurate voiceprint scoring with minimal live speech samples.
- This search limiting action also improves database access times as hundreds or thousands of voiceprints may be stored for the same speaker, and more importantly significantly reduces the processing needs to compare the live speaker's speech voiceprint (computed by VB 1130 in real time) with the voiceprints in database 1140 . As a result the number of comparisons is reduced from a scale of hundreds or thousands of comparisons to merely a few dozens or even less comparisons.
- voiceprints may, in some implementations, be stored together with metadata describing their content and/or context.
- metadata may be created using speech recognition and natural language understanding analysis, as well as speaker intent analysis. They may also be captured by analyzing speaker interaction with the system, or use of an application running at a user or speaker's computing device 20 or 30 which may provide a multimodal (voice, text, graphics, video, chat, email) user interface, while enabling all voice processing at the application server 70 .
- Each of the stored voiceprints may correspond to a word, phrase, or paragraphs uttered by a user at an earlier session with the system. For instance, “my name is John” creates a different voiceprint from “thank you”. This is not only due to the different words in these two phrases but also due to the different phonemes, and their combinations, leading to different intonations and linking between them, and so on. These characteristics may have an effect on the number of correct user identification results and failures, especially in the presence of noise. For instance, if the live user's voiceprint has a different context (e.g. is derived from different words or phonemes) than the stored voiceprints selected according to the pre-identification result then the live and the stored voiceprints may not be similar enough to correctly identify the user and lead to a failure or false negative result.
- a different context e.g. is derived from different words or phonemes
- the choice of the amount of AWs can be based on their frequency of occurrence and on AW models generated by VB 1130 and stored in an AW database 1160 .
- These AW models may be created as an average model from the recorded speech of all or a subset of the speakers of system 1100 , i.e., a user or speaker population.
- Example active words may include “balance”, “account” and “please” 1160 . In this example, two active words are defined for the selected context.
- VB 1130 could have selected more AWs or even a single AW.
- the choice of the number of AWs may be based on performance metrics (e.g.
- the system administrator may be structured to reflect first the frequency distribution of the words in a collection of different speakers' speech, or of the same speaker 1101 in past uses of the system for the same context. Second, to reflect the possibility that in the presence of noise, or strange accent and intonation of the speaker 1101 (e.g. when the speaker has a sore throat) some of the uttered words may be erroneously transcribed, so using a larger set of AWs will contain at least one of the words uttered by the speaker and correctly transcribed by ASR 1120 .
- a balance between the size of the set of AWs and VB 1130 performance may be calculated in real time by VB 1130 , which may periodically adjust the size of the AW set for speaker 1101 and for the specific context in consideration.
- a system administrator may define the size of the AW sets.
- VB 1130 compares the live user's speech voiceprint with the AW speaker's voiceprint(s) for the detected content and/or context, stored in database 1140 , and produces a first similarity score for each AW speaker's voiceprint.
- comparison is done in a vector space, where both the live speaker's speech voiceprint with the selected AW speaker's voiceprint(s) are i-vectors (or x-vectors, or other representation).
- i-vectors may be stored in a compact form where correlations between the coefficients in the i-vector or between the i-vector and the corresponding AW i-vector (in database 1150 ) are exploited, so as to significantly reduce the dimensions of the i-vectors that need to be compared with the aim to reduce processing time during real-time operation.
- VB 1130 uses the first similarity scores of the comparison results and selects the stored speaker voiceprint corresponding to the highest score result, which it then compares against threshold T 1 and if the first similarity score is equal or exceeds T 1 , then the speaker is correctly verified.
- AWs enable the word-by-word (or even subword-by-subword) comparison between the live user's speech and the stored (e.g. in i-vector form) user's sample for each AW, producing more accurate results at a fraction of the processing time required without AWs, and using live speech segments of very short duration.
- AWs can provide accurate identification results using live speech of only a single word or of word segments (e.g. syllables or phoneme combinations) which is very useful in situations when the speaker is very “stingy” with his speech conversation with system 1100 or when segments of his uttered speech are corrupted beyond ASR 1120 recognition due to noise or other reasons related to the speaker or to external influences.
- System 1100 is managed by a Dialogue Manager (DM) module 1110 which controls the operation of all other modules 1105 , 1120 , 1130 , 1150 .
- DM 1110 is controlled by an application developer or system administrator and exchanges control data with the other system components.
- DM 1110 controls the operation of IVR 1105 to start and perform a dialogue with speaker 1101 in order to get his request to be serviced by system 1100 (e.g. to provide the speaker with the balance of his bank account), or to request further information (e.g. which account, credit or savings account), or to prompt the speaker to get more speech input in cases where an accurate speaker identification or verification is not possible using the already uttered speech by the speaker (e.g. very noisy sample, truncated, etc.).
- Communication between IVR 1105 and DM 1110 may be done using any publicly available or proprietary protocol.
- DM module 1110 also communicates with VB module 1130 to initiate and control a VB session using the streamed and transcribed speech and identification data received at VB module 1130 from ASR module 1120 .
- VB 1130 signals DM 1110 that speaker identification is not possible with high accuracy using the available live speech specimen.
- DM 1110 may then instruct VB 1130 to store the initial speech, text and identification data and initiate a second round of identification, also involving IVR 1105 to get better scoring that will increase the accuracy of user identification. If identification is achieved, the result is signaled by VB 1130 to DM 1110 and DM 1110 signals all system modules to end current session. DM 1110 may then output the identification result to external systems (not shown in FIG. 3 ).
- the efficient operation of the system and the improvements brought by the use of AWs and pre-identification information allow in an exemplary scenario of use to verify a user who calls the system from his fixed-line or from his smartphone and utters only the phrase “account balance” if he has a single account with the bank he is call at, or “balance” if he has only one account and not credit etc. cards with the same bank.
- the proposed innovative method and system are able to verify the user's identity and then service his request using the pre-identification information, the derived context of the speech session and a live speech sample of minimal duration (even as low as 1 second or even less).
- a speaker calls into system 1100 listens to a first speech prompt from IVR 1105 , and speaks a first utterance.
- DM 1110 receives identification data (e.g. Automatic Number Identification Data (ANI), caller ID, or MAC address, or other) from IVR 1105 and partially identifies (pre-identifies) the speaker from the identification information.
- identification data e.g. Automatic Number Identification Data (ANI), caller ID, or MAC address, or other
- the speaker cannot be fully identified by the identification information provided by IVR 1105 because a device, calling number, etc. can be associated with or used by more than one speaker (e.g. a phone number is used by a family, or a smart phone of a user may be used by a third person either under the owner's consent or without it).
- DM 1110 contacts and instructs VB 1130 to perform a lookup for an existing voiceprint referenced by the identification information, i.e. a voiceprint associated with the identification information.
- VB 1130 initializes a new (empty) voiceprint, receives the utterance transcription from ASR 1120 , identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to the AW, creates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system's AW models, and returns back to DM 1110 a control message (e.g. voiceprint_does_not_exist).
- a control message e.g. voiceprint_does_not_exist.
- DM 1110 instructs IVR 1105 to play a next speech prompt to the speaker.
- the speaker in response, speaks a second utterance, and his speech is streamed by IVR 1105 to ASR 1120 .
- ASR 1120 then relays the streamed speech to VB 1130 .
- the streamed speech is sent by IVR 1105 to DM 1110 , and it is DM 1110 that relays the streamed speech to VB 1130 .
- ASR 1120 produces a transcription of the second utterance, which is sent to VB 1130 .
- VB 1130 evaluates the suitability of the speech sample for the creation of the voiceprint.
- sample quality is not good (e.g. one or more quality metric is below a corresponding threshold)
- sample quality is not good (e.g. one or more quality metric is below a corresponding threshold)
- the sample is discarded and a “bad_sample” response is returned by VB 1130 to DM 1110 .
- VB 1130 identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to the AW, updates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system's AW models, and also stores the AW speech samples (that have been segmented and isolated from the speaker's speech using alignment of the speech with the transcribed text used to detect the AW) in database 1140 .
- the process is repeated until enough samples are stored in database 1140 so as to create a voiceprint for each AW, and a voiceprint is created for each AW.
- the collection of enough samples to create the speaker voiceprint for all AWs may be done during a single speaker interaction with IVR 1105 , or during several interactions.
- speaker voice biometric verification cannot be performed until (at least one) speaker voiceprint for an AW has been saved in database 1140 . So, until this voiceprint(s) can be stored, all speaker interactions with IVR 1105 need to involve other types of user (full not partial) identification than VB (e.g. key-in or utter a password).
- the AW voiceprint is stored in database 1140 alongside the speech AW samples (i.e. audio) that were used to created the AW voiceprint and a “voiceprint_creation_success” message is sent by VB 1130 to DM 1110 .
- a speaker calls in system 1100 , listens to a first speech prompt from IVR 1105 , and speaks a first utterance.
- DM 1110 receives identification data (e.g. Automatic Number Identification Data (ANI), caller ID, or MAC address, or other) from IVR 1105 and partially identifies the speaker from the identification information.
- identification data e.g. Automatic Number Identification Data (ANI), caller ID, or MAC address, or other
- the speaker cannot be fully identified by the identification information provided by IVR 1105 because a device, calling number, etc. can be associated with or used by more than one speaker (e.g. a phone number is used by a family, or a smart phone of a user may be used by a third person either under the owner's consent or without it).
- DM 1110 contacts and instructs VB 1130 to perform a lookup for an existing voiceprint referenced by the identification information, i.e. a voiceprint associated with the identification information.
- VB 1130 returns back to DM 1110 a “voiceprint_exists” message and DM 1110 instructs IVR 1105 to play a next speech prompt to the speaker.
- the speaker in response, speaks a second utterance, and his speech is streamed by IVR 1105 to ASR 1120 .
- ASR 1120 then relays the streamed speech to VB 1130 .
- the streamed speech is sent by IVR 1105 to DM 1110 , and it is DM 1110 that relays the streamed speech to VB 1130 .
- ASR 1120 transcribes the utterance and forwards it to VB 1130 (directly or via DM 1110 ).
- VB 1130 evaluates the suitability of the speech sample for the creation of the voiceprint. If the sample quality is not good (e.g. one or more quality metric is below a corresponding threshold), the sample is discarded and a “bad sample” response is returned by VB 1130 to DM 1110 .
- VB 1130 identifies any AW occurrences, extracts and stores in database 1140 the audio corresponding to each AW, updates in database 1160 an AW model for the speaker for each detected AW, stores alongside the AW model the speech samples corresponding to the detected AW for use in the system's AW models, and also stores the AW speech samples (that have been segmented and isolated from the user's speech using alignment of the speech with the transcribed text used to detect the AW) in database 1140 .
- the process is repeated until enough samples are stored in database 1140 so as to create a voiceprint for each AW, and a voiceprint is created for each AW.
- the collection of enough samples to create the speaker voiceprint for all AWs may be done during a single speaker interaction with IVR 1105 , or during several interactions.
- voice biometric verification cannot be performed until (at least one) speaker voiceprint for an AW has been saved in database 1140 . So, until this voiceprint(s) can be stored, all speaker interactions with IVR 1105 need to involve other types of speaker (full not partial) identification than VB (e.g. key-in or utter a password).
- the AW voiceprint is stored in database 1140 alongside the speech AW samples (i.e. audio) that were used to created the AW voiceprint and a “voiceprint_creation_success” message is sent by VB 1130 to DM 1110 .
- System 1100 may subsequently perform Voice recognitions and/or speaker verifications by comparing voiceprints of captured AWs during a user call with stored user-specific voiceprints associated with AWs.
- the voiceprints may be created using a model that creates i-vectors (or x-vectors or other representation).
- VB 1130 may compare the live user's speech voiceprint with the stored AW speaker's voiceprint(s) for the detected content stored in database 1140 and/or 460 , and produce a first similarity score for each voiceprint associated with the user.
- VB 1130 may perform comparisons in a vector space, where both the detected live speaker's or user's speech voiceprint with the selected AW speaker's voiceprint(s) are first converted into an i-vector (or x-vector, or other representation). Then, i-vectors (or other representations) may be stored in a compact form where correlations between the coefficients in the i-vector or between the i-vector and the corresponding AW i-vector (in database 1150 ) are exploited, so as to significantly reduce the dimensions of the i-vectors that need to be compared with the aim to reduce processing time during real-time operation.
- VB 1130 uses the first similarity scores of the comparison results and selects a stored speaker voiceprint corresponding to the highest score result, which it then compares against threshold T 1 and if the first similarity score is equal or exceeds T 1 , then the speaker is correctly verified.
- VB 1130 may be employed by VB 1130 to determine similarity scores based on comparing a voiceprint from detected live AW with a stored voiceprint associated with an AW of a user.
- a single uttered speech is used to identify the speaker and no second or subsequent uttered speech segments are needed.
- FIG. 4 A shows a first example of a database storing speaker voiceprint and related information.
- the database 700 stores entries corresponding to speakers of the speaker identification service.
- a first speaker entry 710 corresponding to a first speaker, which entry may comprise random speech (e.g. speaker name, debit account, and card number. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the first speaker (“Maria Schneider”, “Debit Account”, “Card Number”, “1237890”, “6543”).
- the first speaker entry is also associated with a stored Voiceprint 1 715 and metadata describing the content and context of the voiceprint or other information associated with the first speaker.
- the content of the speech may be either random speech, requiring text-independent voice biometrics or passphrase-specific (e.g., “my voice is my password”), requiring text-dependent voice biometrics.
- the present disclosure describes systems and methods that advantageous enable voice recognition and/or speaker recognition by applying voice biometrics to words that a user is likely to use, e.g., active words, depending on the context of the user's interaction with a system or service. In this way, the processing power and memory of the system is substantially reduced with respect to text-independent voice biometric systems, while user acceptance is substantially increased with respect to text-dependent voice biometric systems because users are no longer required to recite the same phrase each time they access the system or service.
- Database 700 also contains an n th speaker entry 780 , corresponding to an n th speaker, which entry may comprise speaker name, account number, and address. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the n th user (“George Smith”, “Account Number”, “Address”, “123454”).
- the first speaker entry is also associated with a stored Voiceprint n 785 and metadata describing the content and context of the voiceprint or other information associated with the first speaker.
- Database 700 may also contain a second speaker entry 720 corresponding to the first speaker, which entry may comprise speaker name, credit card, and home city. These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the first speaker (“Maria Schneider”, “Credit Card”, “2378”, “New York”).
- the second speaker entry is also associated with a stored voiceprint 2 725 and metadata describing the content and context of the voiceprint or other information associated with the first speaker. Additional entries (and associated voiceprints and metadata) may exist for any of the users where each corresponds to different content (i.e. different uttered words and phrases) and context (i.e. different meaning of the word pair “New York” as part of an address and as part of the name “New York Commercial Bank”).
- a user and/or speaker may be asked to utter a specific passphrase (e.g. “my voice is my password . . . ”). Other speakers are also asked to utter the same passphrase.
- Database 700 contains entries for all speakers, where these entries correspond to the same passphrase. Voiceprints are created and stored for each speaker uttering the same passphrase, together with metadata describing the content and context of the voiceprint or other information associated with the first speaker.
- FIG. 4 B shows an example of a database storing speaker Active Words voiceprint and related information.
- the database 700 b stores entries corresponding to speakers of the speaker identification service.
- a first speaker entry 710 b corresponding to a first speaker, which entry may comprise speaker name, debit account, and card number.
- These are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the first speaker (“Maria Schneider”, “Debit Account”, “Card Number”, “1237890”, “6543”).
- system 1100 uses historical data (e.g. word frequency used in previous speaker interactions with the system and selects “Maria”, “Schneider”, “1237890”, and “6543” as AW 1 , AW 2 , AW 3 , and AW 4 , respectively.
- system 1100 creates an associated Voiceprint 715 b, 725 b, 735 b, 745 b, where each AW Voiceprint is stored along with metadata describing the content and context of the voiceprint or other information associated with the first speaker.
- Database 700 also contains speaker entries for other speakers, which entries may include speaker name, account number, and address which are stored as audio files (e.g. WAV, MP4, MP3, and the like) that were uttered by the other speakers. These entries are also associated with a stored Voiceprint and metadata describing the content and context of the voiceprint or other information associated with the associated speaker.
- the speaker may be asked to utter a specific passphrase (e.g. “my voice is my password . . . ”). The same passphrase is uttered by all other speakers. All speaker utterings of the passphrase are stored in database 700 together with metadata and associated voiceprints.
- FIG. 4 C illustrates a database 400 associating users with user identifiers (IDs).
- the database includes a column 402 having a list of known users such as User A, User B, User C, and User X among other users. The number of users may be limited only by the processing and memory capacity of, for example, System 1100 .
- Column 404 includes user identifiers (e.g., ID A , ID B , ID C , and ID X , among others) corresponding to each user.
- a user ID may include one or more of the identifier information previously discussed herein.
- column 404 may include multiple IDs that correspond to a particular user.
- FIG. 4 D illustrates a database 440 or datastore associating active words with various interaction contexts.
- Column 442 includes a list of various interaction contexts.
- An interaction context includes a type of interaction by a user with a system such as System 1100 .
- Interaction Context A may represent a banking application where users interface with System 1100 to enroll, access their banking account information, or initiate banking transactions via interactive voice responses with System 1100 .
- Interaction Context B may, for example, include an insurance application that enables users to submit claims via an IVR system.
- Interaction Context C other Interaction Contexts, and Interaction Context X may include a healthcare application, a frequent flyer rewards application, a utility provider application, a mobile service provider application, or any IVR based application that enables users to interface with a service or product provider.
- Column 444 lists groups of Active Words that are associated with a corresponding Interaction Context. For example, Interaction Context A is associated with three Active Words AW A1 , AW A2 , and AW A3 . While three Active Words are shown, any number of Active Words may be associated with an Interaction Context as previously discussed herein.
- System 1100 may select AWs that are associated with a particular Interaction Context based on one or more selection criterion, which may include a frequency of use for a particular word or phrase by a population of users that interact with System 1100 via a particular Interaction Context. For example, System 1100 may gather data regarding all users during Interaction Context A, i.e., during calls to a banking application. As previously discussed, System 1100 may identify those words, subwords, or phrases most frequently spoken by all users of the user population that use the banking application.
- the System 1100 may designate and/or identify one or more AWs (e.g., AW A1 , AW A2 , and AW A3 ) for Interaction Context A (e.g, a banking application).
- AW A1 may be the word “account.”
- AW A2 may be the word “card.”
- AW A3 may be the word “one.”
- Interaction Context C e.g., an auto insurance application
- AW C1 may be the word “car”
- AW C2 may be the word “accident.”
- System 1100 may determine that only two AWs are needed to verify a user identity using Interaction Context C.
- FIG. 4 E illustrates a database 460 that associates voiceprints of active words by various users for a particular interaction context associated with a IVR system.
- Column 462 includes a list of users while columns 464 include voiceprints of recorded AWs of corresponding users for Interaction Context A.
- other databases like database 460 will include voiceprints of AWs of corresponding users.
- captured audio of AWs may be stored in addition to or instead storing voiceprint.
- VB 1130 may generate voiceprints of stored audio of AWs in real-time or near real-time, while also generating voiceprints of detect live audio of AWs to perform speaker recognition.
- User A has corresponding stored voiceprints VP USERA AW A1 , VP USERA AW A2 , and VP USERA W A3 . These voiceprints may have be pre-generated based on a prior enrollment session and/or other IVR session between User A and System 1100 .
- ASR 1120 may detect AW A1 in User A′s speech using speech recognition of User A′s conversation.
- VB 1130 may then process the captured audio of AW A1 to generate a voiceprint of the captured audio of AW A1 , and compare the voice print of the captured AW A1 with the stored voiceprint VP USERA AW A1 to perform voice recognition and confirm the identity of User A.
- column 462 may include identification information associated with each User A, User B, User C, User X, and so on.
- FIG. 5 shows the basic hardware architecture of an application server.
- Application Server 5100 comprises a microprocessor 5110 , a memory 5120 , a screen adapter 5130 , a hard-disk 5140 , a graphics processor 5150 , a communications interface adapter 5160 , and a UI adapter 5170 .
- Application Server 5100 may also contain other components which are not shown in FIG. 5 or lack some of the components shown in FIG. 5 .
- FIG. 6 shows functional elements running on an application server 6200 .
- the functional elements may be implemented as hardware, software, firmware, or a combination thereof.
- the functional elements may include an Operating System (OS) 6210 , Utilities 6220 , an Application Server Software 6230 , at least one Application or Web Service 6240 , and at least one Hardware driver 6250 . Additional hardware and/or software components may run at the application server while some of those shown in FIG. 6 may be optionally not implemented.
- OS Operating System
- the method described in FIG. 2 is modified by omitting the user pre-identification step 120 .
- the Pre-identified User's Live Voiceprint Calculation step 130 is modified to calculate the user's live voiceprint (i.e. without knowledge of the user's identity), and the Fetch Pre-Identified User's Stored Voiceprints step 140 is modified to fetch all stored voiceprints (i.e. belonging to all users), or all stored voiceprints of the same category or of similar characteristics.
- Additional exemplary aspects include a method for performing speaker verification including performing a speech conversation with a first user and receiving a first user identifier.
- the speech conversation may have an interaction context based on a subject matter of the speech conversation.
- the method also includes receiving a group of active words associated with the interaction context, where each active word is selected based on one or more selection criterion derived from conversations of a population of users.
- the method includes receiving the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user, where each first user voiceprint corresponds to each active word of the group of active words.
- the method further includes performing speech recognition of the first user audio provided during the speech conversation where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements.
- the method also includes receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements.
- the method also includes: comparing the plurality of corresponding text elements with each active word of the group of active words, identifying text elements matching each active word of the group of active words, generating a captured voiceprint for each captured audio element corresponding to each text element matching each active word, comparing each captured voiceprint corresponding to each active word of the group of active words with each first user voiceprint corresponding to each active word of the group of active words, generating a similarity score based one or more of the comparisons of each captured voiceprint with each first user voiceprint; and if the similarity score is greater than or equal to a threshold value, indicating that the first user identifier is verified or if the similarity score is less than the threshold value, indicating that the first user identifier is not verified.
- a further exemplary aspect includes a system for performing speaker verification including: means for performing a speech conversation with a first user and means for receiving a first user identifier where the speech conversation having an interaction context based on a subject matter of the speech conversation; means for storing a group of active words associated with the interaction context where each active word being selected based on one or more selection criterion derived from conversations of a population of users; means for storing the first user identifier and a plurality of first user voiceprints derived from pre-captured audio of the first user where each first user voiceprint corresponding to each active word of the group of active words; means for performing speech recognition of the first user audio provided during the speech conversation where the processor converts the first user audio including a plurality of captured audio elements into transcribed text including a corresponding plurality of text elements; means for receiving the first user audio including the plurality of captured audio elements and the transcribed text including the plurality of corresponding text elements; means for comparing the plurality of corresponding text elements with each active word of the group of
- Another exemplary aspect includes a method for verifying a speaker using natural speech including: initiating a session with one of a system or a computing device or a computer apparatus or a telephone; capturing session related data; pre-identifying the user using the session related data; capturing the user's live speech; identifying at least one frequent word in the user's speech; calculating the user's live voiceprint; retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user's pre-identification data; comparing the live voiceprint with the at least one retrieved voiceprint; and verifying the user.
- a further exemplary aspect includes a system for verifying a speaker using natural speech the system including: means for initiating a session with one of a system or a computing device or a computer apparatus or a telephone; means for capturing session related data; means for pre-identifying the user using the session related data; means for capturing the user's live speech; means for identifying at least one frequent word in the user's speech; means for calculating the user's live voiceprint; means for retrieving at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user's pre-identification data; means for comparing the live voiceprint with the at least one retrieved voiceprint; and means for verifying the user.
- Yet another exemplary aspect includes non-transitory computer program product that causes a system to verify a speaker using natural speech, the non-transitory computer program product having instructions to: initiate a session with one of a system or a computing device or a computer apparatus or a telephone; capture session related data; pre-identify the user using the session related data; capture the user's live speech; identify at least one frequent word in the user's speech; calculate the user's live voiceprint; retrieve at least one stored voiceprint, where the at least one stored voiceprint is associated with the user and with the at least one frequent word, and where the retrieval is characterized by the session related data and the user's pre-identification data; compare the live voiceprint with the at least one retrieved voiceprint; and verify the user.
- the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on in a non-transitory manner or transmitted over as one or more instructions or code on a computer readable medium.
- Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another.
- a storage media may be any available media that can be accessed by a computer.
- such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer or any other device or apparatus operating as a computer.
- Such storage mediums and/or databases on them may be referred to as datastores.
- any connection is properly termed a computer-readable medium.
- Disk and disc includes compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Game Theory and Decision Science (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
Abstract
Description
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/363,658 US11948582B2 (en) | 2019-03-25 | 2019-03-25 | Systems and methods for speaker verification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/363,658 US11948582B2 (en) | 2019-03-25 | 2019-03-25 | Systems and methods for speaker verification |
Publications (2)
Publication Number | Publication Date |
---|---|
US20200312337A1 US20200312337A1 (en) | 2020-10-01 |
US11948582B2 true US11948582B2 (en) | 2024-04-02 |
Family
ID=72607647
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/363,658 Active 2040-10-04 US11948582B2 (en) | 2019-03-25 | 2019-03-25 | Systems and methods for speaker verification |
Country Status (1)
Country | Link |
---|---|
US (1) | US11948582B2 (en) |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11227606B1 (en) * | 2019-03-31 | 2022-01-18 | Medallia, Inc. | Compact, verifiable record of an audio communication and method for making same |
US11398239B1 (en) | 2019-03-31 | 2022-07-26 | Medallia, Inc. | ASR-enhanced speech compression |
KR20210054800A (en) * | 2019-11-06 | 2021-05-14 | 엘지전자 주식회사 | Collecting user voice sample |
US11929077B2 (en) * | 2019-12-23 | 2024-03-12 | Dts Inc. | Multi-stage speaker enrollment in voice authentication and identification |
US11664033B2 (en) * | 2020-06-15 | 2023-05-30 | Samsung Electronics Co., Ltd. | Electronic apparatus and controlling method thereof |
US11516347B2 (en) * | 2020-06-30 | 2022-11-29 | ROVl GUIDES, INC. | Systems and methods to automatically join conference |
WO2022040524A1 (en) * | 2020-08-21 | 2022-02-24 | Pindrop Security, Inc. | Improving speaker recognition with quality indicators |
CN112242137B (en) * | 2020-10-15 | 2024-05-17 | 上海依图网络科技有限公司 | Training of human voice separation model and human voice separation method and device |
CN113707183B (en) * | 2021-09-02 | 2024-04-19 | 北京奇艺世纪科技有限公司 | Audio processing method and device in video |
US20230206924A1 (en) * | 2021-12-24 | 2023-06-29 | Mediatek Inc. | Voice wakeup method and voice wakeup device |
WO2024049311A1 (en) * | 2022-08-30 | 2024-03-07 | Biometriq Sp. Z O.O. | Method of selecting the optimal voiceprint |
US20240184876A1 (en) * | 2022-12-01 | 2024-06-06 | Bank Of America Corporation | Multi-dimensional voice-based digital authentication |
Citations (58)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH01103759A (en) * | 1987-10-16 | 1989-04-20 | Nec Corp | Password detecting device |
US5414755A (en) | 1994-08-10 | 1995-05-09 | Itt Corporation | System and method for passive voice verification in a telephone network |
US20020002465A1 (en) * | 1996-02-02 | 2002-01-03 | Maes Stephane Herman | Text independent speaker recognition for transparent command ambiguity resolution and continuous access control |
US6421453B1 (en) * | 1998-05-15 | 2002-07-16 | International Business Machines Corporation | Apparatus and methods for user recognition employing behavioral passwords |
US6615194B1 (en) * | 1998-06-05 | 2003-09-02 | Lucent Technologies Inc. | System for secure execution of credit based point of sale purchases |
EP1507394A2 (en) | 2003-08-11 | 2005-02-16 | Microsoft Corporation | Speech recognition enhanced caller identification |
US20050105712A1 (en) | 2003-02-11 | 2005-05-19 | Williams David R. | Machine learning |
WO2007051109A2 (en) | 2005-10-24 | 2007-05-03 | Invention Machine Corporation | System and method for cross-language knowledge searching |
US20100131900A1 (en) * | 2008-11-25 | 2010-05-27 | Spetalnick Jeffrey R | Methods and Systems for Improved Data Input, Compression, Recognition, Correction, and Translation through Frequency-Based Language Analysis |
US20110238408A1 (en) | 2010-03-26 | 2011-09-29 | Jean-Marie Henri Daniel Larcheveque | Semantic Clustering |
WO2011119171A2 (en) | 2010-03-26 | 2011-09-29 | Virtuoz, Inc. | Semantic clustering |
US20120253809A1 (en) * | 2011-04-01 | 2012-10-04 | Biometric Security Ltd | Voice Verification System |
US20130110520A1 (en) | 2010-01-18 | 2013-05-02 | Apple Inc. | Intent Deduction Based on Previous User Interactions with Voice Assistant |
US20130132091A1 (en) | 2001-01-31 | 2013-05-23 | Ibiometrics, Inc. | Dynamic Pass Phrase Security System (DPSS) |
US20130238317A1 (en) * | 2012-03-08 | 2013-09-12 | Hon Hai Precision Industry Co., Ltd. | Vocabulary look up system and method using same |
US20140025705A1 (en) | 2012-07-20 | 2014-01-23 | Veveo, Inc. | Method of and System for Inferring User Intent in Search Input in a Conversational Interaction System |
US20140172430A1 (en) * | 2012-12-19 | 2014-06-19 | Robert Rutherford | System and method for voice authentication |
US20140222419A1 (en) | 2013-02-06 | 2014-08-07 | Verint Systems Ltd. | Automated Ontology Development |
US20140372128A1 (en) * | 2013-06-17 | 2014-12-18 | John F. Sheets | Speech transaction processing |
US20150269945A1 (en) * | 2014-03-24 | 2015-09-24 | Thomas Jason Taylor | Voice-key electronic commerce |
US9148512B1 (en) * | 2013-10-11 | 2015-09-29 | Angel.Com Incorporated | Routing user communications to agents |
US20160372116A1 (en) | 2012-01-24 | 2016-12-22 | Auraya Pty Ltd | Voice authentication and speech recognition system and method |
US20170075935A1 (en) | 2015-09-10 | 2017-03-16 | Xerox Corporation | Enriching how-to guides by linking actionable phrases |
WO2017059679A1 (en) * | 2015-10-10 | 2017-04-13 | 北京云知声信息技术有限公司 | Account processing method and apparatus |
US20170148445A1 (en) * | 2015-11-23 | 2017-05-25 | International Business Machines Corporation | Generating call context metadata from speech, contacts, and common names in a geographic area |
US20170236520A1 (en) * | 2016-02-16 | 2017-08-17 | Knuedge Incorporated | Generating Models for Text-Dependent Speaker Verification |
US20170301353A1 (en) * | 2016-04-15 | 2017-10-19 | Sensory, Incorporated | Unobtrusive training for speaker verification |
TWM550889U (en) * | 2017-08-07 | 2017-10-21 | ying-xin Chen | Simulation training system |
US20170358317A1 (en) * | 2016-06-10 | 2017-12-14 | Google Inc. | Securely Executing Voice Actions Using Contextual Signals |
US20180018973A1 (en) * | 2016-07-15 | 2018-01-18 | Google Inc. | Speaker verification |
US20180040325A1 (en) * | 2016-08-03 | 2018-02-08 | Cirrus Logic International Semiconductor Ltd. | Speaker recognition |
US20180063062A1 (en) * | 2016-08-30 | 2018-03-01 | Facebook, Inc. | Prompt ranking |
US20180082689A1 (en) * | 2016-09-19 | 2018-03-22 | Pindrop Security, Inc. | Speaker recognition in the call center |
US20180108001A1 (en) * | 2014-03-24 | 2018-04-19 | Thomas Jason Taylor | Voice triggered transactions |
US20180151182A1 (en) * | 2016-11-29 | 2018-05-31 | Interactive Intelligence Group, Inc. | System and method for multi-factor authentication using voice biometric verification |
WO2018108080A1 (en) * | 2016-12-13 | 2018-06-21 | 北京奇虎科技有限公司 | Voiceprint search-based information recommendation method and device |
US20180240463A1 (en) * | 2017-02-22 | 2018-08-23 | Plantronics, Inc. | Enhanced Voiceprint Authentication |
US20180279063A1 (en) * | 2015-02-03 | 2018-09-27 | Dolby Laboratories Licensing Corporation | Scheduling playback of audio in a virtual acoustic space |
US20180285752A1 (en) * | 2017-03-31 | 2018-10-04 | Samsung Electronics Co., Ltd. | Method for providing information and electronic device supporting the same |
US20180315429A1 (en) * | 2017-04-28 | 2018-11-01 | Cloud Court, Inc. | System and method for automated legal proceeding assistant |
US20180342237A1 (en) * | 2017-05-29 | 2018-11-29 | Samsung Electronics Co., Ltd. | Electronic apparatus for recognizing keyword included in your utterance to change to operating state and controlling method thereof |
WO2019013316A1 (en) * | 2017-07-14 | 2019-01-17 | ダイキン工業株式会社 | Equipment control system |
US20190026760A1 (en) | 2017-07-21 | 2019-01-24 | Sk Planet Co., Ltd. | Method for profiling user's intention and apparatus therefor |
US20190251960A1 (en) * | 2018-02-13 | 2019-08-15 | Roku, Inc. | Trigger Word Detection With Multiple Digital Assistants |
US20190251417A1 (en) | 2018-02-12 | 2019-08-15 | Microsoft Technology Licensing, Llc | Artificial Intelligence System for Inferring Grounded Intent |
US20190279647A1 (en) * | 2018-03-08 | 2019-09-12 | Frontive, Inc. | Methods and systems for speech signal processing |
EP3543874A1 (en) | 2018-03-23 | 2019-09-25 | Servicenow, Inc. | Automated intent mining, clustering and classification |
US20190295535A1 (en) | 2018-03-23 | 2019-09-26 | Servicenow, Inc. | Hybrid learning system for natural language understanding |
US20190362057A1 (en) * | 2018-05-24 | 2019-11-28 | Nice Ltd. | System and method for performing voice biometrics analysis |
US20190370396A1 (en) | 2018-05-31 | 2019-12-05 | Wipro Limited | Method and Device for Identifying Relevant Keywords from Documents |
US20190386917A1 (en) | 2018-06-19 | 2019-12-19 | Good Egg Media LLC | Systems and methods for triaging and routing of emergency services communications sessions |
US20190385615A1 (en) * | 2012-10-30 | 2019-12-19 | Google Llc | Hotword-Based Speaker Recognition |
US20200019688A1 (en) * | 2018-07-11 | 2020-01-16 | Realwear, Inc. | Voice activated authentication |
US20200066255A1 (en) | 2018-08-24 | 2020-02-27 | International Business Machines Corporation | Unsupervised Learning of Interpretable Conversation Models from Conversation Logs |
US20200152206A1 (en) * | 2017-12-26 | 2020-05-14 | Robert Bosch Gmbh | Speaker Identification with Ultra-Short Speech Segments for Far and Near Field Voice Assistance Applications |
US20200184979A1 (en) * | 2018-12-05 | 2020-06-11 | Nice Ltd. | Systems and methods to determine that a speaker is human using a signal to the speaker |
US20200411013A1 (en) * | 2016-01-12 | 2020-12-31 | Andrew Horton | Caller identification in a secure environment using voice biometrics |
US20210125619A1 (en) * | 2018-07-06 | 2021-04-29 | Veridas Digital Authentication Solutions, S.L. | Authenticating a user |
-
2019
- 2019-03-25 US US16/363,658 patent/US11948582B2/en active Active
Patent Citations (62)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH01103759A (en) * | 1987-10-16 | 1989-04-20 | Nec Corp | Password detecting device |
US5414755A (en) | 1994-08-10 | 1995-05-09 | Itt Corporation | System and method for passive voice verification in a telephone network |
US20020002465A1 (en) * | 1996-02-02 | 2002-01-03 | Maes Stephane Herman | Text independent speaker recognition for transparent command ambiguity resolution and continuous access control |
US6421453B1 (en) * | 1998-05-15 | 2002-07-16 | International Business Machines Corporation | Apparatus and methods for user recognition employing behavioral passwords |
US6615194B1 (en) * | 1998-06-05 | 2003-09-02 | Lucent Technologies Inc. | System for secure execution of credit based point of sale purchases |
US20130132091A1 (en) | 2001-01-31 | 2013-05-23 | Ibiometrics, Inc. | Dynamic Pass Phrase Security System (DPSS) |
US20050105712A1 (en) | 2003-02-11 | 2005-05-19 | Williams David R. | Machine learning |
EP1507394A2 (en) | 2003-08-11 | 2005-02-16 | Microsoft Corporation | Speech recognition enhanced caller identification |
WO2007051109A2 (en) | 2005-10-24 | 2007-05-03 | Invention Machine Corporation | System and method for cross-language knowledge searching |
US8942986B2 (en) | 2006-09-08 | 2015-01-27 | Apple Inc. | Determining user intent based on ontologies of domains |
US20100131900A1 (en) * | 2008-11-25 | 2010-05-27 | Spetalnick Jeffrey R | Methods and Systems for Improved Data Input, Compression, Recognition, Correction, and Translation through Frequency-Based Language Analysis |
US20130110520A1 (en) | 2010-01-18 | 2013-05-02 | Apple Inc. | Intent Deduction Based on Previous User Interactions with Voice Assistant |
WO2011119171A2 (en) | 2010-03-26 | 2011-09-29 | Virtuoz, Inc. | Semantic clustering |
US20110238408A1 (en) | 2010-03-26 | 2011-09-29 | Jean-Marie Henri Daniel Larcheveque | Semantic Clustering |
US20120253809A1 (en) * | 2011-04-01 | 2012-10-04 | Biometric Security Ltd | Voice Verification System |
US20160372116A1 (en) | 2012-01-24 | 2016-12-22 | Auraya Pty Ltd | Voice authentication and speech recognition system and method |
US20130238317A1 (en) * | 2012-03-08 | 2013-09-12 | Hon Hai Precision Industry Co., Ltd. | Vocabulary look up system and method using same |
US20140025705A1 (en) | 2012-07-20 | 2014-01-23 | Veveo, Inc. | Method of and System for Inferring User Intent in Search Input in a Conversational Interaction System |
US20190385615A1 (en) * | 2012-10-30 | 2019-12-19 | Google Llc | Hotword-Based Speaker Recognition |
US20140172430A1 (en) * | 2012-12-19 | 2014-06-19 | Robert Rutherford | System and method for voice authentication |
US20140222419A1 (en) | 2013-02-06 | 2014-08-07 | Verint Systems Ltd. | Automated Ontology Development |
US10339452B2 (en) | 2013-02-06 | 2019-07-02 | Verint Systems Ltd. | Automated ontology development |
US20190325324A1 (en) | 2013-02-06 | 2019-10-24 | Verint Systems Ltd. | Automated ontology development |
US10679134B2 (en) | 2013-02-06 | 2020-06-09 | Verint Systems Ltd. | Automated ontology development |
US20140372128A1 (en) * | 2013-06-17 | 2014-12-18 | John F. Sheets | Speech transaction processing |
US9148512B1 (en) * | 2013-10-11 | 2015-09-29 | Angel.Com Incorporated | Routing user communications to agents |
US20150269945A1 (en) * | 2014-03-24 | 2015-09-24 | Thomas Jason Taylor | Voice-key electronic commerce |
US20180108001A1 (en) * | 2014-03-24 | 2018-04-19 | Thomas Jason Taylor | Voice triggered transactions |
US20180279063A1 (en) * | 2015-02-03 | 2018-09-27 | Dolby Laboratories Licensing Corporation | Scheduling playback of audio in a virtual acoustic space |
US20170075935A1 (en) | 2015-09-10 | 2017-03-16 | Xerox Corporation | Enriching how-to guides by linking actionable phrases |
WO2017059679A1 (en) * | 2015-10-10 | 2017-04-13 | 北京云知声信息技术有限公司 | Account processing method and apparatus |
US20170148445A1 (en) * | 2015-11-23 | 2017-05-25 | International Business Machines Corporation | Generating call context metadata from speech, contacts, and common names in a geographic area |
US20200411013A1 (en) * | 2016-01-12 | 2020-12-31 | Andrew Horton | Caller identification in a secure environment using voice biometrics |
US20170236520A1 (en) * | 2016-02-16 | 2017-08-17 | Knuedge Incorporated | Generating Models for Text-Dependent Speaker Verification |
US20170301353A1 (en) * | 2016-04-15 | 2017-10-19 | Sensory, Incorporated | Unobtrusive training for speaker verification |
US20170358317A1 (en) * | 2016-06-10 | 2017-12-14 | Google Inc. | Securely Executing Voice Actions Using Contextual Signals |
US20180018973A1 (en) * | 2016-07-15 | 2018-01-18 | Google Inc. | Speaker verification |
US20180040325A1 (en) * | 2016-08-03 | 2018-02-08 | Cirrus Logic International Semiconductor Ltd. | Speaker recognition |
US20180063062A1 (en) * | 2016-08-30 | 2018-03-01 | Facebook, Inc. | Prompt ranking |
US20180082689A1 (en) * | 2016-09-19 | 2018-03-22 | Pindrop Security, Inc. | Speaker recognition in the call center |
US20180151182A1 (en) * | 2016-11-29 | 2018-05-31 | Interactive Intelligence Group, Inc. | System and method for multi-factor authentication using voice biometric verification |
WO2018108080A1 (en) * | 2016-12-13 | 2018-06-21 | 北京奇虎科技有限公司 | Voiceprint search-based information recommendation method and device |
US20180240463A1 (en) * | 2017-02-22 | 2018-08-23 | Plantronics, Inc. | Enhanced Voiceprint Authentication |
US20180285752A1 (en) * | 2017-03-31 | 2018-10-04 | Samsung Electronics Co., Ltd. | Method for providing information and electronic device supporting the same |
US20180315429A1 (en) * | 2017-04-28 | 2018-11-01 | Cloud Court, Inc. | System and method for automated legal proceeding assistant |
US20180342237A1 (en) * | 2017-05-29 | 2018-11-29 | Samsung Electronics Co., Ltd. | Electronic apparatus for recognizing keyword included in your utterance to change to operating state and controlling method thereof |
WO2019013316A1 (en) * | 2017-07-14 | 2019-01-17 | ダイキン工業株式会社 | Equipment control system |
US20190026760A1 (en) | 2017-07-21 | 2019-01-24 | Sk Planet Co., Ltd. | Method for profiling user's intention and apparatus therefor |
TWM550889U (en) * | 2017-08-07 | 2017-10-21 | ying-xin Chen | Simulation training system |
US20200152206A1 (en) * | 2017-12-26 | 2020-05-14 | Robert Bosch Gmbh | Speaker Identification with Ultra-Short Speech Segments for Far and Near Field Voice Assistance Applications |
US20190251417A1 (en) | 2018-02-12 | 2019-08-15 | Microsoft Technology Licensing, Llc | Artificial Intelligence System for Inferring Grounded Intent |
US20190251960A1 (en) * | 2018-02-13 | 2019-08-15 | Roku, Inc. | Trigger Word Detection With Multiple Digital Assistants |
US20190279647A1 (en) * | 2018-03-08 | 2019-09-12 | Frontive, Inc. | Methods and systems for speech signal processing |
US20190295535A1 (en) | 2018-03-23 | 2019-09-26 | Servicenow, Inc. | Hybrid learning system for natural language understanding |
EP3543874A1 (en) | 2018-03-23 | 2019-09-25 | Servicenow, Inc. | Automated intent mining, clustering and classification |
US20190362057A1 (en) * | 2018-05-24 | 2019-11-28 | Nice Ltd. | System and method for performing voice biometrics analysis |
US20190370396A1 (en) | 2018-05-31 | 2019-12-05 | Wipro Limited | Method and Device for Identifying Relevant Keywords from Documents |
US20190386917A1 (en) | 2018-06-19 | 2019-12-19 | Good Egg Media LLC | Systems and methods for triaging and routing of emergency services communications sessions |
US20210125619A1 (en) * | 2018-07-06 | 2021-04-29 | Veridas Digital Authentication Solutions, S.L. | Authenticating a user |
US20200019688A1 (en) * | 2018-07-11 | 2020-01-16 | Realwear, Inc. | Voice activated authentication |
US20200066255A1 (en) | 2018-08-24 | 2020-02-27 | International Business Machines Corporation | Unsupervised Learning of Interpretable Conversation Models from Conversation Logs |
US20200184979A1 (en) * | 2018-12-05 | 2020-06-11 | Nice Ltd. | Systems and methods to determine that a speaker is human using a signal to the speaker |
Non-Patent Citations (6)
Title |
---|
Nicolaou, C. et al., "Where Is My Parcel? Fast and Efficient Classifiers to Detect User Intent in Natural Language," 2019 6th Intl. Conf. on Social Networks Analysis, Management and Security (SNAMS), IEEE, pp. 351-356. |
Papalkar, S. et al., "A Review of Dialogue Intent Identification Methods for Closed Domain Conversational Agents," IEEE, 2018, pp. 566-570. |
Pappu, A. et al., "Method and System for Providing Semi-Supervised User Intent Detection for Multi-Domain Dialogues," IP.com Prior Art Database Technical Disclosure, Yahoo! 2017, 6 pages. |
Qu, C. et al., "User Intent Prediction in Information-Seeking Conversations," CHHR '19 Mar. 10-14, 2019, Glasgow Scotland, ACM, 9 pages. |
Zhang, H. et al., "Generic Intent Representation in Web Search," 2019, ACM, 10 pages. |
Zhong, J. et al., "Predicting Customer Call Intent by Analyzing Phone Call Transcripts Based on CNN for Multi-Class Classification," Marchex, Inc., 2019, pp. 09-20. |
Also Published As
Publication number | Publication date |
---|---|
US20200312337A1 (en) | 2020-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11948582B2 (en) | Systems and methods for speaker verification | |
WO2020192890A1 (en) | Systems and methods for speaker verification | |
CN106373575B (en) | User voiceprint model construction method, device and system | |
US11210461B2 (en) | Real-time privacy filter | |
AU2016216737B2 (en) | Voice Authentication and Speech Recognition System | |
US10332525B2 (en) | Automatic speaker identification using speech recognition features | |
US10147419B2 (en) | Automated recognition system for natural language understanding | |
US20160372116A1 (en) | Voice authentication and speech recognition system and method | |
US8484031B1 (en) | Automated speech recognition proxy system for natural language understanding | |
US10972609B2 (en) | Caller deflection and response system and method | |
US20150287408A1 (en) | Systems and methods for supporting hearing impaired users | |
JP5042194B2 (en) | Apparatus and method for updating speaker template | |
KR102097710B1 (en) | Apparatus and method for separating of dialogue | |
KR20230098266A (en) | Filtering the voice of other speakers from calls and audio messages | |
CN114385800A (en) | Voice conversation method and device | |
CN112331217A (en) | Voiceprint recognition method and device, storage medium and electronic equipment | |
CN113744742A (en) | Role identification method, device and system in conversation scene | |
CN112565242B (en) | Remote authorization method, system, equipment and storage medium based on voiceprint recognition | |
US11558506B1 (en) | Analysis and matching of voice signals | |
US20240355323A1 (en) | Deepfake detection | |
US20240363099A1 (en) | Deepfake detection | |
US20240363125A1 (en) | Active voice liveness detection system | |
WO2022156539A1 (en) | Enhanced reproduction of speech on a computing system | |
WO2024220274A2 (en) | Deepfake detection | |
KR20150010499A (en) | Method and device for improving voice recognition using voice by telephone conversation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
AS | Assignment |
Owner name: OMILIA NATURAL LANGUAGE SOLUTIONS LTD., CYPRUS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:STAFYLAKIS, THEMOS;MIZERA, PETR;VASSOS, DIMITRIS;REEL/FRAME:050304/0513 Effective date: 20190903 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |