[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2019073235A1 - Detection of liveness - Google Patents

Detection of liveness Download PDF

Info

Publication number
WO2019073235A1
WO2019073235A1 PCT/GB2018/052907 GB2018052907W WO2019073235A1 WO 2019073235 A1 WO2019073235 A1 WO 2019073235A1 GB 2018052907 W GB2018052907 W GB 2018052907W WO 2019073235 A1 WO2019073235 A1 WO 2019073235A1
Authority
WO
WIPO (PCT)
Prior art keywords
signal
ultrasound
audio band
audio
component
Prior art date
Application number
PCT/GB2018/052907
Other languages
French (fr)
Inventor
John Paul Lesso
Original Assignee
Cirrus Logic International Semiconductor Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB1801663.4A external-priority patent/GB201801663D0/en
Priority claimed from GBGB1801661.8A external-priority patent/GB201801661D0/en
Priority claimed from GBGB1801664.2A external-priority patent/GB201801664D0/en
Priority claimed from GBGB1801874.7A external-priority patent/GB201801874D0/en
Application filed by Cirrus Logic International Semiconductor Limited filed Critical Cirrus Logic International Semiconductor Limited
Priority to CN201880066346.8A priority Critical patent/CN111201568A/en
Priority to KR1020207013319A priority patent/KR20200062320A/en
Priority to GB2004477.2A priority patent/GB2581594B/en
Publication of WO2019073235A1 publication Critical patent/WO2019073235A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/10Multimodal systems, i.e. based on the integration of multiple recognition engines or fusion of expert systems
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S15/00Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems
    • G01S15/02Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems using reflection of acoustic waves
    • G01S15/50Systems of measurement, based on relative movement of the target
    • G01S15/52Discriminating between fixed and moving objects or between objects moving at different speeds
    • G01S15/523Discriminating between fixed and moving objects or between objects moving at different speeds for presence detection
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S15/00Systems using the reflection or reradiation of acoustic waves, e.g. sonar systems
    • G01S15/88Sonar systems specially adapted for specific applications
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S7/00Details of systems according to groups G01S13/00, G01S15/00, G01S17/00
    • G01S7/52Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S15/00
    • G01S7/52004Means for monitoring or calibrating
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S7/00Details of systems according to groups G01S13/00, G01S15/00, G01S17/00
    • G01S7/52Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S15/00
    • G01S7/539Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S15/00 using analysis of echo signal for target characterisation; Target signature; Target cross-section
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G07CHECKING-DEVICES
    • G07CTIME OR ATTENDANCE REGISTERS; REGISTERING OR INDICATING THE WORKING OF MACHINES; GENERATING RANDOM NUMBERS; VOTING OR LOTTERY APPARATUS; ARRANGEMENTS, SYSTEMS OR APPARATUS FOR CHECKING NOT PROVIDED FOR ELSEWHERE
    • G07C9/00Individual registration on entry or exit
    • G07C9/30Individual registration on entry or exit not involving the use of a pass
    • G07C9/32Individual registration on entry or exit not involving the use of a pass in combination with an identity check
    • G07C9/37Individual registration on entry or exit not involving the use of a pass in combination with an identity check using biometric data, e.g. fingerprints, iris scans or voice recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • AHUMAN NECESSITIES
    • A61MEDICAL OR VETERINARY SCIENCE; HYGIENE
    • A61BDIAGNOSIS; SURGERY; IDENTIFICATION
    • A61B8/00Diagnosis using ultrasonic, sonic or infrasonic waves
    • A61B8/48Diagnostic techniques
    • A61B8/488Diagnostic techniques involving Doppler signals
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S7/00Details of systems according to groups G01S13/00, G01S15/00, G01S17/00
    • G01S7/52Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S15/00
    • G01S7/523Details of pulse systems
    • G01S7/524Transmitters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • G10L17/24Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase

Definitions

  • the liveness detection can be used for detecting a replay attack on a voice biometrics system.
  • embodiments described herein relate to methods and devices for improving the robustness of a speech processing system.
  • Biometrics systems are becoming widely used.
  • a user trains the system by providing samples of their speech during an enrolment phase.
  • the system is able to discriminate between the enrolled user and non- registered speakers.
  • Voice biometrics systems can in principle be used to control access to a wide range of services and systems.
  • One way for a malicious party to attempt to defeat a voice biometrics system is to obtain a recording of the enrolled user's speech, and to play back the recording in an attempt to impersonate the enrolled user and to gain access to services that are intended to be restricted to the enrolled user. This is referred to as a replay attack, or as a spoofing attack.
  • the system recognises a characteristic of the user.
  • a biometrics system recognises a characteristic of the user.
  • many devices include microphones, which can be used to detect ambient sounds.
  • the ambient sounds include the speech of one or more nearby speaker.
  • Audio signals generated by the microphones can be used in many ways. For example, audio signals representing speech can be used as the input to a speech recognition system, allowing a user to control a device or system using spoken commands.
  • a method of liveness detection comprises: receiving a speech signal; generating an ultrasound signal; detecting a reflection of the generated ultrasound signal; detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of the liveness of a speaker based on the detected Doppler shifts. Identifying whether the received speech signal is indicative of liveness based on the detected Doppler shifts comprises determining whether the detected Doppler shifts correspond to a speech articulation rate.
  • a system configured for performing the method of the first aspect.
  • a device comprising such a system.
  • the device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
  • a computer program product comprising a computer-readable tangible medium, and instructions for performing a method according to the first aspect.
  • a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the first aspect.
  • a method of detecting liveness of a speaker which comprises: generating an ultrasound signal; receiving an audio signal comprising a reflection of the ultrasound signal; using the received audio signal comprising the reflection of the ultrasound signal to detect the liveness of a speaker; monitoring ambient ultrasound noise; and adjusting the operation of a system receiving the audio signal, based on a level of the reflected ultrasound and the monitored ambient ultrasound noise.
  • the method can be used in a voice biometrics system, in which case detecting the liveness of a speaker comprises determining whether a received speech signal may be a product of a replay attack. The operation of the voice biometrics system may be adjusted based on a level of the reflected ultrasound and the monitored ambient ultrasound noise.
  • a system for liveness detection the system being configured for performing the method of the second aspect.
  • a device comprising such a system.
  • the device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
  • a computer program product comprising a computer-readable tangible medium, and instructions for performing a method according to the second aspect.
  • a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the second aspect.
  • a method of liveness detection in a device comprises: receiving a speech signal from a voice source; generating and transmitting an ultrasound signal through a transducer of the device; detecting a reflection of the transmitted ultrasound signal; detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of liveness of a speaker based on the detected Doppler shifts.
  • the method further comprises: obtaining information about a position of the device; and adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device.
  • a system for liveness detection the system being configured for performing the method of the third aspect.
  • a device comprising such a system.
  • the device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
  • a computer program product comprising a computer-readable tangible medium, and instructions for performing a method according to the third aspect.
  • a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the third aspect.
  • a method for improving the robustness of a speech processing system having at least one speech processing module comprising: receiving an input sound signal comprising audio and non-audio frequencies; separating the input sound signal into an audio band component and a non-audio band component; identifying possible interference within the audio band from the non-audio band component; and adjusting the operation of a downstream speech processing module based on said identification.
  • a system for improving the robustness of a speech processing system configured for operating in accordance with the method of the fourth aspect.
  • a device comprising such a system.
  • the device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
  • a computer program product comprising a computer-readable tangible medium, and instructions for performing a method according to the fourth aspect.
  • a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the fourth aspect.
  • Figure 2 is a schematic diagram, illustrating the form of the smartphone.
  • Figure 3 illustrates a situation in which a replay attack is being performed
  • Figure 4 is a flow chart illustrating a method of detecting liveness
  • Figure 5 illustrates a speech processing system, including a system for detecting liveness
  • Figure 6 is a flow chart illustrating a part of the method of detecting liveness
  • Figure 7 illustrates various possible uses of smartphones
  • Figure 8 is a flow chart illustrating a part of the method of detecting liveness
  • Figure 9 is a flow chart illustrating a part of the method of detecting liveness
  • Figure 10 is a block diagram, illustrating a part of the system for detecting liveness;
  • Figure 1 1 illustrates results of the method of detecting liveness;
  • Figure 12 illustrates a smartphone
  • Figure 13 is a schematic diagram, illustrating the form of the smartphone
  • Figure 14 illustrates a speech processing system
  • Figure 15 illustrates an effect of using a speech processing system
  • Figure 16 is a flow chart illustrating a method of handling an audio signal
  • Figure 17 is a block diagram illustrating a system using the method of Figure 16
  • Figure 18 is a block diagram illustrating a system using the method of Figure 16;
  • Figure 19 is a block diagram of a system using the method of Figure 16;
  • Figure 20 is a block diagram of a system using the method of Figure 16;
  • Figure 21 is a block diagram of a system using the method of Figure 16;
  • Figure 22 is a block diagram of a system using the method of Figure 16;
  • Figure 23 is a block diagram of a system using the method of Figure 16.
  • Figure 24 is a block diagram of a system using the method of Figure 16.
  • Detailed Description of Embodiments The description below sets forth example embodiments according to this disclosure. Further example embodiments and implementations will be apparent to those having ordinary skill in the art. Further, those having ordinary skill in the art will recognize that various equivalent techniques may be applied in lieu of, or in conjunction with, the embodiments discussed below, and all such equivalents should be deemed as being encompassed by the present disclosure.
  • Figure 1 illustrates a smartphone 1 10, having a microphone 1 12 for detecting ambient sounds.
  • the microphone is of course used for detecting the speech of a user who is holding the smartphone 1 10.
  • the smartphone 1 10 also has two loudspeakers 1 14, 1 16.
  • the first loudspeaker 1 14 is located at the top of the smartphone 1 10, when it is held in its normal operating position for making a voice call, and is used for playing the sounds that are received from the remote party to the call.
  • the second loudspeaker 1 16 is located at the bottom of the smartphone 1 10, and is used for playing back media content from local or remote sources.
  • the second loudspeaker 1 16 is used for playing back music that is stored on the smartphone 1 10 or sounds associated with videos that are being accessed over the internet.
  • the illustrated smartphone 1 10 also has two additional microphones 1 12a, 1 12b.
  • the additional microphones if present in the device, may be provided at any suitable location.
  • one microphone 1 12a is located at the top end of the front of the device, while another microphone 1 12b is located at the top end of the side of the device.
  • FIG. 2 is a schematic diagram, illustrating the form of the smartphone 1 10. Specifically, Figure 2 shows various interconnected components of the smartphone 1 10. It will be appreciated that the smartphone 1 10 will in practice contain many other components, but the following description is sufficient for an understanding of the present invention.
  • Figure 2 shows the microphone 1 12 mentioned above.
  • the smartphone 1 10 is provided with multiple microphones 1 12, 1 12a, 1 12b, etc.
  • Figure 2 also shows the loudspeakers 1 14, 1 16.
  • Figure 2 also shows a memory 1 18, which may in practice be provided as a single component or as multiple components.
  • the memory 1 18 is provided for storing data and program instructions.
  • Figure 2 also shows a processor 120, which again may in practice be provided as a single component or as multiple components.
  • one component of the processor 120 may be an applications processor of the smartphone 1 10.
  • FIG. 2 also shows a transceiver 122, which is provided for allowing the smartphone 1 10 to communicate with external networks.
  • the transceiver 122 may include circuitry for establishing an internet connection over a WiFi local area network and/or over a cellular network.
  • Figure 2 also shows audio processing circuitry 124, for performing operations on the audio signals detected by the microphone 1 12 as required.
  • the audio processing circuitry 124 may filter the audio signals or perform other signal processing operations.
  • the audio signal processing circuitry is also able to generate audio signals for playback through the loudspeakers 1 14, 1 16, as discussed in more detail below.
  • the smartphone 1 10 may include one or more sensors 126.
  • the sensor(s) may include any combination of the following: gyroscopes, accelerometers, proximity sensors, light level sensors, touch sensors, and a camera.
  • the smartphone 1 10 is provided with voice biometric functionality, and with control functionality.
  • voice biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person.
  • certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command.
  • Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
  • the spoken commands are transmitted using the transceiver 122 to a remote speech recognition system, which determines the meaning of the spoken commands.
  • the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 1 10 or other local device. In other embodiments, the speech recognition system is also located on the device 1 10.
  • One attempt to deceive a voice biometric system is to play a recording of an enrolled user's voice in a so-called replay or spoof attack.
  • FIG 3 shows an example of a situation in which a replay attack is being performed.
  • the smartphone 1 10 is provided with voice biometric functionality.
  • the smartphone 1 10 is in the possession, at least temporarily, of an attacker, who has another smartphone 130.
  • the smartphone 130 has been used to record the voice of the enrolled user of the smartphone 1 10.
  • the smartphone 130 is brought close to the microphone inlet 1 12 of the smartphone 1 10, and the recording of the enrolled user's voice is played back. If the voice biometric system is unable to determine that the enrolled user's voice that it recognises is a recording, the attacker will gain access to one or more services that are intended to be accessible only by the enrolled user.
  • the smartphone 1 10 when the smartphone 1 10 is provided with a camera- based biometric functionality, such as a facial recognition system, an attacker may use the display of the smartphone 130 to show a photo or video of the enrolled user, in an attempt to defeat the facial recognition system.
  • a camera- based biometric functionality such as a facial recognition system
  • Embodiments described herein therefore attempt to perform liveness detection, for example detecting the presence of a person speaking any voice sounds that are detected.
  • FIG 4 is a flow chart, illustrating a method of liveness detection, for example for use in a biometrics system, and in this illustrated example used for detecting a replay attack on a voice biometrics system
  • Figure 5 is a block diagram illustrating functional blocks in one example of a speech processing system that includes the voice biometrics system.
  • a signal is received on an input 170 of the system shown in Figure 5.
  • the input 170 may be connected to the microphone 1 12 shown in Figure 1 or the multiple microphones 1 12, 1 12a, 1 12b, etc shown in Figure 2.
  • the received signal is passed to a voice activity detector (VAD) 172, which detects when the received signal contains speech.
  • VAD voice activity detector
  • the received signal is also passed to a keyword detection block 174. If it is determined by the voice activity detector 172 that the received signal contains speech, the keyword detection block 174 is activated, and it acts to detect the presence of a predetermined keyword in the detected speech. For example, the speech processing system of a smartphone might as a default operate in a low power mode, reflecting the fact that speech processing will be required for only a small fraction of the operating life of the device. The speech processing system may then be taken out of the low-power mode by the user uttering the predetermined keyword or phrase, such as "Hello phone".
  • the received signal is also passed to a speaker recognition block 176.
  • the speaker recognition block 176 attempts to determine whether the person who uttered the predetermined keyword is the registered user of the device and/or of a particular application on the device. Suitable biometric techniques are known for determining whether the speaker of the speech that is present in the received signal is the registered user.
  • the received signal is passed to a speech processing block 178, which may be present on the device or may be located remotely, in the cloud.
  • the speech processing block 178 determines the content of the speech. If the speech contains a command, for example, then the speech processing block 178 generates a suitable signal for causing that command to be acted upon.
  • the system shown in Figure 5 includes a mechanism for performing liveness detection, and hence for detecting whether the received signal containing speech has originated from a replay attack, as illustrated in Figure 3.
  • an ultrasound signal is generated and transmitted, by the ultrasound generate and transmit block 180 shown in Figure 5.
  • the ultrasound transmit block 180 may operate at all times. In other embodiments, the ultrasound transmit block 180 operates only when it receives an enable signal on its input 182.
  • the enable signal may be generated, for example, when the voice activity detector 172 determines that the received signal contains speech, or when the keyword detection block 174 detects the presence of the predetermined keyword, or when the speaker recognition block 176 starts to perform a biometric technique to determine whether the person who uttered the predetermined keyword is the registered user.
  • the ultrasound signal may be a single tone sine wave, or other configurations may be used, for example a chirp signal.
  • the frequency of the ultrasound signal may be selected to be relatively close to 20kHz for transmittability reasons, while being high enough to ensure that it is not audible.
  • a reflection of the generated ultrasound signal is detected.
  • a signal is received on an input 184, and passed to an ultrasound detection block 186.
  • the input 184 may be connected to one or more of the multiple microphones 1 12, 1 12a, 1 12b, etc shown in Figure 2, to receive any signal detected thereby.
  • the received signal is passed to the ultrasound detection block 186, which may for example comprise one or more filter for selecting signals having a frequency that is close to the frequency of the ultrasound signal transmitted by the ultrasound transmit block 180.
  • Reflected ultrasound signals may be Doppler shifted in their frequency, but the Doppler shifts are unlikely to be much more than 100Hz, and so the ultrasound detection block 186 may comprise a filter for selecting signals having a frequency that is within 100Hz of the frequency of the ultrasound signal transmitted by the ultrasound transmit block 180.
  • the received ultrasound signal detected by the ultrasound detection block 186 is passed to a Doppler detect block 188, to detect Doppler shifts in the reflection of the generated ultrasound signal.
  • the received reflected ultrasound signal is compared with the generated ultrasound signal to identify frequency shifts in the reflected signal that are caused by reflections off a moving surface, such as the face, and in particular the lips, of a person who is speaking to generate the detected speech signal.
  • step 158 of the method shown in Figure 4 it is determined based on the detected Doppler shifts whether these Doppler shifts provide good evidence for the liveness of a person generating the detected speech.
  • the output of the Doppler detect block 188 is applied to one input of a correlation block 190.
  • the received audio signal on the input 170 is applied to another input of the correlation block 190.
  • a signal generated by the voice activity detect block 172 is applied to the other input of the correlation block 190.
  • the output of the correlation block 190 is applied to a determination block 192 shown in Figure 5.
  • the correlation block 190 If it is found by the correlation block 190 that there is a correlation between time periods in which Doppler shifts are detected in the reflection of the generated ultrasound signal, and time periods in which speech content is identified in the received speech signal, this indicates that the detected speech is generated by a live person moving their lips to generate the sound. If the degree of correlation is low, one possible reason for this may be that the detected speech is not generated by a live person moving their lips to generate the sound. One possible cause of this is that the detected speech is in fact generated by a replay attack. Therefore, the determination block 192 produces an output signal that contains information about the liveness of the speaker, and hence about the likelihood that the detected speech was generated by a replay attack.
  • This output signal is applied, in this illustrated embodiment, to the speaker recognition block 176, which is performing one or more voice biometrics process to determine whether the speaker is the registered user of the device.
  • the speaker recognition block 176 can then use the output signal as one of several factors that it uses to determine whether the speaker is in fact the registered user of the device. For example, there may be one or more factors which indicate whether the detected speech is the speech of the registered user, and one or more factors which indicate whether the detected speech may have resulted from a replay attack.
  • the liveness detection can be used for other purposes, for example for detecting an attempt to defeat a facial recognition system by presenting a still or moving image of an enrolled user.
  • the purpose of generating the ultrasound signal is to detect the movement of a speaker's face, and in particular the lips, while speaking. For this to operate successfully, it is advantageous that the ultrasound signal may be varied depending on information about the use of the device.
  • step 152 of the process shown in Figure 4 involves generating and transmitting the ultrasound signal.
  • Figure 6 is a flow chart, giving more detail about this step, in some embodiments.
  • the system obtains information about a position of the device 1 10.
  • obtaining information about a position of the device may comprise obtaining information about an orientation of the device.
  • Information about the orientation of the device may for example be obtained from gyroscopes and/or accelerometers provided as sensors 126 in the device 1 10.
  • obtaining information about a position of the device may comprise obtaining information about a distance of the device from the voice source.
  • Information about a distance of the device from the voice source may for example be obtained by detecting the levels of signals generated by the microphones 1 12, 1 12a, 1 12b. For example, a higher signal level from one microphone may indicate that the voice source is closer to that microphone than to one or more other microphone.
  • obtaining information about a position of the device may comprise obtaining information about a position of the device relative to a supposed speaker.
  • Information about the position of the device relative to a supposed speaker may for example be obtained from one or more proximity sensor provided as a sensor 126 in the device 1 10.
  • Information about the position of the device relative to a supposed speaker may also be obtained from one or more light level sensor provided as a sensor 126 in the device 1 10.
  • Information about the position of the device relative to a supposed speaker may also be obtained from one or more touch sensor provided as a sensor 126 in the device 1 10, indicating how the device 1 10 is being held by a user.
  • Information about the position of the device relative to a supposed speaker may also be obtained from a camera provided as a sensor 126 in the device 1 10, which can track the position of a user's face relative to the device 1 10. Then, in step 1 1 12, the method involves adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device.
  • Adapting the generating and transmitting of the ultrasound signal may for example comprise adjusting a transmit power of the ultrasound signal.
  • adapting the generating and transmitting of the ultrasound signal may comprise selecting the one or more transducer in which the ultrasound signal is generated, with the intention that the ultrasound signal should be generated from a transducer that is close to the user's mouth in order to be able to detect movement of the user's lips.
  • obtaining information about a position of the device may comprise obtaining information about a distance of the device from the voice source, and adapting the generating and transmitting of the ultrasound signal may comprise adjusting a transmit power of the ultrasound signal, with a higher power being used when the device is further from the voice source, at least for distances below a certain limit. This allows the device to generate ultrasound signals that produce clearly detectable reflections, without risking transmitting ultrasound energy when the device is close to the user's ear.
  • obtaining information about a position of the device may comprise obtaining information as to which of multiple loudspeaker transducers is closest to the voice source (for example based on signal levels at microphones placed located close to those transducers), and adapting the generating and transmitting of the ultrasound signal may comprise transmitting the ultrasound signal mainly or entirely from that transducer. This allows the device to generate ultrasound signals from the transducer that is closest to the sound source, and thereby increase the chance of detecting usable reflection signals.
  • adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal at a relatively low power from the first transducer 1 16 if the information about the position of the device indicates that the device 1 10 is being used in a close talk mode.
  • Close talk will be understood as a use of a phone where the phone is positioned adjacent the side of a user's face, and where communication is using the close-range earpiece speaker, e.g. as with a "traditional" phone handset positioning.
  • the ultrasound signal may be transmitted at a level of 70-90dB SPL at 1 cm in this mode.
  • the information about the position of the device may be considered to indicate that the device is being used in a close talk mode if, for example, accelerometers indicate that the device 1 10 is in an upright position, and proximity sensors detect that the device 1 10 is being held close to a surface that might be a user's face 1 120, as shown in
  • adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal from the second transducer if the information about the position of the device indicates that the device is being used in a generally vertical orientation.
  • adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal at a relatively high power from the transducer 1 16 at the lower end of the device, if the information about the position of the device indicates that the device 110 may be being held by the user in front of their face 1 130, with the lower microphone 1 12 pointing towards them, i.e. in a "pizza slice" version of a near talk mode, as shown in Figure 7(b).
  • Near-talk mode will be understood as where a phone is positioned in front of the user's face, and where use may be made of near-field loudspeakers and microphones. This position may be suitable for the purposes of a video call, e.g. using software products such as SkypeTM from Microsoft or FaceTimeTMfrom Apple.
  • "Pizza slice” mode will be understood as a variation of near-talk mode, but where the phone is held in a relatively horizontal position (such that a microphone positioned at the lower end of the phone faces the user directly).
  • the ultrasound signal may be transmitted at a level of 90-1 10dB SPL at 1 cm in this mode.
  • the information about the position of the device may be considered to indicate that the device is being used in a "pizza slice" mode if, for example, accelerometers indicate that the device is in a horizontal position, and the signal level detected by the microphone 1 12 is higher than the signal level detected by the microphones 1 12a, 1 12b.
  • adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal from the first transducer if the information about the position of the device indicates that the device is being used in a generally horizontal orientation.
  • adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal at a relatively high power from the transducer 1 14 at the upper end of the device, or from transducers at both ends of the device.
  • adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise preventing transmission of the ultrasound signal if the information about the position of the device indicates that the device is being used in a far field mode, for example with the device 1 10 being placed on a surface 1 140 some distance from the user 1 142, as shown in Figure 7(c).
  • the information about the position of the device may indicate that the device is located more than a threshold distance (for example 50cm) from the source of the sound.
  • the output of the Doppler detect block 188 is applied to one input of a correlation block 190.
  • the received audio signal on the input 170 is applied to another input of the correlation block 190.
  • the correlation block 190 determines whether there is a correlation between time periods in which Doppler shifts are detected in the reflection of the generated ultrasound signal, and periods in which there is speech.
  • the aim is to confirm that any Doppler shifts that are detected in the received reflection of the generated ultrasound signal do result from facial movements of a speaker, and are not the result of spurious reflections from other moving objects.
  • FIG 8 is a flow chart, illustrating a method performed in the correlation block 190.
  • step 1 150 it is determined in step 1 150, whether the detected Doppler shifts correspond to a general speech articulation rate.
  • the articulation rate is the rate at which syllables are produced during speech, and it has been found that, for most speech, a typical articulation rate is in the range of 4-1 OHz.
  • the facial movements of the speaker for example movements of the speaker's lips, cheeks, and nostrils
  • step 1 150 it is determined whether the detected Doppler shifts correspond to facial movements at a frequency in the range of 4-1 OHz.
  • step 1 152 it is determined whether the detected Doppler shifts correspond to an articulation rate of the current speech.
  • the articulation rate of the speech contained in the received audio signal is extracted in the correlation block 190. It is then determined whether the detected Doppler shifts correspond to facial movements at a frequency that corresponds to that extracted articulation rate.
  • step 1 154 of the method shown in Figure 8 it is determined whether there is a correlation between detected Doppler shifts in the reflection of the generated ultrasound signal, and speech content of the received speech signal. It is recognised that one issue with using ultrasound as described herein, is that there may be interfering sources of ambient ultrasound noise. Therefore, Figure 9 is a flow chart, illustrating one method performed in the Doppler detect block 188 and correlation block 190. Specifically, in step 1 170, a level of ambient ultrasound noise is monitored. Then, in step 1 172, the operation of the voice biometrics system is adjusted based on the levels of the reflected ultrasound and monitored ambient ultrasound noise.
  • Figure 10 is a block diagram, illustrating schematically the operation of the Doppler detect block 188 and correlation block 190.
  • Figure 11 illustrates signals obtained at different stages of the operation.
  • the signal from one or microphone 1 12 is passed to a low pass filter 1 180, for isolating the audio frequency components (for example, below 20kHz) of the detected signal.
  • the resulting audio signal in one example, is shown in Figure 1 1 (a).
  • the signal level of the audio signal is found in a block 1 182 that finds the absolute value of the signal.
  • the resulting envelope signal in the same example, is shown in Figure 1 1 (b).
  • the signal from the one or microphone 1 12 is also passed to a high pass filter 1 184, for isolating the ultrasound components (for example, above 20kHz) of the detected signal.
  • This may contain the wanted reflection of the generated ultrasound signal, but may also contain interfering ambient ultrasound noise.
  • the level of the ultrasound signal is determined by a level detector 1 186.
  • the ultrasound signal is then passed to a demodulation block 1 188, where it is downconverted to the audio band, and any Doppler shifted reflections are found. This is achieved by mixing the received ultrasound signal with the ultrasound signal that was generated and transmitted.
  • the received ultrasound signal can be passed through a band pass filter before downconversion if required, in order to remove other ultrasound signals not originating from the transmitted signal.
  • the output of the mixing step can be low-pass filtered.
  • the resulting signal in one example, is shown in Figure 1 1 (c).
  • the signal level of the Doppler shifted reflected signal is found in a block 1 190 that finds the absolute value of the signal. It can thus be seen from Figure 1 1 that there is a correlation between the detected Doppler shifts in the reflection of the generated ultrasound signal, and speech content of the received speech signal.
  • either the audio signal is differentiated (for example by passing through a block 1 194 in the form of a band pass filter with a pass-band of, say, 10-200Hz, an envelope block, or a differentiator), or the ultrasound signal is integrated (for example by passing through a block 1 196 in the form of a leaky integrator or a band pass filter with a pass-band of, say, 10-200Hz).
  • the correlator 1 192 then performs a frame-by-frame cross correlation on the signals. If the correlation result Rxy is above a threshold then it is determined that there is enough of a correlation, between the detected Doppler shifts and the speech content of the received speech signal, to conclude that there is evidence of a live speaker, and hence that the speech may not result from a replay attack. If there is not good evidence of liveness of a speaker, this may be an indication that the received speech signal may be a product of a replay attack.
  • the operation of the system may be adjusted, based on a level of the reflected ultrasound and the monitored ambient ultrasound noise, as detected by the level detector 1 186.
  • the reliance that is placed on the determination as to whether the received speech signal may be the result of a replay attack may be adjusted, based on the level of the monitored ambient ultrasound noise.
  • the determination, as to whether the received speech signal may be the result of a replay attack will typically be made based on more than one factor. It is recognised that the presence of large ambient ultrasound signals will impact on the reliability of this system, and so the reliance that is placed on the determination may be reduced, as the level of the monitored ambient ultrasound noise increases. More specifically, if the level of the monitored ambient ultrasound noise exceeds a first threshold level, the result of the correlation may be ignored completely, or the correlation may not be performed.
  • the adjustment of the operation of the system may involve adapting the threshold correlation value that is used in determining whether there is enough of a correlation, between the detected Doppler shifts and the speech content of the received speech signal, to conclude that there is evidence of a live speaker.
  • a high threshold correlation value can be used for low levels of ultrasound interference.
  • lower threshold correlation values can be used, to take account of the fact that the presence of interference will automatically reduce the correlation values obtained from the correlator 1 192.
  • Figure 12 illustrates a smartphone 210, having a microphone 212 for detecting ambient sounds.
  • the microphone is of course used for detecting the speech of a user who is holding the smartphone 210 close to their face.
  • Figure 13 is a schematic diagram, illustrating the form of the smartphone 210. Specifically, Figure 13 shows various interconnected components of the smartphone 210. It will be appreciated that the smartphone 210 will in practice contain many other components, but the following description is sufficient for an understanding of the present invention.
  • Figure 13 shows the microphone 212 mentioned above.
  • the smartphone 210 is provided with multiple microphones 212, 212a, 212b, etc.
  • Figure 13 also shows a memory 214, which may in practice be provided as a single component or as multiple components.
  • the memory 214 is provided for storing data and program instructions.
  • Figure 13 also shows a processor 216, which again may in practice be provided as a single component or as multiple components.
  • one component of the processor 216 may be an applications processor of the smartphone 210.
  • Figure 13 also shows a transceiver 218, which is provided for allowing the smartphone 210 to communicate with external networks.
  • the transceiver 218 may include circuitry for establishing an internet connection either over a WiFi local area network or over a cellular network.
  • Figure 13 also shows audio processing circuitry 220, for performing operations on the audio signals detected by the microphone 212 as required.
  • the audio processing circuitry 220 may filter the audio signals or perform other signal processing operations.
  • the smartphone 210 is provided with voice biometric functionality, and with control functionality.
  • the smartphone 210 is able to perform various functions in response to spoken commands from an enrolled user.
  • the biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person.
  • certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command.
  • Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
  • voice biometric functionality is performed on the smartphone 210 or other device that is located close to the user
  • the spoken commands are transmitted using the transceiver 218 to a remote speech recognition system, which determines the meaning of the spoken commands.
  • the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 210 or other local device.
  • Figure 14 is a block diagram illustrating the basic form of a speech processing system in a device 210.
  • signals received at a microphone 212 are passed to a speech processing block 230.
  • the speech processing block 230 may comprise a voice activity detector, a speaker recognition block for performing a speaker identification or speaker verification process, and/or a speech recognition block for identifying the speech content of the signals.
  • the speech processing block 230 may also comprise signal conditioning circuitry, such as a pre-amplifier, analog-digital conversion circuitry, and the like.
  • the non- linearity may be in the microphone 212, or may be in signal conditioning circuitry in the speech processing block 230.
  • the effect of this is non-linearity in the circuitry is that ultrasonic tones may mix down into the audio band.
  • Figure 15 illustrates this schematically. Specifically, Figure 15 shows a situation where there are interfering signals at two frequencies Fi and F2 in the ultrasound frequency range (i.e. at frequencies > 20kHz), which mix down as a result of the circuit non- linearity to form a signal at a frequency F 3 in the audio frequency range (i.e. at frequencies between about 20Hz and 20kHz).
  • Figure 16 is a flow chart, illustrating a method of analysing an audio signal.
  • the method comprises receiving an input sound signal comprising audio and non-audio frequencies.
  • the method comprises separating the input sound signal into an audio band component and a non-audio band component.
  • the non-audio component may be an ultrasonic component.
  • the method comprises identifying possible interference within the audio band from the non-audio band. Identifying possible interference within the audio band from the non-audio band component may comprise determining whether a power level of the non-audio band component exceeds a threshold value and, if so, identifying possible interference within the audio band from the non-audio band component. Alternatively, identifying possible interference within the audio band from the non-audio band component may comprise comparing the audio band and non-audio band components.
  • problematic signals may be present accidentally, as the result of relatively high levels of background sound signals, such as ultrasonic signals from ultrasonic sensor devices or modems.
  • the problematic signals may be generated by a malicious actor in an attempt to interfere with or spoof the operation of a speech processing system, for example by generating ultrasonic signals that mix down as a result of circuit non-linearities to form audio band signals that can be misinterpreted as speech, or by generating ultrasonic signals that interfere with other aspects of the processing.
  • step 258 the method comprises adjusting the operation of a downstream speech processing module based on said identification of possible interference.
  • the adjusting of the operation of the speech processing module may take the form of modifications to the speech processing that is performed by the speech processing module, or may take the form of modifications to the signal that is applied to the speech processing module.
  • modifications to the speech processing that is performed by the speech processing module may involve placing less (or zero) reliance on the speech signal during time periods when possible interference is identified, or warning a user that there is possible interference.
  • modifications to the signal that is applied to the speech processing module may take the form of attempting to remove the effect of the interference.
  • FIG 17 is a block diagram illustrating the basic form of a speech processing system in a device 210.
  • signals received at a microphone 212 are passed to a speech processing block 230.
  • the speech processing block 230 may comprise a voice activity detector, a speaker recognition block for performing a speaker identification or speaker verification process, and/or a speech recognition block for identifying the speech content of the signals.
  • the speech processing block 230 may also comprise signal conditioning circuitry, such as a pre-amplifier, analog- digital conversion circuitry, and the like.
  • the non-linearity may be in the microphone 212, or may be in signal conditioning circuitry in the speech processing block 230.
  • the received signals are also passed to an ultrasound monitoring block 262, which separates the input sound signal into an audio band component and a non-audio band component, which may be an ultrasonic component, and identifies possible interference within the audio band from the non-audio band component.
  • the speech processing that is performed by the speech processing module may be modified appropriately.
  • FIG 18 is a block diagram illustrating the basic form of a speech processing system in a device 210.
  • signals received at a microphone 212 are passed to an ultrasound monitoring block 266, which separates the input sound signal into an audio band component and a non-audio band component, which may be an ultrasonic component, and identifies possible interference within the audio band from the non-audio band component, resulting for example from non-linearity in the microphone 212. If a source of possible interference is identified, the received signal may be modified appropriately, and the modified signal may then be applied to the speech processing module 230.
  • the speech processing block 230 may comprise a voice activity detector, a speaker recognition block for performing a speaker identification or speaker verification process, and/or a speech recognition block for identifying the speech content of the signals.
  • the speech processing block 230 may also comprise signal conditioning circuitry, such as a pre-amplifier, analog-digital conversion circuitry, and the like.
  • FIG 19 is a block diagram, illustrating the form of the ultrasound monitoring block 262 or 266, in some embodiments.
  • signals received from the microphone 212 are separated into an audio band component and a non-audio band component.
  • the received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ⁇ 20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
  • LPF low-pass filter
  • the received signals are also passed to a high-pass filter (HPF) 284, for example a high-pass filter with a cut-off frequency at or above ⁇ 20kHz, to obtain a non-audio band component of the input sound signal, which will be an ultrasound signal when the high-pass filter has a cut-off frequency at or above ⁇ 20kHz.
  • HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ⁇ 20kHz to ⁇ 90kHz.
  • the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ⁇ 20kHz.
  • the non-audio band component of the input sound signal is passed to a power level detect block 2150, which determines whether a power level of the non-audio band component exceeds a threshold value.
  • the power level detect block 2150 may determine whether the peak non-audio band (e.g. ultrasound) power level exceeds a threshold. For example, it may determine whether the peak ultrasound power level exceeds -30dBFS (decibels relative to full scale). Such a level of ultrasound may result from an attack by a malicious party. In any event, if the ultrasound power level exceeds the threshold value, it could be identified that this may result in interference in the audio band due to non-linearities.
  • the threshold value may be set based on knowledge of the effect of the non-linearity in the circuit.
  • the effect of the nonlinearity is known to be a value A(nl), for example a 40dB mixdown
  • a threshold A(bb) for a power level in the audio base band which could affect system operation, for example 30dB SPL.
  • the output of the power level detect block 2150 may be a flag, to be sent to the
  • downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
  • Figure 20 is a block diagram, illustrating the form of the ultrasound monitoring block 262 or 266, in some embodiments.
  • signals received from the microphone 212 are separated into an audio band component and a non-audio band component.
  • the received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ⁇ 20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
  • LPF low-pass filter
  • HPF high-pass filter
  • the HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ⁇ 20kHz to ⁇ 90kHz.
  • the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ⁇ 20kHz.
  • the non-audio band component of the input sound signal is passed to a power level compare block 2160.
  • identifying possible interference within the audio band from the non-audio band component may comprise: measuring a signal power in the audio band component P a ; measuring a signal power in the non-audio band component P b . Then, if (P a /Pb) is less than a threshold limit, it could be identified that this may result in interference in the audio band due to non-linearities.
  • the output of the power level compare block 2160 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof. More specifically, this flag may indicate to the speech processing module that the quality of the input sound signal is unreliable for speech processing. The operation of the downstream speech processing module may then be controlled based on the flagged unreliable quality.
  • Figure 21 is a block diagram, illustrating the form of the ultrasound monitoring block 262 or 266, in some embodiments.
  • Signals received from the microphone 212 are separated into an audio band component and a non-audio band component.
  • the received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ⁇ 20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
  • LPF low-pass filter
  • HPF high-pass filter
  • the HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ⁇ 20kHz to ⁇ 90kHz.
  • the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ⁇ 20kHz.
  • the non-audio band component of the input sound signal may be passed to a block 286 that simulates the effect of a non-linearity on the signal, and then to a low-pass filter 288.
  • the audio band component generated by the low-pass filter 282 and the simulated non-linear signal generated by the block 286 and the low-pass filter 288 are then passed to a comparison block 290.
  • the comparison block 290 measures a signal power in the audio band component, measures a signal power in the non-audio band component, and calculates a ratio of the signal power in the audio band component to the signal power in the non-audio band component. If this ratio is below a threshold limit, this is taken to indicate that the input sound signal may contain too high a level of ultrasound to be reliably used for speech processing.
  • the output of the comparison block 290 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
  • the comparison block 290 detects the envelope of the signal of the non-audio band component, and detects a level of correlation between the envelope of the signal and the audio band component. Detecting the level of correlation may comprise measuring a time-domain correlation between identified signal envelopes of the non-audio band component, and speech components of the audio band component. In this situation, some or all of the audio band component may result from ultrasound signals in the ambient sound, that have been downconverted into the audio band by non-linearities in the microphone 212. This will lead to a correlation with the non-audio band component that is selected by the filter 284. Therefore, the presence of such a correlation exceeding a threshold value is taken as an indication that there may be non-audio band interference within the audio band.
  • the output of the comparison block 290 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
  • the block 286 simulates the effect of a non-linearity on the signal, to provide a simulated non-linear signal.
  • the block 286 may attempt to model the non-linearity in the system that may be causing the interference by non-linear downconversion of the input sound signal.
  • the non-linearities simulated by the block 286 may be second-order and/or third-order non-linearities.
  • the comparison block 290 then detects a level of correlation between the simulated non-linear signal and the audio band component. If the level of correlation exceeds a threshold value, then it is determined that there may be interference within the audio band caused by signals from the non-audio band. Again, in that case, the output of the comparison block 290 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
  • Figure 22 is a block diagram, illustrating the form of the ultrasound monitoring block 266, in some other embodiments.
  • Signals received from the microphone 212 are separated into an audio band component and a non-audio band component.
  • the received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ⁇ 20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
  • LPF low-pass filter
  • HPF high-pass filter
  • the HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ⁇ 20kHz to ⁇ 90kHz.
  • the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ⁇ 20kHz.
  • the non-audio band component of the input sound signal may be passed to a block 286 that simulates the effect of a non-linearity on the signal, and then to a low-pass filter 288.
  • the adjustment of the operation of the downstream speech processing module in step 258 of the method of Figure 16, comprises providing a compensated sound signal to the downstream speech processing module.
  • the step of providing the compensated sound signal may comprise subtracting the simulated non-linear signal from the audio band component to provide the
  • the simulated non-linear signal generated by the block 286 and the low-pass filter 288 are passed to a further filter 2100.
  • the audio band component generated by the low-pass filter 282 is passed to a subtractor 2102, and the output of the further filter 2100 is subtracted from the audio band component, in order to remove from the audio band signal any component caused by downconversion of ultrasound signals.
  • the further filter 2100 may be an adaptive filter, and in its simplest form it may be an adaptive gain.
  • the further filter 2100 is adapted such that the component of the filtered simulated non-linearity signal in the compensated output signal is minimised.
  • the resulting compensated audio band signal is passed to the downstream speech processing module.
  • Figure 23 is a block diagram, illustrating the form of the ultrasound monitoring block 266, in some other embodiments.
  • the signals from the microphone 212 may be analog signals, and they may be passed to an analog-digital converter for conversion to digital form before being passed to the respective filters.
  • analog-digital converters have not been shown in the figures.
  • Figure 23 shows a case in which the analog-digital conversion is not ideal, and so Figure 23 shows signals received from the microphone 212 being passed to an analog-digital converter (ADC) 2120. Again, the resulting signal is separated into an audio band component and a non-audio band component.
  • the received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ⁇ 20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
  • LPF low-pass filter
  • Figure 23 shows the output of the ADC 2120 being passed not to a high-pass filter, but to a band-pass filter (BPF) 2122.
  • BPF band-pass filter
  • the lower end of the pass-band may for example be at ⁇ 20kHz, with the upper end of the pass-band being at a frequency that excludes the frequencies that are corrupted by quantization noise, for example at ⁇ 90kHz.
  • the non-audio band component of the input sound signal may be passed to a block 286 that simulates the effect of a non-linearity on the signal, and then to a low-pass filter 288.
  • the adjustment of the operation of the downstream speech processing module in step 258 of the method of Figure 16, comprises providing a compensated sound signal to the downstream speech processing module.
  • the step of providing the compensated sound signal may comprise subtracting the simulated non-linear signal from the audio band component to provide the compensated output signal, which is then provided to the downstream speech processing module.
  • the audio band component generated by the low-pass filter 282 is passed to a subtractor 2102, and the simulated non-linear signal generated by the block 286 and the low-pass filter 288 is subtracted from the audio band component. This attempts to remove from the audio band signal any component caused by downconversion of ultrasound signals.
  • the resulting compensated audio band signal is passed to the downstream speech processing module.
  • Figure 24 is a block diagram, illustrating the form of the ultrasound monitoring block 266, in some other embodiments, where the non-linearity in the microphone 212 or elsewhere is unknown (for example the magnitude of the non-linearity and/or the relative strengths of 2 nd order non-linearity and 3 rd order non-linearity).
  • the step of simulating a non-linearity comprises providing the non-audio band component to an adaptive non-linearity module, and the method comprises controlling the adaptive non-linearity module such that the component of the simulated non-linearity signal in the compensated output signal is minimised.
  • Figure 24 shows the received signal being passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ⁇ 20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
  • LPF low-pass filter
  • Figure 24 shows the received signal being passed to a band-pass filter (BPF) 2122.
  • BPF band-pass filter
  • the lower end of the pass-band may for example be at ⁇ 20kHz, with the upper end of the pass-band being at a frequency that excludes the frequencies that are corrupted by quantization noise, for example at ⁇ 90kHz.
  • the non-audio band component of the input sound signal may be passed to an adaptive block 2140 that simulates the effect of a non-linearity on the signal.
  • the output of the block 2140 is passed to a low-pass filter 288.
  • the adjustment of the operation of the downstream speech processing module in step 258 of the method of Figure 16, comprises providing a compensated sound signal to the downstream speech processing module.
  • the step of providing the compensated sound signal may comprise subtracting the simulated non-linear signal from the audio band component to provide the compensated output signal, which is then provided to the downstream speech processing module.
  • the audio band component generated by the low-pass filter 282 is passed to a subtractor 2102, and the simulated non-linear signal generated by the block 2140 and the low-pass filter 288 is subtracted from the audio band component. This attempts to remove from the audio band signal any component caused by downconversion of ultrasound signals.
  • the resulting compensated audio band signal is passed to the downstream speech processing module.
  • the non-linearity may be modelled in the block 2140 with a polynomial p(x), with the error being fed back from the output of the subtractor 2102.
  • the Least Mean Squares algorithm may update the m-th polynomial term as per: j1 ⁇ 2 ⁇ i . + ⁇ ' s ⁇ x m
  • is a filter function.
  • a simple Boxcar filter could be used.
  • any of the embodiments described above can be used in a two-stage system, in which the first stage corresponds to that shown in Figure 19. That is, the received signal is filtered to obtain an audio band component and a non-audio band (for example, ultrasound) component of the input signal. It is then determined whether the signal power in the non-audio band component is below or above a threshold value. If there is a low power level in the ultrasound band, this indicates that there is unlikely to be a problem caused by downconversion of audio signals to the audio band. If there is a higher power level in the ultrasound band, there is a possibility of a problem, and so the further processing described above with reference to Figure 21 , 22, 23 or 24 is performed to determine if interference is likely, and to take mitigating action if required.
  • a non-audio band for example, ultrasound
  • the input sound signal may be flagged as free of non-audio band interference, and, if the measured signal power level in the non-audio band component is above a threshold level X, the audio band and non-audio band components may be compared to identify possible interference within the audio band from the non-audio band.
  • This allows for low-power operation, as the comparison step will only be performed in situations where the non-audio band component has a signal power above the threshold level. For a non-audio band component having signal power below such a threshold, it can be assumed that no interference will be present in the input sound signal used for downstream speech processing.
  • processor control code for example on a nonvolatile carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier.
  • a nonvolatile carrier medium such as a disk, CD- or DVD-ROM
  • programmed memory such as read only memory (Firmware)
  • a data carrier such as an optical or electrical signal carrier.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field Programmable Gate Array
  • the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA.
  • the code may also comprise code for dynamically configuring re- configurable apparatus such as re-programmable logic gate arrays.
  • the code may comprise code for a hardware description language such as Verilog TM or VHDL (Very high speed integrated circuit Hardware Description Language).
  • Verilog TM or VHDL Very high speed integrated circuit Hardware Description Language
  • the code may be distributed between a plurality of coupled components in communication with one another.
  • the embodiments may also be implemented using code running on a field-(re)programmable analogue array or similar device in order to configure analogue hardware.
  • module shall be used to refer to a functional unit or block which may be implemented at least partly by dedicated hardware components such as custom defined circuitry and/or at least partly be implemented by one or more software processors or appropriate code running on a suitable general purpose processor or the like.
  • a module may itself comprise other modules or functional units.
  • a module may be provided by multiple components or sub-modules which need not be co-located and could be provided on different integrated circuits and/or running on different processors.
  • Embodiments may be implemented in a host device, especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
  • a host device especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
  • a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Telephone Function (AREA)

Abstract

Detecting a replay attack on a voice biometrics system comprises: receiving a speech signal; generating an ultrasound signal; detecting a reflection of the generated ultrasound signal; detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of the liveness of a speaker based on the detected Doppler shifts. Identifying whether the received speech signal is indicative of liveness based on the detected Doppler shifts comprises determining whether the detected Doppler shifts correspond to a speech articulation rate.

Description

DETECTION OF LIVENESS Technical Field Embodiments described herein relate to methods and devices for detecting liveness of a speaker. As one example, the liveness detection can be used for detecting a replay attack on a voice biometrics system.
In addition, embodiments described herein relate to methods and devices for improving the robustness of a speech processing system.
Background
Biometrics systems are becoming widely used. In a voice biometrics system, a user trains the system by providing samples of their speech during an enrolment phase. In subsequent use, the system is able to discriminate between the enrolled user and non- registered speakers. Voice biometrics systems can in principle be used to control access to a wide range of services and systems. One way for a malicious party to attempt to defeat a voice biometrics system is to obtain a recording of the enrolled user's speech, and to play back the recording in an attempt to impersonate the enrolled user and to gain access to services that are intended to be restricted to the enrolled user. This is referred to as a replay attack, or as a spoofing attack.
In a facial recognition, or other type of biometrics system, the system recognises a characteristic of the user. Again, one way for a malicious party to attempt to defeat such a biometrics system is to present the system with a photograph or video recording of the enrolled user.
In addition, many devices include microphones, which can be used to detect ambient sounds. In many situations, the ambient sounds include the speech of one or more nearby speaker. Audio signals generated by the microphones can be used in many ways. For example, audio signals representing speech can be used as the input to a speech recognition system, allowing a user to control a device or system using spoken commands.
It has been suggested that it is possible to interfere with the operation of such a system by transmitting an ultrasound signal, which is by definition inaudible to the user of the device, but which is converted into a signal in the audio frequency band by non-linear components of the electronic circuitry in the device, and which will be recognised as speech by the speech recognition system. Such a malicious ultrasonics-based attack is sometimes referred to as a "dolphin attack", due to the similarity with how dolphins communicate in ultrasonic audio bands.
Summary
According to a first aspect of the present invention, there is provided a method of liveness detection. The method comprises: receiving a speech signal; generating an ultrasound signal; detecting a reflection of the generated ultrasound signal; detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of the liveness of a speaker based on the detected Doppler shifts. Identifying whether the received speech signal is indicative of liveness based on the detected Doppler shifts comprises determining whether the detected Doppler shifts correspond to a speech articulation rate.
According to another aspect of the present invention, there is provided a system configured for performing the method of the first aspect.
According to another aspect of the present invention, there is provided a device comprising such a system. The device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the first aspect. According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the first aspect.
According to a second aspect of the present invention, there is provided a method of detecting liveness of a speaker, which comprises: generating an ultrasound signal; receiving an audio signal comprising a reflection of the ultrasound signal; using the received audio signal comprising the reflection of the ultrasound signal to detect the liveness of a speaker; monitoring ambient ultrasound noise; and adjusting the operation of a system receiving the audio signal, based on a level of the reflected ultrasound and the monitored ambient ultrasound noise. The method can be used in a voice biometrics system, in which case detecting the liveness of a speaker comprises determining whether a received speech signal may be a product of a replay attack. The operation of the voice biometrics system may be adjusted based on a level of the reflected ultrasound and the monitored ambient ultrasound noise.
According to another aspect of the present invention, there is provided a system for liveness detection, the system being configured for performing the method of the second aspect.
According to another aspect of the present invention, there is provided a device comprising such a system. The device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the second aspect.
According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the second aspect. According to a third aspect of the present invention, there is provided a method of liveness detection in a device. The method comprises: receiving a speech signal from a voice source; generating and transmitting an ultrasound signal through a transducer of the device; detecting a reflection of the transmitted ultrasound signal; detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of liveness of a speaker based on the detected Doppler shifts. The method further comprises: obtaining information about a position of the device; and adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device.
According to another aspect of the present invention, there is provided a system for liveness detection, the system being configured for performing the method of the third aspect. According to another aspect of the present invention, there is provided a device comprising such a system. The device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the third aspect. According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the third aspect. According to a fourth aspect of the present invention, there is provided a method for improving the robustness of a speech processing system having at least one speech processing module, the method comprising: receiving an input sound signal comprising audio and non-audio frequencies; separating the input sound signal into an audio band component and a non-audio band component; identifying possible interference within the audio band from the non-audio band component; and adjusting the operation of a downstream speech processing module based on said identification. According to another aspect of the present invention, there is provided a system for improving the robustness of a speech processing system, configured for operating in accordance with the method of the fourth aspect.
According to another aspect of the present invention, there is provided a device comprising such a system. The device may comprise a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
According to another aspect of the present invention, there is provided a computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to the fourth aspect.
According to another aspect of the present invention, there is provided a non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to the fourth aspect.
Brief Description of Drawings
For a better understanding of the present invention, and to show how it may be put into effect, reference will now be made to the accompanying drawings, in which:- Figure 1 illustrates a smartphone.
Figure 2 is a schematic diagram, illustrating the form of the smartphone.
Figure 3 illustrates a situation in which a replay attack is being performed;
Figure 4 is a flow chart illustrating a method of detecting liveness;
Figure 5 illustrates a speech processing system, including a system for detecting liveness;
Figure 6 is a flow chart illustrating a part of the method of detecting liveness; Figure 7 illustrates various possible uses of smartphones;
Figure 8 is a flow chart illustrating a part of the method of detecting liveness;
Figure 9 is a flow chart illustrating a part of the method of detecting liveness;
Figure 10 is a block diagram, illustrating a part of the system for detecting liveness; Figure 1 1 illustrates results of the method of detecting liveness;
Figure 12 illustrates a smartphone;
Figure 13 is a schematic diagram, illustrating the form of the smartphone;
Figure 14 illustrates a speech processing system;
Figure 15 illustrates an effect of using a speech processing system;
Figure 16 is a flow chart illustrating a method of handling an audio signal;
Figure 17 is a block diagram illustrating a system using the method of Figure 16; Figure 18 is a block diagram illustrating a system using the method of Figure 16;
Figure 19 is a block diagram of a system using the method of Figure 16;
Figure 20 is a block diagram of a system using the method of Figure 16;
Figure 21 is a block diagram of a system using the method of Figure 16;
Figure 22 is a block diagram of a system using the method of Figure 16;
Figure 23 is a block diagram of a system using the method of Figure 16; and
Figure 24 is a block diagram of a system using the method of Figure 16. Detailed Description of Embodiments The description below sets forth example embodiments according to this disclosure. Further example embodiments and implementations will be apparent to those having ordinary skill in the art. Further, those having ordinary skill in the art will recognize that various equivalent techniques may be applied in lieu of, or in conjunction with, the embodiments discussed below, and all such equivalents should be deemed as being encompassed by the present disclosure.
One example of the invention is illustrated with reference to its use in a smartphone, by way of example, though it will be appreciated that it may be implemented in any suitable device, as described in more detail below.
Figure 1 illustrates a smartphone 1 10, having a microphone 1 12 for detecting ambient sounds. In normal use, the microphone is of course used for detecting the speech of a user who is holding the smartphone 1 10. The smartphone 1 10 also has two loudspeakers 1 14, 1 16. The first loudspeaker 1 14 is located at the top of the smartphone 1 10, when it is held in its normal operating position for making a voice call, and is used for playing the sounds that are received from the remote party to the call. The second loudspeaker 1 16 is located at the bottom of the smartphone 1 10, and is used for playing back media content from local or remote sources. Thus, the second loudspeaker 1 16 is used for playing back music that is stored on the smartphone 1 10 or sounds associated with videos that are being accessed over the internet. The illustrated smartphone 1 10 also has two additional microphones 1 12a, 1 12b. The additional microphones, if present in the device, may be provided at any suitable location. In this illustrated device, one microphone 1 12a is located at the top end of the front of the device, while another microphone 1 12b is located at the top end of the side of the device.
Figure 2 is a schematic diagram, illustrating the form of the smartphone 1 10. Specifically, Figure 2 shows various interconnected components of the smartphone 1 10. It will be appreciated that the smartphone 1 10 will in practice contain many other components, but the following description is sufficient for an understanding of the present invention.
Thus, Figure 2 shows the microphone 1 12 mentioned above. In this particular illustrated embodiment, the smartphone 1 10 is provided with multiple microphones 1 12, 1 12a, 1 12b, etc. Figure 2 also shows the loudspeakers 1 14, 1 16.
Figure 2 also shows a memory 1 18, which may in practice be provided as a single component or as multiple components. The memory 1 18 is provided for storing data and program instructions. Figure 2 also shows a processor 120, which again may in practice be provided as a single component or as multiple components. For example, one component of the processor 120 may be an applications processor of the smartphone 1 10.
Figure 2 also shows a transceiver 122, which is provided for allowing the smartphone 1 10 to communicate with external networks. For example, the transceiver 122 may include circuitry for establishing an internet connection over a WiFi local area network and/or over a cellular network.
Figure 2 also shows audio processing circuitry 124, for performing operations on the audio signals detected by the microphone 1 12 as required. For example, the audio processing circuitry 124 may filter the audio signals or perform other signal processing operations.
The audio signal processing circuitry is also able to generate audio signals for playback through the loudspeakers 1 14, 1 16, as discussed in more detail below.
Figure 2 also shows that the smartphone 1 10 may include one or more sensors 126. In certain embodiments, the sensor(s) may include any combination of the following: gyroscopes, accelerometers, proximity sensors, light level sensors, touch sensors, and a camera. In this illustrated embodiment, the smartphone 1 10 is provided with voice biometric functionality, and with control functionality. Thus, the smartphone 1 10 is able to perform various functions in response to spoken commands from an enrolled user. The biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person. Thus, certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command. Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user.
In some embodiments, while voice biometric functionality is performed on the smartphone 1 10 or other device that is located close to the user, the spoken commands are transmitted using the transceiver 122 to a remote speech recognition system, which determines the meaning of the spoken commands. For example, the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 1 10 or other local device. In other embodiments, the speech recognition system is also located on the device 1 10. One attempt to deceive a voice biometric system is to play a recording of an enrolled user's voice in a so-called replay or spoof attack.
Figure 3 shows an example of a situation in which a replay attack is being performed. Thus, in Figure 3, the smartphone 1 10 is provided with voice biometric functionality. In this example, the smartphone 1 10 is in the possession, at least temporarily, of an attacker, who has another smartphone 130. The smartphone 130 has been used to record the voice of the enrolled user of the smartphone 1 10. The smartphone 130 is brought close to the microphone inlet 1 12 of the smartphone 1 10, and the recording of the enrolled user's voice is played back. If the voice biometric system is unable to determine that the enrolled user's voice that it recognises is a recording, the attacker will gain access to one or more services that are intended to be accessible only by the enrolled user.
At the same time, or separately, when the smartphone 1 10 is provided with a camera- based biometric functionality, such as a facial recognition system, an attacker may use the display of the smartphone 130 to show a photo or video of the enrolled user, in an attempt to defeat the facial recognition system.
Embodiments described herein therefore attempt to perform liveness detection, for example detecting the presence of a person speaking any voice sounds that are detected.
Figure 4 is a flow chart, illustrating a method of liveness detection, for example for use in a biometrics system, and in this illustrated example used for detecting a replay attack on a voice biometrics system, and Figure 5 is a block diagram illustrating functional blocks in one example of a speech processing system that includes the voice biometrics system. Specifically, in step 150 in the method of Figure 4, a signal is received on an input 170 of the system shown in Figure 5. Thus, the input 170 may be connected to the microphone 1 12 shown in Figure 1 or the multiple microphones 1 12, 1 12a, 1 12b, etc shown in Figure 2. The received signal is passed to a voice activity detector (VAD) 172, which detects when the received signal contains speech.
The received signal is also passed to a keyword detection block 174. If it is determined by the voice activity detector 172 that the received signal contains speech, the keyword detection block 174 is activated, and it acts to detect the presence of a predetermined keyword in the detected speech. For example, the speech processing system of a smartphone might as a default operate in a low power mode, reflecting the fact that speech processing will be required for only a small fraction of the operating life of the device. The speech processing system may then be taken out of the low-power mode by the user uttering the predetermined keyword or phrase, such as "Hello phone". The received signal is also passed to a speaker recognition block 176. If it is determined by the keyword detection block 174 that the predetermined keyword is present in the detected speech, the speaker recognition block 176 then attempts to determine whether the person who uttered the predetermined keyword is the registered user of the device and/or of a particular application on the device. Suitable biometric techniques are known for determining whether the speaker of the speech that is present in the received signal is the registered user.
If it is determined by the speaker recognition block 176 that the person who uttered the predetermined keyword is the registered user of the device and/or of the particular application on the device, then the received signal is passed to a speech processing block 178, which may be present on the device or may be located remotely, in the cloud. The speech processing block 178 then determines the content of the speech. If the speech contains a command, for example, then the speech processing block 178 generates a suitable signal for causing that command to be acted upon.
The system shown in Figure 5 includes a mechanism for performing liveness detection, and hence for detecting whether the received signal containing speech has originated from a replay attack, as illustrated in Figure 3.
Thus, in step 152 of the method shown in Figure 4, an ultrasound signal is generated and transmitted, by the ultrasound generate and transmit block 180 shown in Figure 5. The ultrasound transmit block 180 may operate at all times. In other embodiments, the ultrasound transmit block 180 operates only when it receives an enable signal on its input 182. The enable signal may be generated, for example, when the voice activity detector 172 determines that the received signal contains speech, or when the keyword detection block 174 detects the presence of the predetermined keyword, or when the speaker recognition block 176 starts to perform a biometric technique to determine whether the person who uttered the predetermined keyword is the registered user.
The ultrasound signal may be a single tone sine wave, or other configurations may be used, for example a chirp signal. The frequency of the ultrasound signal may be selected to be relatively close to 20kHz for transmittability reasons, while being high enough to ensure that it is not audible. In step 154 of the method shown in Figure 4, a reflection of the generated ultrasound signal is detected.
In the system shown in Figure 5, a signal is received on an input 184, and passed to an ultrasound detection block 186. For example, the input 184 may be connected to one or more of the multiple microphones 1 12, 1 12a, 1 12b, etc shown in Figure 2, to receive any signal detected thereby.
The received signal is passed to the ultrasound detection block 186, which may for example comprise one or more filter for selecting signals having a frequency that is close to the frequency of the ultrasound signal transmitted by the ultrasound transmit block 180. Reflected ultrasound signals may be Doppler shifted in their frequency, but the Doppler shifts are unlikely to be much more than 100Hz, and so the ultrasound detection block 186 may comprise a filter for selecting signals having a frequency that is within 100Hz of the frequency of the ultrasound signal transmitted by the ultrasound transmit block 180.
In step 156 of the method shown in Figure 4, the received ultrasound signal detected by the ultrasound detection block 186 is passed to a Doppler detect block 188, to detect Doppler shifts in the reflection of the generated ultrasound signal. Thus, the received reflected ultrasound signal is compared with the generated ultrasound signal to identify frequency shifts in the reflected signal that are caused by reflections off a moving surface, such as the face, and in particular the lips, of a person who is speaking to generate the detected speech signal.
In step 158 of the method shown in Figure 4, it is determined based on the detected Doppler shifts whether these Doppler shifts provide good evidence for the liveness of a person generating the detected speech. In the illustrated embodiment shown in Figure 5, the output of the Doppler detect block 188 is applied to one input of a correlation block 190. The received audio signal on the input 170 is applied to another input of the correlation block 190. In an alternative embodiment, a signal generated by the voice activity detect block 172 is applied to the other input of the correlation block 190. The output of the correlation block 190 is applied to a determination block 192 shown in Figure 5. If it is found by the correlation block 190 that there is a correlation between time periods in which Doppler shifts are detected in the reflection of the generated ultrasound signal, and time periods in which speech content is identified in the received speech signal, this indicates that the detected speech is generated by a live person moving their lips to generate the sound. If the degree of correlation is low, one possible reason for this may be that the detected speech is not generated by a live person moving their lips to generate the sound. One possible cause of this is that the detected speech is in fact generated by a replay attack. Therefore, the determination block 192 produces an output signal that contains information about the liveness of the speaker, and hence about the likelihood that the detected speech was generated by a replay attack. This output signal is applied, in this illustrated embodiment, to the speaker recognition block 176, which is performing one or more voice biometrics process to determine whether the speaker is the registered user of the device. The speaker recognition block 176 can then use the output signal as one of several factors that it uses to determine whether the speaker is in fact the registered user of the device. For example, there may be one or more factors which indicate whether the detected speech is the speech of the registered user, and one or more factors which indicate whether the detected speech may have resulted from a replay attack.
In other examples, the liveness detection can be used for other purposes, for example for detecting an attempt to defeat a facial recognition system by presenting a still or moving image of an enrolled user.
As discussed in more detail below, the purpose of generating the ultrasound signal is to detect the movement of a speaker's face, and in particular the lips, while speaking. For this to operate successfully, it is advantageous that the ultrasound signal may be varied depending on information about the use of the device.
Thus, as described above, step 152 of the process shown in Figure 4 involves generating and transmitting the ultrasound signal. Figure 6 is a flow chart, giving more detail about this step, in some embodiments. Specifically, in step 1 1 10 of the method, the system obtains information about a position of the device 1 10. For example, obtaining information about a position of the device may comprise obtaining information about an orientation of the device. Information about the orientation of the device may for example be obtained from gyroscopes and/or accelerometers provided as sensors 126 in the device 1 10. As one alternative, obtaining information about a position of the device may comprise obtaining information about a distance of the device from the voice source. Information about a distance of the device from the voice source may for example be obtained by detecting the levels of signals generated by the microphones 1 12, 1 12a, 1 12b. For example, a higher signal level from one microphone may indicate that the voice source is closer to that microphone than to one or more other microphone.
As another alternative, obtaining information about a position of the device may comprise obtaining information about a position of the device relative to a supposed speaker. Information about the position of the device relative to a supposed speaker may for example be obtained from one or more proximity sensor provided as a sensor 126 in the device 1 10. Information about the position of the device relative to a supposed speaker may also be obtained from one or more light level sensor provided as a sensor 126 in the device 1 10. Information about the position of the device relative to a supposed speaker may also be obtained from one or more touch sensor provided as a sensor 126 in the device 1 10, indicating how the device 1 10 is being held by a user. Information about the position of the device relative to a supposed speaker may also be obtained from a camera provided as a sensor 126 in the device 1 10, which can track the position of a user's face relative to the device 1 10. Then, in step 1 1 12, the method involves adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device.
Adapting the generating and transmitting of the ultrasound signal may for example comprise adjusting a transmit power of the ultrasound signal. As another example, when the device has multiple transducers 1 14, 1 16, adapting the generating and transmitting of the ultrasound signal may comprise selecting the one or more transducer in which the ultrasound signal is generated, with the intention that the ultrasound signal should be generated from a transducer that is close to the user's mouth in order to be able to detect movement of the user's lips. For example, obtaining information about a position of the device may comprise obtaining information about a distance of the device from the voice source, and adapting the generating and transmitting of the ultrasound signal may comprise adjusting a transmit power of the ultrasound signal, with a higher power being used when the device is further from the voice source, at least for distances below a certain limit. This allows the device to generate ultrasound signals that produce clearly detectable reflections, without risking transmitting ultrasound energy when the device is close to the user's ear.
As another example, obtaining information about a position of the device may comprise obtaining information as to which of multiple loudspeaker transducers is closest to the voice source (for example based on signal levels at microphones placed located close to those transducers), and adapting the generating and transmitting of the ultrasound signal may comprise transmitting the ultrasound signal mainly or entirely from that transducer. This allows the device to generate ultrasound signals from the transducer that is closest to the sound source, and thereby increase the chance of detecting usable reflection signals.
Other possibilities relate to specific ways in which speakers may use the device. Thus, for example, when the device 1 10 is a mobile phone comprising at least a first transducer 1 16 at a lower end of the device and a second transducer 1 14 at an upper end of the device, adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal at a relatively low power from the first transducer 1 16 if the information about the position of the device indicates that the device 1 10 is being used in a close talk mode. Close talk will be understood as a use of a phone where the phone is positioned adjacent the side of a user's face, and where communication is using the close-range earpiece speaker, e.g. as with a "traditional" phone handset positioning. For example, the ultrasound signal may be transmitted at a level of 70-90dB SPL at 1 cm in this mode.
The information about the position of the device may be considered to indicate that the device is being used in a close talk mode if, for example, accelerometers indicate that the device 1 10 is in an upright position, and proximity sensors detect that the device 1 10 is being held close to a surface that might be a user's face 1 120, as shown in
Figure 7(a). More generally, adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal from the second transducer if the information about the position of the device indicates that the device is being used in a generally vertical orientation. As another example, when the device 1 10 is a mobile phone comprising at least a first transducer 1 16 at a lower end of the device and a second transducer 1 14 at an upper end of the device, adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal at a relatively high power from the transducer 1 16 at the lower end of the device, if the information about the position of the device indicates that the device 110 may be being held by the user in front of their face 1 130, with the lower microphone 1 12 pointing towards them, i.e. in a "pizza slice" version of a near talk mode, as shown in Figure 7(b). Near-talk mode will be understood as where a phone is positioned in front of the user's face, and where use may be made of near-field loudspeakers and microphones. This position may be suitable for the purposes of a video call, e.g. using software products such as Skype™ from Microsoft or FaceTime™from Apple. "Pizza slice" mode will be understood as a variation of near-talk mode, but where the phone is held in a relatively horizontal position (such that a microphone positioned at the lower end of the phone faces the user directly).
For example, the ultrasound signal may be transmitted at a level of 90-1 10dB SPL at 1 cm in this mode. The information about the position of the device may be considered to indicate that the device is being used in a "pizza slice" mode if, for example, accelerometers indicate that the device is in a horizontal position, and the signal level detected by the microphone 1 12 is higher than the signal level detected by the microphones 1 12a, 1 12b.
More generally, adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal from the first transducer if the information about the position of the device indicates that the device is being used in a generally horizontal orientation.
In the variant of the near talk mode, in which the device is held by the user in front of their face, for example so that they can see the screen on the device while speaking, adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise transmitting the ultrasound signal at a relatively high power from the transducer 1 14 at the upper end of the device, or from transducers at both ends of the device.
As another example, adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device may comprise preventing transmission of the ultrasound signal if the information about the position of the device indicates that the device is being used in a far field mode, for example with the device 1 10 being placed on a surface 1 140 some distance from the user 1 142, as shown in Figure 7(c). In this example, the information about the position of the device may indicate that the device is located more than a threshold distance (for example 50cm) from the source of the sound.
This is because it may be determined that detecting the movement of a speaker's lips is only reliable enough for use when the indications are that the device may be being held close to the user's face.
As shown in Figure 5, and as described above, the output of the Doppler detect block 188 is applied to one input of a correlation block 190. The received audio signal on the input 170 is applied to another input of the correlation block 190. The correlation block 190 determines whether there is a correlation between time periods in which Doppler shifts are detected in the reflection of the generated ultrasound signal, and periods in which there is speech.
The aim is to confirm that any Doppler shifts that are detected in the received reflection of the generated ultrasound signal do result from facial movements of a speaker, and are not the result of spurious reflections from other moving objects.
Figure 8 is a flow chart, illustrating a method performed in the correlation block 190. First, it is determined in step 1 150, whether the detected Doppler shifts correspond to a general speech articulation rate. The articulation rate is the rate at which syllables are produced during speech, and it has been found that, for most speech, a typical articulation rate is in the range of 4-1 OHz. The facial movements of the speaker (for example movements of the speaker's lips, cheeks, and nostrils) will typically occur at the same rate. Thus, in step 1 150, it is determined whether the detected Doppler shifts correspond to facial movements at a frequency in the range of 4-1 OHz.
In step 1 152, it is determined whether the detected Doppler shifts correspond to an articulation rate of the current speech.
Thus, the articulation rate of the speech contained in the received audio signal is extracted in the correlation block 190. It is then determined whether the detected Doppler shifts correspond to facial movements at a frequency that corresponds to that extracted articulation rate.
If it is determined that the detected Doppler shifts correspond to facial movements at a frequency that corresponds to that extracted articulation rate, this can be taken as good evidence of liveness. In a further possible step, in step 1 154 of the method shown in Figure 8, it is determined whether there is a correlation between detected Doppler shifts in the reflection of the generated ultrasound signal, and speech content of the received speech signal. It is recognised that one issue with using ultrasound as described herein, is that there may be interfering sources of ambient ultrasound noise. Therefore, Figure 9 is a flow chart, illustrating one method performed in the Doppler detect block 188 and correlation block 190. Specifically, in step 1 170, a level of ambient ultrasound noise is monitored. Then, in step 1 172, the operation of the voice biometrics system is adjusted based on the levels of the reflected ultrasound and monitored ambient ultrasound noise.
Figure 10 is a block diagram, illustrating schematically the operation of the Doppler detect block 188 and correlation block 190. Figure 11 illustrates signals obtained at different stages of the operation.
Specifically, the signal from one or microphone 1 12 is passed to a low pass filter 1 180, for isolating the audio frequency components (for example, below 20kHz) of the detected signal. The resulting audio signal, in one example, is shown in Figure 1 1 (a).
The signal level of the audio signal is found in a block 1 182 that finds the absolute value of the signal. The resulting envelope signal, in the same example, is shown in Figure 1 1 (b).
The signal from the one or microphone 1 12 is also passed to a high pass filter 1 184, for isolating the ultrasound components (for example, above 20kHz) of the detected signal. This may contain the wanted reflection of the generated ultrasound signal, but may also contain interfering ambient ultrasound noise.
The level of the ultrasound signal is determined by a level detector 1 186.
The ultrasound signal is then passed to a demodulation block 1 188, where it is downconverted to the audio band, and any Doppler shifted reflections are found. This is achieved by mixing the received ultrasound signal with the ultrasound signal that was generated and transmitted. The received ultrasound signal can be passed through a band pass filter before downconversion if required, in order to remove other ultrasound signals not originating from the transmitted signal. In addition, the output of the mixing step can be low-pass filtered.
The resulting signal, in one example, is shown in Figure 1 1 (c). The signal level of the Doppler shifted reflected signal is found in a block 1 190 that finds the absolute value of the signal. It can thus be seen from Figure 1 1 that there is a correlation between the detected Doppler shifts in the reflection of the generated ultrasound signal, and speech content of the received speech signal.
In order to obtain a robust result, a correlation operation is performed, as shown at block 1 192 of Figure 10.
However, before performing the correlation, it is noted that, while the audio signal is effectively the result of the facial movements of the speaker, the Doppler shifts in the reflected ultrasound signal will result from the velocity of the facial movements.
Therefore, in some embodiments, either the audio signal is differentiated (for example by passing through a block 1 194 in the form of a band pass filter with a pass-band of, say, 10-200Hz, an envelope block, or a differentiator), or the ultrasound signal is integrated (for example by passing through a block 1 196 in the form of a leaky integrator or a band pass filter with a pass-band of, say, 10-200Hz).
The correlator 1 192 then performs a frame-by-frame cross correlation on the signals. If the correlation result Rxy is above a threshold then it is determined that there is enough of a correlation, between the detected Doppler shifts and the speech content of the received speech signal, to conclude that there is evidence of a live speaker, and hence that the speech may not result from a replay attack. If there is not good evidence of liveness of a speaker, this may be an indication that the received speech signal may be a product of a replay attack.
The operation of the system may be adjusted, based on a level of the reflected ultrasound and the monitored ambient ultrasound noise, as detected by the level detector 1 186.
For example, the reliance that is placed on the determination as to whether the received speech signal may be the result of a replay attack may be adjusted, based on the level of the monitored ambient ultrasound noise. The determination, as to whether the received speech signal may be the result of a replay attack, will typically be made based on more than one factor. It is recognised that the presence of large ambient ultrasound signals will impact on the reliability of this system, and so the reliance that is placed on the determination may be reduced, as the level of the monitored ambient ultrasound noise increases. More specifically, if the level of the monitored ambient ultrasound noise exceeds a first threshold level, the result of the correlation may be ignored completely, or the correlation may not be performed.
For lower levels of interference, the adjustment of the operation of the system may involve adapting the threshold correlation value that is used in determining whether there is enough of a correlation, between the detected Doppler shifts and the speech content of the received speech signal, to conclude that there is evidence of a live speaker. Specifically, for low levels of ultrasound interference, a high threshold correlation value can be used. For somewhat higher levels of ultrasound interference (still below the first threshold mentioned above), lower threshold correlation values can be used, to take account of the fact that the presence of interference will automatically reduce the correlation values obtained from the correlator 1 192.
The following methods described herein can be implemented in a wide range of devices and systems. However, for ease of explanation of one embodiment, an illustrative example will be described, in which the implementation occurs in a smartphone.
Figure 12 illustrates a smartphone 210, having a microphone 212 for detecting ambient sounds. In normal use, the microphone is of course used for detecting the speech of a user who is holding the smartphone 210 close to their face.
Figure 13 is a schematic diagram, illustrating the form of the smartphone 210. Specifically, Figure 13 shows various interconnected components of the smartphone 210. It will be appreciated that the smartphone 210 will in practice contain many other components, but the following description is sufficient for an understanding of the present invention.
Thus, Figure 13 shows the microphone 212 mentioned above. In certain embodiments, the smartphone 210 is provided with multiple microphones 212, 212a, 212b, etc. Figure 13 also shows a memory 214, which may in practice be provided as a single component or as multiple components. The memory 214 is provided for storing data and program instructions.
Figure 13 also shows a processor 216, which again may in practice be provided as a single component or as multiple components. For example, one component of the processor 216 may be an applications processor of the smartphone 210. Figure 13 also shows a transceiver 218, which is provided for allowing the smartphone 210 to communicate with external networks. For example, the transceiver 218 may include circuitry for establishing an internet connection either over a WiFi local area network or over a cellular network. Figure 13 also shows audio processing circuitry 220, for performing operations on the audio signals detected by the microphone 212 as required. For example, the audio processing circuitry 220 may filter the audio signals or perform other signal processing operations. In this embodiment, the smartphone 210 is provided with voice biometric functionality, and with control functionality. Thus, the smartphone 210 is able to perform various functions in response to spoken commands from an enrolled user. The biometric functionality is able to distinguish between spoken commands from the enrolled user, and the same commands when spoken by a different person. Thus, certain embodiments of the invention relate to operation of a smartphone or another portable electronic device with some sort of voice operability, for example a tablet or laptop computer, a games console, a home control system, a home entertainment system, an in-vehicle entertainment system, a domestic appliance, or the like, in which the voice biometric functionality is performed in the device that is intended to carry out the spoken command. Certain other embodiments relate to systems in which the voice biometric functionality is performed on a smartphone or other device, which then transmits the commands to a separate device if the voice biometric functionality is able to confirm that the speaker was the enrolled user. In some embodiments, while voice biometric functionality is performed on the smartphone 210 or other device that is located close to the user, the spoken commands are transmitted using the transceiver 218 to a remote speech recognition system, which determines the meaning of the spoken commands. For example, the speech recognition system may be located on one or more remote server in a cloud computing environment. Signals based on the meaning of the spoken commands are then returned to the smartphone 210 or other local device.
Figure 14 is a block diagram illustrating the basic form of a speech processing system in a device 210. Thus, signals received at a microphone 212 are passed to a speech processing block 230. For example, the speech processing block 230 may comprise a voice activity detector, a speaker recognition block for performing a speaker identification or speaker verification process, and/or a speech recognition block for identifying the speech content of the signals. The speech processing block 230 may also comprise signal conditioning circuitry, such as a pre-amplifier, analog-digital conversion circuitry, and the like.
In such a system, there may be a non-linearity in the system. For example, the non- linearity may be in the microphone 212, or may be in signal conditioning circuitry in the speech processing block 230. The effect of this is non-linearity in the circuitry is that ultrasonic tones may mix down into the audio band.
Figure 15 illustrates this schematically. Specifically, Figure 15 shows a situation where there are interfering signals at two frequencies Fi and F2 in the ultrasound frequency range (i.e. at frequencies > 20kHz), which mix down as a result of the circuit non- linearity to form a signal at a frequency F3 in the audio frequency range (i.e. at frequencies between about 20Hz and 20kHz).
Figure 16 is a flow chart, illustrating a method of analysing an audio signal.
In step 252, the method comprises receiving an input sound signal comprising audio and non-audio frequencies.
In step 254, the method comprises separating the input sound signal into an audio band component and a non-audio band component. The non-audio component may be an ultrasonic component. In step 256, the method comprises identifying possible interference within the audio band from the non-audio band. Identifying possible interference within the audio band from the non-audio band component may comprise determining whether a power level of the non-audio band component exceeds a threshold value and, if so, identifying possible interference within the audio band from the non-audio band component. Alternatively, identifying possible interference within the audio band from the non-audio band component may comprise comparing the audio band and non-audio band components.
Separating the input sound signal into an audio component and a non-audio component, such as an ultrasonic component, makes it possible to identify the presence of potentially problematic non-audio band components which may result in interference in the audio band. Such problematic signals may be present accidentally, as the result of relatively high levels of background sound signals, such as ultrasonic signals from ultrasonic sensor devices or modems. Alternatively, the problematic signals may be generated by a malicious actor in an attempt to interfere with or spoof the operation of a speech processing system, for example by generating ultrasonic signals that mix down as a result of circuit non-linearities to form audio band signals that can be misinterpreted as speech, or by generating ultrasonic signals that interfere with other aspects of the processing.
In step 258, the method comprises adjusting the operation of a downstream speech processing module based on said identification of possible interference.
The adjusting of the operation of the speech processing module may take the form of modifications to the speech processing that is performed by the speech processing module, or may take the form of modifications to the signal that is applied to the speech processing module.
For example, modifications to the speech processing that is performed by the speech processing module may involve placing less (or zero) reliance on the speech signal during time periods when possible interference is identified, or warning a user that there is possible interference.
For example, modifications to the signal that is applied to the speech processing module may take the form of attempting to remove the effect of the interference.
Figure 17 is a block diagram illustrating the basic form of a speech processing system in a device 210. As in Figure 14, signals received at a microphone 212 are passed to a speech processing block 230. Again, as in Figure 14, the speech processing block 230 may comprise a voice activity detector, a speaker recognition block for performing a speaker identification or speaker verification process, and/or a speech recognition block for identifying the speech content of the signals. The speech processing block 230 may also comprise signal conditioning circuitry, such as a pre-amplifier, analog- digital conversion circuitry, and the like.
As mentioned with respect to Figure 14, there may be a non-linearity in the system. For example, the non-linearity may be in the microphone 212, or may be in signal conditioning circuitry in the speech processing block 230. In the system of Figure 17, the received signals are also passed to an ultrasound monitoring block 262, which separates the input sound signal into an audio band component and a non-audio band component, which may be an ultrasonic component, and identifies possible interference within the audio band from the non-audio band component.
If a source of possible interference is identified, the speech processing that is performed by the speech processing module may be modified appropriately.
Figure 18 is a block diagram illustrating the basic form of a speech processing system in a device 210. In the system of Figure 18, signals received at a microphone 212 are passed to an ultrasound monitoring block 266, which separates the input sound signal into an audio band component and a non-audio band component, which may be an ultrasonic component, and identifies possible interference within the audio band from the non-audio band component, resulting for example from non-linearity in the microphone 212. If a source of possible interference is identified, the received signal may be modified appropriately, and the modified signal may then be applied to the speech processing module 230. As in Figure 14, the speech processing block 230 may comprise a voice activity detector, a speaker recognition block for performing a speaker identification or speaker verification process, and/or a speech recognition block for identifying the speech content of the signals. The speech processing block 230 may also comprise signal conditioning circuitry, such as a pre-amplifier, analog-digital conversion circuitry, and the like.
Figure 19 is a block diagram, illustrating the form of the ultrasound monitoring block 262 or 266, in some embodiments. In this embodiment, signals received from the microphone 212 are separated into an audio band component and a non-audio band component. The received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ~20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal. The received signals are also passed to a high-pass filter (HPF) 284, for example a high-pass filter with a cut-off frequency at or above ~20kHz, to obtain a non-audio band component of the input sound signal, which will be an ultrasound signal when the high-pass filter has a cut-off frequency at or above ~20kHz. In other embodiments, the HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ~20kHz to ~90kHz. Again, the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ~20kHz.
The non-audio band component of the input sound signal is passed to a power level detect block 2150, which determines whether a power level of the non-audio band component exceeds a threshold value. For example, the power level detect block 2150 may determine whether the peak non-audio band (e.g. ultrasound) power level exceeds a threshold. For example, it may determine whether the peak ultrasound power level exceeds -30dBFS (decibels relative to full scale). Such a level of ultrasound may result from an attack by a malicious party. In any event, if the ultrasound power level exceeds the threshold value, it could be identified that this may result in interference in the audio band due to non-linearities. The threshold value may be set based on knowledge of the effect of the non-linearity in the circuit. Thus, if the effect of the nonlinearity is known to be a value A(nl), for example a 40dB mixdown, it is possible to set a threshold A(bb) for a power level in the audio base band which could affect system operation, for example 30dB SPL.
Then, an ultrasonic signal at or above A(us), where A(us) = A(bb) + A(nl), would cause problems in the audio band, because the non-linearity would cause it to generate a base band signal above the threshold at which system operation could be affected. With the examples given above, where A(nl) = 40dB and A(bb) = 30dB SPL, this gives a threshold value of 70dB for the ultrasound power level.
If it is determined that the ultrasound power level exceeds the threshold value, the output of the power level detect block 2150 may be a flag, to be sent to the
downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
Figure 20 is a block diagram, illustrating the form of the ultrasound monitoring block 262 or 266, in some embodiments.
In this embodiment, signals received from the microphone 212 are separated into an audio band component and a non-audio band component. The received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ~20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal. The received signals are also passed to a high-pass filter (HPF) 284, for example a high-pass filter with a cut-off frequency at or above ~20kHz, to obtain a non-audio band component of the input sound signal, which will be an ultrasound signal when the high-pass filter has a cut-off frequency at or above ~20kHz. In other embodiments, the HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ~20kHz to ~90kHz. Again, the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ~20kHz.
The non-audio band component of the input sound signal is passed to a power level compare block 2160. This compares the audio band and non-audio band components. For example, in this case, identifying possible interference within the audio band from the non-audio band component may comprise: measuring a signal power in the audio band component Pa; measuring a signal power in the non-audio band component Pb. Then, if (Pa /Pb) is less than a threshold limit, it could be identified that this may result in interference in the audio band due to non-linearities.
In that case, the output of the power level compare block 2160 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof. More specifically, this flag may indicate to the speech processing module that the quality of the input sound signal is unreliable for speech processing. The operation of the downstream speech processing module may then be controlled based on the flagged unreliable quality.
Figure 21 is a block diagram, illustrating the form of the ultrasound monitoring block 262 or 266, in some embodiments.
Signals received from the microphone 212 are separated into an audio band component and a non-audio band component. The received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ~20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal. The received signals are also passed to a high-pass filter (HPF) 284, for example a high-pass filter with a cut-off frequency at or above ~20kHz, to obtain a non-audio band component of the input sound signal, which will be an ultrasound signal when the high-pass filter has a cut-off frequency at or above ~20kHz. In other embodiments, the HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ~20kHz to ~90kHz. Again, the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ~20kHz. The non-audio band component of the input sound signal may be passed to a block 286 that simulates the effect of a non-linearity on the signal, and then to a low-pass filter 288.
The audio band component generated by the low-pass filter 282 and the simulated non-linear signal generated by the block 286 and the low-pass filter 288 are then passed to a comparison block 290. In one embodiment, the comparison block 290 measures a signal power in the audio band component, measures a signal power in the non-audio band component, and calculates a ratio of the signal power in the audio band component to the signal power in the non-audio band component. If this ratio is below a threshold limit, this is taken to indicate that the input sound signal may contain too high a level of ultrasound to be reliably used for speech processing. In that case, the output of the comparison block 290 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
In another embodiment, the comparison block 290 detects the envelope of the signal of the non-audio band component, and detects a level of correlation between the envelope of the signal and the audio band component. Detecting the level of correlation may comprise measuring a time-domain correlation between identified signal envelopes of the non-audio band component, and speech components of the audio band component. In this situation, some or all of the audio band component may result from ultrasound signals in the ambient sound, that have been downconverted into the audio band by non-linearities in the microphone 212. This will lead to a correlation with the non-audio band component that is selected by the filter 284. Therefore, the presence of such a correlation exceeding a threshold value is taken as an indication that there may be non-audio band interference within the audio band.
In that case, the output of the comparison block 290 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
In another embodiment, the block 286 simulates the effect of a non-linearity on the signal, to provide a simulated non-linear signal. For example, the block 286 may attempt to model the non-linearity in the system that may be causing the interference by non-linear downconversion of the input sound signal. The non-linearities simulated by the block 286 may be second-order and/or third-order non-linearities.
In that embodiment, the comparison block 290 then detects a level of correlation between the simulated non-linear signal and the audio band component. If the level of correlation exceeds a threshold value, then it is determined that there may be interference within the audio band caused by signals from the non-audio band. Again, in that case, the output of the comparison block 290 may be a flag, to be sent to the downstream speech processing module in step 258 of the method of Figure 16, in order to control the operation thereof.
Figure 22 is a block diagram, illustrating the form of the ultrasound monitoring block 266, in some other embodiments.
Signals received from the microphone 212 are separated into an audio band component and a non-audio band component. The received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ~20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal. The received signals are also passed to a high-pass filter (HPF) 284, for example a high-pass filter with a cut-off frequency at or above ~20kHz, to obtain a non-audio band component of the input sound signal, which will be an ultrasound signal when the high-pass filter has a cut-off frequency at or above ~20kHz. In other embodiments, the HPF 284 may be replaced by a band-pass filter, for example with a pass-band from ~20kHz to ~90kHz. Again, the non-audio band component of the input sound signal will be an ultrasound signal when the low frequency end of the pass band of the band-pass filter is at or above ~20kHz.
The non-audio band component of the input sound signal may be passed to a block 286 that simulates the effect of a non-linearity on the signal, and then to a low-pass filter 288.
In the case of the embodiments shown in Figure 22, the adjustment of the operation of the downstream speech processing module, in step 258 of the method of Figure 16, comprises providing a compensated sound signal to the downstream speech processing module.
The step of providing the compensated sound signal may comprise subtracting the simulated non-linear signal from the audio band component to provide the
compensated output signal, which is then provided to the downstream speech processing module. In the embodiment of Figure 22, the simulated non-linear signal generated by the block 286 and the low-pass filter 288 are passed to a further filter 2100.
The audio band component generated by the low-pass filter 282 is passed to a subtractor 2102, and the output of the further filter 2100 is subtracted from the audio band component, in order to remove from the audio band signal any component caused by downconversion of ultrasound signals. The further filter 2100 may be an adaptive filter, and in its simplest form it may be an adaptive gain. The further filter 2100 is adapted such that the component of the filtered simulated non-linearity signal in the compensated output signal is minimised.
The resulting compensated audio band signal is passed to the downstream speech processing module.
Figure 23 is a block diagram, illustrating the form of the ultrasound monitoring block 266, in some other embodiments.
In the embodiments illustrated above, the signals from the microphone 212 may be analog signals, and they may be passed to an analog-digital converter for conversion to digital form before being passed to the respective filters. However, for ease of illustration, in cases where it is assumed that the analog-digital conversion is not the source of non-linearity that causes ultrasound signals to be mixed down into the audio band, the analog-digital converters have not been shown in the figures.
However, Figure 23 shows a case in which the analog-digital conversion is not ideal, and so Figure 23 shows signals received from the microphone 212 being passed to an analog-digital converter (ADC) 2120. Again, the resulting signal is separated into an audio band component and a non-audio band component. The received signals are passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ~20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
In general the bandwidth of the ADC must be large enough to be able to handle the ultrasonic components of the received signal. However, in any real ADC, there will be a frequency at which the quantization noise of the ADC will start to rise. This places an upper limit on the frequencies that can be allowed into the non-linearity. Therefore, Figure 23 shows the output of the ADC 2120 being passed not to a high-pass filter, but to a band-pass filter (BPF) 2122. The lower end of the pass-band may for example be at ~20kHz, with the upper end of the pass-band being at a frequency that excludes the frequencies that are corrupted by quantization noise, for example at ~90kHz.
As in other embodiments, the non-audio band component of the input sound signal may be passed to a block 286 that simulates the effect of a non-linearity on the signal, and then to a low-pass filter 288.
In the case of the embodiments shown in Figure 23, the adjustment of the operation of the downstream speech processing module, in step 258 of the method of Figure 16, comprises providing a compensated sound signal to the downstream speech processing module.
In this illustrated example, the step of providing the compensated sound signal may comprise subtracting the simulated non-linear signal from the audio band component to provide the compensated output signal, which is then provided to the downstream speech processing module.
Thus, in Figure 23, the audio band component generated by the low-pass filter 282 is passed to a subtractor 2102, and the simulated non-linear signal generated by the block 286 and the low-pass filter 288 is subtracted from the audio band component. This attempts to remove from the audio band signal any component caused by downconversion of ultrasound signals.
The resulting compensated audio band signal is passed to the downstream speech processing module.
Figure 24 is a block diagram, illustrating the form of the ultrasound monitoring block 266, in some other embodiments, where the non-linearity in the microphone 212 or elsewhere is unknown (for example the magnitude of the non-linearity and/or the relative strengths of 2nd order non-linearity and 3rd order non-linearity). In this case, the step of simulating a non-linearity comprises providing the non-audio band component to an adaptive non-linearity module, and the method comprises controlling the adaptive non-linearity module such that the component of the simulated non-linearity signal in the compensated output signal is minimised.
Thus, Figure 24 shows the received signal being passed to a low-pass filter (LPF) 282, for example a low-pass filter with a cut-off frequency at or below ~20kHz, which filters the input sound signal to obtain an audio band component of the input sound signal.
Figure 24 shows the received signal being passed to a band-pass filter (BPF) 2122. The lower end of the pass-band may for example be at ~20kHz, with the upper end of the pass-band being at a frequency that excludes the frequencies that are corrupted by quantization noise, for example at ~90kHz.
In these embodiments, the non-audio band component of the input sound signal may be passed to an adaptive block 2140 that simulates the effect of a non-linearity on the signal. The output of the block 2140 is passed to a low-pass filter 288.
As before, the adjustment of the operation of the downstream speech processing module, in step 258 of the method of Figure 16, comprises providing a compensated sound signal to the downstream speech processing module.
More specifically, in this illustrated example, the step of providing the compensated sound signal may comprise subtracting the simulated non-linear signal from the audio band component to provide the compensated output signal, which is then provided to the downstream speech processing module.
Thus, in Figure 24, the audio band component generated by the low-pass filter 282 is passed to a subtractor 2102, and the simulated non-linear signal generated by the block 2140 and the low-pass filter 288 is subtracted from the audio band component. This attempts to remove from the audio band signal any component caused by downconversion of ultrasound signals.
The resulting compensated audio band signal is passed to the downstream speech processing module.
In one example, the non-linearity may be modelled in the block 2140 with a polynomial p(x), with the error being fed back from the output of the subtractor 2102. The Least Mean Squares algorithm may update the m-th polynomial term as per: j½ → i . + μ ' s · xm
pm + {x - d) ¾Μ
An alternative version applies a filtering to the error signal:
ix— &) ·
where λ is a filter function. For example a simple Boxcar filter could be used.
Any of the embodiments described above can be used in a two-stage system, in which the first stage corresponds to that shown in Figure 19. That is, the received signal is filtered to obtain an audio band component and a non-audio band (for example, ultrasound) component of the input signal. It is then determined whether the signal power in the non-audio band component is below or above a threshold value. If there is a low power level in the ultrasound band, this indicates that there is unlikely to be a problem caused by downconversion of audio signals to the audio band. If there is a higher power level in the ultrasound band, there is a possibility of a problem, and so the further processing described above with reference to Figure 21 , 22, 23 or 24 is performed to determine if interference is likely, and to take mitigating action if required. For example, if the measured signal power level in the non-audio band component is below a threshold level X, the input sound signal may be flagged as free of non-audio band interference, and, if the measured signal power level in the non-audio band component is above a threshold level X, the audio band and non-audio band components may be compared to identify possible interference within the audio band from the non-audio band. This allows for low-power operation, as the comparison step will only be performed in situations where the non-audio band component has a signal power above the threshold level. For a non-audio band component having signal power below such a threshold, it can be assumed that no interference will be present in the input sound signal used for downstream speech processing. The skilled person will recognise that some aspects of the above-described apparatus and methods may be embodied as processor control code, for example on a nonvolatile carrier medium such as a disk, CD- or DVD-ROM, programmed memory such as read only memory (Firmware), or on a data carrier such as an optical or electrical signal carrier. For many applications embodiments of the invention will be implemented on a DSP (Digital Signal Processor), ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array). Thus the code may comprise conventional program code or microcode or, for example code for setting up or controlling an ASIC or FPGA. The code may also comprise code for dynamically configuring re- configurable apparatus such as re-programmable logic gate arrays. Similarly the code may comprise code for a hardware description language such as Verilog TM or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, the code may be distributed between a plurality of coupled components in communication with one another. Where appropriate, the embodiments may also be implemented using code running on a field-(re)programmable analogue array or similar device in order to configure analogue hardware.
Note that as used herein the term module shall be used to refer to a functional unit or block which may be implemented at least partly by dedicated hardware components such as custom defined circuitry and/or at least partly be implemented by one or more software processors or appropriate code running on a suitable general purpose processor or the like. A module may itself comprise other modules or functional units. A module may be provided by multiple components or sub-modules which need not be co-located and could be provided on different integrated circuits and/or running on different processors.
Embodiments may be implemented in a host device, especially a portable and/or battery powered host device such as a mobile computing device for example a laptop or tablet computer, a games console, a remote control device, a home automation controller or a domestic appliance including a domestic temperature or lighting control system, a toy, a machine such as a robot, an audio player, a video player, or a mobile telephone for example a smartphone.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units recited in the claims. Any reference numerals or labels in the claims shall not be construed so as to limit their scope.

Claims

1. A method of detecting liveness, the method comprising:
receiving a speech signal;
generating an ultrasound signal;
detecting a reflection of the generated ultrasound signal;
detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of the liveness of a speaker based on the detected Doppler shifts,
wherein identifying whether the received speech signal is indicative of liveness based on the detected Doppler shifts comprises:
determining whether the detected Doppler shifts correspond to a speech articulation rate.
2. A method according to claim 1 , wherein determining whether the detected Doppler shifts correspond to a speech articulation rate comprises:
determining whether the detected Doppler shifts correspond to facial movements at a frequency in the range of 4-1 OHz.
3. A method according to claim 1 or 2, wherein determining whether the detected Doppler shifts correspond to a speech articulation rate comprises:
determining an articulation rate associated with the speech signal; and determining whether the detected Doppler shifts correspond to facial movements at the articulation rate associated with the speech signal.
4. A method according to claim 2, further comprising,
if it is determined that the detected Doppler shifts correspond to facial movements at a frequency in the range of 4-1 OHz:
determining an articulation rate associated with the speech signal;
determining whether the detected Doppler shifts correspond to lip movements at the articulation rate associated with the speech signal; and
determining that the received speech signal is indicative of liveness if the detected Doppler shifts correspond to lip movements at the articulation rate associated with the speech signal.
5. A method according to any of claims 1 to 4, for use in a voice biometrics system, wherein identifying whether the received speech signal is indicative of liveness comprises determining whether the received speech signal may be a product of a replay attack.
6. A system for liveness detection, the system comprising:
at least one microphone input, for receiving an audio signal from a microphone; and
at least one transducer output, for transmitting a signal to an ultrasound transducer, and the system being configured for:
receiving a speech signal at the at least one microphone input;
generating an ultrasound signal by transmitting a signal at the at least one transducer output;
detecting a reflection of the generated ultrasound signal;
detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of the liveness of a speaker based on the detected Doppler shifts,
wherein identifying whether the received speech signal is indicative of liveness based on the detected Doppler shifts comprises:
determining whether the detected Doppler shifts correspond to a speech articulation rate.
7. A device comprising a system as claimed in claim 6.
8. A device as claimed in claim 7, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
9. A computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to any one of claims 1 to 5.
10. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to any one of claims 1 to 5.
1 1. A device comprising the non-transitory computer readable storage medium as claimed in claim 10.
12. A device as claimed in claim 1 1 , wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
13. A method of liveness detection, the method comprising:
generating an ultrasound signal;
receiving an audio signal comprising a reflection of the ultrasound signal;
using the received audio signal comprising the reflection of the ultrasound signal to detect the liveness of a speaker;
monitoring ambient ultrasound noise; and
adjusting the operation of a system receiving the audio signal, based on a level of the reflected ultrasound and the monitored ambient ultrasound noise.
14. A method according to claim 13, for use in a voice biometrics system, wherein detecting the liveness of a speaker comprises determining whether a received speech signal may be a product of a replay attack, and comprising:
adjusting the operation of the voice biometrics system based on a level of the reflected ultrasound and the monitored ambient ultrasound noise.
15. A method according to claim 14, comprising:
detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal may be the result of a replay attack on the voice biometrics system based on the detected Doppler shifts,
the method further comprising:
determining a reliance to be placed on the identification whether the received speech signal may be the result of a replay attack, based on the level of the monitored ambient ultrasound noise.
16. A method according to claim 15, wherein determining the reliance to be placed on the identification comprises not performing the identification if the level of the monitored ambient ultrasound noise exceeds a first threshold level.
17. A method according to claim 14, comprising:
detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal may be the result of a replay attack on the voice biometrics system based on the detected Doppler shifts,
wherein identifying whether the received speech signal may result from a replay attack based on the detected Doppler shifts comprises:
determining a correlation between the detected Doppler shifts and the received speech signal; and
adapting a threshold correlation value to be used in identifying whether the received speech signal may result from a replay attack, based on the level of the monitored ambient ultrasound noise.
18. A system for liveness detection, the system comprising:
at least one microphone input, for receiving an audio signal from a microphone; and
at least one transducer output, for transmitting a signal to an ultrasound transducer, and the system being configured for:
generating an ultrasound signal;
receiving an audio signal comprising a reflection of the ultrasound signal;
using the received audio signal comprising the reflection of the ultrasound signal to detect the liveness of a speaker;
monitoring ambient ultrasound noise; and
adjusting the operation of a system receiving the audio signal, based on a level of the reflected ultrasound and the monitored ambient ultrasound noise.
19. A device comprising a system as claimed in claim 18.
20. A device as claimed in claim 19, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
21 . A computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to any one of claims 13 to 17.
22. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to any one of claims 13 to 17.
23. A device comprising the non-transitory computer readable storage medium as claimed in claim 22.
24. A device as claimed in claim 23, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
25. A method of liveness detection in a device, the method comprising:
receiving a speech signal from a voice source;
generating and transmitting an ultrasound signal through a transducer of the device;
detecting a reflection of the transmitted ultrasound signal;
detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of liveness of a speaker based on the detected Doppler shifts,
and the method further comprising:
obtaining information about a position of the device; and
adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device.
26. A method according to claim 25, wherein adapting the generating and transmitting of the ultrasound signal comprises:
adjusting a transmit power of the ultrasound signal.
27. A method according to claim 25 or 26, wherein the device has multiple transducers, and wherein adapting the generating and transmitting of the ultrasound signal comprises:
selecting the transducer in which the ultrasound signal is generated.
28. A method according to claim 25, 26 or 27, wherein obtaining information about a position of the device comprises obtaining information about an orientation of the device.
29. A method according to claim 25, 26, 27 or 28, wherein obtaining information about a position of the device comprises obtaining information about a distance of the device from the voice source.
30. A method according to claim 25, wherein the device is a mobile phone comprising at least a first transducer at a lower end of the device and a second transducer at an upper end of the device, and wherein adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device comprises:
transmitting the ultrasound signal from the first transducer at an intensity in the range of 70-90dB SPL at 1 cm if the information about the position of the device indicates that the device is being used in a close talk mode.
31 . A method according to claim 25, wherein the device is a mobile phone comprising at least a first transducer at a lower end of the device and a second transducer at an upper end of the device, and wherein adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device comprises:
transmitting the ultrasound signal at an intensity in the range of 90-1 10dB SPL at 1 cm if the information about the position of the device indicates that the device is being used in a near talk mode.
32. A method according to claim 27, wherein adapting the generating and
transmitting of the ultrasound signal based on the information about the position of the device comprises:
transmitting the ultrasound signal from the first transducer if the information about the position of the device indicates that the device is being used in a generally horizontal orientation.
33. A method according to claim 27 or 32, wherein adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device comprises: transmitting the ultrasound signal from the second transducer if the information about the position of the device indicates that the device is being used in a generally vertical orientation.
34. A method according to any of claims 25 to 33, wherein adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device comprises:
preventing transmission of the ultrasound signal if the information about the position of the device indicates that the device is being used in a far talk mode.
35. A method according to claim 29, wherein adapting the generating and transmitting of the ultrasound signal comprises adjusting a transmit power of the ultrasound signal, with a higher power being used when the device is further from the voice source, for distances below a predetermined maximum distance.
36. A method according to any of claims 25 to 35, wherein obtaining information about a position of the device comprises obtaining information as to which of multiple loudspeaker transducers is closest to the voice source, and adapting the generating and transmitting of the ultrasound signal comprises transmitting the ultrasound signal primarily or entirely from that loudspeaker.
37. A method according to any of claims 25 to 36, comprising obtaining information about the position of the device from one or more of the following:
gyroscopes, accelerometers, proximity sensors, light level sensors, touch sensors, sound level sensors, and a camera.
38. A method according to any of claims 25 to 37, for use in a voice biometrics system, wherein identifying whether the received speech signal is indicative of liveness comprises determining whether the received speech signal may be a product of a replay attack.
39. A system for liveness detection in a device, the system comprising:
at least one microphone input, for receiving an audio signal from a microphone; and
at least one transducer output, for transmitting a signal to an ultrasound, and the system being configured for: receiving a speech signal from the at least one microphone input;
generating a control signal through the transducer output, for transmitting an ultrasound signal through a transducer of the device;
detecting a reflection of the transmitted ultrasound signal;
detecting Doppler shifts in the reflection of the generated ultrasound signal; and identifying whether the received speech signal is indicative of liveness of a speaker based on the detected Doppler shifts,
and the method further comprising:
obtaining information about a position of the device; and
adapting the generating and transmitting of the ultrasound signal based on the information about the position of the device.
40. A device comprising a system as claimed in claim 39.
41 . A device as claimed in claim 40, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
42. A computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to any one of claims 25 to 38.
43. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to any one of claims 25 to 38.
44. A device comprising the non-transitory computer readable storage medium as claimed in claim 43.
45. A device as claimed in claim 44, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
46. A method for improving the robustness of a speech processing system having at least one speech processing module, the method comprising:
receiving an input sound signal comprising audio and non-audio frequencies; separating the input sound signal into an audio band component and a non-audio band component;
identifying possible interference within the audio band from the non-audio band component; and
adjusting the operation of a downstream speech processing module based on said identification.
47. The method of claim 46, wherein identifying possible interference within the audio band from the non-audio band component comprises determining whether a power level of the non-audio band component exceeds a threshold value and, if so, identifying possible interference within the audio band from the non-audio band component.
48. The method of claim 46, wherein identifying possible interference within the audio band from the non-audio band component comprises comparing the audio band and non-audio band components.
49. The method of claim 48, wherein the step of identifying possible interference within the audio band from the non-audio band component comprises:
measuring a signal power in the audio band component Pa;
measuring a signal power in the non-audio band component Pb and
if (Pa /Pb) < threshold limit, flagging the quality of the input sound signal as unreliable for speech processing; and
wherein the step of adjusting comprises controlling the operation of a
downstream speech processing module based on the flagged unreliable quality.
50. The method of claim 48, wherein the step of comparing comprises:
detecting the envelope of the signal of the non-audio band component;
detecting a level of correlation between the envelope of the signal and the audio band component; and
determining possible non-audio band interference within the audio band if the level of correlation exceeds a threshold value.
51 . The method of claim 48, wherein the step of comparing comprises: simulating the effect of a non-linearity on the non-audio band component to provide a simulated non-linear signal;
detecting a level of correlation between the simulated non-linear signal and the audio band component; and
determining possible non-audio band interference within the audio band if the level of correlation exceeds a threshold value.
52. The method of claim 50 or 51 , wherein the step of adjusting comprises flagging a detection of possible non-audio band interference within the audio band to a downstream speech processing module.
53. The method of any of claims 46 to 52, wherein the step of adjusting comprises providing a compensated sound signal to a downstream speech processing module.
54. The method of claim 53, wherein the step of providing a compensated sound signal comprises subtracting a simulated non-linear signal from the audio band component to provide a compensated output signal; and
providing the compensated output signal to a downstream speech processing module.
55. The method of claim 48, wherein the steps of comparing and adjusting comprise: simulating the effect of a non-linearity on the non-audio band component to provide a simulated non-linear signal;
subtracting the simulated non-linear signal from the audio band component to provide a compensated output signal; and
providing the compensated output signal to a downstream speech processing module.
56. The method of claim 54 or 55, wherein the step of subtracting comprises:
applying the simulated non-linearity signal to a filter; and
subtracting the filtered simulated non-linearity signal from the audio band component of the input sound signal to provide a compensated output signal.
57. A method according to claim 56, wherein the filter is an adaptive filter, and the method comprises adapting the adaptive filter such that the component of the filtered simulated non-linearity signal in the compensated output signal is minimised.
58. The method of claim 57, wherein adapting the adaptive filter comprises adapting a gain of the filter.
59. The method of claim 57 or 58, wherein adapting the adaptive filter comprises adapting filter coefficients of the filter.
60. The method of claim 54 or 55, wherein the step of simulating a non-linearity comprises providing the non-audio band component to an adaptive non-linearity module, and wherein the method comprises controlling the adaptive non-linearity module such that the component of the simulated non-linearity signal in the
compensated output signal is minimised.
61 . The method of any of claims 46 to 60, further comprising the step of:
measuring a signal power in the non-audio band component Pb, wherein the method is responsive to the step of measuring the signal power, such that:
if the measured signal power level Pb is below a threshold level X, the method comprises flagging the input sound signal as free of non-audio band interference, and
if the measured signal power level Pb is above a threshold level X, the method performs the step of identifying possible interference within the audio band from the non-audio band component.
62. The method of any of claims 46 to 61 , wherein the step of separating comprises: filtering the input sound signal to obtain an audio band component of the input sound signal; and
filtering the input sound signal to obtain a non-audio band component of the input sound signal.
63. The method of any of claims 46 to 62, wherein the speech processing system is a voice biometrics system.
64. A method of detecting an ultrasound interference signal, the method comprising: filtering an input signal to obtain an audio band component of the input signal; filtering the input signal to obtain an ultrasound component of the input signal; detecting an envelope of the ultrasound component of the input signal; detecting a degree of correlation between the audio band component of the input signal and the envelope of the ultrasound component of the input signal; and
detecting a presence of an ultrasound interference signal if the degree of correlation between the audio band component of the input signal and the envelope of the ultrasound component of the input signal exceeds a threshold level.
65. A method of detecting an ultrasound interference signal, the method comprising: filtering an input signal to obtain an audio band component of the input signal; filtering the input signal to obtain an ultrasound component of the input signal; modifying the ultrasound component to simulate an effect of a non-linear downconversion of the input signal;
detecting a degree of correlation between the audio band component of the input signal and the modified ultrasound component of the input signal; and
detecting a presence of an ultrasound interference signal if the degree of correlation between the audio band component of the input signal and the modified ultrasound component of the input signal exceeds a threshold level.
66. A method of processing a signal containing an ultrasound interference signal, the method comprising:
filtering an input signal to obtain an audio band component of the input signal; filtering the input signal to obtain an ultrasound component of the input signal; modifying the ultrasound component to simulate an effect of a non-linear downconversion of the input signal; and
comparing the audio band component of the input signal and the modified ultrasound component.
67. A method according to claim 66, wherein comparing the audio band component of the input signal and the modified ultrasound component comprises:
detecting a degree of correlation between the audio band component of the input signal and the modified ultrasound component of the input signal; and
detecting a presence of an ultrasound interference signal if the degree of correlation between the audio band component of the input signal and the modified ultrasound component of the input signal exceeds a threshold level.
68. A method according to claim 67, further comprising sending the audio band component of the input signal to a speech processing module only if no ultrasound interference signal is detected.
69. A method according to claim 66, wherein comparing the audio band component of the input signal and the modified ultrasound component comprises:
applying the modified ultrasound component of the input signal to a filter; and subtracting the filtered modified ultrasound component of the input signal from the audio band component of the input signal to obtain an output signal.
70. A method according to claim 69, wherein the filter is an adaptive filter, and the method comprises adapting the adaptive filter such that the component of the filtered modified ultrasound component in the output signal is minimised.
71 . A system for improving the robustness of a speech processing system having at least one speech processing module, the system comprising an input for receiving an input sound signal comprising audio and non-audio frequencies; and a filter for separating a non-audio band component from the input sound signal, and the system being configured for:
receiving an input sound signal comprising audio and non-audio frequencies; separating the input sound signal into an audio band component and a non-audio band component;
identifying possible interference within the audio band from the non-audio band component; and
adjusting the operation of a downstream speech processing module based on said identification.
72. A device comprising a system as claimed in claim 71 .
73. A device as claimed in claim 72, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
74. A computer program product, comprising a computer-readable tangible medium, and instructions for performing a method according to any one of claims 46 to 70.
75. A non-transitory computer readable storage medium having computer-executable instructions stored thereon that, when executed by processor circuitry, cause the processor circuitry to perform a method according to any one of claims 46 to 70.
76. A device comprising the non-transitory computer readable storage medium as claimed in claim 75.
77. A device as claimed in claim 75, wherein the device comprises a mobile telephone, an audio player, a video player, a mobile computing platform, a games device, a remote controller device, a toy, a machine, or a home automation controller or a domestic appliance.
PCT/GB2018/052907 2017-10-13 2018-10-11 Detection of liveness WO2019073235A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201880066346.8A CN111201568A (en) 2017-10-13 2018-10-11 Detection in situ
KR1020207013319A KR20200062320A (en) 2017-10-13 2018-10-11 Detection of vitality
GB2004477.2A GB2581594B (en) 2017-10-13 2018-10-11 Detection of liveness

Applications Claiming Priority (16)

Application Number Priority Date Filing Date Title
US201762572033P 2017-10-13 2017-10-13
US201762572016P 2017-10-13 2017-10-13
US201762572001P 2017-10-13 2017-10-13
US201762571944P 2017-10-13 2017-10-13
US62/572,033 2017-10-13
US62/572,016 2017-10-13
US62/571,944 2017-10-13
US62/572,001 2017-10-13
GBGB1801663.4A GB201801663D0 (en) 2017-10-13 2018-02-01 Detection of liveness
GB1801664.2 2018-02-01
GBGB1801661.8A GB201801661D0 (en) 2017-10-13 2018-02-01 Detection of liveness
GBGB1801664.2A GB201801664D0 (en) 2017-10-13 2018-02-01 Detection of liveness
GB1801661.8 2018-02-01
GB1801663.4 2018-02-01
GBGB1801874.7A GB201801874D0 (en) 2017-10-13 2018-02-06 Improving robustness of speech processing system against ultrasound and dolphin attacks
GB1801874.7 2018-02-06

Publications (1)

Publication Number Publication Date
WO2019073235A1 true WO2019073235A1 (en) 2019-04-18

Family

ID=66100447

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2018/052907 WO2019073235A1 (en) 2017-10-13 2018-10-11 Detection of liveness

Country Status (4)

Country Link
KR (1) KR20200062320A (en)
CN (1) CN111201568A (en)
GB (1) GB2581594B (en)
WO (1) WO2019073235A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022145835A1 (en) * 2020-12-30 2022-07-07 삼성전자 주식회사 Device and method for detecting voice attack against voice assistant service
WO2023079005A1 (en) * 2021-11-05 2023-05-11 Elliptic Laboratories Asa Proximity and distance detection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071532A1 (en) * 2006-09-12 2008-03-20 Bhiksha Ramakrishnan Ultrasonic doppler sensor for speech-based user interface
US20100204991A1 (en) * 2009-02-06 2010-08-12 Bhiksha Raj Ramakrishnan Ultrasonic Doppler Sensor for Speaker Recognition
EP3156978A1 (en) * 2015-10-14 2017-04-19 Samsung Electronics Polska Sp. z o.o. A system and a method for secure speaker verification

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7386372B2 (en) * 1995-06-07 2008-06-10 Automotive Technologies International, Inc. Apparatus and method for determining presence of objects in a vehicle
US20090046538A1 (en) * 1995-06-07 2009-02-19 Automotive Technologies International, Inc. Apparatus and method for Determining Presence of Objects in a Vehicle
CN105446474B (en) * 2014-09-26 2018-08-10 中芯国际集成电路制造(上海)有限公司 Wearable smart machine and its method of interaction, wearable smart machine system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080071532A1 (en) * 2006-09-12 2008-03-20 Bhiksha Ramakrishnan Ultrasonic doppler sensor for speech-based user interface
US20100204991A1 (en) * 2009-02-06 2010-08-12 Bhiksha Raj Ramakrishnan Ultrasonic Doppler Sensor for Speaker Recognition
EP3156978A1 (en) * 2015-10-14 2017-04-19 Samsung Electronics Polska Sp. z o.o. A system and a method for secure speaker verification

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022145835A1 (en) * 2020-12-30 2022-07-07 삼성전자 주식회사 Device and method for detecting voice attack against voice assistant service
WO2023079005A1 (en) * 2021-11-05 2023-05-11 Elliptic Laboratories Asa Proximity and distance detection

Also Published As

Publication number Publication date
GB202004477D0 (en) 2020-05-13
KR20200062320A (en) 2020-06-03
GB2581594A (en) 2020-08-26
CN111201568A (en) 2020-05-26
GB2581594B (en) 2022-08-10

Similar Documents

Publication Publication Date Title
US11705135B2 (en) Detection of liveness
US11017252B2 (en) Detection of liveness
US11023755B2 (en) Detection of liveness
US10832702B2 (en) Robustness of speech processing system against ultrasound and dolphin attacks
US11704397B2 (en) Detection of replay attack
US11631402B2 (en) Detection of replay attack
US11276409B2 (en) Detection of replay attack
US20190115030A1 (en) Detection of replay attack
KR101255404B1 (en) Configuration of echo cancellation
US20140341386A1 (en) Noise reduction
US10529356B2 (en) Detecting unwanted audio signal components by comparing signals processed with differing linearity
WO2019073235A1 (en) Detection of liveness
KR101659895B1 (en) Method And Apparatus for Noise Reduction And Inducement thereto
US11705109B2 (en) Detection of live speech
US12142259B2 (en) Detection of live speech
US20230343359A1 (en) Live speech detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18788842

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 202004477

Country of ref document: GB

Kind code of ref document: A

Free format text: PCT FILING DATE = 20181011

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20207013319

Country of ref document: KR

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 18788842

Country of ref document: EP

Kind code of ref document: A1