GB2526980A - Sensor input recognition - Google Patents
Sensor input recognition Download PDFInfo
- Publication number
- GB2526980A GB2526980A GB1515868.6A GB201515868A GB2526980A GB 2526980 A GB2526980 A GB 2526980A GB 201515868 A GB201515868 A GB 201515868A GB 2526980 A GB2526980 A GB 2526980A
- Authority
- GB
- United Kingdom
- Prior art keywords
- signal
- recognition system
- speech
- sensor input
- input recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009467 reduction Effects 0.000 claims abstract description 32
- 238000010200 validation analysis Methods 0.000 claims abstract description 18
- 238000001514 detection method Methods 0.000 claims description 38
- 230000000694 effects Effects 0.000 claims description 21
- 238000000034 method Methods 0.000 claims description 11
- 230000004044 response Effects 0.000 claims description 8
- 230000007613 environmental effect Effects 0.000 claims description 2
- 230000005670 electromagnetic radiation Effects 0.000 claims 1
- 238000004891 communication Methods 0.000 abstract description 15
- 238000012795 verification Methods 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 28
- 230000006870 function Effects 0.000 description 21
- 230000005236 sound signal Effects 0.000 description 11
- 230000002093 peripheral effect Effects 0.000 description 7
- 238000003909 pattern recognition Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 238000005070 sampling Methods 0.000 description 3
- 230000002618 waking effect Effects 0.000 description 3
- 230000009471 action Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- UGFAIRIUMAVXCW-UHFFFAOYSA-N Carbon monoxide Chemical compound [O+]#[C-] UGFAIRIUMAVXCW-UHFFFAOYSA-N 0.000 description 1
- 230000003213 activating effect Effects 0.000 description 1
- 230000002730 additional effect Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 229910002091 carbon monoxide Inorganic materials 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000003344 environmental pollutant Substances 0.000 description 1
- 230000000977 initiatory effect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 231100000719 pollutant Toxicity 0.000 description 1
- 229920001690 polydopamine Polymers 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000010079 rubber tapping Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000002604 ultrasonography Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/28—Constructional details of speech recognition systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/22—Interactive procedures; Man-machine interfaces
- G10L17/24—Interactive procedures; Man-machine interfaces the user being prompted to utter a password or a predefined phrase
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L2015/088—Word spotting
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Telephone Function (AREA)
Abstract
In eg. a speech recognition system (in eg. in a mobile phone), a sensor 208 detects an input (eg. sound) signal and determines whether it contains a selectable pattern (eg. trigger phrase such as hello phone) via analysers 206 coupled to an analysis signal memory and a controllable pattern detector 220 acting on eg. detected keywords (from 120, 5a) . Aspects of the invention include the subsequent validation 120 of the trigger word or pass phrase which, when spoken by an authorised speaker (identified by speaker verification), wakes up 212 the speech recognition engine (SRE 132). Noise reduction may be tuned for speech recognition 134 or communication 136 depending on whether it is applied before or after the validation.
Description
SENSOR INPUT RECOGNITION
This invention relates to sensor input recognition, and in particular to a system for handling various sensor inputs. For example, in a situation in which the sensor is a microphone, the system can detect that a pass phrase has been spoken, and may also be able to validate that the pass phrase was spoken by a specified speaker, allowing the system to be used as a hands-free and low power consumption means of activating higher power consumption functions such as speech recognition in consumer devices.
Smartphones are one example of such consumer devices.
In speech recognition, it is known to provide circuitry which is able to continually listen for voice commands, while in stand-by mode. This removes the requirement for a button or other mechanical trigger to generally wake up' the device from stand-by mode, for instance to activate a speech recognition function. One possible way of initiating hands-free operation is for the user of the phone to say a key phrase, for example "Hello phone". The device is then able to recognise that the key phrase has been spoken, and to wake up the speech recognition function and potentially the rest of the device. Furthermore the hands-free command may be programmed to be user specific, in which case only a previously registered user (or users) can utter the key phrase and the device will be able to verify that it is that specific user speaking (speaker recognition) and progress to wake up the speech recognition function.
According to the present invention, there is provided a sensor input recognition system, comprising: at least one sensor for generating an input signal; a plurality of analysers, for generating respective analysis signals from the at least one input signal; a buffer memory, for storing the analysis signals; a controllable pattern detector, for detecting a selectable pattern in the analysis signals.
For a better understanding of the present invention, and to show how it may be put into effect, reference will now be made, by way of example, to the accompanying drawings, in which:-Figure 1 shows a mobile telephone and various peripheral devices; Figure 2 shows components of the audio processing circuitry in the mobile telephone of Figure 1; Figure 3 shows a simplified schematic of Figure 2 of the components which relate to the voice recognition of the device; Figure 4 illustrates the functional blocks in a first sensor input recognition system; Figure 5 illustrates the functional blocks in a second sensor input recognition system.
Figure 1 shows a consumer device according to an aspect of the invention, in this example a communications device in the form of a mobile telephone 1, more specifically in the form of a smartphone. In this example, the mobile telephone 1 has a screen 3 and a keypad 5, although of course the invention is equally applicable to devices with touchscreens and/or other user interfaces and devices such as tablet computers for example or devices with more limited communications capability such as (pure) BluetoothTMenabled devices or devices with no communication capability. The mobile telephone 1 also has an inbuilt speaker 7 and an inbuilt main microphone 9, which are both analog transducers. The mobile telephone 1 also has a plurality of, in this particular example four, microphones 11 (which may be analog or digital microphones), allowing multiple acoustic signals to be received and converted to respective electrical signals, for example to provide multiple electrical ambient noise signals for use in a noise cancellation system or for example to provide multiple signals to allow beamforming capabilities to enhance the signal input to a speech recognition system.
As shown in Figure 1, the mobile telephone 1 may have a jack socket (not illustrated) or similar connection means, such as a USB socket or a multi-pin connector socket, allowing a headset, comprising a pair of stereo earpieces 13 and possibly a microphone 15, to be connected to it by respective wires and a jack plug (not illustrated) or similar connection means, such as a USB plug or a multi-pin connector plug. Alternatively, the mobile telephone 1 may be connected wirelessly, for example using the BluetoothTM communications protocol, to a wireless headset 17, having earpieces 19 and possibly a microphone 21. Although not illustrated, the earpieces 13, 19 may also comprise one or more ambient noise microphones (which may be analog or digital microphones), allowing one or more ambient noise signals to be received, for example for use in a noise cancellation system.
Figure 2 shows components of the sensor input handling system in the mobile telephone 1. Communication with the cellular telephone network 29 is handled by a baseband processor (sometimes referred to as a communications processor) 31. An applications processor 33 handles, amongst other processes, processes in which audio data is reproduced from or stored into a memory 35 (which may be solid-state or on a disk, and which may be built-in or attachable, for example, either permanently in the mobile telephone or on a removable memory device) and other processes in which audio data is generated internally within the telephone 1. For example, the applications processor 33 may handle: voice recognition; the reproduction of stereo music stored digitally in the memory 35; recording of telephone conversations and other audio data into the memory 35; the generation of satellite navigation commands; and the generation of tones to confirm the pressing of any button on the keypad 5. A wireless transceiver (or wireless codec) 37 handles communications using the BluetoothTM protocol or another short-range communications protocol, for example with the wireless headset 17.
The baseband processor 31, the applications processor 33, and the wireless transceiver 37 all send audio data to, and receive audio data from, switching circuitry in the form of an audio hub 39, i.e. an audio codec. The audio hub 39 takes the form of an integrated circuit in this described embodiment.
In the embodiment described above, the audio signals between the audio hub 39 and the baseband processor 31, the applications processor 33, and the wireless transceiver 37 are all digital, and some of them may be in stereo, comprising a left audio data stream and a right audio data stream. Additionally, at least in the case of communication with the applications processor 33, further data streams may be multiplexed into the audio signals, for example to enable the applications processor 33 to provide stereo music and also other audio signals such as key press confirmation tones simultaneously.
The audio hub 39 communicates with the baseband processor 31, the applications processor 33, and the wireless transceiver 37 over respective audio data links, i.e. buses, 38b, 38a, 38c, and the audio hub 39 has respective digital interfaces 4Db, 40a, 40c for these data links.
It will be appreciated that, in applications where there is no requirement for a wireless transceiver 37 for example, the audio hub 39 need only have two audio data links and two respective digital interfaces.
The audio hub 39 also provides audio signals to, and receives audio signals from, the built-in analog audio transducers of the mobile telephone 1. As shown in Figure 2, the audio hub 39 provides output audio signals to the speaker 7, and receives input audio signals from the microphones 9, 11.
The audio hub 39 can also be connected to other output transducers 43, which may be analog or digital transducers, and which may be built in to the mobile telephone 1 (for example in the case of a haptic output transducer) or in devices external to the mobile telephone 1 (for example the earpieces 13 of the wired headset shown in Figure 1).
The audio hub 39 can also be connected to other input transducers 45, which again may be analog or digital transducers, and which again may be built in to the mobile telephone 1 (for example an ultrasound microphone) or in devices external to the mobile telephone 1 (for example the microphone 15 of the wired headset). The input transducers 45 may include some or all of: accelerometers, temperature sensors, pressure sensors, or the like, as described in more detail below..
It is to be appreciated that Figure 2 shows just one possible device that can be controlled by voice recognition, and that generally similar architectures, for example based around audio hub integrated circuits as described here, are usable in an extremely wide range of electronic devices, including industrial, professional or consumer devices, such as cameras (DSC and/or video), portable media players, PDAs, games consoles, satellite navigation devices, tablets, notebook computers, TV5 or the like. Devices comprising to other embodiments or aspects of the invention may have different architectures, for example with only a single data interface, or even with no audio data interfaces to other processors.
Figure 3 is a block diagram showing components of the system which may be involved in the sensor input handling functionality. The microphone, or multiple microphones, 11, audio hub 39, and applications processor 33 are located in the mobile telephone 1, whereas the peripheral audio input devices 46 are connected to the mobile telephone 1 by either a wired or wireless connection. As mentioned previously, other sensors may be provided in addition to, or instead of, the microphone(s) 11, and the device 1 may be any type of electronic device rather than a mobile telephone.
The electrical signals which are continuously generated in response to respective acoustic stimuli by either one microphone or multiple microphones 11, or the peripheral audio input devices 46, or the other input transducers 45, are inputted into the audio hub 39. These generated audio signal or signals are then routed through the audio hub 39, wherein the signals can be processed by one or more digital signal processing (DSP) elements. Inside the audio hub 39 the audio signals are not restricted to one route and can be processed in many different ways. As described in more detail below, this processing can include key phrase detection, noise reduction, altering the frequency response, and altering the gain of the signal. Audio signal analysis and processing can take place in the audio hub 39 when other components, such as the applications processor 33 of the audio system, are in stand-by mode, i.e. in a low-power mode.
In this described example, the voice recognition functionality operates using a multi-phase process.
Figure 4 is a block diagram, illustrating in more detail the functionality of the sensor input handling system as described herein. The required functionality may be provided in hardware or software as required, and in particular any of the functions described herein may be provided as computer-readable code possibly stored on non-transitory medium for running on any suitable computational resources for providing the required function. It will be appreciated that this functionality may be distributed amongst multiple separate integrated circuits, or even across several larger devices, as required. For example, in one embodiment based around the architecture shown in Figure 2, the first and second phase operations might be performed in a digital signal processor within the audio hub integrated circuit, while a third phase operation might be performed in the applications processor, and further operations might be performed in a server computer accessed over the Internet. Other possibilities exist, particularly in devices with different processor architectures.
In one alternative architecture, also the third phase operation is carried out in the audio hub.
In another alternative architecture, all of the processing is performed in a single host processor, in which case the first phase (and possibly also the second phase) processing can be carried out in a power island of the processor that is always powered up, with the remainder of the processor only being powered up or enabled to perform the third phase when the second phase determines that the trigger phrase has been spoken.
Figure 4 shows multiple possible sources for the sound system. Figure 4 shows an internal microphone 100, but there might be multiple such microphones. For example, a handset might be provided with one microphone on its front surface and one microphone on its rear surface, although of course other configurations are quite possible. In a system with multiple microphones, it may be sufficient for at least the initial phases of the speech detection to use the signal from only one microphone, as described below.
In addition, the system shown in Figure 4 has the possibility to have at least one peripheral device connected thereto. For example, the peripheral device may be a headset, with a wired or wireless (for example BluetoothTM) connection. When such a headset is being worn, the microphone on the headset will usually pick up the user's speech better than a microphone on the handset, and so it will typically be preferred to use the signal detected by a microphone on the headset for the purposes of speech recognition whenever the headset is connected to the handset. Therefore, a source selection block may be connected to receive the signals from the internal microphone(s) 100 and the microphones on the peripheral device, and to select one of these signals for further processing. For example, the source selection block may select a signal from the peripheral device when it is detected that a headset is plugged into the handset, or when it is detected that the handset has a BluetoothlM connection to a headset.
As mentioned above, at least the initial phases of a multi-phase speech recognition system can advantageously use the input from a single microphone, even when multiple microphones are available. However, it may be preferred not to rely on the input from a predetermined microphone, because one or more of the microphones on a handset may be occluded, for example if the handset is placed on a flat surface or is being carried in a bag or pocket. The system may therefore include a microphone polling function, which detects whether one or more of the microphones is occluded, and selects the signal from a microphone that is determined not to be occluded.
For example, an algorithm running on the audio hub 39 (or the host processor 33) could periodically enable each microphone in turn (including a headset if connected), compare the magnitude of the output of each microphone across different parts of the spectrum, determine which microphone has the strongest and "flattest" signal (i.e. a spectral shape most similar to likely or desired speech signals), and select this microphone as a source for Phase 1 operation, disabling the remainder of the microphones.
The signal received from the microphone 100 (or other selected input source) is passed to a buffer 110, which typically is able to store signals representing a period of sound, say 2-10 seconds for example: clearly the buffer can be sized so as to store the required period of time varying signal or signals. It will be appreciated that the buffer 110 may store signals generated by selected multiple microphones, or all available microphones, if multiple signals are provided concurrently.
The signal received from the microphone 100 (or other selected input source) is also passed to first phase processing circuitry 112, and specifically to one oi more filter blocks 114. The purpose of the first phase processing is to detect within the received signal Dout signals that might represent speech.
The filter blocks 114 can for example remove or attenuate the components of the signal at respective frequency bands. These frequency bands can be relatively narrow, for example to remove disturbance signals at specific frequencies, or can be relatively wide, for example to ensure that signals in frequency bands that would typically not be contained in speech are not passed through. Thus, in one example, the filter blocks 114 include a bandpass filter that passes signals in a frequency range that is typical of speech, such as 300Hz -3kHz.
The filtered signal is passed to a signal activity detection (SAD) block 116. As mentioned above, the purpose of this block 116 is to identify received microphone signals Dout that might represent speech, in order that such signals can then be examined in more detail to determine whether they contain the predetermined trigger phrase. Many signal or voice activity detection (VAD) circuits 116 already exist, for example for use in noise cancellation systems or voice wireless communication protocols, and any suitable voice/signal activity detection block/circuit 116 may be used here. However, it should be noted that some activity detection blocks/circuits 116 aim to detect the user's voice with high reliability and are thus relatively complex, and therefore require relatively high power consumption.
In this case, the result of a positive determination by the signal activity detection block 116 is an Enable signal that enables the phase 2 processing and controls the operational mode of the buffer 110.
As such, it might be advantageous to use a relatively simple form of activity detection block 116, that has correspondingly lower power consumption, and tolerate a larger number of false detection events. For example, the activity detection block 116 might simply determine whether its received signal exceeds a threshold level. Such a determination should be made based on the overall envelope of the signal rather than a single sample being above a threshold level.
When the signal activity detection block 116 determines that the signal might contain speech, the phase 2 processing (block 118) is enabled. In particular, the phase 2 processing includes a trigger phrase detection block 120.
In this example, the signal Dout received from the selected microphone is passed through filters 114, before it is determined whether it might represent speech. The signal Dout from the selected microphone is also stored in a buffer 110. When the signal activity detection block 116 determines that a specific part of the signal Dout might represent speech, the unfiltered signal Dout generated by the selected microphone during the same time period is retrieved from the buffer 110 and passed to the trigger phrase detection block/circuit 120. (Here the term "unfiltered' is used to refer to a signal that has not passed though filter block 114: such a signal may have passed though some filter included in a part of the path from the microphone 100 to the buffer 110.) More specifically, the signal that is passed to the trigger phrase detection block 120 contains the unfiltered signal from the time period corresponding to the signal identified by the signal activity detection block, but also contains the unfiltered signal from a short period (for example 200ms) before and a short period (for example lOOms) after that time period. This allows the trigger phrase detection block 120 to detect the ambient noise, and to take that into account when attempting to detect the trigger phrase. This also allows for any delay in the signal detection within phase 1. The general operation of various forms of trigger phrase detection block is known to the person skilled in the art, and is not described further herein.
When the trigger phrase detection block 120 determines that the received signal contains speech representing the trigger phrase, an enable signal (03 EN) is sent to a trigger phrase validation block 122 for phase 3 processing. It will be noted that the trigger phrase detection block 120 in the phase 2 processing simply attempts to recognise the presence of the predetermined trigger word or phrase in the received signal. It does not attempt to confirm that the trigger phrase is being spoken by the authorised user of the device.
A trigger phrase validation function is performed by the trigger phrase validation block 122, operating on the same section of the original signal that was used by the trigger phrase detection block 120, and that was stored in the buffer 110. This allows successive stages of trigger phrase validation to take place transparently to the user, without the need for the user to repeat the phrase, providing security without compromising a relatively natural communication style, which is advantageous.
The trigger phrase validation function therefore needs to be trained by the user, who might for example need to speak the trigger phrase multiple times and under multiple conditions as part of the initialization of the system. Then, when the phase 3 processing is enabled, the trigger phrase validation function can compare the speech data with the stored data obtained during this initialization, in order to judge whether the trigger phrase has been spoken by the user. Techniques for performing this function are known to the person skilled in the art, and so they are not described in further detail here, as they are not relevant for an understanding of the present invention.
When it is determined by the trigger phrase validation block 122 that the trigger phrase was spoken by an authorised user, an enable signal (SR EN) is sent to the speech recognition engine 132 which might be provided in a specialist processor, and might, as mentioned previously, be provided in a separate device altogether. The purpose of the speech recognition function 132 is to identify commands spoken by the user after speaking the trigger phrase. These commands can then be acted upon for control purposes, for example to control an aspect of the operation of the mobile telephone 1 or other device. By way of example, the command may be an instruction to place a phone call to another person specified in the command.
In one example, the system is configured so that certain functions can be performed by any person, without waiting for the trigger phrase validation block 122 to complete its analysis of the current speech sample or to make its decision.. As noted above, the normal operation is that the second phase processing will recognise that a specified trigger phrase has been spoken, and the third phrase processing will recognise whether it has been spoken by the specified user. Only if the third phrase processing recognises that the trigger phrase was spoken by the specified user, the subsequent speech will be sent to the speech recognition engine for interpretation and processing.
However, if the subsequent speech contains a predetermined phrase (which may for example be a phrase from a list of "emergency response" type phrases, such as "Call 999", "Call ambulance", or the like), then this is recognised, and the appropriate action is taken, without first determining whether it was an authorised user that spoke the trigger phrase. In order to achieve this, this recognition step may take place in the trigger phrase detection block 120. Alternatively, whenever the trigger phrase is detected in the phase 2 processing, the subsequent speech may always be sent to the speech recognition engine 132 (in parallel with trigger phrase validation block 122) to determine whether it contains one of the specified emergency call phrases.
In order to be able to perform speech recognition with a high degree of accuracy, it may be advantageous to perform noise reduction on the speech signal before performing the speech recognition.
Thus, the signal output by the microphone(s) 100 may be passed to a noise reduction block 134 so that it generates a noise-reduced output. In the noise reduction block 134, the noise reduction is specifically optimised for automatic speech recognition. The output signal of this noise reduction block 134 is ultimately passed to a speech recognition function, as described in more detail below. In order to save power, it may be advantageous for the noise reduction block 134 to be switched on only once the trigger phrase detection block 120 has determined that the trigger phrase has been spoken i.e. the signal 03 EN enables the operation of the noise reduction block 134.
At the same time, the signal output by the microphone 100 may be passed to a second noise reduction block 136, in which the noise reduction is specifically optimised for human communication or the characteristics of the network voice communication channel to be used. In the case where the device is a mobile telephone, the output of this second noise reduction block 136 is ultimately transmitted over the mobile communications link. The operation of a suitable second noise reduction block 136 is known to the person skilled in the art, and will not be described further herein.
It will therefore be noted that the functions performed by the first noise reduction block 134 and second noise reduction block 136 are different. In one example, the functions performed by the second noise reduction block 136 are a subset of the functions performed by the first noise reduction block 134. More specifically, noise reduction that is performed for human communication tends to introduce distortion and other artefacts which have an adverse impact on speech recognition. Therefore, a low distortion form of processing is used in the second noise reduction block 136 for speech recognition.
The output of the first noise reduction block 134, which is optimised for speech recognition, and the output of the buffer 110, that is the buffered unfiltered digital input speech signal Dout, are both capable of being passed to a path selection block 140, for example in the form of a multiplexer, which is controlled by a source selection driver 142. The signal selected by the path selection block 140 may then be passed to the speech recognition engine 132, and possibly also to the trigger phrase validation block 122.
In some situations, it is difficult to perform trigger phrase validation and/or full speech recognition, because of the ambient noise levels. The use of the noise reduction block 134 as described above means that ambient noise reduction can be performed, and the output of the noise reduction block 134 can be sent to the speech recognition engine 132.
In this illustrated embodiment, noise reduction is also performed while in standby mode. That is, noise reduction is activated whenever the signal generated by the microphone(s) 100 is representative of a noisy environment. Specifically, in this embodiment, the signal generated by the microphone(s) 100 is passed to a voice activity detection block 150 and a second signal activity detection block 152. The voice activity detection block 150 generates a signal to indicate that the received signal has the characteristics of speech, while the second signal activity detection block 152 generates a signal to indicate that the received signal represents a high noise level. (In this embodiment, the second signal detection block 152 operates with different parameters from the first signal detection block 116, but it would equally be possible to use the signal generated by the first signal detection block 116 as an indication that the signal contains a high level of noise.) The output signals of the voice activity detection block 150 and the second signal activity detection block 152 are passed to a block 154 that determines when these signals indicate that there is a high level of ambient noise (that is, a high sound level that is not the result of nearby speech). In this event, a control signal is sent to the source selection manager 142. The source selection manager 142 then sends an enable signal to begin operation of the noise reduction block 134, and also controls the multiplexer 140 so that it is the output of the noise reduction block 134 that is supplied to the trigger phrase validation block 122.
Thus, trigger phrase validation is improved by the use of noise reduction when the noisy environment requires this, but power is saved by disabling the noise reduction in less noisy environments.
In one example, as discussed above, the phase 2 processing 118 and the associated functions, including the buffer 110 and the path select block 140, are provided in one integrated circuit such as an audio hub, i.e. audio codec, while the phase 3 processing is provided in another integrated circuit such as an applications processor of a mobile phone.
Thus, a multi-phase approach is used.
The first phase can operate using very low power, and so the fact that it is always on does not lead to high continual power consumption.
The second phase operates using relatively low power, and is on for a relatively small fraction of time, and so again this does not lead to high power consumption when averaged over a time interval comprising a high fraction of inactivity.
The third phase uses a relatively high power, but is expected to be operating for only a very small fraction of time, and so again this does not lead to high average power consumption.
In an audio system of the general type shown in Figure 3, in which there are two or more processors, the first (and also the second) phase processing may be carried out in one integrated circuit, such as the audio hub 39, while the third phase processing may be carried out in another integrated circuit, such as the applications processor 33 in the mobile telephone 1. This has the advantage that, while the handset is in a standby mode, the applications processor 33 does not even need to be woken up unless the second phase processing determines that the trigger phrase has been spoken.
Further, especially to provide more sophisticated algorithms for speech recognition (or even other applications such as real-time language translation) than may be implemented in real time with the computing and data-bank resources of the device, the actual speech recognition may advantageously not be carried out in the mobile telephone 1 at all, but might be carried out using cloud-based processing, by establishing a network connection from the mobile telephone. As this will be triggered only rarely, and when actually required, the power consumption involved with the network connection will not contribute greatly to the total average power consumption.
Thus, this progressive commitment of processing power means that the system as a whole can operate in an apparently "always on" fashion, while its average power consumption remains relatively low.
Similar monitoring of signals in an apparently always on" fashion, while maintaining a low average power consumption can be achieved with other sensor input signals. This can be used to bring about a wide range of additional functionality.
Figure 4 also shows various data and signals from the speech recognition system, and elsewhere, being stored for such monitoring purposes. Specifically, a second buffer is provided, with the intention that this should be able to store metadata (or data relating to the raw microphone data, rather than the raw data itselO at a relatively low data rate for a relatively long period of time, for example greater than 10 minutes, greater than 30 minutes, greater than 2 hours, or greater than 8 hours.
The data stored in the second buffer 200 can be derived from the raw microphone data. For example, Figure 4 shows the output of the voice activity detection block 150 being stored in the buffer 200. Thus, the buffer 200 may for example store a low bandwidth signal that indicates in perhaps just one bit, at a sampling rate of say 10Hz, whether the signal generated by the microphone 100 represents speech.
Figure 4 also shows the output of the trigger phrase detection block 120 being stored in the buffer 200. Thus, the buffer 200 may for example store a signal whenever the trigger phrase is detected.
Figure 4 also shows the output of the second signal activity detection block 152 being stored in the buffer 200. Thus, the buffer 200 may for example store a low bandwidth signal that indicates in a small number of bits, at a sampling rate of say 10Hz, whether the signal generated by the microphone 100 represents noise above one or more threshold values.
Figure 4 also shows the microphone signal being passed to a Fast Fourier Transform (FFT) block, which can determine the amplitude of the microphone signal in each of several frequency ranges. After being passed through a compression or downsampling block 204, the FFT output data is stored in the buffer 200. Thus, for example, the buffer 200 may for example store several low bandwidth signals that each indicate in a small number of bits, at a sampling rate of say 10Hz, whether the signal generated by the microphone 100 represents sound in a respective frequency range above one or more threshold levels.
Figure 4 also shows the microphone signal being passed to at least one additional analyser block 206, which can perform a specific analysis function on the microphone signal.
For example, one possible analyser block 206 detects a specific form of user interaction with the device, such as tapping the device, swiping it in a particular way, or picking it up. Another possible analyser block 206 is an ultrasonic analyser. That is, when the microphone 100 is able to detect signals at ultrasonic frequencies, the ultrasonic analyser may look for specific features of such signals. For example, the ultrasonic analyser may detect features of such signals that are representative of a situation in which the device is about to be picked up, or may perform gesture recognition to detect specific patterns of signals received from multiple ultrasonic sensors.
In addition, while Figure 4 shows signals coming from a microphone 100 that is integrated in the device that incorporates the system, it has been mentioned that signals can be used from microphones provided on accessories such as headphones.
In addition, signals generated by microphones on other separate devices can be routed to the analysis blocks, so that the relevant metadata can be stored in the buffer 200.
In addition to, or as an alternative to, the microphone 100, signals can also be received from other sensors 208. Suitable sensors may include, for example, accelerometers, magnetometers, gyroscopes, ambient light (visible, infrared and ultraviolet) sensors, proximity detectors, pressure sensors, touch sensors, haptic sensors, fingerprint readers, medical sensors (such as heart rate monitors and blood sugar monitors), environmental sensors (such as pollutant sensors for example carbon monoxide sensors and humidity detectors).
The sensors 208 may generate low bandwidth data that can be stored in the buffer 200, or they may generate data that is analysed before storage in the buffer 200. For example, the sensor may be connected to an analysis block (not shown in Figure 4) that is adapted to perform some initial analysis on the data (such as determining whether a particular criterion is met, such as whether a threshold value is exceeded), with the binary of low bit precision output being stored in the buffer 200.
Thus, the low bandwidth data may relate to specific features of the data, such as: when the audio input contains speech; when the audio input contains loud noises; when there are bright flashes of light; when the device is in the presence of an infrared controller such as a set top box remote control unit; when the device is close to other devices; or the like.
The audio hub also has an interface to allow software applications 210 running on the host processor to access the data from the buffer 200. These software applications may be downloaded by the user after purchase of the electronic device, or may be provided by the manufacturer of the electronic device. Thus, the software applications can look for patterns in the long-term buffered data and produce results that are useful to that user. The software applications can also have access to the raw audio data stored in the first buffer 110 and/or the noise-reduced audio data generated by the noise reduction block 134, and/or to the output of the speech recognition engine 132.
One function of the software applications 210 may be to wake up other functionality in the electronic device, so that the other functionality can appear to be "always on". That is, the user does not need to activate the specific functionality because it is activated automatically in response to specific criteria being met, although the power consumption required is considerably less than would be required in order to maintain that functionality operating permanently.
Thus, each of the applications 210 is able to control a wake up criteria manager 212, and set the criteria that are of interest. The wake up criteria manager 212 controls a pattern detector or change detector 214, which examines the data (or, more specifically, the low bandwidth metadata) being stored in the buffer 200. When the pattern detector or change detector 214 determines that the required criterion has been met, it can generate an interrupt request (IRO) signal on the interface 216, which can then be supplied to the relevant additional functionality.
For example, there exist various possibilities for performing analysis on the movement of a person carrying an electronic device, but this analysis might involve a power consumption of tens or hundreds of milliwatts. By waking up this functionality only when the low bandwidth analysis data from a sensor indicates that the person is actually moving at a specified speed, the power consumption might be reduced by a factor of ten or more, even though the functionality appears to the user to be always on.
As another example, analysis of the metadata resulting from a microphone signal might reveal that the ambient noise has the characteristics of music. "Mien the suitable criteria are met, a signal may be sent to start running a different application, for example an application that identifies the title and performer of music from an audio input. Again, therefore, by waking up this application only when music is audible, the power consumption can be greatly reduced, even though the application appears to be always on. This leads to the possibility that a full list of performers and artists heard by the user over a long period can be generated by the application without requiring additional action by the user.
Figure 5 illustrates a further sensor input recognition system. Various aspects of the system of FigureS are the same as those of the system of Figure 4. These aspects are illustrated by the same reference numerals, and will not be described further herein.
In Figure 5, the data, or metadata, generated by the analysers such as the activity detectors, trigger phrase detectors, etc. are passed through a trainable pattern recognition engine 220, which operates under the control of the wake up criteria manager and also receives training inputs from a user of the device. The trainable pattern recognition engine 220 may for example take the form of a neural network, though other pattern detectors are available.
Particular applications 210 might be context specific, and might therefore need to operate only in particular circumstances. However, it might be difficult to define which sensor inputs are associated with those circumstances. However, by suitable training, the pattern recognition engine can detect combinations of sensor inputs that are associated with the particular circumstances. Then, when the particular circumstances arise, the application can record the fact, or can take specific action, such as waking up particular functionality in the device, or controlling the operation of functionality in the device.
For example, when the user of the device is travelling by train, the audio detected by the microphone 100 will have specific patterns, and the movement detected by any accelerometers or the like will also have specific patterns, that might well be different from the patterns detected when travelling by car. However, it will be very difficult to establish in advance exactly what these patterns will be. Therefore, the user can be asked to provide an input (e.g. by pressing a key on the device or the like) when travelling by train. Overtime, the pattern recognition engine 220 will recognise combinations of inputs that are associated with train travel and will become able to recognise this situation.
Similarly, when the user of the device is speaking with another person, that other person's speech will have specific characteristics, but it will be very difficult to define those characteristics. However, if the user, prompted by a suitable application 210, provides a particular input whenever that other person is speaking, the pattern recognition engine 220 will eventually recognise combinations of inputs that are associated with that other person speaking, and will become able to recognise this.
This will allow the application to initiate a specific reaction in response to such a determination.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim, "a" or "an" does not exclude a plurality, and a single feature or other unit may fulfil the functions of several units recited in the claims. The terms "trigger phrase" and "trigger word" are interchangeable throughout the description. Any reference signs in the claims shall not be construed so as to limit their scope.
Claims (18)
- CLAIMS1. A sensor input recognition system, comprising: at least one sensor for generating an input signal; a plurality of analysers, for generating respective analysis signals from the at least one input signal; a buffer memory, for storing the analysis signals; a controllable pattern detector, for detecting a selectable pattern in the analysis signals.
- 2. A sensor input recognition system as claimed in claim 1, adapted to generate a wake up signal in response to detection of the selectable pattern in the analysis signals.
- 3. A sensor input recognition system as claimed in claim 2, comprising an interface for outputting the generated wake up signal.
- 4. A sensor input recognition system as claimed in claim 2, adapted to wake up a part of the system in response to the generated wake up signal.
- 5. A sensor input recognition system as claimed in one of claims 1 to 4, comprising an interface for receiving inputs from at least one application program, and for controlling the selectable pattern to be detected by the controllable pattern detector in response to the inputs.
- 6. A sensor input recognition system as claimed in one of claims 1 to 5, wherein the at least one sensor includes a microphone.
- 7. A sensor input recognition system as claimed in claim 6, wherein the plurality of analysers include an activity detector.
- 8. A sensor input recognition system as claimed in claim 6 or 7, wherein the plurality of analysers include a keyword detector.
- 9. A sensor input recognition system as claimed in one of claims 6 to 8, wherein the plurality of analysers include a frequency analyser.
- 10. A sensor input recognition system as claimed in one of claims 1 to 9, wherein the at least one sensor includes at least one sensor selected from the group comprising: accelerometers, magnetometers, gyroscopes, light sensors, invisible electromagnetic radiation sensors, proximity detectors, pressure sensors, touch sensors, haptic sensors, fingerprint readers, medical sensors, and environmental sensors.
- 11. A sensor input recognition system as claimed in one of claims 1 to 10, adapted to output selected stored analysis signals.
- 12. A sensor input recognition system as claimed in one of claims 1 to 11, wherein the controllable pattern detector is trainable.
- 13. A method of speech recognition, comprising: receiving and storing a signal; determining whether the stored signal represents speech containing a trigger phrase, if so, performing a validation to determine whether the trigger phrase was spoken by a registered user, and, if so, performing a speech recognition operation, the method comprising: detecting an ambient noise level and: in first ambient noise conditions: performing the validation on the stored signal and, if it is determined that the trigger phrase was spoken by a registered user, applying noise reduction to the received signal and performing the speech recognition operation on the noise-reduced received signal; and in second ambient noise conditions: applying noise reduction to the received signal and performing the validation on the noise-reduced received signal and, if it is determined that the trigger phrase was spoken by a registered user, performing the speech recognition operation on the noise-reduced received signal.
- 14. A method as claimed in claim 13, wherein the second ambient noise conditions represent a higher ambient noise level than the first ambient noise conditions.
- 15. A method as claimed in claim 13 or 14, wherein the step of detecting the ambient noise level comprises detecting a noise level during a period when the detected noise does not contain speech.
- 16. A method as claimed in claim 130114, wherein the step of detecting the ambient noise level comprises performing speech detection on detected sounds; determining a noise level during a period when speech is not detected in the detected sounds; and determining whether the noise level during the period when speech is not detected in the detected sounds exceeds a threshold level.
- 17. A speech recognition system, for performing a method as claimed in any of claimsl3tol6.
- 18. A speech recognition system as claimed in claim 17, comprising a buffer for storing the received signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1515868.6A GB2526980B (en) | 2013-07-10 | 2013-07-10 | Sensor input recognition |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1312395.5A GB2516075B (en) | 2013-07-10 | 2013-07-10 | Sensor input recognition |
GB1515868.6A GB2526980B (en) | 2013-07-10 | 2013-07-10 | Sensor input recognition |
Publications (3)
Publication Number | Publication Date |
---|---|
GB201515868D0 GB201515868D0 (en) | 2015-10-21 |
GB2526980A true GB2526980A (en) | 2015-12-09 |
GB2526980B GB2526980B (en) | 2017-04-12 |
Family
ID=49033631
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1515868.6A Expired - Fee Related GB2526980B (en) | 2013-07-10 | 2013-07-10 | Sensor input recognition |
GB1312395.5A Expired - Fee Related GB2516075B (en) | 2013-07-10 | 2013-07-10 | Sensor input recognition |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1312395.5A Expired - Fee Related GB2516075B (en) | 2013-07-10 | 2013-07-10 | Sensor input recognition |
Country Status (1)
Country | Link |
---|---|
GB (2) | GB2526980B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017143299A1 (en) * | 2016-02-19 | 2017-08-24 | Invensense, Inc | Adaptive buffering |
CN107767861A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
US10169266B2 (en) | 2016-02-19 | 2019-01-01 | Invensense, Inc. | Adaptive buffering of data received from a sensor |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018090252A1 (en) * | 2016-11-16 | 2018-05-24 | 深圳达闼科技控股有限公司 | Voice instruction recognition method for robot, and related robot device |
US20220020387A1 (en) * | 2020-07-17 | 2022-01-20 | Apple Inc. | Interrupt for noise-cancelling audio devices |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013049358A1 (en) * | 2011-09-30 | 2013-04-04 | Google Inc. | Systems and methods for continual speech recognition and detection in mobile computing devices |
US20130289994A1 (en) * | 2012-04-26 | 2013-10-31 | Michael Jack Newman | Embedded system for construction of small footprint speech recognition with user-definable constraints |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8775179B2 (en) * | 2010-05-06 | 2014-07-08 | Senam Consulting, Inc. | Speech-based speaker recognition systems and methods |
US8818810B2 (en) * | 2011-12-29 | 2014-08-26 | Robert Bosch Gmbh | Speaker verification in a health monitoring system |
-
2013
- 2013-07-10 GB GB1515868.6A patent/GB2526980B/en not_active Expired - Fee Related
- 2013-07-10 GB GB1312395.5A patent/GB2516075B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013049358A1 (en) * | 2011-09-30 | 2013-04-04 | Google Inc. | Systems and methods for continual speech recognition and detection in mobile computing devices |
US20130289994A1 (en) * | 2012-04-26 | 2013-10-31 | Michael Jack Newman | Embedded system for construction of small footprint speech recognition with user-definable constraints |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2017143299A1 (en) * | 2016-02-19 | 2017-08-24 | Invensense, Inc | Adaptive buffering |
US10133690B2 (en) | 2016-02-19 | 2018-11-20 | Invensense, Inc. | Adaptive buffering of data received from a sensor |
US10169266B2 (en) | 2016-02-19 | 2019-01-01 | Invensense, Inc. | Adaptive buffering of data received from a sensor |
US10628346B2 (en) | 2016-02-19 | 2020-04-21 | Invensense, Inc. | Adaptive buffering of data received from a sensor |
CN107767861A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
Also Published As
Publication number | Publication date |
---|---|
GB2516075A (en) | 2015-01-14 |
GB201312395D0 (en) | 2013-08-21 |
GB2516075B (en) | 2018-08-22 |
GB2526980B (en) | 2017-04-12 |
GB201515868D0 (en) | 2015-10-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114080589B (en) | Automatic Active Noise Reduction (ANR) control to improve user interaction | |
US10586534B1 (en) | Voice-controlled device control using acoustic echo cancellation statistics | |
CN104252860B (en) | Speech recognition | |
CN107105367B (en) | Audio signal processing method and terminal | |
CN109346075A (en) | Identify user speech with the method and system of controlling electronic devices by human body vibration | |
KR102565882B1 (en) | the Sound Outputting Device including a plurality of microphones and the Method for processing sound signal using the plurality of microphones | |
US11605372B2 (en) | Time-based frequency tuning of analog-to-information feature extraction | |
US11437021B2 (en) | Processing audio signals | |
CN111210021A (en) | Audio signal processing method, model training method and related device | |
CN113949956B (en) | Noise reduction processing method and device, electronic equipment, earphone and storage medium | |
CN109429132A (en) | Earphone system | |
CN110364156A (en) | Voice interactive method, system, terminal and readable storage medium storing program for executing | |
US10783903B2 (en) | Sound collection apparatus, sound collection method, recording medium recording sound collection program, and dictation method | |
GB2526980A (en) | Sensor input recognition | |
TWI692253B (en) | Controlling headset method and headset | |
WO2019228329A1 (en) | Personal hearing device, external sound processing device, and related computer program product | |
US11488606B2 (en) | Audio system with digital microphone | |
KR102652553B1 (en) | Electronic device and method for detecting block of microphone | |
US11705109B2 (en) | Detection of live speech | |
WO2015131634A1 (en) | Audio noise reduction method and terminal | |
GB2553040A (en) | Sensor input recognition | |
WO2023136835A1 (en) | Method, apparatus and system for neural network hearing aid | |
US12149881B2 (en) | Automatic active noise reduction (ANR) control to improve user interaction | |
KR20160102713A (en) | Output Audio Size Automatic Adjustment Apparatus According to the Nosie using the DSP Codec Built | |
CN116320872A (en) | Earphone mode switching method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PCNP | Patent ceased through non-payment of renewal fee |
Effective date: 20230710 |