CN108735209B

CN108735209B - Wake-up word binding method, intelligent device and storage medium

Info

Publication number: CN108735209B
Application number: CN201810407844.6A
Authority: CN
Inventors: 何瑞澄
Original assignee: Midea Group Co Ltd; GD Midea Air Conditioning Equipment Co Ltd
Current assignee: Midea Group Co Ltd; GD Midea Air Conditioning Equipment Co Ltd
Priority date: 2018-04-28
Filing date: 2018-04-28
Publication date: 2021-01-08
Anticipated expiration: 2038-04-28
Also published as: CN108735209A

Abstract

The invention discloses a method for binding awakening words, which comprises the following steps: step S1, collecting voice signals sent by a user; step S2, extracting awakening word information and user information in the voice signal; and step S3, binding the user information and the awakening word information with the user. The invention also provides the intelligent equipment and a storage medium. The invention does not need to record a large amount of voice, reduces the operation, is convenient to use and improves the intelligent degree.

Description

Wake-up word binding method, intelligent device and storage medium

Technical Field

The invention relates to the technical field of voice recognition, in particular to a method for binding awakening words, intelligent equipment and a storage medium.

Background

Speech recognition technology is a high technology that allows a machine to convert speech signals into corresponding text or commands through a recognition and understanding process, i.e., allows the machine to understand human speech. Also known as Automatic Speech Recognition (ASR), the goal is to convert the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes, or character sequences. In recent years, voice recognition technology has entered the fields of home appliances, communication, electronic products, home services, and the like to provide near-field or far-field control of the home appliances or electronic products, and wake-up word binding technology provides a premise for near-field or far-field control of user home appliances or electronic products.

The main technology for binding the awakening words is technical software awakening, but the software operation is based on the premise of system starting, in order to ensure that the voice instruction of a user can be received anytime and anywhere, a voice recognition engine needs to be operated and monitored in a background all the time, the system cannot enter a dormant standby power-saving state, and the power consumption is large. In order to reduce system power consumption, voice low-power wake-up techniques have been developed, in which a large amount of voice data is recorded and trained into a fixed wake-up word, so as to wake up the system when the wake-up word is recognized in a voice command of a user.

However, the present inventors have found that the above-mentioned techniques have at least the following technical problems:

the user-defined awakening words need to define voice data which are recorded very much, and the method is complex in operation, inconvenient to use and poor in intelligent degree.

Disclosure of Invention

The embodiment of the invention provides a method for binding the awakening words, and solves the technical problems that the existing user-defined awakening words need to define very much recorded voice data, are complex in operation, inconvenient to use and poor in intelligent degree.

The embodiment of the invention provides a method for binding awakening words, which comprises the following steps:

step S1, collecting voice signals sent by a user;

step S2, extracting awakening word information and user information in the voice signal;

and step S3, binding the user information and the awakening word information with the user.

Optionally, the step S3 includes:

step S31, obtaining the awakening word model of the user registered to the voice recognition system, and binding the user information and the awakening word with the awakening word model.

Optionally, when the user information is voiceprint information, the step S31 includes:

s311, collecting awakening word tone signals input by a user for multiple times;

step S312, acquiring rhythm sensing characteristics, tone characteristics and phoneme characteristics in the awakening word sound signals input each time;

step 313, performing acoustic feature processing on the rhythm feature and the tone feature acquired each time, and registering rhythm feature information and tone feature information subjected to the acoustic feature processing as voiceprint data of the user;

s314, sequencing and combining the phoneme characteristics acquired each time based on a preset acoustic model to obtain the awakening word model;

and S315, storing the voiceprint data and the awakening words in a correlation mode with the awakening word model.

Optionally, the step S2 includes:

step S21, when a voice signal is received, judging whether the volume value of the voice signal is larger than a preset volume value;

and step S22, if yes, acquiring awakening word information in the voice signal based on the acoustic model and the grammar structure, and acquiring voiceprint information in the voice signal based on the voiceprint recognition technology.

Optionally, after the step S3, the method further includes:

step S4, receiving a wake-up voice signal, and extracting wake-up words in the wake-up voice signal;

and step S5, when the awakening words are matched with preset awakening words in the voice recognition system, responding to the awakening word sound signals.

Optionally, after the step S4, the method further includes:

step S6, adjusting the recognition threshold value of a preset awakening word in the voice recognition system;

and step S7, when the awakening words are matched with the adjusted preset awakening words, responding to the awakening word sound signals.

Optionally, the user information is voiceprint information, and the step S6 includes:

step S61, extracting voiceprint information in the awakening word sound signal;

step S62, when voiceprint data matched with the voiceprint information do not exist in the voice recognition system, the awakening word recognition threshold value of the voice recognition system is increased;

and step S63, when the voiceprint data matched with the voiceprint information exist in the voice recognition system, turning down the awakening word recognition threshold of the voice recognition system.

Optionally, after the step S61, the method further includes:

step S64, calculating the similarity between the voiceprint information and the voiceprint data registered in the voice recognition system according to a preset voiceprint model;

step S65, when the similarity is in a preset range, determining that voiceprint data matched with the voiceprint information exist in the voice recognition system;

and step S66, when the similarity is out of the preset range, judging that the voiceprint data matched with the voiceprint information does not exist in the voice recognition system.

The present invention further provides a storage medium storing a wakeup word binding program, where the wakeup word binding program implements the above-mentioned step of binding wakeup words when executed by a processor.

According to the invention, the awakening word is bound with the user by acquiring the awakening word information in the received voice signal, instead of recording a large amount of voice blindly, the awakening word is bound with the user information after being recorded, and the user and the awakening word can be directly identified in the subsequent identification process, so that the identification accuracy is improved, the recording of a large amount of voice is not required, the operation is reduced, the use is convenient, and the intelligent degree is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a schematic structural diagram of a hardware operating environment related to a smart device according to the present invention;

FIG. 2 is a flowchart illustrating a method for binding wake words according to a first embodiment of the present invention;

fig. 3 is a schematic flow chart illustrating a process of acquiring a wake-up word model of the user registered in the voice recognition system and binding the user information and the wake-up word with the wake-up word model according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a detailed process of step S20 according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a wake-up word binding method according to a second embodiment of the present invention;

FIG. 6 is a flowchart illustrating a method for binding wake words according to a third embodiment of the present invention;

FIG. 7 is a flow chart illustrating adjusting recognition threshold according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a process of determining voiceprint information according to an embodiment of the invention;

FIG. 9 is a flowchart illustrating a detailed process of step S203 according to an embodiment of the present invention;

fig. 10 is a flowchart illustrating a detailed process of step S70 according to an embodiment of the present invention.

The reference numbers illustrate:

reference numerals	Name (R)	Reference numerals	Name (R)
				100	Intelligent device	101	Radio frequency unit
102	WiFi module	103	Audio output unit
				104	A/V input unit	1041	Graphics processor
1042	Microphone (CN)	105	Sensor with a sensor element
				106	Display unit	1061	Display interface
107	User input unit	1071	Control interface
				1072	Other input devices	108	Interface unit
109	Memory device	110	Processor with a memory having a plurality of memory cells
				111	Power supply

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the following description, suffixes such as "module", "component", or "unit" used to denote elements are used only for facilitating the explanation of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.

Smart devices may be implemented in various forms. For example, the smart device described in the present invention may be implemented by a mobile terminal having a display interface, such as a mobile phone, a tablet computer, a notebook computer, a palm top computer, a Personal Digital Assistant (PDA), a Portable Media Player (PMP), a navigation device, a wearable device, a smart band, a pedometer, a smart speaker, or the like, or may be implemented by a fixed terminal having a display interface, such as a Digital TV, a desktop computer, an air conditioner, a refrigerator, a water heater, a dust collector, or the like.

While the following description will be given by way of example of a smart device, it will be appreciated by those skilled in the art that the configuration according to the embodiment of the present invention can be applied to a fixed type smart device, in addition to elements particularly used for mobile purposes.

Referring to fig. 1, which is a schematic diagram of a hardware structure of an intelligent device for implementing various embodiments of the present invention, the intelligent device 100 may include: RF (Radio Frequency) unit 101, WiFi module 102, audio output unit 103, a/V (audio/video) input unit 104, sensor 105, display area 106, user input unit 107, interface unit 108, memory 109, processor 110, and power supply 111. Those skilled in the art will appreciate that the smart device architecture shown in FIG. 1 does not constitute a limitation of a smart device, which may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The following describes each component of the smart device in detail with reference to fig. 1:

the radio frequency unit 101 may be configured to receive and transmit signals during information transmission and reception or during a call, and specifically, receive downlink information of a base station and then process the downlink information to the processor 110; in addition, the uplink data is transmitted to the base station. Typically, radio frequency unit 101 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 101 can also communicate with a network and other devices through wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to GSM (Global System for Mobile communications), GPRS (General Packet Radio Service), CDMA2000(Code Division Multiple Access2000 ), WCDMA (Wideband Code Division Multiple Access), TD-SCDMA (Time Division-Synchronous Code Division Multiple Access), FDD-LTE (Frequency Division duplex Long Term Evolution), and TDD-LTE (Time Division duplex Long Term Evolution).

WiFi belongs to short-distance wireless transmission technology, and intelligent equipment can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 102, and provides wireless broadband internet access for the user. Although fig. 1 shows the WiFi module 102, it is understood that it does not belong to the essential constitution of the smart device, and may be omitted entirely as needed within the scope not changing the essence of the invention. For example, in this embodiment, the smart device 100 may establish a synchronization association relationship with an App terminal based on the WiFi module 102.

The audio output unit 103 may convert audio data received by the radio frequency unit 101 or the WiFi module 102 or stored in the memory 109 into an audio signal and output as sound when the smart device 100 is in a call signal reception mode, a call mode, a recording mode, a voice recognition mode, a broadcast reception mode, or the like. Also, the audio output unit 103 may also provide audio output related to a specific function performed by the smart device 100 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 103 may include a speaker, a buzzer, and the like. As in the present embodiment, when a prompt to re-input a voice signal is output, the prompt may be a voice prompt, a vibration prompt based on a buzzer, or the like.

The a/V input unit 104 is used to receive audio or video signals. The a/V input Unit 104 may include a Graphics Processing Unit (GPU) 1041 and a microphone 1042, the Graphics processor 1041 Processing image data of still pictures or video obtained by an image capturing device (e.g., a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display area 106. The image frames processed by the graphic processor 1041 may be stored in the memory 109 (or other storage medium) or transmitted via the radio frequency unit 101 or the WiFi module 102. The microphone 1042 may receive sounds (audio data) via the microphone 1042 in a phone call mode, a recording mode, a voice recognition mode, or the like, and may be capable of processing such sounds into audio data. The processed audio (voice) data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 101 in case of a phone call mode. The microphone 1042 may implement various types of noise cancellation (or suppression) algorithms to cancel (or suppress) noise or interference generated in the course of receiving and transmitting audio signals.

The smart device 100 also includes at least one sensor 105, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display interface 1061 according to the brightness of ambient light, and a proximity sensor that can turn off the display interface 1061 and/or backlight when the smart device 100 is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

The display area 106 is used to display information input by the user or information provided to the user. The Display area 106 may include a Display interface 1061, and the Display interface 1061 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.

The user input unit 107 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the smart device. In particular, the user input unit 107 may include a manipulation interface 1071 and other input devices 1072. The control interface 1071, also referred to as a touch screen, may collect touch operations by a user (e.g., operations by a user on or near the control interface 1071 using a finger, a stylus, or any other suitable object or attachment) and drive the corresponding connection device according to a predetermined program. The manipulation interface 1071 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 110, and can receive and execute commands sent by the processor 110. In addition, the manipulation interface 1071 can be implemented in various types, such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the manipulation interface 1071, the user input unit 107 may include other input devices 1072. In particular, other input devices 1072 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like, and are not limited to these specific examples.

Further, the manipulation interface 1071 may overlay the display interface 1061, and when the manipulation interface 1071 detects a touch operation thereon or nearby, transmit to the processor 110 to determine the type of the touch event, and then the processor 110 provides a corresponding visual output on the display interface 1061 according to the type of the touch event. Although in fig. 1, the control interface 1071 and the display interface 1061 are two separate components to implement the input and output functions of the smart device, in some embodiments, the control interface 1071 and the display interface 1061 may be integrated to implement the input and output functions of the smart device, which is not limited herein.

The interface unit 108 serves as an interface through which at least one external device is connected to the smart device 100. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 108 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the smart device 100 or may be used to transmit data between the smart device 100 and the external device.

The memory 109 may be used to store software programs as well as various data. The memory 109 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a voice recognition system) required for at least one function, and the like; the storage data area may store data created according to the use of the smart device (such as voiceprint data, a wakeup word model, user information, etc.), and the like. Further, the memory 109 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 110 is a control center of the smart device, connects various parts of the entire smart device using various interfaces and lines, and performs various functions of the smart device and processes data by operating or executing software programs and/or modules stored in the memory 109 and calling data stored in the memory 109, thereby performing overall monitoring of the smart device. Processor 110 may include one or more processing units; preferably, the processor 110 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 110.

The smart device 100 may further include a power source 111 (such as a battery) for supplying power to various components, and preferably, the power source 111 may be logically connected to the processor 110 through a power management system, so as to implement functions of managing charging, discharging, and power consumption through the power management system.

Although not shown in fig. 1, the smart device 100 may further include a bluetooth module and the like capable of establishing a communication connection with other terminals, which will not be described herein.

Based on the hardware structure of the intelligent device, the intelligent device provided by the embodiment of the invention is provided with the voice recognition system, the awakening word is bound with the user by acquiring the awakening word information in the received voice signal instead of recording a large amount of voice blindly, and the awakening word is bound with the user information after the awakening word is recorded.

As shown in fig. 1, the memory 109, which is a kind of computer storage medium, may include an operating system and a wakeup word binding program.

In the intelligent device 100 shown in fig. 1, the WiFi module 102 is mainly used for connecting to a background server or a big data cloud, performing data communication with the background server or the big data cloud, and implementing communication connection with other terminal devices; the processor 110 may be configured to invoke the wake word binding application stored in the memory 109 and perform the following operations:

step S1, collecting voice signals sent by a user;

Optionally, the step S3 includes:

Further, when the user information is voiceprint information, the processor 110 may be configured to call the wakeup word binding application stored in the memory 109, and perform the following operations:

Further, the processor 110 may be configured to invoke the wake word binding application stored in the memory 109 and perform the following operations:

Further, after the step S3, the processor 110 may be configured to call the wakeup word binding application stored in the memory 109, and perform the following operations:

Further, after the step S4, the processor 110 may be configured to call the wakeup word binding application stored in the memory 109, and perform the following operations:

Further, the user information is voiceprint information, and the processor 110 may be configured to call the wakeup word binding application stored in the memory 109, and perform the following operations:

step S61, extracting voiceprint information in the awakening word sound signal;

Further, after the step S61, the processor 110 may be configured to call the wakeup word binding application stored in the memory 109, and perform the following operations:

The invention further provides a method for binding the awakening words, which is applied to awakening the voice recognition system or intelligent equipment loaded with the voice recognition system.

Referring to fig. 2, fig. 2 is a flowchart illustrating a method for binding a wakeup word according to a first embodiment of the present invention.

In this embodiment, the method for binding the wake word includes the following steps:

s10: collecting voice signals sent by a user;

in this embodiment, when the user wakes up the speech recognition system with the self-defined wake-up word for the first time or needs to input the wake-up word of the user, in order to avoid wake-up failure and improve the wake-up rate, the user-defined wake-up word model needs to be trained, so as to respond when receiving the user input including the wake-up word corresponding to the wake-up word model. The method comprises the steps that a user sends out voice signals, the voice signals sent out by the user are collected, the voice signals can comprise an air conditioner, a dehumidifier, a fan and the like, and can also comprise a power-on state, a temperature increasing state, a first-gear wind speed increasing state and the like, and information serving as a wake-up word is set in advance.

S20, extracting awakening word information and user information in the voice signal;

after a voice signal input by a user is acquired, awakening word information and user information in the voice signal are extracted; the user information may be user identity information, user voiceprint data, and the like, which may be used to identify the user. And the awakening words and the user information are extracted, the voice signals are converted into text information through conversion, and the awakening words and the sentences carrying the user information are extracted from the text information.

S30, binding the user information and the awakening word information with the user.

Specifically, a user-defined awakening word sound signal is collected, for example, a user can input a voice signal of an 'air conditioner' for multiple times, and after the voice signal of the 'air conditioner' is picked up by the intelligent device based on a microphone or an audio sensor, an awakening word model of the user registered to a voice recognition system is obtained, and the user information and the awakening word are bound with the awakening word model.

In order to facilitate more accurate adjustment of the awakening word identification threshold value according to the identified voiceprint data in the follow-up process, after the voiceprint data of the registered user and the registered awakening word model are obtained, the voiceprint data and the awakening word model are further associated, and an association relation between the voiceprint data and the awakening word model is established.

According to the embodiment, the awakening word is bound with the user by acquiring the awakening word information in the received voice signal instead of recording a large amount of voice blindly, the awakening word is bound with the user information after being recorded, and the user and the awakening word can be directly identified in the subsequent identification process, so that the identification accuracy is improved, the recording of a large amount of voice is not needed, the operation is reduced, the use is convenient, and the intelligent degree is improved.

Further, referring to fig. 3, based on the method for binding a wake word in the foregoing embodiment, the step of obtaining a wake word model in which the user is registered in a speech recognition system, and binding the user information and the wake word with the wake word model includes:

s100: collecting awakening word tone signals input by a user for multiple times;

in this embodiment, the user information is described by taking user voiceprint data as an example. In order to improve the binding accuracy of the awakening word, in the sampling stage, the method in this embodiment may collect the awakening word tone signal input by the user for multiple times, and then obtain an optimal awakening word model and voice print data according to the awakening word tone signal collected for multiple times.

S200: acquiring rhythm sensing characteristics, tone characteristics and phoneme characteristics in the awakening word sound signals input each time;

when acquiring voiceprint data of a user and a wake-up word model registered by the user to a voice recognition system according to wake-up word sound signals acquired for multiple times, specifically, acquiring rhythm characteristics and tone characteristics in the voice signals based on a voiceprint recognition technology after converting the wake-up word sound signals input by the same user each time into voice digital signals; obtaining factor characteristics in the speech signal based on the acoustic model and the grammatical structure, such as obtaining the starting point and the ending point positions of the speech signal in various paragraphs (such as phoneme, syllable and morpheme) through end point detection, and excluding the unvoiced segments from the speech signal.

S300: acoustic feature processing is carried out on the rhythm sense feature and the tone feature which are obtained every time, and rhythm sense feature information and tone feature information which are processed through the acoustic feature processing are registered as voiceprint data of the user;

after the awakening word sound signal input for the first time is obtained, the rhythm sense characteristic 1 and the tone characteristic 1 are obtained based on voiceprint recognition, then the rhythm sense characteristic 2 and the tone characteristic 2 in the awakening word sound signal input for the second time are obtained, when the difference is large, the rhythm sense characteristic 1 is optimized by the rhythm sense characteristic 2, the tone characteristic 1 is optimized by the tone characteristic 2, and the like are carried out until the difference between the rhythm sense characteristic n and the tone characteristic n which are obtained again and the difference between the current rhythm sense characteristic n-1 and the current tone characteristic n-1 are within a preset range, and the current rhythm sense characteristic and the current tone characteristic are registered as voiceprint data of the user after being processed by acoustic characteristics.

S400: sequencing and combining the phoneme characteristics acquired each time based on a preset acoustic model to obtain the awakening word model;

similarly, after the awakening word tone signal input for the first time is obtained, the phoneme characteristics 1 are obtained based on the acoustic model and the grammar structure, the phoneme characteristics 2 in the awakening word tone signal input for the second time are obtained, the position of the same phoneme in the permutation combination is obtained, when the input for the first time is different from the input for the second time, the phoneme characteristics 3 in the awakening word tone signal input for the third time are obtained, and the awakening word model is obtained until the position of each phoneme in the preset phoneme permutation combination of the awakening word model is determined.

S500: and storing the voiceprint data and the awakening words in association with the awakening word model.

After obtaining voiceprint data of a user and a registered awakening word model, the voiceprint data, the awakening word and the awakening word model are stored in the voice recognition system in a correlation mode through user information of the user, such as a user account number, a user number and the like, so that the awakening word model corresponding to the user is determined according to the recognized voiceprint data in a subsequent awakening process, and the awakening word model can be recognized subsequently. Through the association of the voiceprint data and the awakening words and the awakening word model, the identification and awakening through the voiceprint data are more accurate.

Further, referring to fig. 4, based on the method for binding a wake word in the foregoing embodiment, step S20 includes:

s20 a: when a voice signal is received, judging whether the volume value of the voice signal is larger than a preset volume value or not;

in this embodiment, since the voiceprint is a sound wave spectrum with speech information, the voiceprint itself is closely related to amplitude, frequency, gene profile, formant frequency bandwidth, and the like, and the sound wave is transmitted in a shorter distance, the sound volume value of the received voice signal is smaller, and the amplitude is in an inverse proportion relationship with the sound volume value, so that the voiceprint is related to the sound volume value of the received voice signal. In addition, the speech recognition engine of the speech recognition system only recognizes a speech whose speech volume reaches a preset threshold, and thus, in order to improve the accuracy of voiceprint recognition and speech recognition, it is necessary to determine whether the volume value of the received speech signal is greater than a preset volume value, which is the minimum volume value of the speech signal required for voiceprint recognition and speech recognition.

S20 b: and if so, acquiring awakening word information in the voice signal based on the acoustic model and the grammatical structure, and acquiring voiceprint information in the voice signal based on a voiceprint recognition technology.

When the volume value of the received voice signal is greater than the preset volume value, the received voice signal is judged to be valid, and voiceprint recognition and acoustic model analysis can be further performed on the received voice signal, for example, silent sections of the voice signal in sections such as phonemes, syllables and morphemes are excluded based on end point detection, then voiceprint information of the voice signal is obtained based on syllable features in the voice signal, and awakening word information in the voice signal is obtained based on morpheme features, phoneme feature acoustic models and grammatical structures in the voice signal.

Further, referring to fig. 5, after step S30, the method for binding a wake word according to the foregoing embodiment further includes:

s40, receiving the awakening voice signal, and extracting awakening words in the awakening voice signal;

s50, when the awakening word is matched with the preset awakening word in the voice recognition system, responding to the awakening word sound signal.

And after the user has the bound awakening word, receiving the awakening word sound signal, performing awakening operation, extracting the awakening word in the awakening voice signal, and responding to the awakening word sound signal to execute response operation when the awakening word is matched with a preset awakening word in a voice recognition system. And when the extracted awakening words are matched with the preset awakening words which are correspondingly stored in the voice recognition system by the user, executing response operation. And realizing accurate awakening.

Further, in order to better perform the wake-up and reduce the error rate, referring to fig. 6, after the step S40, the method further includes:

s60: adjusting a recognition threshold value of a preset awakening word in a voice recognition system;

s70: and when the awakening words are matched with the adjusted preset awakening words, responding to the awakening word sound signals and executing responding operation.

The awakening words are adjusted, and cannot be fixed and unchanged, and the adjustment is made along with different user conditions. Specifically, referring to fig. 7, the adjusting process includes:

s201: extracting voiceprint information in the awakening word sound signal;

after the awakening word information is extracted, voiceprint information is extracted from the awakening word sound signal, the problem that the awakening rate is low when a user awakens a voice recognition system or an intelligent device loaded with the voice recognition system by using an individualized or customized awakening word is solved, and the awakening word binding technology and the core of the voice recognition technology are a training model and a recognition model, so that in order to improve the awakening rate of voice recognition, the corresponding awakening word model and voiceprint data need to be registered in the voice recognition system in advance for the user to awaken the voice recognition system after inputting a matched voice signal. In order to further improve the awakening rate of the voice recognition system and avoid false awakening caused by environmental noise, whether voiceprint data matched with the voiceprint information exists in the voice recognition system or not can be preferentially judged. If the voice print information exists in the voice recognition system, step S202 is executed, and if not, step S203 is executed.

S202: turning down a awakening word recognition threshold value of a voice recognition system;

when the voiceprint data matched with the voiceprint information exists in the voice recognition system, the current user of the intelligent device can be determined to be a registered user according to the voiceprint data registered by the user in the voice recognition system, the condition that environmental noise or other sounds are mistakenly awakened is eliminated, and therefore the awakening word recognition threshold value of the user corresponding to the voiceprint data is reduced, and the probability that the user awakens the voice recognition system is improved.

S203: and (5) raising the awakening word recognition threshold of the voice recognition system.

When voiceprint data matched with the voiceprint information do not exist in the voice recognition system, the fact that the voice signal may be environmental noise or the voice signal is sent by a non-registered user can be inferred, false awakening caused by the environmental noise is avoided, meanwhile, the safety of the voice recognition system is improved, and at the moment, an awakening word recognition threshold value of the voice recognition system can be correspondingly increased, so that awakening difficulty is improved.

Further, referring to fig. 8, after step S201, the method for binding a wakeup word according to the foregoing embodiment further includes:

s204: calculating the similarity between the voiceprint information and voiceprint data registered in a voice recognition system according to a preset voiceprint model;

in this embodiment, when determining whether voiceprint data matched with voiceprint information in a voice signal exists in a voice recognition system, in order to improve accuracy of voiceprint recognition and thus improve a subsequent wake-up rate for subsequent voice recognition, a similarity between the voiceprint information in the voice signal and the voiceprint data registered in the voice recognition system may be calculated based on a preset voiceprint model during determination, specifically, syllable state segmentation may be performed on a tone a in the voiceprint information based on the preset voiceprint model, then syllable state segmentation is performed on a tone S in the voiceprint data based on the same means, and then coincidence degrees of various state syllables of the tone a and the tone S are compared, where the coincidence degree is the similarity degree. In other embodiments, the similarity may also be calculated by comparing the rhythm B in the voiceprint information in the speech signal with the rhythm D in the voiceprint data.

S205: when the similarity is within a preset range, determining that voiceprint data matched with the voiceprint information exist in the voice recognition system;

when the coincidence degree of each state syllable of the tone A and the tone S is within a preset range, it can be judged that voiceprint data matched with the voiceprint information exists in the voice recognition system.

S206: and when the similarity is out of a preset range, judging that voiceprint data matched with the voiceprint information do not exist in the voice recognition system.

And when the coincidence degree of each state syllable of the tone A and the tone S is out of a preset range, judging that voiceprint data matched with the voiceprint information does not exist in the voice recognition system.

Further, referring to fig. 9, the method for binding a wake word based on the foregoing embodiment, in step S203, includes:

s2031: when voiceprint data matched with the voiceprint information do not exist in the voice recognition system, current user state information and image information are obtained;

in this embodiment, when the coincidence degree of each state syllable of the tone a and the tone S is outside the preset range, it is determined that voiceprint data matching the voiceprint information does not exist in the speech recognition system, and at this time, an unregistered wakeup word may be input by the user, or the voiceprint data may be caused by receiving environmental noise, so that it is necessary to further acquire the state information and image information of the current user to determine whether the current user is a registered user or whether the received speech signal is environmental noise.

S2032: and when detecting that the current user does not speak, is out of the recognition range of the voice recognition system or is not registered, turning up the awakening word recognition threshold of the voice recognition system.

When the user is judged not to be sounded or the user is judged to be out of the recognition range of the voice recognition system according to the obtained current user state information, the received voice signal is judged to be environmental noise, and in order to reduce false awakening caused by the environmental noise, the awakening word recognition threshold value of the voice recognition system is increased, so that the awakening difficulty is improved, and the false awakening rate is reduced. And when the current user is judged to be unregistered according to the acquired image information of the current user, the awakening word recognition threshold value of the voice recognition system is increased so as to improve the awakening difficulty and improve the safety of voice recognition.

Further, referring to fig. 10, based on the method for binding a wake word in the foregoing embodiment, step S70 includes:

s71: counting the matching degree of the awakening word information in the received voice signal and an awakening word model registered to a voice recognition system;

in this embodiment, since the wake word information in the speech signal is mainly matched with the wake word model, and the specific matching manner may be a matching degree of permutation and combination between phonemes, for example, when the wake word model includes 48 phonemes, it is necessary to count the wake word information in the received speech signal, that is, count phoneme characteristics in the wake word information, and then compare phonemes in the wake word information to a preset number, and further compare permutation and combination manners between phonemes.

S72: and when the matching degree reaches the identification threshold of the awakening words after being turned down or turned up, awakening the voice identification system or awakening the intelligent equipment where the voice identification system is located.

When the phonemes in the awakening word information reach the preset number and the coincidence rate of the permutation combination between the phonemes is greater than the preset threshold value, the matching degree of the awakening word information in the voice signal and the awakening word model reaches the awakening word recognition threshold value after being turned down or turned up, at the moment, the voice signal can be responded, for example, the voice recognition system or the intelligent equipment where the voice recognition system is awakened, so that a voice control instruction or a voice interaction instruction input by a subsequent user is recognized, and further, a response control action or interaction action is made, so that the intelligence of the intelligent equipment is improved.

In addition, an embodiment of the present invention further provides a storage medium, where the storage medium stores a wakeup word binding application program, and the wakeup word binding program, when executed by a processor, implements the steps of the wakeup word binding method described above.

The method implemented when the wakeup word binding program is executed may refer to each embodiment of the wakeup word binding method of the present invention, and is not described herein again.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be noted that in the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for binding a wake-up word, the method comprising:

step S1, collecting voice signals sent by a user;

step S3, binding the user information and the awakening word information with the user;

the step S3 includes:

step S31, acquiring a wake-up word model of the user registered to a voice recognition system, and binding the user information and the wake-up word with the wake-up word model;

when the user information is voiceprint information, the step S31 includes:

step 313, performing acoustic feature processing on the rhythm feature and the tone feature acquired each time, and registering rhythm feature information and tone feature information subjected to the acoustic feature processing as voiceprint data of the user; after a first-time input awakening word sound signal is obtained, rhythm sense characteristic 1 and tone characteristic 1 are obtained based on voiceprint recognition, rhythm sense characteristic 2 and tone characteristic 2 in a second-time input awakening word sound signal are obtained, when the difference is large, rhythm sense characteristic 1 is optimized by utilizing rhythm sense characteristic 2, tone characteristic 1 is optimized by utilizing tone characteristic 2 until the difference between rhythm sense characteristic n obtained again and current rhythm sense characteristic n-1 and the difference between tone characteristic n and tone characteristic n-1 are within a preset range, and the current rhythm sense characteristic and tone characteristic are registered as voiceprint data of the user after being processed by acoustic characteristics;

step S315, storing the voiceprint data and the awakening words in association with the awakening word model;

after the step S3, the method further includes:

step S5, when the awakening word is matched with a preset awakening word in a voice recognition system, responding to the awakening word sound signal;

after the step S4, the method further includes:

step S6, adjusting the recognition threshold of the preset awakening word in the voice recognition system, and adjusting the awakening word without fixation or along with different user conditions;

step S7, when the awakening word is matched with the adjusted preset awakening word, responding to the awakening word sound signal;

the step S6 includes:

step S61, extracting voiceprint information in the awakening word sound signal;

2. The wake word binding method according to claim 1, wherein the step S2 comprises:

3. The method for binding wake words according to claim 1, wherein after the step S61, the method further comprises:

4. A smart device loaded with a voice recognition system, the smart device further comprising a memory, a processor, and a wake-up word binding application stored in the memory and executable on the processor, the voice recognition system being coupled to the processor, wherein:

the voice recognition system is used for responding to the voice signals meeting the awakening condition;

the wake word binding program when executed by the processor performs the steps of the wake word binding method according to any of claims 1 to 3.

5. A storage medium storing a wake word binding application program, the wake word binding application program when executed by a processor implementing the steps of the wake word binding method according to any one of claims 1 to 3.