WO2021103712A1 - Neural network-based voice keyword detection method and device, and system - Google Patents
Neural network-based voice keyword detection method and device, and system Download PDFInfo
- Publication number
- WO2021103712A1 WO2021103712A1 PCT/CN2020/111940 CN2020111940W WO2021103712A1 WO 2021103712 A1 WO2021103712 A1 WO 2021103712A1 CN 2020111940 W CN2020111940 W CN 2020111940W WO 2021103712 A1 WO2021103712 A1 WO 2021103712A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voice
- score
- basic
- keyword
- neural network
- Prior art date
Links
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 37
- 238000001514 detection method Methods 0.000 title claims abstract description 17
- 238000003062 neural network model Methods 0.000 claims abstract description 32
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000013507 mapping Methods 0.000 claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims description 20
- 239000011159 matrix material Substances 0.000 claims description 18
- 238000002372 labelling Methods 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012545 processing Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000003213 activating effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 206010011224 Cough Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000013077 scoring method Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/144—Training of HMMs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Definitions
- the invention belongs to the technical field of computer speech recognition, and specifically relates to a method, device and system for detecting speech keywords based on a neural network.
- the traditional method is to introduce a complete decoder for speech recognition, decode the speech input the keywords to be detected, generate multiple candidate results, and save them in some way, such as Lattice structure; further generation Inverted index, and then quickly search from this inverted index whether the voice to be detected contains the specified keywords.
- This Lattice-based keyword strategy because multiple candidates can be represented in Lattice, generally has a high recall rate.
- the disadvantage is that it is too complicated. It requires the introduction of the entire recognition system and the processing of complex Lattice.
- the generation of inverted indexes generally involves the introduction of Finite-State Transducer (FST, Finite-State Transducer) operations, which are generally difficult to grasp and the deployment is complicated. The degree is also great.
- a neural network is generally established for each keyword, and each neural network judges whether the keyword is activated by accumulating the output score of each frame.
- a neural network is established for each keyword to determine whether it is activated.
- One is that a large number of voices containing this keyword are needed to train the model, and the data collection is very troublesome; the other is that when the keyword is added or modified, it needs to be renewed.
- the whole process of collecting data and retraining the model is also very complicated.
- this model generally has high false alarms, and the system is often activated by mistake in undesired situations.
- the present invention proposes a method, device and system for voice keyword detection based on neural network.
- the present invention can reduce the network model resources required by the keyword retrieval system.
- the model does not need to be retrained when modifying keywords, which can save the time required for model retraining and the cost required for retraining the model.
- One aspect of the present invention is to disclose a voice keyword detection method based on neural network, which includes the following steps:
- said outputting the basic phoneme corresponding to each frame of speech feature includes:
- the basic phonemes corresponding to each frame of speech feature are output according to an N ⁇ M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
- the output of the basic phoneme corresponding to each frame of speech feature includes:
- the neural network model is obtained through the following steps:
- sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice
- the sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
- the inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame includes:
- the voice features are input into a pre-trained GMM-HMM model of the target language in frames to perform forced alignment on the voice features to obtain at least one basic phoneme corresponding to each frame of voice features.
- the calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword includes:
- multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
- the score calculation strategy includes: at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N ⁇ M matrix space.
- the judging whether a keyword is activated according to the score includes:
- Another aspect of the present invention is to disclose a voice keyword detection device based on a neural network, the device comprising:
- the voice feature extraction unit is used to receive the voice to be detected and extract the voice feature of the voice
- the basic phoneme prediction unit is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame;
- Candidate word mapping unit for mapping each candidate keyword to a corresponding basic phoneme
- a score calculation unit configured to calculate the voice for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;
- the judging unit is used to judge whether a keyword is activated according to the score.
- the basic phoneme prediction unit is configured to output the basic phoneme corresponding to the speech feature of each frame in the manner of an N ⁇ M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the basis of the target language Number of phonemes
- Another aspect of the present invention is to disclose a computer system, including:
- One or more processors are One or more processors.
- a memory associated with the one or more processors where the memory is used to store program instructions that, when read and executed by the one or more processors, execute a terminal, including memory and processing
- the processor reads the computer program instructions stored in the memory, so that the processor executes the method described above.
- the product of the present invention only needs to have one of the above-mentioned effects.
- FIG. 1 is a flowchart of the voice keyword detection method of the present invention
- FIG. 2 is a flowchart of a method in Embodiment 1 of the present invention.
- Fig. 3 is a structural diagram of a device according to the second embodiment of the present invention.
- Figure 4 is a structural diagram of the computer system of the present invention.
- the invention uses a neural network-based method to solve the task of voice keyword detection.
- the modeling unit of the neural network of the present invention is not a complete keyword or a word in the keyword, but a basic phoneme unit of the language to which the keyword belongs.
- the output node of the neural network of the present invention is to model all initials and vowels of Hanyu Pinyin, and to splice the desired keywords through the sequence combination of initials and vowels.
- the neural network of the present invention is relatively small, the scores obtained through multiple neural networks for the same voice can be further integrated, thereby further improving performance, making the scores better reflect the keyword confidence, and enhancing the recall of the keyword detection system Rate and reduce false alarms.
- Fig. 1 shows a flowchart of the voice keyword detection method of the present invention.
- the voice keyword detection method of the present invention can be divided into two parts, one is training a neural network model, and the other is using a trained neural network model to detect voice keywords.
- Training the neural network model includes the following steps.
- Step 1 Obtain a sample training set, including the sample speech used for training and the sample basic phoneme labeling result of the speech.
- a sample training set including the sample speech used for training and the sample basic phoneme labeling result of the speech.
- For the speech of the target language collect a certain amount of marked speech, preferably a speech training set of more than 500 hours.
- Step 2 Extract sample voice features.
- Step 3 Train the neural network model. Use the sample speech with basic phoneme annotation results to train the GMM-HMM model required for speech recognition. This model enforces the alignment of the speech, and obtains the information of which basic phoneme or which basic phoneme each frame of the extracted speech belongs to (if each If a frame belongs to multiple basic phonemes, the sum of the probabilities of the multiple basic phonemes is 1).
- the phoneme information corresponding to a sentence can be obtained by mapping an existing dictionary resource, but the phoneme information of a certain frame cannot be determined specifically, so a GMM-HMM model needs to be trained, and each frame can be further obtained by using this model Phoneme information.
- the output nodes of the neural network represent the basic phonemes of the target language. Therefore, the number of output nodes of the neural network can simply be equal to the sum of the basic phonemes of the target language. For example, for Chinese, it can be the sum of all initials plus vowels; English is the sum of the numbers of international phonetic symbols. In addition, it is extensible. For tonal languages, such as Chinese, the finals can have tones. There are 5 tones in total (four tones plus soft tones), so the total number of nodes is the initials plus 5 times the number of finals. In addition, some additional nodes can be added to absorb parts of the speech that do not belong to any phonemes, such as noise, abnormal sounds, cough sounds, and so on.
- the neural network model of the present invention is not aimed at a complete keyword or a word in the keyword, but the basic phoneme unit of the language to which the keyword belongs.
- the output node of the neural network of the present invention is to model all initials and vowels of Hanyu Pinyin, and to splice the desired keywords through the sequence combination of initials and vowels.
- the corresponding consonant plus vowel sequence combination is "xiao3huo3xiao3huo3".
- the total modeling unit is generally no more than 500, so that the neural network model will not be too large, which is convenient for embedding.
- Deployed in mobile devices such as mobile phones, cameras and other devices.
- the above-mentioned application network can adopt a simple fully connected feedforward neural network, or a more complex network, such as a time delay neural network, a convolutional neural network, a recurrent neural network, etc., which are all within the protection scope of the present invention.
- Using the trained neural network model to detect speech keywords includes the following steps:
- Step 4 Receive the voice information to be detected input by the user, and extract the voice features of the voice.
- Step 5 Input the speech features into the neural network model trained in the above step by frame, and output the corresponding phonemes. For each frame, the neural network will get a vector of the number of network output nodes. Assuming there are N frames of speech and M output nodes of the network, a phoneme distribution matrix of size N ⁇ M will be obtained.
- N and M are different.
- Step 6 Calculate the score of each candidate keyword, that is, calculate the above N ⁇ M matrix as the possible score of each candidate keyword.
- Each candidate keyword is mapped into a phoneme sequence through its pronunciation dictionary. Since each phoneme can correspond to a node of the network output, the score of the phoneme sequence of the candidate keyword in the N ⁇ M matrix can be calculated.
- This scoring method includes, but is not limited to, dynamic programming, the longest sequence score subject to constraints, or the optimal path score after brute force exhaustion in the N ⁇ M matrix space. For the convenience of discussion, all score calculation methods that may be used in this process are collectively referred to as "score calculation strategy".
- the present invention can train multiple neural networks for score calculation.
- different score calculation neural networks adopt different "score calculation strategies" to obtain multiple scores. These scores can be merged using different methods. Such as weighted average, etc., to get a better score representation.
- candidate keywords must be mapped to phoneme sequences, any candidate keywords must be scored in step 6, and the neural network does not need to be retrained.
- candidate keywords since only the phoneme sequence of candidate keywords is considered here, candidate keywords with the same pronunciation but different words are treated equally.
- Step 7 Determine whether there are candidate keywords to activate.
- the candidate keyword candidate set select the candidate keyword with the largest score. If the score exceeds the threshold defined by the candidate keyword, the candidate keyword will be activated; otherwise, consider the candidate keyword with the second highest score. Exceeds the pre-defined threshold of this candidate keyword. Continue in order. As long as one candidate keyword is activated, the control information that the candidate keyword is activated is returned to complete the recognition of a sentence. If the scores of all candidate keywords are lower than the threshold, it returns that no candidate keywords are activated. The whole process ends. Take a financial payment app on a mobile phone as an example. After opening the app, the user says “open the payment code” and “open the payment code", the system judges that it receives a specific keyword based on the user's voice, and then automatically opens it The corresponding QR code is for users to use.
- this example is a Chinese scene, a certain amount of Chinese corpus is collected first. It is easy to find Chinese corpus marked with more than 500 hours on the Internet. Use open source tools to train the Chinese GMM-HMM model, and use the trained model to further compulsively align the batch of Chinese corpus to obtain the Chinese phoneme, which is the label of the Hanyu Pinyin level, that is, the phoneme information of each frame.
- the App uses phoneme-level annotations and corpus to train one or more neural networks, which can be fully connected feedforward neural networks and time-delay neural networks.
- the output nodes of the network are the total number of phonemes. In this way, the neural network is even trained.
- the neural network resources are saved offline, packaged with the mobile app and deployed on the mobile phone, and loaded into the memory of the mobile phone when the app is opened.
- the App also stores voice feature extraction strategies and candidate keyword sets such as "open money collection code", "open payment code” and so on.
- the microphone of the mobile phone collects the sampling points of the sentence, performs feature extraction, and sends it to the neural network in the memory to obtain a phoneme distribution matrix output. Calculate the phoneme distribution matrix of the sentence and the scores of different candidate keywords. For the output of multiple neural networks, through a certain strategy fusion, such as weighted average, a more accurate score can be obtained.
- the score of "Please open my money collection code” and the candidate keyword “open money collection code” is 90, and the score of the candidate keyword “open payment code” is 40, and the candidate key
- the threshold of the word is 80, then according to the keyword score from high to low, check whether the score of each keyword exceeds the threshold set in advance for this keyword, and it will be found that the score of the candidate keyword "open the money collection code” exceeds the threshold, then If the keyword is activated, use the activated keyword to perform subsequent operations.
- the first embodiment of the present application discloses a method for detecting voice keywords based on a neural network, as shown in FIG. 2, including the following steps:
- S21 Receive a voice to be detected, and extract voice features of the voice.
- the specific steps include:
- the voice features are input into the pre-trained GMM-HMM model of the target language according to frames to force the alignment of the voice features to obtain at least one basic phoneme corresponding to the voice features of each frame, and output all the voice features in an N ⁇ M matrix.
- the basic phonemes corresponding to each frame of speech feature wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
- this step includes:
- multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
- the score calculation strategy includes: at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N ⁇ M matrix space.
- the relationship between the candidate keyword score and the pre-defined score threshold of the candidate keyword can be sequentially determined in the order of the score from largest to smallest, until it is determined that the score of a candidate keyword is greater than the predetermined score of the candidate keyword. Threshold, stop the judgment after activating the candidate keyword.
- the above-mentioned neural network model can be obtained through the following steps:
- sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice
- the sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
- the second embodiment of the present application also discloses a voice keyword detection device based on a neural network.
- the device includes:
- the voice feature extraction unit 31 is used to receive the voice to be detected and extract the voice feature of the voice.
- the basic phoneme prediction unit 32 is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame.
- the basic phoneme prediction unit 32 is used for:
- the voice features are input into the pre-trained GMM-HMM model of the target language according to frames to force the alignment of the voice features to obtain at least one basic phoneme corresponding to the voice features of each frame, and output all the voice features in an N ⁇ M matrix.
- the basic phonemes corresponding to each frame of speech feature wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
- the candidate word mapping unit 33 is used to map each candidate keyword to a corresponding basic phoneme. Specifically, it can be mapped through a pronunciation dictionary.
- the score calculation unit 34 is configured to calculate the score of each candidate keyword for the voice according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword.
- the score calculation unit 34 is configured to calculate multiple scores through multiple score calculation strategies based on the basic phonemes of the voice feature and the basic phonemes of the candidate keywords, and merge the multiple scores to obtain a final score.
- the score calculation strategy includes at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N ⁇ M matrix space.
- the judging unit 35 is used for judging whether a keyword is activated according to the score.
- the judging unit 35 is used to sequentially judge the relationship between the candidate keyword score and the pre-defined score threshold of the candidate keyword according to the order of the score, until it is judged that there is a candidate keyword with a score greater than the candidate keyword. Pre-defined score threshold, stop the judgment after activating the candidate keyword
- Embodiment 3 of the present invention discloses a computer system, including:
- One or more processors are One or more processors.
- a memory associated with the one or more processors where the memory is used to store program instructions that, when read and executed by the one or more processors, execute a terminal, including memory and processing
- the processor reads the computer program instructions stored in the memory, so that the processor executes the method described above.
- the fourth embodiment of the present application provides a computer system, including:
- One or more processors are One or more processors.
- a memory associated with the one or more processors where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
- said outputting the basic phoneme corresponding to each frame of speech feature includes:
- the basic phonemes corresponding to each frame of speech feature are output according to an N ⁇ M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
- the neural network model is obtained through the following steps:
- sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice
- the sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
- the inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame includes:
- the voice features are input into a pre-trained GMM-HMM model of the target language in frames to perform forced alignment on the voice features to obtain at least one basic phoneme corresponding to each frame of voice features.
- the calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword includes:
- multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
- the score calculation strategy includes: at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N ⁇ M matrix space.
- the judging whether a keyword is activated according to the score includes:
- the relationship between the candidate keyword score and the score threshold defined by the candidate keyword is determined in turn, until it is determined that the score of the candidate keyword is greater than the score threshold defined by the candidate keyword, and the Stop the judgment after the candidate keyword is activated.
- FIG. 4 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520.
- the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.
- the processor 1510 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. Perform relevant procedures to realize the technical solutions provided in this application.
- a general CPU Central Processing Unit, central processing unit
- a microprocessor central processing unit
- ASIC Application Specific Integrated Circuit
- the memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc.
- the memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500.
- BIOS basic input output system
- a web browser 1523, a data storage management system 1524, and an icon font processing system 1525 can also be stored.
- the foregoing icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application.
- the related program code is stored in the memory 1520 and is called and executed by the processor 1510.
- the input/output interface 1513 is used to connect input/output modules to realize information input and output.
- the input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions.
- the input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
- the network interface 1514 is used to connect a communication module (not shown in the figure) to realize communication interaction between the device and other devices.
- the communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
- the bus 1530 includes a path to transmit information between various components of the device (for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).
- various components of the device for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520.
- the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.
- the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation.
- the above-mentioned device may also include only the components necessary to implement the solution of the present application, and not necessarily include all the components shown in the figure.
- the above provides a detailed introduction to the neural network-based speech keyword detection method, device and system provided in this application. Specific examples are used in this article to illustrate the principles and implementation of this application. The description of the above embodiments is only used To help understand the methods and core ideas of this application; at the same time, for those of ordinary skill in the art, based on the ideas of this application, there will be changes in the specific implementation and scope of application. In summary, the content of this specification should not be construed as a limitation on this application. In summary, the present invention uses a very simple way to achieve the same function. The present invention changes the traditional multiple keywords that require multiple neural network models, and changes to multiple keywords only requires one neural network model, which can change the neural network model.
- the size of the network is very small, and the model 10M can achieve very excellent performance, which is suitable for deployment in embedded devices and takes up very low resources to complete the function.
- keywords can be configured arbitrarily, and there is no need to re-collect data for specific keywords and retrain the model; at the same time, there is no need to retrain the model when modifying keywords, which reduces the troublesome steps of collecting specific keyword corpus and saves the model. The time required for retraining.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Probability & Statistics with Applications (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
- Telephonic Communication Services (AREA)
Abstract
A neural network-based voice keyword detection method and device, and a system. Said method comprises the following steps: pre-receiving a voice to be detected, and extracting voice features of the voice (S21); inputting the voice features into a pre-trained neural network model of a target language by frames, and outputting a basic phoneme corresponding to each frame of voice feature (S22); mapping each preset candidate keyword to the corresponding basic phoneme (S23); according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, calculating the score of the voice being each candidate keyword (S24); and determining whether a keyword is activated according to the score (S25). The voice keyword detection method saves system resources, and reduces the time and costs required for retraining a model.
Description
本发明属于计算机语音识别技术领域,具体来说,涉及基于神经网络的语音关键词检测方法、装置及系统。The invention belongs to the technical field of computer speech recognition, and specifically relates to a method, device and system for detecting speech keywords based on a neural network.
针对语音关键词检测任务,传统的做法是引入语音识别的完整解码器,对输入待检测关键词的语音做一遍解码,生成多候选结果,通过某种方式保存,比如说Lattice结构;进一步再生成倒排索引,然后从此倒排索引里快速检索待检测语音是否包含指定的关键词。这种基于Lattice的关键词策略,由于多候选的情况都可以在Lattice里表示出来,一般具有很高的召回率。缺陷在于过于复杂,需要引入整个识别系统,还要处理复杂的Lattice,生成倒排索引一般还会引入有限状态转换机(FST,Finite-State Transducer)相关的操作,一般很难掌握,部署的复杂度也很大。For the task of speech keyword detection, the traditional method is to introduce a complete decoder for speech recognition, decode the speech input the keywords to be detected, generate multiple candidate results, and save them in some way, such as Lattice structure; further generation Inverted index, and then quickly search from this inverted index whether the voice to be detected contains the specified keywords. This Lattice-based keyword strategy, because multiple candidates can be represented in Lattice, generally has a high recall rate. The disadvantage is that it is too complicated. It requires the introduction of the entire recognition system and the processing of complex Lattice. The generation of inverted indexes generally involves the introduction of Finite-State Transducer (FST, Finite-State Transducer) operations, which are generally difficult to grasp and the deployment is complicated. The degree is also great.
在最新的基于神经网络的关键词检测框架下,一般会对每个关键词建立一个神经网络,每个神经网络通过每帧输出得分的累加来判断该关键词是否激活。而对每个关键词都建立一个神经网络来判断其是否激活,一是需要大量包含此关键词的语音来训练模型,数据的收集非常麻烦;二是当关键词增加或者修改时,又需要重新收集数据和重新训练模型,整个过程也很复杂。而且,这种模型一般虚警也很高,系统很多时候会在不希望的情况下被误激活。Under the latest neural network-based keyword detection framework, a neural network is generally established for each keyword, and each neural network judges whether the keyword is activated by accumulating the output score of each frame. A neural network is established for each keyword to determine whether it is activated. One is that a large number of voices containing this keyword are needed to train the model, and the data collection is very troublesome; the other is that when the keyword is added or modified, it needs to be renewed. The whole process of collecting data and retraining the model is also very complicated. Moreover, this model generally has high false alarms, and the system is often activated by mistake in undesired situations.
发明内容Summary of the invention
针对现有技术过于复杂的缺陷,本发明提出一种基于神经网络的语音关键词检测方法、装置及系统。本发明可以降低关键词检索系统所需要的网络模型资源,另一方面,修改关键词时不需要重新训练模型,可以节约模型重新训练所需要的时间,并节约重新训练模型所需的成本。In view of the overcomplex defects of the prior art, the present invention proposes a method, device and system for voice keyword detection based on neural network. The present invention can reduce the network model resources required by the keyword retrieval system. On the other hand, the model does not need to be retrained when modifying keywords, which can save the time required for model retraining and the cost required for retraining the model.
本发明的一个方面在于公开一种基于神经网络的语音关键词检测方法,包括以下步骤:One aspect of the present invention is to disclose a voice keyword detection method based on neural network, which includes the following steps:
接收待检测的语音,并提取所述语音的语音特征;Receiving the voice to be detected, and extracting voice features of the voice;
将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素;Inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame;
将预先设置的每一候选关键词映射为对应的基础音素;Map each candidate keyword set in advance to the corresponding basic phoneme;
根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分;Calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;
根据所述得分判断是否有关键词被激活。Determine whether any keywords are activated according to the score.
优选的,所述输出所述每帧语音特征对应的基础音素包括:Preferably, said outputting the basic phoneme corresponding to each frame of speech feature includes:
按照N×M矩阵的方式输出所述每帧语音特征对应的基础音素;其中,N等于所述语音的帧数,M等于所述目标语种的基础音素的个数。The basic phonemes corresponding to each frame of speech feature are output according to an N×M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
优选的,所述输出每帧语音特征对应的基础音素包括:Preferably, the output of the basic phoneme corresponding to each frame of speech feature includes:
优选的,所述神经网络模型通过如下步骤获得:Preferably, the neural network model is obtained through the following steps:
获取训练用样本数据集,所述样本数据集包括样本语音以及与所述样本语音对应的样本基础音素标注结果;Acquiring a sample data set for training, the sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice;
提取所述样本语音的样本语音特征;Extracting sample voice features of the sample voice;
将所述样本语音特征作为输入,将所述样本语音对应的样本基础音素标注结果作为输出训练神经网络模型。The sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
优选的,所述将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素包括:Preferably, the inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame includes:
将所述语音特征按帧输入预先训练好的目标语种的GMM-HMM模型对所述语音特征做强制对齐,得到每帧语音特征对应的至少一个基础音素。The voice features are input into a pre-trained GMM-HMM model of the target language in frames to perform forced alignment on the voice features to obtain at least one basic phoneme corresponding to each frame of voice features.
优选的,所述根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分包括:Preferably, the calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword includes:
根据所述语音特征的基础音素和所述候选关键词的基础音素,通过多种得分计算策略计算获得多个得分,对多个得分进行融合得到最终得分。According to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
优选的,所述得分计算策略包括:动态规划、受限制约束的最长序列得分、在N×M矩阵空间里暴力穷举后的最优路径得分中的至少两个。Preferably, the score calculation strategy includes: at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N×M matrix space.
优选的,所述根据所述得分判断是否有关键词被激活包括:Preferably, the judging whether a keyword is activated according to the score includes:
按照得分从大到小的顺序,依次判断候选关键词得分与该候选关键词预先定义的得分阈值的关系,直至判断到有候选关键词的得分大于该候选关键词预先定义的得分阈值,将该候选关键词激活后停止判断。本发明的另一方面在于公开一种基于神经网络的语音关键词检测装置,所述装置包括:According to the order of the score, the relationship between the candidate keyword score and the score threshold defined by the candidate keyword is determined in turn, until it is determined that the score of the candidate keyword is greater than the score threshold defined by the candidate keyword, and the Stop the judgment after the candidate keyword is activated. Another aspect of the present invention is to disclose a voice keyword detection device based on a neural network, the device comprising:
语音特征提取单元,用于接收待检测的语音,并提取所述语音的语音特征;The voice feature extraction unit is used to receive the voice to be detected and extract the voice feature of the voice;
基础音素预测单元,用于将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素;The basic phoneme prediction unit is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame;
候选词映射单元,用于将每一候选关键词映射为对应的基础音素;Candidate word mapping unit for mapping each candidate keyword to a corresponding basic phoneme;
得分计算单元,用于根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分;A score calculation unit, configured to calculate the voice for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;
判断单元,用于根据所述得分判断是否有关键词被激活。The judging unit is used to judge whether a keyword is activated according to the score.
优选的,所述基础音素预测单元,用于按照N×M矩阵的方式输出所述每帧语音特征对应的基础音素;其中,N等于所述语音的帧数,M等于所述目标语种的基础音素的个数Preferably, the basic phoneme prediction unit is configured to output the basic phoneme corresponding to the speech feature of each frame in the manner of an N×M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the basis of the target language Number of phonemes
本发明的另一方面在于公开一种计算机系统,包括:Another aspect of the present invention is to disclose a computer system, including:
一个或多个处理器;以及One or more processors; and
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行一种终端,包括存储器和处理器,处理器读取存储器中存储的计算机程序指令,从而使处理器执行如上所述的方法。A memory associated with the one or more processors, where the memory is used to store program instructions that, when read and executed by the one or more processors, execute a terminal, including memory and processing The processor reads the computer program instructions stored in the memory, so that the processor executes the method described above.
根据本申请提供的具体实施例,本申请公开了以下技术效果:According to the specific embodiments provided in this application, this application discloses the following technical effects:
1.对于不同关键词,不需要训练不同的神经网络模型;只要一个模型即可完成对所有关键词的检测。传统策略下,一个关键词就需要一个特定的神经网络模型,非常占用资源。1. For different keywords, there is no need to train different neural network models; only one model can complete the detection of all keywords. Under the traditional strategy, a keyword requires a specific neural network model, which is very resource intensive.
2.当修改关键词时,也不需要重新训练模型,仅需要修改对应的音素序列即可。传统策略下,关键词修改,模型肯定需要拿特定的语音重新训练。而本发明只要用包含目标语种所有音素的语音训练一次网络即可,大大降低了模 型再次训练的成本,操作简单,部署方便。2. When modifying keywords, there is no need to retrain the model, just modify the corresponding phoneme sequence. Under the traditional strategy, keyword modification, the model must be retrained with a specific voice. However, the present invention only needs to train the network once with the speech that contains all the phonemes of the target language, which greatly reduces the cost of retraining the model, has simple operation and convenient deployment.
本发明所述产品只需具有上述一种效果即可。The product of the present invention only needs to have one of the above-mentioned effects.
通过参照以下附图及对本发明的具体实施方式的详细描述,本发明的特征及优点将会变得清楚。The features and advantages of the present invention will become clear by referring to the following drawings and detailed description of the specific embodiments of the present invention.
图1是本发明的语音关键词检测方法的流程图;Figure 1 is a flowchart of the voice keyword detection method of the present invention;
图2是本发明实施例一方法的流程图;FIG. 2 is a flowchart of a method in Embodiment 1 of the present invention;
图3是本发明实施例二装置结构图;Fig. 3 is a structural diagram of a device according to the second embodiment of the present invention;
图4是本发明计算机系统结构图。Figure 4 is a structural diagram of the computer system of the present invention.
为了使本发明的技术方案更加清楚、明了,下面将结合附图作进一步详述,应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the technical solution of the present invention clearer and more comprehensible, the following will further describe it in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, and are not intended to limit the present invention.
本发明采用基于神经网络的方式,解决语音关键词检测这一任务。特别地,本发明神经网络的建模单元不是完整的关键词或者是关键词中的一个个字,而是关键词所属语种的基本音素单元。比如说对于中文,本发明神经网络的输出节点是对汉语拼音的所有声母和韵母建模,通过声母和韵母的序列组合拼接出想要的关键词。The invention uses a neural network-based method to solve the task of voice keyword detection. In particular, the modeling unit of the neural network of the present invention is not a complete keyword or a word in the keyword, but a basic phoneme unit of the language to which the keyword belongs. For example, for Chinese, the output node of the neural network of the present invention is to model all initials and vowels of Hanyu Pinyin, and to splice the desired keywords through the sequence combination of initials and vowels.
另外,由于本发明的神经网络比较小,对于同一语音通过多个神经网络得到的得分还可以进一步融合,从而进一步提升性能,使得分更好的反映关键词置信度,增强关键词检测系统的召回率和降低虚警。In addition, because the neural network of the present invention is relatively small, the scores obtained through multiple neural networks for the same voice can be further integrated, thereby further improving performance, making the scores better reflect the keyword confidence, and enhancing the recall of the keyword detection system Rate and reduce false alarms.
图1示出了本发明的语音关键词检测方法的流程图。如图1所示,本发明的语音关键词检测方法可分为两部分,一部分为训练神经网络模型,一部分为利用训练好的神经网络模型对语音关键词进行检测。Fig. 1 shows a flowchart of the voice keyword detection method of the present invention. As shown in FIG. 1, the voice keyword detection method of the present invention can be divided into two parts, one is training a neural network model, and the other is using a trained neural network model to detect voice keywords.
训练神经网络模型包括以下步骤。Training the neural network model includes the following steps.
步骤一:获取样本训练集,包括用于训练的样本语音和该语音的样本基础音素标注结果。针对目标语种的语音,搜集一定量有标注的语音,最好500小 时以上的语音训练集。Step 1: Obtain a sample training set, including the sample speech used for training and the sample basic phoneme labeling result of the speech. For the speech of the target language, collect a certain amount of marked speech, preferably a speech training set of more than 500 hours.
步骤二:提取样本语音特征。Step 2: Extract sample voice features.
步骤三:训练神经网络模型。利用有基础音素标注结果的样本语音训练语音识别需要的GMM-HMM模型,以此模型对语音做强制对齐,得到提取特征后的语音每一帧属于目标语种哪个或哪些基础音素的信息(若每一帧属于多个基础音素,则多个基础音素的概率和为1)。实际操作中,可以通过一个已有的字典资源映射得到一句话对应的音素信息,但无法具体确定某一帧的音素信息,因此需要训练一个GMM-HMM模型,利用该模型可以进一步得到每一帧的音素信息。Step 3: Train the neural network model. Use the sample speech with basic phoneme annotation results to train the GMM-HMM model required for speech recognition. This model enforces the alignment of the speech, and obtains the information of which basic phoneme or which basic phoneme each frame of the extracted speech belongs to (if each If a frame belongs to multiple basic phonemes, the sum of the probabilities of the multiple basic phonemes is 1). In actual operation, the phoneme information corresponding to a sentence can be obtained by mapping an existing dictionary resource, but the phoneme information of a certain frame cannot be determined specifically, so a GMM-HMM model needs to be trained, and each frame can be further obtained by using this model Phoneme information.
神经网络的输出节点表示目标语种的基础音素,因此,神经网络的输出节点的个数可以简单等于目标语种基础音素数目之和。比如对于中文,就可以是所有声母加上韵母的个数之和;英文就是国际音标数目数之和。另外,可扩展的,对于带调语种,比如说中文,韵母可以带调,总计5种调(四声加上轻声),那么总节点个数就是声母加上5倍的韵母个数。并且,还可以增加一些额外节点,吸收语音中不属于任何音素的部分,比如噪声、异常声音、咳嗽声音等。The output nodes of the neural network represent the basic phonemes of the target language. Therefore, the number of output nodes of the neural network can simply be equal to the sum of the basic phonemes of the target language. For example, for Chinese, it can be the sum of all initials plus vowels; English is the sum of the numbers of international phonetic symbols. In addition, it is extensible. For tonal languages, such as Chinese, the finals can have tones. There are 5 tones in total (four tones plus soft tones), so the total number of nodes is the initials plus 5 times the number of finals. In addition, some additional nodes can be added to absorb parts of the speech that do not belong to any phonemes, such as noise, abnormal sounds, cough sounds, and so on.
本发明的神经网络模型不是针对完整的关键词或者是关键词中的一个个字,而是关键词所属语种的基本音素单元。比如说对于中文,本发明神经网络的输出节点是对汉语拼音的所有声母和韵母建模,通过声母和韵母的序列组合拼接出想要的关键词。The neural network model of the present invention is not aimed at a complete keyword or a word in the keyword, but the basic phoneme unit of the language to which the keyword belongs. For example, for Chinese, the output node of the neural network of the present invention is to model all initials and vowels of Hanyu Pinyin, and to splice the desired keywords through the sequence combination of initials and vowels.
举例来说,比如关键词是“小伙小伙”,则其对应的声母加韵母序列组合即为“xiao3 huo3 xiao3 huo3”。一般语种的基本音素单元不超过100个,即使像中文这种带调语种,算上调型,一般总的建模单元也不会超过500个,这样使得神经网络模型不会过大,便于在嵌入式设备,比如手机、摄像头等设备中部署。上述申请网络可以采用简单的全连接前馈神经网络,也可采用较复杂的网络,如时延神经网络、卷积神经网络、递归神经网络等,这些都在本发明的保护范围内。For example, if the key word is "lady guy", the corresponding consonant plus vowel sequence combination is "xiao3huo3xiao3huo3". There are no more than 100 basic phoneme units in a general language. Even if a tonal language like Chinese is included, the total modeling unit is generally no more than 500, so that the neural network model will not be too large, which is convenient for embedding. Deployed in mobile devices, such as mobile phones, cameras and other devices. The above-mentioned application network can adopt a simple fully connected feedforward neural network, or a more complex network, such as a time delay neural network, a convolutional neural network, a recurrent neural network, etc., which are all within the protection scope of the present invention.
利用训练好的神经网络模型对语音关键词进行检测包括以下步骤:Using the trained neural network model to detect speech keywords includes the following steps:
步骤四:接收用户输入的待检测的语音信息,提取该语音的语音特征。Step 4: Receive the voice information to be detected input by the user, and extract the voice features of the voice.
步骤五:将语音特征按帧输入上述步骤训练好的神经网络模型中,输出对应的音素。对于每一帧,神经网络都会得到一个网络输出节点个数大小的向量。假设语音共有N帧,网络输出节点是M个,那么就会得到一个N×M大小的音素分布矩阵。Step 5: Input the speech features into the neural network model trained in the above step by frame, and output the corresponding phonemes. For each frame, the neural network will get a vector of the number of network output nodes. Assuming there are N frames of speech and M output nodes of the network, a phoneme distribution matrix of size N×M will be obtained.
对应不同的目标语种,N、M个数不同。Corresponding to different target languages, the number of N and M are different.
步骤六:计算每一候选关键词得分即计算上述N×M矩阵为每一候选关键词可能的得分。将每个候选关键词通过其发音词典映射成一个音素序列,由于每个音素都能对应到网络输出的一个节点,从而可以计算出该候选关键词的音素序列在N×M矩阵里的得分。这种得分方式包括但不限于动态规划、受限制约束的最长序列得分、或者是在N×M矩阵空间里暴力穷举后的最优路径得分。为方便讨论,将此过程可能用到的所有得分计算方法统一称为“得分计算策略”。Step 6: Calculate the score of each candidate keyword, that is, calculate the above N×M matrix as the possible score of each candidate keyword. Each candidate keyword is mapped into a phoneme sequence through its pronunciation dictionary. Since each phoneme can correspond to a node of the network output, the score of the phoneme sequence of the candidate keyword in the N×M matrix can be calculated. This scoring method includes, but is not limited to, dynamic programming, the longest sequence score subject to constraints, or the optimal path score after brute force exhaustion in the N×M matrix space. For the convenience of discussion, all score calculation methods that may be used in this process are collectively referred to as "score calculation strategy".
本发明可以训练多个用于得分计算的神经网络,对于一个候选关键词,在不同的得分计算神经网络中采用不同的“得分计算策略”获得多个得分,这些得分可以采用不同的方法融合,如加权平均等,以获得更好的得分表示。The present invention can train multiple neural networks for score calculation. For a candidate keyword, different score calculation neural networks adopt different "score calculation strategies" to obtain multiple scores. These scores can be merged using different methods. Such as weighted average, etc., to get a better score representation.
需要注意的是,由于候选关键词一定可以映射为音素序列,因此任意候选关键词在步骤六里都一定可以计算出得分,而且神经网络也不需要被重新训练。另外,由于这里只考虑候选关键词的音素序列,因此发音相同但字不同的候选关键词是等价对待的。It should be noted that because candidate keywords must be mapped to phoneme sequences, any candidate keywords must be scored in step 6, and the neural network does not need to be retrained. In addition, since only the phoneme sequence of candidate keywords is considered here, candidate keywords with the same pronunciation but different words are treated equally.
步骤七:判断是否有候选关键词激活。在候选关键词候选集合中,挑选得分最大的候选关键词,如果该得分超过此候选关键词预先定义的门限,则该候选关键词激活;否则,考虑得分次大的候选关键词,其得分是否超过此候选关键词预先定义的门限。依次进行下去。只要有一个候选关键词被激活,则返回候选关键词被激活的控制信息,完成一句话的识别。若所有候选关键词得分均低于门限,则返回没有候选关键词被激活。整个流程结束。以一个手机上金融支付的app为例,在打开该app后,用户说出“打开收钱码”、“打开付钱码”,系统根据用户的语音判断接收到特定的关键词,之后自动打开对应的二维码供 用户使用。Step 7: Determine whether there are candidate keywords to activate. In the candidate keyword candidate set, select the candidate keyword with the largest score. If the score exceeds the threshold defined by the candidate keyword, the candidate keyword will be activated; otherwise, consider the candidate keyword with the second highest score. Exceeds the pre-defined threshold of this candidate keyword. Continue in order. As long as one candidate keyword is activated, the control information that the candidate keyword is activated is returned to complete the recognition of a sentence. If the scores of all candidate keywords are lower than the threshold, it returns that no candidate keywords are activated. The whole process ends. Take a financial payment app on a mobile phone as an example. After opening the app, the user says "open the payment code" and "open the payment code", the system judges that it receives a specific keyword based on the user's voice, and then automatically opens it The corresponding QR code is for users to use.
因为本例是中文场景,因此首先收集一定量的中文语料,网上很容易找到500小时以上标注好的中文语料。利用开源工具,训练中文的GMM-HMM模型,用训练好的模型对这批中文语料做进一步的强制对齐,得到中文音素,也就是汉语拼音级别的标注,即每一帧的音素信息。Because this example is a Chinese scene, a certain amount of Chinese corpus is collected first. It is easy to find Chinese corpus marked with more than 500 hours on the Internet. Use open source tools to train the Chinese GMM-HMM model, and use the trained model to further compulsively align the batch of Chinese corpus to obtain the Chinese phoneme, which is the label of the Hanyu Pinyin level, that is, the phoneme information of each frame.
接下来,利用音素级的标注和语料,训练一个或者多个神经网络,可以取全连接的前馈神经网络和时延神经网络,网络输出节点即是音素总的个数。这样,神经网络就算训练完成了。将神经网络资源离线保存,和手机app打包在一起部署在手机上,在app被打开时加载进手机的内存。App中同时还存储有语音特征的提取策略、候选关键词集合如“打开收钱码”、“打开付钱码”等。Next, use phoneme-level annotations and corpus to train one or more neural networks, which can be fully connected feedforward neural networks and time-delay neural networks. The output nodes of the network are the total number of phonemes. In this way, the neural network is even trained. The neural network resources are saved offline, packaged with the mobile app and deployed on the mobile phone, and loaded into the memory of the mobile phone when the app is opened. At the same time, the App also stores voice feature extraction strategies and candidate keyword sets such as "open money collection code", "open payment code" and so on.
当用户说完一句话时,如“请打开我的收钱码”,手机麦克风采集到这句话的采样点,进行特征提取,送入内存里的神经网络,得到一个音素分布矩阵输出,再计算出该句话的音素分布矩阵与不同候选关键词的得分。对于多神经网络的输出,通过某种策略融合,如加权平均,得到更准确的得分。如根据计算得到用户所说的“请打开我的收钱码”与候选关键词“打开收钱码”的得分为90,与候选关键词“打开付钱码”的得分为40,且候选关键词的阈值均为80,那么按关键词得分从高到低,考察每个关键词得分是否超过此关键词预先设定的门限会发现候选关键词“打开收钱码”的得分超过阈值,则该关键词被激活,利用该激活的关键词执行后续操作即可。When the user finishes saying a sentence, such as "Please open my payment code", the microphone of the mobile phone collects the sampling points of the sentence, performs feature extraction, and sends it to the neural network in the memory to obtain a phoneme distribution matrix output. Calculate the phoneme distribution matrix of the sentence and the scores of different candidate keywords. For the output of multiple neural networks, through a certain strategy fusion, such as weighted average, a more accurate score can be obtained. For example, according to the calculation, the score of "Please open my money collection code" and the candidate keyword "open money collection code" is 90, and the score of the candidate keyword "open payment code" is 40, and the candidate key The threshold of the word is 80, then according to the keyword score from high to low, check whether the score of each keyword exceeds the threshold set in advance for this keyword, and it will be found that the score of the candidate keyword "open the money collection code" exceeds the threshold, then If the keyword is activated, use the activated keyword to perform subsequent operations.
具体来说,将应用支持的所有关键词写入一个文件,由系统内存读取即可。当需要修改或增加关键词时,不需要重新收集语音或重新训练模型,只需要修改文件即可。而一般的关键词策略,都需要用修改后的关键词或新增的关键词语音重新训练模型,而本发明无需此操作,大大节约了成本和时间。Specifically, write all keywords supported by the application to a file and read them from the system memory. When you need to modify or add keywords, you don't need to re-collect voices or retrain the model, just modify the file. In general keyword strategies, the model needs to be retrained with modified keywords or newly added keyword voices, but the present invention does not require this operation, which greatly saves cost and time.
实施例一Example one
对应上述描述,本申请实施例一公开一种基于神经网络的语音关键词检测方法,如图2所示,包括以下步骤:Corresponding to the above description, the first embodiment of the present application discloses a method for detecting voice keywords based on a neural network, as shown in FIG. 2, including the following steps:
S21、接收待检测的语音,并提取所述语音的语音特征。S21: Receive a voice to be detected, and extract voice features of the voice.
S22、将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素。S22. Input the voice features into a pre-trained neural network model of the target language in frames, and output the basic phonemes corresponding to the voice features of each frame.
具体的该步骤包括:The specific steps include:
将所述语音特征按帧输入预先训练好的目标语种的GMM-HMM模型对所述语音特征做强制对齐,得到每帧语音特征对应的至少一个基础音素,并按照N×M矩阵的方式输出所述每帧语音特征对应的基础音素;其中,N等于所述语音的帧数,M等于所述目标语种的基础音素的个数。The voice features are input into the pre-trained GMM-HMM model of the target language according to frames to force the alignment of the voice features to obtain at least one basic phoneme corresponding to the voice features of each frame, and output all the voice features in an N×M matrix. The basic phonemes corresponding to each frame of speech feature; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
S23、将预先设置的每一候选关键词映射为对应的基础音素。该步骤可通过发音词典进行映射。S23. Map each preset candidate keyword to a corresponding basic phoneme. This step can be mapped through the pronunciation dictionary.
S24、根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分。S24. Calculate the score of each candidate keyword for the voice according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword.
具体的,该步骤包括:Specifically, this step includes:
根据所述语音特征的基础音素和所述候选关键词的基础音素,通过多种得分计算策略计算获得多个得分,对多个得分进行融合得到最终得分。According to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
优选的,所述得分计算策略包括:动态规划、受限制约束的最长序列得分、在N×M矩阵空间里暴力穷举后的最优路径得分中的至少两个。Preferably, the score calculation strategy includes: at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N×M matrix space.
S25、根据所述得分判断是否有关键词被激活。S25: Determine whether any keyword is activated according to the score.
具体的,可按照得分从大到小的顺序,依次判断候选关键词得分与该候选关键词预先定义的得分阈值的关系,直至判断到有候选关键词的得分大于该候选关键词预先定义的得分阈值,将该候选关键词激活后停止判断。Specifically, the relationship between the candidate keyword score and the pre-defined score threshold of the candidate keyword can be sequentially determined in the order of the score from largest to smallest, until it is determined that the score of a candidate keyword is greater than the predetermined score of the candidate keyword. Threshold, stop the judgment after activating the candidate keyword.
其中,上述的神经网络模型可通过如下步骤获得:Among them, the above-mentioned neural network model can be obtained through the following steps:
获取训练用样本数据集,所述样本数据集包括样本语音以及与所述样本语音对应的样本基础音素标注结果;Acquiring a sample data set for training, the sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice;
提取所述样本语音的样本语音特征;Extracting sample voice features of the sample voice;
将所述样本语音特征作为输入,将所述样本语音对应的样本基础音素标注结果作为输出训练神经网络模型。The sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
实施例二Example two
对应上述方法,本申请实施例二还公开一种基于神经网络的语音关键词检测装置,如图3所示,所述装置包括:Corresponding to the above method, the second embodiment of the present application also discloses a voice keyword detection device based on a neural network. As shown in FIG. 3, the device includes:
语音特征提取单元31,用于接收待检测的语音,并提取所述语音的语音特征。The voice feature extraction unit 31 is used to receive the voice to be detected and extract the voice feature of the voice.
基础音素预测单元32,用于将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素。The basic phoneme prediction unit 32 is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame.
具体的该基础音素预测单元32用于:Specifically, the basic phoneme prediction unit 32 is used for:
将所述语音特征按帧输入预先训练好的目标语种的GMM-HMM模型对所述语音特征做强制对齐,得到每帧语音特征对应的至少一个基础音素,并按照N×M矩阵的方式输出所述每帧语音特征对应的基础音素;其中,N等于所述语音的帧数,M等于所述目标语种的基础音素的个数。The voice features are input into the pre-trained GMM-HMM model of the target language according to frames to force the alignment of the voice features to obtain at least one basic phoneme corresponding to the voice features of each frame, and output all the voice features in an N×M matrix. The basic phonemes corresponding to each frame of speech feature; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
候选词映射单元33,用于将每一候选关键词映射为对应的基础音素。具体可通过发音词典进行映射。The candidate word mapping unit 33 is used to map each candidate keyword to a corresponding basic phoneme. Specifically, it can be mapped through a pronunciation dictionary.
得分计算单元34,用于根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分。The score calculation unit 34 is configured to calculate the score of each candidate keyword for the voice according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword.
具体的,该得分计算单元34用于根据所述语音特征的基础音素和所述候选关键词的基础音素,通过多种得分计算策略计算获得多个得分,对多个得分进行融合得到最终得分。Specifically, the score calculation unit 34 is configured to calculate multiple scores through multiple score calculation strategies based on the basic phonemes of the voice feature and the basic phonemes of the candidate keywords, and merge the multiple scores to obtain a final score.
其中,所述得分计算策略包括:动态规划、受限制约束的最长序列得分、在N×M矩阵空间里暴力穷举后的最优路径得分中的至少两个。Wherein, the score calculation strategy includes at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N×M matrix space.
判断单元35,用于根据所述得分判断是否有关键词被激活。The judging unit 35 is used for judging whether a keyword is activated according to the score.
具体的,判断单元35用于按照得分从大到小的顺序,依次判断候选关键词得分与该候选关键词预先定义的得分阈值的关系,直至判断到有候选关键词的得分大于该候选关键词预先定义的得分阈值,将该候选关键词激活后停止判 断Specifically, the judging unit 35 is used to sequentially judge the relationship between the candidate keyword score and the pre-defined score threshold of the candidate keyword according to the order of the score, until it is judged that there is a candidate keyword with a score greater than the candidate keyword. Pre-defined score threshold, stop the judgment after activating the candidate keyword
实施例三Example three
对应上述方法,本发明实施例三公开一种计算机系统,包括:Corresponding to the foregoing method, Embodiment 3 of the present invention discloses a computer system, including:
一个或多个处理器;以及One or more processors; and
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行一种终端,包括存储器和处理器,处理器读取存储器中存储的计算机程序指令,从而使处理器执行如上所述的方法。A memory associated with the one or more processors, where the memory is used to store program instructions that, when read and executed by the one or more processors, execute a terminal, including memory and processing The processor reads the computer program instructions stored in the memory, so that the processor executes the method described above.
本申请实施例四提供一种计算机系统,包括:The fourth embodiment of the present application provides a computer system, including:
一个或多个处理器;以及One or more processors; and
与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如下操作:A memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:
接收待检测的语音,并提取所述语音的语音特征;Receiving the voice to be detected, and extracting voice features of the voice;
将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素;Inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame;
将预先设置的每一候选关键词映射为对应的基础音素;Map each candidate keyword set in advance to the corresponding basic phoneme;
根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分;Calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;
根据所述得分判断是否有关键词被激活。Determine whether any keywords are activated according to the score.
优选的,所述输出所述每帧语音特征对应的基础音素包括:Preferably, said outputting the basic phoneme corresponding to each frame of speech feature includes:
按照N×M矩阵的方式输出所述每帧语音特征对应的基础音素;其中,N等于所述语音的帧数,M等于所述目标语种的基础音素的个数。The basic phonemes corresponding to each frame of speech feature are output according to an N×M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
优选的,所述神经网络模型通过如下步骤获得:Preferably, the neural network model is obtained through the following steps:
获取训练用样本数据集,所述样本数据集包括样本语音以及与所述样本语音对应的样本基础音素标注结果;Acquiring a sample data set for training, the sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice;
提取所述样本语音的样本语音特征;Extracting sample voice features of the sample voice;
将所述样本语音特征作为输入,将所述样本语音对应的样本基础音素标注 结果作为输出训练神经网络模型。The sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
优选的,所述将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素包括:Preferably, the inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame includes:
将所述语音特征按帧输入预先训练好的目标语种的GMM-HMM模型对所述语音特征做强制对齐,得到每帧语音特征对应的至少一个基础音素。The voice features are input into a pre-trained GMM-HMM model of the target language in frames to perform forced alignment on the voice features to obtain at least one basic phoneme corresponding to each frame of voice features.
优选的,所述根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分包括:Preferably, the calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword includes:
根据所述语音特征的基础音素和所述候选关键词的基础音素,通过多种得分计算策略计算获得多个得分,对多个得分进行融合得到最终得分。According to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
优选的,所述得分计算策略包括:动态规划、受限制约束的最长序列得分、在N×M矩阵空间里暴力穷举后的最优路径得分中的至少两个。Preferably, the score calculation strategy includes: at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N×M matrix space.
优选的,所述根据所述得分判断是否有关键词被激活包括:Preferably, the judging whether a keyword is activated according to the score includes:
按照得分从大到小的顺序,依次判断候选关键词得分与该候选关键词预先定义的得分阈值的关系,直至判断到有候选关键词的得分大于该候选关键词预先定义的得分阈值,将该候选关键词激活后停止判断。According to the order of the score, the relationship between the candidate keyword score and the score threshold defined by the candidate keyword is determined in turn, until it is determined that the score of the candidate keyword is greater than the score threshold defined by the candidate keyword, and the Stop the judgment after the candidate keyword is activated.
其中,图4示例性的展示出了计算机系统的架构,具体可以包括处理器1510,视频显示适配器1511,磁盘驱动器1512,输入/输出接口1513,网络接口1514,以及存储器1520。上述处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,与存储器1520之间可以通过通信总线1530进行通信连接。Among them, FIG. 4 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.
其中,处理器1510可以采用通用的CPU(Central Processing Unit,中央处理器)、微处理器、应用专用集成电路(Application Specific Integrated Circuit,ASIC)、或者一个或多个集成电路等方式实现,用于执行相关程序,以实现本申请所提供的技术方案。Among them, the processor 1510 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. Perform relevant procedures to realize the technical solutions provided in this application.
存储器1520可以采用ROM(Read Only Memory,只读存储器)、RAM(Random Access Memory,随机存取存储器)、静态存储设备,动态存储设备等形式实现。存储器1520可以存储用于控制计算机系统1500运行的操作系统1521,用于 控制计算机系统1500的低级别操作的基本输入输出系统(BIOS)。另外,还可以存储网页浏览器1523,数据存储管理系统1524,以及图标字体处理系统1525等等。上述图标字体处理系统1525就可以是本申请实施例中具体实现前述各步骤操作的应用程序。总之,在通过软件或者固件来实现本申请所提供的技术方案时,相关的程序代码保存在存储器1520中,并由处理器1510来调用执行。The memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, and an icon font processing system 1525 can also be stored. The foregoing icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application. In short, when the technical solution provided by the present application is implemented through software or firmware, the related program code is stored in the memory 1520 and is called and executed by the processor 1510.
输入/输出接口1513用于连接输入/输出模块,以实现信息输入及输出。输入输出/模块可以作为组件配置在设备中(图中未示出),也可以外接于设备以提供相应功能。其中输入设备可以包括键盘、鼠标、触摸屏、麦克风、各类传感器等,输出设备可以包括显示器、扬声器、振动器、指示灯等。The input/output interface 1513 is used to connect input/output modules to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.
网络接口1514用于连接通信模块(图中未示出),以实现本设备与其他设备的通信交互。其中通信模块可以通过有线方式(例如USB、网线等)实现通信,也可以通过无线方式(例如移动网络、WIFI、蓝牙等)实现通信。The network interface 1514 is used to connect a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).
总线1530包括一通路,在设备的各个组件(例如处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,与存储器1520)之间传输信息。The bus 1530 includes a path to transmit information between various components of the device (for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).
另外,该计算机系统1500还可以从虚拟资源对象领取条件信息数据库1541中获得具体领取条件的信息,以用于进行条件判断,等等。In addition, the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.
需要说明的是,尽管上述设备仅示出了处理器1510、视频显示适配器1511、磁盘驱动器1512、输入/输出接口1513、网络接口1514,存储器1520,总线1530等,但是在具体实施过程中,该设备还可以包括实现正常运行所必需的其他组件。此外,本领域的技术人员可以理解的是,上述设备中也可以仅包含实现本申请方案所必需的组件,而不必包含图中所示的全部组件。It should be noted that although the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the solution of the present application, and not necessarily include all the components shown in the figure.
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、 光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,云服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。From the description of the foregoing implementation manners, it can be known that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , An optical disc, etc., including a number of instructions to enable a computer device (which may be a personal computer, a cloud server, or a network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments of the present application.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the part of the description of the method embodiment. The systems and system embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.
以上对本申请所提供的基于神经网络的语音关键词检测方法、装置及系统,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处。综上所述,本说明书内容不应理解为对本申请的限制。综上所述,本发明采用非常简单的方式实现同样的功能,本发明将传统的多个关键词需要多个神经网络模型,改变为多个关键词只需要1个神经网络模型,可以将神经网络的尺寸做的非常小,模型10M即可取得非常优异的性能,从而适合在嵌入式设备部署,占用非常低的资源完成功能。另外,关键词可以任意配置,不需要重新针对特定的关键词搜集数据并且重新训练模型;同时,修改关键词时不需要重新训练模型,减少了麻烦的搜集特定关键词语料的步骤,节约了模型重新训练所需要的时间。The above provides a detailed introduction to the neural network-based speech keyword detection method, device and system provided in this application. Specific examples are used in this article to illustrate the principles and implementation of this application. The description of the above embodiments is only used To help understand the methods and core ideas of this application; at the same time, for those of ordinary skill in the art, based on the ideas of this application, there will be changes in the specific implementation and scope of application. In summary, the content of this specification should not be construed as a limitation on this application. In summary, the present invention uses a very simple way to achieve the same function. The present invention changes the traditional multiple keywords that require multiple neural network models, and changes to multiple keywords only requires one neural network model, which can change the neural network model. The size of the network is very small, and the model 10M can achieve very excellent performance, which is suitable for deployment in embedded devices and takes up very low resources to complete the function. In addition, keywords can be configured arbitrarily, and there is no need to re-collect data for specific keywords and retrain the model; at the same time, there is no need to retrain the model when modifying keywords, which reduces the troublesome steps of collecting specific keyword corpus and saves the model. The time required for retraining.
以上所述仅为本发明的优选实施例,并非因此限制本发明的专利范围,凡是在本发明的构思下,利用本发明说明书及附图内容所作的等效结构变换,或直接/间接运用在其他相关的技术领域均包括在本发明的专利保护范围内。The above descriptions are only the preferred embodiments of the present invention, and do not limit the scope of the present invention. Under the concept of the present invention, equivalent structural transformations made by using the contents of the description and drawings of the present invention, or directly/indirectly used in Other related technical fields are included in the scope of patent protection of the present invention.
Claims (10)
- 一种基于神经网络的语音关键词检测方法,其特征在于,包括以下步骤:A method for detecting speech keywords based on neural network is characterized in that it comprises the following steps:接收待检测的语音,并提取所述语音的语音特征;Receiving the voice to be detected, and extracting voice features of the voice;将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素;Inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame;将预先设置的每一候选关键词映射为对应的基础音素;Map each candidate keyword set in advance to the corresponding basic phoneme;根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分;Calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;根据所述得分判断是否有关键词被激活。Determine whether any keywords are activated according to the score.
- 如权利要求1所述的方法,其特征在于,所述输出所述每帧语音特征对应的基础音素包括:The method according to claim 1, wherein said outputting the basic phoneme corresponding to the speech feature of each frame comprises:按照N×M矩阵的方式输出所述每帧语音特征对应的基础音素;其中,N等于所述语音的帧数,M等于所述目标语种的基础音素的个数。The basic phonemes corresponding to each frame of speech feature are output according to an N×M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
- 如权利要求1所述的方法,其特征在于,所述神经网络模型通过如下步骤获得:The method of claim 1, wherein the neural network model is obtained through the following steps:获取训练用样本数据集,所述样本数据集包括样本语音以及与所述样本语音对应的样本基础音素标注结果;Acquiring a sample data set for training, the sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice;提取所述样本语音的样本语音特征;Extracting sample voice features of the sample voice;将所述样本语音特征作为输入,将所述样本语音对应的样本基础音素标注结果作为输出训练神经网络模型。The sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
- 如权利要求1所述的方法,其特征在于,所述将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素包括:The method according to claim 1, wherein said inputting said voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to said voice features of each frame comprises:将所述语音特征按帧输入预先训练好的目标语种的GMM-HMM模型对所 述语音特征做强制对齐,得到每帧语音特征对应的至少一个基础音素。The voice features are input into a pre-trained GMM-HMM model of the target language in frames, and the voice features are forcibly aligned to obtain at least one basic phoneme corresponding to each frame of voice features.
- 如权利要求1所述的方法,其特征在于,所述根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分包括:The method of claim 1, wherein the calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword comprises:根据所述语音特征的基础音素和所述候选关键词的基础音素,通过多种得分计算策略计算获得多个得分,对多个得分进行融合得到最终得分。According to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
- 如权利要求5所述的方法,其特征在于,所述得分计算策略包括:动态规划、受限制约束的最长序列得分、在N×M矩阵空间里暴力穷举后的最优路径得分中的至少两个。The method according to claim 5, wherein the score calculation strategy comprises: dynamic programming, restricted and constrained longest sequence score, the optimal path score after brute force exhaustion in the N×M matrix space At least two.
- 如权利要求1-6所述的方法,其特征在于,所述根据所述得分判断是否有关键词被激活包括:The method according to claims 1-6, wherein the judging whether a keyword is activated according to the score comprises:按照得分从大到小的顺序,依次判断候选关键词得分与该候选关键词预先定义的得分阈值的关系,直至判断到有候选关键词的得分大于该候选关键词预先定义的得分阈值,将该候选关键词激活后停止判断。According to the order of the score, the relationship between the candidate keyword score and the score threshold defined by the candidate keyword is determined in turn, until it is determined that the score of the candidate keyword is greater than the score threshold defined by the candidate keyword, and the Stop the judgment after the candidate keyword is activated.
- 一种基于神经网络的语音关键词检测装置,其特征在于,所述装置包括:A voice keyword detection device based on neural network, characterized in that, the device comprises:语音特征提取单元,用于接收待检测的语音,并提取所述语音的语音特征;The voice feature extraction unit is used to receive the voice to be detected and extract the voice feature of the voice;基础音素预测单元,用于将所述语音特征按帧输入预先训练好的目标语种的神经网络模型,输出所述每帧语音特征对应的基础音素;The basic phoneme prediction unit is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame;候选词映射单元,用于将每一候选关键词映射为对应的基础音素;Candidate word mapping unit for mapping each candidate keyword to a corresponding basic phoneme;得分计算单元,用于根据所述语音特征的基础音素和所述候选关键词的基础音素计算所述语音为所述每一候选关键词的得分;A score calculation unit, configured to calculate the voice for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;判断单元,用于根据所述得分判断是否有关键词被激活。The judging unit is used to judge whether a keyword is activated according to the score.
- 如权利要求8所述的装置,其特征在于,所述基础音素预测单元,用于按照N×M矩阵的方式输出所述每帧语音特征对应的基础音素;其中,N等于所述语音的帧数,M等于所述目标语种的基础音素的个数。The device according to claim 8, wherein the basic phoneme prediction unit is configured to output the basic phoneme corresponding to each frame of speech feature according to an N×M matrix; wherein, N is equal to the frame of the speech The number, M is equal to the number of basic phonemes of the target language.
- 一种计算机系统,其特征在于,包括:A computer system, characterized in that it comprises:一个或多个处理器;以及与所述一个或多个处理器关联的存储器,所述存储器用于存储程序指令,所述程序指令在被所述一个或多个处理器读取执行时,执行如权利要求1-7所述的方法。One or more processors; and a memory associated with the one or more processors, where the memory is used to store program instructions, which are executed when read and executed by the one or more processors The method of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CA3162745A CA3162745A1 (en) | 2019-11-26 | 2020-08-28 | Method of detecting speech keyword based on neutral network, device and system |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911173619.1A CN110992929A (en) | 2019-11-26 | 2019-11-26 | Voice keyword detection method, device and system based on neural network |
CN201911173619.1 | 2019-11-26 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021103712A1 true WO2021103712A1 (en) | 2021-06-03 |
Family
ID=70087106
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/111940 WO2021103712A1 (en) | 2019-11-26 | 2020-08-28 | Neural network-based voice keyword detection method and device, and system |
Country Status (3)
Country | Link |
---|---|
CN (1) | CN110992929A (en) |
CA (1) | CA3162745A1 (en) |
WO (1) | WO2021103712A1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113506584A (en) * | 2021-07-06 | 2021-10-15 | 腾讯音乐娱乐科技(深圳)有限公司 | Data processing method and device |
CN114978866A (en) * | 2022-05-25 | 2022-08-30 | 北京天融信网络安全技术有限公司 | Detection method, detection device and electronic equipment |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110992929A (en) * | 2019-11-26 | 2020-04-10 | 苏宁云计算有限公司 | Voice keyword detection method, device and system based on neural network |
CN111489737B (en) * | 2020-04-13 | 2020-11-10 | 深圳市友杰智新科技有限公司 | Voice command recognition method and device, storage medium and computer equipment |
CN111797607B (en) * | 2020-06-04 | 2024-03-29 | 语联网(武汉)信息技术有限公司 | Sparse noun alignment method and system |
CN111933124B (en) * | 2020-09-18 | 2021-04-30 | 电子科技大学 | Keyword detection method capable of supporting self-defined awakening words |
CN113724710A (en) * | 2021-10-19 | 2021-11-30 | 广东优碧胜科技有限公司 | Voice recognition method and device, electronic equipment and computer readable storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8321218B2 (en) * | 2009-06-19 | 2012-11-27 | L.N.T.S. Linguistech Solutions Ltd | Searching in audio speech |
US20170148429A1 (en) * | 2015-11-24 | 2017-05-25 | Fujitsu Limited | Keyword detector and keyword detection method |
US20180068653A1 (en) * | 2016-09-08 | 2018-03-08 | Intel IP Corporation | Method and system of automatic speech recognition using posterior confidence scores |
CN108182937A (en) * | 2018-01-17 | 2018-06-19 | 出门问问信息科技有限公司 | Keyword recognition method, device, equipment and storage medium |
CN108305617A (en) * | 2018-01-31 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
US10199037B1 (en) * | 2016-06-29 | 2019-02-05 | Amazon Technologies, Inc. | Adaptive beam pruning for automatic speech recognition |
CN110364142A (en) * | 2019-06-28 | 2019-10-22 | 腾讯科技(深圳)有限公司 | Phoneme of speech sound recognition methods and device, storage medium and electronic device |
CN110992929A (en) * | 2019-11-26 | 2020-04-10 | 苏宁云计算有限公司 | Voice keyword detection method, device and system based on neural network |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103971678B (en) * | 2013-01-29 | 2015-08-12 | 腾讯科技(深圳)有限公司 | Keyword spotting method and apparatus |
CN105374352B (en) * | 2014-08-22 | 2019-06-18 | 中国科学院声学研究所 | A kind of voice activated method and system |
CN106297776B (en) * | 2015-05-22 | 2019-07-09 | 中国科学院声学研究所 | A kind of voice keyword retrieval method based on audio template |
CN108615525B (en) * | 2016-12-09 | 2020-10-09 | 中国移动通信有限公司研究院 | Voice recognition method and device |
CN107331384B (en) * | 2017-06-12 | 2018-05-04 | 平安科技(深圳)有限公司 | Audio recognition method, device, computer equipment and storage medium |
CN107358951A (en) * | 2017-06-29 | 2017-11-17 | 阿里巴巴集团控股有限公司 | A kind of voice awakening method, device and electronic equipment |
CN108932943A (en) * | 2018-07-12 | 2018-12-04 | 广州视源电子科技股份有限公司 | Command word sound detection method, device, equipment and storage medium |
CN109243460A (en) * | 2018-08-15 | 2019-01-18 | 浙江讯飞智能科技有限公司 | A method of automatically generating news or interrogation record based on the local dialect |
CN110223673B (en) * | 2019-06-21 | 2020-01-17 | 龙马智芯(珠海横琴)科技有限公司 | Voice processing method and device, storage medium and electronic equipment |
-
2019
- 2019-11-26 CN CN201911173619.1A patent/CN110992929A/en active Pending
-
2020
- 2020-08-28 CA CA3162745A patent/CA3162745A1/en active Pending
- 2020-08-28 WO PCT/CN2020/111940 patent/WO2021103712A1/en active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8321218B2 (en) * | 2009-06-19 | 2012-11-27 | L.N.T.S. Linguistech Solutions Ltd | Searching in audio speech |
US20170148429A1 (en) * | 2015-11-24 | 2017-05-25 | Fujitsu Limited | Keyword detector and keyword detection method |
US10199037B1 (en) * | 2016-06-29 | 2019-02-05 | Amazon Technologies, Inc. | Adaptive beam pruning for automatic speech recognition |
US20180068653A1 (en) * | 2016-09-08 | 2018-03-08 | Intel IP Corporation | Method and system of automatic speech recognition using posterior confidence scores |
CN108182937A (en) * | 2018-01-17 | 2018-06-19 | 出门问问信息科技有限公司 | Keyword recognition method, device, equipment and storage medium |
CN108305617A (en) * | 2018-01-31 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
CN110364142A (en) * | 2019-06-28 | 2019-10-22 | 腾讯科技(深圳)有限公司 | Phoneme of speech sound recognition methods and device, storage medium and electronic device |
CN110992929A (en) * | 2019-11-26 | 2020-04-10 | 苏宁云计算有限公司 | Voice keyword detection method, device and system based on neural network |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113506584A (en) * | 2021-07-06 | 2021-10-15 | 腾讯音乐娱乐科技(深圳)有限公司 | Data processing method and device |
CN113506584B (en) * | 2021-07-06 | 2024-05-14 | 腾讯音乐娱乐科技(深圳)有限公司 | Data processing method and device |
CN114978866A (en) * | 2022-05-25 | 2022-08-30 | 北京天融信网络安全技术有限公司 | Detection method, detection device and electronic equipment |
CN114978866B (en) * | 2022-05-25 | 2024-02-20 | 北京天融信网络安全技术有限公司 | Detection method, detection device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CA3162745A1 (en) | 2021-06-03 |
CN110992929A (en) | 2020-04-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021103712A1 (en) | Neural network-based voice keyword detection method and device, and system | |
CN107134279B (en) | Voice awakening method, device, terminal and storage medium | |
US10192545B2 (en) | Language modeling based on spoken and unspeakable corpuses | |
US9842585B2 (en) | Multilingual deep neural network | |
CN108831439B (en) | Voice recognition method, device, equipment and system | |
CN108847241B (en) | Method for recognizing conference voice as text, electronic device and storage medium | |
CN110033760B (en) | Modeling method, device and equipment for speech recognition | |
US9805718B2 (en) | Clarifying natural language input using targeted questions | |
CN103956169B (en) | A kind of pronunciation inputting method, device and system | |
US20140257804A1 (en) | Exploiting heterogeneous data in deep neural network-based speech recognition systems | |
CN112528637B (en) | Text processing model training method, device, computer equipment and storage medium | |
US20150134332A1 (en) | Speech recognition method and device | |
WO2017127296A1 (en) | Analyzing textual data | |
CN108399914B (en) | Voice recognition method and device | |
CN110782880B (en) | Training method and device for prosody generation model | |
KR20220004224A (en) | Context biasing for speech recognition | |
WO2018192186A1 (en) | Speech recognition method and apparatus | |
CN111883121A (en) | Awakening method and device and electronic equipment | |
JP7400112B2 (en) | Biasing alphanumeric strings for automatic speech recognition | |
CN112562723B (en) | Pronunciation accuracy determination method and device, storage medium and electronic equipment | |
KR20200095947A (en) | Electronic device and Method for controlling the electronic device thereof | |
CN110853669B (en) | Audio identification method, device and equipment | |
JP2004094257A (en) | Method and apparatus for generating question of decision tree for speech processing | |
CN112559725A (en) | Text matching method, device, terminal and storage medium | |
CN111862963A (en) | Voice wake-up method, device and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20891808 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 3162745 Country of ref document: CA |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20891808 Country of ref document: EP Kind code of ref document: A1 |