WO2021103712A1

WO2021103712A1 - Neural network-based voice keyword detection method and device, and system

Info

Publication number: WO2021103712A1
Application number: PCT/CN2020/111940
Authority: WO
Inventors: 许苏魁
Original assignee: 苏宁云计算有限公司
Priority date: 2019-11-26
Filing date: 2020-08-28
Publication date: 2021-06-03
Also published as: CA3162745A1; CN110992929A

Abstract

A neural network-based voice keyword detection method and device, and a system. Said method comprises the following steps: pre-receiving a voice to be detected, and extracting voice features of the voice (S21); inputting the voice features into a pre-trained neural network model of a target language by frames, and outputting a basic phoneme corresponding to each frame of voice feature (S22); mapping each preset candidate keyword to the corresponding basic phoneme (S23); according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, calculating the score of the voice being each candidate keyword (S24); and determining whether a keyword is activated according to the score (S25). The voice keyword detection method saves system resources, and reduces the time and costs required for retraining a model.

Description

Voice keyword detection method, device and system based on neural network

Technical field

The invention belongs to the technical field of computer speech recognition, and specifically relates to a method, device and system for detecting speech keywords based on a neural network.

Background technique

For the task of speech keyword detection, the traditional method is to introduce a complete decoder for speech recognition, decode the speech input the keywords to be detected, generate multiple candidate results, and save them in some way, such as Lattice structure; further generation Inverted index, and then quickly search from this inverted index whether the voice to be detected contains the specified keywords. This Lattice-based keyword strategy, because multiple candidates can be represented in Lattice, generally has a high recall rate. The disadvantage is that it is too complicated. It requires the introduction of the entire recognition system and the processing of complex Lattice. The generation of inverted indexes generally involves the introduction of Finite-State Transducer (FST, Finite-State Transducer) operations, which are generally difficult to grasp and the deployment is complicated. The degree is also great.

Under the latest neural network-based keyword detection framework, a neural network is generally established for each keyword, and each neural network judges whether the keyword is activated by accumulating the output score of each frame. A neural network is established for each keyword to determine whether it is activated. One is that a large number of voices containing this keyword are needed to train the model, and the data collection is very troublesome; the other is that when the keyword is added or modified, it needs to be renewed. The whole process of collecting data and retraining the model is also very complicated. Moreover, this model generally has high false alarms, and the system is often activated by mistake in undesired situations.

Summary of the invention

In view of the overcomplex defects of the prior art, the present invention proposes a method, device and system for voice keyword detection based on neural network. The present invention can reduce the network model resources required by the keyword retrieval system. On the other hand, the model does not need to be retrained when modifying keywords, which can save the time required for model retraining and the cost required for retraining the model.

One aspect of the present invention is to disclose a voice keyword detection method based on neural network, which includes the following steps:

Receiving the voice to be detected, and extracting voice features of the voice;

Inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame;

Map each candidate keyword set in advance to the corresponding basic phoneme;

Calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;

Determine whether any keywords are activated according to the score.

Preferably, said outputting the basic phoneme corresponding to each frame of speech feature includes:

The basic phonemes corresponding to each frame of speech feature are output according to an N×M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.

Preferably, the output of the basic phoneme corresponding to each frame of speech feature includes:

Preferably, the neural network model is obtained through the following steps:

Acquiring a sample data set for training, the sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice;

Extracting sample voice features of the sample voice;

The sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.

Preferably, the inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame includes:

The voice features are input into a pre-trained GMM-HMM model of the target language in frames to perform forced alignment on the voice features to obtain at least one basic phoneme corresponding to each frame of voice features.

Preferably, the calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword includes:

According to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.

Preferably, the score calculation strategy includes: at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N×M matrix space.

Preferably, the judging whether a keyword is activated according to the score includes:

According to the order of the score, the relationship between the candidate keyword score and the score threshold defined by the candidate keyword is determined in turn, until it is determined that the score of the candidate keyword is greater than the score threshold defined by the candidate keyword, and the Stop the judgment after the candidate keyword is activated. Another aspect of the present invention is to disclose a voice keyword detection device based on a neural network, the device comprising:

The voice feature extraction unit is used to receive the voice to be detected and extract the voice feature of the voice;

The basic phoneme prediction unit is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame;

Candidate word mapping unit for mapping each candidate keyword to a corresponding basic phoneme;

A score calculation unit, configured to calculate the voice for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;

The judging unit is used to judge whether a keyword is activated according to the score.

Preferably, the basic phoneme prediction unit is configured to output the basic phoneme corresponding to the speech feature of each frame in the manner of an N×M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the basis of the target language Number of phonemes

Another aspect of the present invention is to disclose a computer system, including:

One or more processors; and

A memory associated with the one or more processors, where the memory is used to store program instructions that, when read and executed by the one or more processors, execute a terminal, including memory and processing The processor reads the computer program instructions stored in the memory, so that the processor executes the method described above.

According to the specific embodiments provided in this application, this application discloses the following technical effects:

1. For different keywords, there is no need to train different neural network models; only one model can complete the detection of all keywords. Under the traditional strategy, a keyword requires a specific neural network model, which is very resource intensive.

2. When modifying keywords, there is no need to retrain the model, just modify the corresponding phoneme sequence. Under the traditional strategy, keyword modification, the model must be retrained with a specific voice. However, the present invention only needs to train the network once with the speech that contains all the phonemes of the target language, which greatly reduces the cost of retraining the model, has simple operation and convenient deployment.

The product of the present invention only needs to have one of the above-mentioned effects.

The features and advantages of the present invention will become clear by referring to the following drawings and detailed description of the specific embodiments of the present invention.

Description of the drawings

Figure 1 is a flowchart of the voice keyword detection method of the present invention;

FIG. 2 is a flowchart of a method in Embodiment 1 of the present invention;

Fig. 3 is a structural diagram of a device according to the second embodiment of the present invention;

Figure 4 is a structural diagram of the computer system of the present invention.

Detailed ways

In order to make the technical solution of the present invention clearer and more comprehensible, the following will further describe it in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, and are not intended to limit the present invention.

The invention uses a neural network-based method to solve the task of voice keyword detection. In particular, the modeling unit of the neural network of the present invention is not a complete keyword or a word in the keyword, but a basic phoneme unit of the language to which the keyword belongs. For example, for Chinese, the output node of the neural network of the present invention is to model all initials and vowels of Hanyu Pinyin, and to splice the desired keywords through the sequence combination of initials and vowels.

In addition, because the neural network of the present invention is relatively small, the scores obtained through multiple neural networks for the same voice can be further integrated, thereby further improving performance, making the scores better reflect the keyword confidence, and enhancing the recall of the keyword detection system Rate and reduce false alarms.

Fig. 1 shows a flowchart of the voice keyword detection method of the present invention. As shown in FIG. 1, the voice keyword detection method of the present invention can be divided into two parts, one is training a neural network model, and the other is using a trained neural network model to detect voice keywords.

Training the neural network model includes the following steps.

Step 1: Obtain a sample training set, including the sample speech used for training and the sample basic phoneme labeling result of the speech. For the speech of the target language, collect a certain amount of marked speech, preferably a speech training set of more than 500 hours.

Step 2: Extract sample voice features.

Step 3: Train the neural network model. Use the sample speech with basic phoneme annotation results to train the GMM-HMM model required for speech recognition. This model enforces the alignment of the speech, and obtains the information of which basic phoneme or which basic phoneme each frame of the extracted speech belongs to (if each If a frame belongs to multiple basic phonemes, the sum of the probabilities of the multiple basic phonemes is 1). In actual operation, the phoneme information corresponding to a sentence can be obtained by mapping an existing dictionary resource, but the phoneme information of a certain frame cannot be determined specifically, so a GMM-HMM model needs to be trained, and each frame can be further obtained by using this model Phoneme information.

The output nodes of the neural network represent the basic phonemes of the target language. Therefore, the number of output nodes of the neural network can simply be equal to the sum of the basic phonemes of the target language. For example, for Chinese, it can be the sum of all initials plus vowels; English is the sum of the numbers of international phonetic symbols. In addition, it is extensible. For tonal languages, such as Chinese, the finals can have tones. There are 5 tones in total (four tones plus soft tones), so the total number of nodes is the initials plus 5 times the number of finals. In addition, some additional nodes can be added to absorb parts of the speech that do not belong to any phonemes, such as noise, abnormal sounds, cough sounds, and so on.

The neural network model of the present invention is not aimed at a complete keyword or a word in the keyword, but the basic phoneme unit of the language to which the keyword belongs. For example, for Chinese, the output node of the neural network of the present invention is to model all initials and vowels of Hanyu Pinyin, and to splice the desired keywords through the sequence combination of initials and vowels.

For example, if the key word is "lady guy", the corresponding consonant plus vowel sequence combination is "xiao3huo3xiao3huo3". There are no more than 100 basic phoneme units in a general language. Even if a tonal language like Chinese is included, the total modeling unit is generally no more than 500, so that the neural network model will not be too large, which is convenient for embedding. Deployed in mobile devices, such as mobile phones, cameras and other devices. The above-mentioned application network can adopt a simple fully connected feedforward neural network, or a more complex network, such as a time delay neural network, a convolutional neural network, a recurrent neural network, etc., which are all within the protection scope of the present invention.

Using the trained neural network model to detect speech keywords includes the following steps:

Step 4: Receive the voice information to be detected input by the user, and extract the voice features of the voice.

Step 5: Input the speech features into the neural network model trained in the above step by frame, and output the corresponding phonemes. For each frame, the neural network will get a vector of the number of network output nodes. Assuming there are N frames of speech and M output nodes of the network, a phoneme distribution matrix of size N×M will be obtained.

Corresponding to different target languages, the number of N and M are different.

Step 6: Calculate the score of each candidate keyword, that is, calculate the above N×M matrix as the possible score of each candidate keyword. Each candidate keyword is mapped into a phoneme sequence through its pronunciation dictionary. Since each phoneme can correspond to a node of the network output, the score of the phoneme sequence of the candidate keyword in the N×M matrix can be calculated. This scoring method includes, but is not limited to, dynamic programming, the longest sequence score subject to constraints, or the optimal path score after brute force exhaustion in the N×M matrix space. For the convenience of discussion, all score calculation methods that may be used in this process are collectively referred to as "score calculation strategy".

The present invention can train multiple neural networks for score calculation. For a candidate keyword, different score calculation neural networks adopt different "score calculation strategies" to obtain multiple scores. These scores can be merged using different methods. Such as weighted average, etc., to get a better score representation.

It should be noted that because candidate keywords must be mapped to phoneme sequences, any candidate keywords must be scored in step 6, and the neural network does not need to be retrained. In addition, since only the phoneme sequence of candidate keywords is considered here, candidate keywords with the same pronunciation but different words are treated equally.

Step 7: Determine whether there are candidate keywords to activate. In the candidate keyword candidate set, select the candidate keyword with the largest score. If the score exceeds the threshold defined by the candidate keyword, the candidate keyword will be activated; otherwise, consider the candidate keyword with the second highest score. Exceeds the pre-defined threshold of this candidate keyword. Continue in order. As long as one candidate keyword is activated, the control information that the candidate keyword is activated is returned to complete the recognition of a sentence. If the scores of all candidate keywords are lower than the threshold, it returns that no candidate keywords are activated. The whole process ends. Take a financial payment app on a mobile phone as an example. After opening the app, the user says "open the payment code" and "open the payment code", the system judges that it receives a specific keyword based on the user's voice, and then automatically opens it The corresponding QR code is for users to use.

Because this example is a Chinese scene, a certain amount of Chinese corpus is collected first. It is easy to find Chinese corpus marked with more than 500 hours on the Internet. Use open source tools to train the Chinese GMM-HMM model, and use the trained model to further compulsively align the batch of Chinese corpus to obtain the Chinese phoneme, which is the label of the Hanyu Pinyin level, that is, the phoneme information of each frame.

Next, use phoneme-level annotations and corpus to train one or more neural networks, which can be fully connected feedforward neural networks and time-delay neural networks. The output nodes of the network are the total number of phonemes. In this way, the neural network is even trained. The neural network resources are saved offline, packaged with the mobile app and deployed on the mobile phone, and loaded into the memory of the mobile phone when the app is opened. At the same time, the App also stores voice feature extraction strategies and candidate keyword sets such as "open money collection code", "open payment code" and so on.

When the user finishes saying a sentence, such as "Please open my payment code", the microphone of the mobile phone collects the sampling points of the sentence, performs feature extraction, and sends it to the neural network in the memory to obtain a phoneme distribution matrix output. Calculate the phoneme distribution matrix of the sentence and the scores of different candidate keywords. For the output of multiple neural networks, through a certain strategy fusion, such as weighted average, a more accurate score can be obtained. For example, according to the calculation, the score of "Please open my money collection code" and the candidate keyword "open money collection code" is 90, and the score of the candidate keyword "open payment code" is 40, and the candidate key The threshold of the word is 80, then according to the keyword score from high to low, check whether the score of each keyword exceeds the threshold set in advance for this keyword, and it will be found that the score of the candidate keyword "open the money collection code" exceeds the threshold, then If the keyword is activated, use the activated keyword to perform subsequent operations.

Specifically, write all keywords supported by the application to a file and read them from the system memory. When you need to modify or add keywords, you don't need to re-collect voices or retrain the model, just modify the file. In general keyword strategies, the model needs to be retrained with modified keywords or newly added keyword voices, but the present invention does not require this operation, which greatly saves cost and time.

Example one

Corresponding to the above description, the first embodiment of the present application discloses a method for detecting voice keywords based on a neural network, as shown in FIG. 2, including the following steps:

S21: Receive a voice to be detected, and extract voice features of the voice.

S22. Input the voice features into a pre-trained neural network model of the target language in frames, and output the basic phonemes corresponding to the voice features of each frame.

The specific steps include:

The voice features are input into the pre-trained GMM-HMM model of the target language according to frames to force the alignment of the voice features to obtain at least one basic phoneme corresponding to the voice features of each frame, and output all the voice features in an N×M matrix. The basic phonemes corresponding to each frame of speech feature; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.

S23. Map each preset candidate keyword to a corresponding basic phoneme. This step can be mapped through the pronunciation dictionary.

S24. Calculate the score of each candidate keyword for the voice according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword.

Specifically, this step includes:

S25: Determine whether any keyword is activated according to the score.

Specifically, the relationship between the candidate keyword score and the pre-defined score threshold of the candidate keyword can be sequentially determined in the order of the score from largest to smallest, until it is determined that the score of a candidate keyword is greater than the predetermined score of the candidate keyword. Threshold, stop the judgment after activating the candidate keyword.

Among them, the above-mentioned neural network model can be obtained through the following steps:

Extracting sample voice features of the sample voice;

Example two

Corresponding to the above method, the second embodiment of the present application also discloses a voice keyword detection device based on a neural network. As shown in FIG. 3, the device includes:

The voice feature extraction unit 31 is used to receive the voice to be detected and extract the voice feature of the voice.

The basic phoneme prediction unit 32 is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame.

Specifically, the basic phoneme prediction unit 32 is used for:

The candidate word mapping unit 33 is used to map each candidate keyword to a corresponding basic phoneme. Specifically, it can be mapped through a pronunciation dictionary.

The score calculation unit 34 is configured to calculate the score of each candidate keyword for the voice according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword.

Specifically, the score calculation unit 34 is configured to calculate multiple scores through multiple score calculation strategies based on the basic phonemes of the voice feature and the basic phonemes of the candidate keywords, and merge the multiple scores to obtain a final score.

Wherein, the score calculation strategy includes at least two of dynamic programming, the longest sequence score subject to constraints, and the optimal path score after brute force exhaustion in the N×M matrix space.

The judging unit 35 is used for judging whether a keyword is activated according to the score.

Specifically, the judging unit 35 is used to sequentially judge the relationship between the candidate keyword score and the pre-defined score threshold of the candidate keyword according to the order of the score, until it is judged that there is a candidate keyword with a score greater than the candidate keyword. Pre-defined score threshold, stop the judgment after activating the candidate keyword

Example three

Corresponding to the foregoing method, Embodiment 3 of the present invention discloses a computer system, including:

One or more processors; and

The fourth embodiment of the present application provides a computer system, including:

One or more processors; and

A memory associated with the one or more processors, where the memory is used to store program instructions, and when the program instructions are read and executed by the one or more processors, perform the following operations:

Receiving the voice to be detected, and extracting voice features of the voice;

Map each candidate keyword set in advance to the corresponding basic phoneme;

Determine whether any keywords are activated according to the score.

Preferably, the neural network model is obtained through the following steps:

Extracting sample voice features of the sample voice;

According to the order of the score, the relationship between the candidate keyword score and the score threshold defined by the candidate keyword is determined in turn, until it is determined that the score of the candidate keyword is greater than the score threshold defined by the candidate keyword, and the Stop the judgment after the candidate keyword is activated.

Among them, FIG. 4 exemplarily shows the architecture of the computer system, which may specifically include a processor 1510, a video display adapter 1511, a disk drive 1512, an input/output interface 1513, a network interface 1514, and a memory 1520. The processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520 may be communicatively connected through the communication bus 1530.

Among them, the processor 1510 can be implemented by a general CPU (Central Processing Unit, central processing unit), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits. Perform relevant procedures to realize the technical solutions provided in this application.

The memory 1520 may be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory, random access memory), static storage device, dynamic storage device, etc. The memory 1520 may store an operating system 1521 for controlling the operation of the computer system 1500, and a basic input output system (BIOS) for controlling the low-level operation of the computer system 1500. In addition, a web browser 1523, a data storage management system 1524, and an icon font processing system 1525 can also be stored. The foregoing icon font processing system 1525 may be an application program that specifically implements the foregoing steps in the embodiment of the present application. In short, when the technical solution provided by the present application is implemented through software or firmware, the related program code is stored in the memory 1520 and is called and executed by the processor 1510.

The input/output interface 1513 is used to connect input/output modules to realize information input and output. The input/output/module can be configured in the device as a component (not shown in the figure), or can be connected to the device to provide corresponding functions. The input device may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and an output device may include a display, a speaker, a vibrator, an indicator light, and the like.

The network interface 1514 is used to connect a communication module (not shown in the figure) to realize communication interaction between the device and other devices. The communication module can realize communication through wired means (such as USB, network cable, etc.), or through wireless means (such as mobile network, WIFI, Bluetooth, etc.).

The bus 1530 includes a path to transmit information between various components of the device (for example, the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, and the memory 1520).

In addition, the computer system 1500 can also obtain information about specific receiving conditions from the virtual resource object receiving condition information database 1541 for condition judgment, and so on.

It should be noted that although the above device only shows the processor 1510, the video display adapter 1511, the disk drive 1512, the input/output interface 1513, the network interface 1514, the memory 1520, the bus 1530, etc., in the specific implementation process, the The device may also include other components necessary for normal operation. In addition, those skilled in the art can understand that the above-mentioned device may also include only the components necessary to implement the solution of the present application, and not necessarily include all the components shown in the figure.

From the description of the foregoing implementation manners, it can be known that those skilled in the art can clearly understand that this application can be implemented by means of software plus a necessary general hardware platform. Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product can be stored in a storage medium, such as ROM/RAM, magnetic disk , An optical disc, etc., including a number of instructions to enable a computer device (which may be a personal computer, a cloud server, or a network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments of the present application.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the difference from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the part of the description of the method embodiment. The systems and system embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, It can be located in one place, or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement without creative work.

The above provides a detailed introduction to the neural network-based speech keyword detection method, device and system provided in this application. Specific examples are used in this article to illustrate the principles and implementation of this application. The description of the above embodiments is only used To help understand the methods and core ideas of this application; at the same time, for those of ordinary skill in the art, based on the ideas of this application, there will be changes in the specific implementation and scope of application. In summary, the content of this specification should not be construed as a limitation on this application. In summary, the present invention uses a very simple way to achieve the same function. The present invention changes the traditional multiple keywords that require multiple neural network models, and changes to multiple keywords only requires one neural network model, which can change the neural network model. The size of the network is very small, and the model 10M can achieve very excellent performance, which is suitable for deployment in embedded devices and takes up very low resources to complete the function. In addition, keywords can be configured arbitrarily, and there is no need to re-collect data for specific keywords and retrain the model; at the same time, there is no need to retrain the model when modifying keywords, which reduces the troublesome steps of collecting specific keyword corpus and saves the model. The time required for retraining.

The above descriptions are only the preferred embodiments of the present invention, and do not limit the scope of the present invention. Under the concept of the present invention, equivalent structural transformations made by using the contents of the description and drawings of the present invention, or directly/indirectly used in Other related technical fields are included in the scope of patent protection of the present invention.

Claims

A method for detecting speech keywords based on neural network is characterized in that it comprises the following steps:

Receiving the voice to be detected, and extracting voice features of the voice;

Inputting the voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to the voice features of each frame;

Map each candidate keyword set in advance to the corresponding basic phoneme;

Calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;

Determine whether any keywords are activated according to the score.
The method according to claim 1, wherein said outputting the basic phoneme corresponding to the speech feature of each frame comprises:

The basic phonemes corresponding to each frame of speech feature are output according to an N×M matrix; wherein, N is equal to the number of frames of the speech, and M is equal to the number of basic phonemes of the target language.
The method of claim 1, wherein the neural network model is obtained through the following steps:

Acquiring a sample data set for training, the sample data set including a sample voice and a sample basic phoneme labeling result corresponding to the sample voice;

Extracting sample voice features of the sample voice;

The sample voice feature is used as an input, and the sample basic phoneme labeling result corresponding to the sample voice is used as an output to train a neural network model.
The method according to claim 1, wherein said inputting said voice features into a pre-trained neural network model of the target language by frame, and outputting the basic phonemes corresponding to said voice features of each frame comprises:

The voice features are input into a pre-trained GMM-HMM model of the target language in frames, and the voice features are forcibly aligned to obtain at least one basic phoneme corresponding to each frame of voice features.
The method of claim 1, wherein the calculating the voice score for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword comprises:

According to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword, multiple scores are calculated through multiple score calculation strategies, and the multiple scores are merged to obtain a final score.
The method according to claim 5, wherein the score calculation strategy comprises: dynamic programming, restricted and constrained longest sequence score, the optimal path score after brute force exhaustion in the N×M matrix space At least two.
The method according to claims 1-6, wherein the judging whether a keyword is activated according to the score comprises:

According to the order of the score, the relationship between the candidate keyword score and the score threshold defined by the candidate keyword is determined in turn, until it is determined that the score of the candidate keyword is greater than the score threshold defined by the candidate keyword, and the Stop the judgment after the candidate keyword is activated.
A voice keyword detection device based on neural network, characterized in that, the device comprises:

The voice feature extraction unit is used to receive the voice to be detected and extract the voice feature of the voice;

The basic phoneme prediction unit is configured to input the voice features into a pre-trained neural network model of the target language according to frames, and output the basic phonemes corresponding to the voice features of each frame;

Candidate word mapping unit for mapping each candidate keyword to a corresponding basic phoneme;

A score calculation unit, configured to calculate the voice for each candidate keyword according to the basic phoneme of the voice feature and the basic phoneme of the candidate keyword;

The judging unit is used to judge whether a keyword is activated according to the score.
The device according to claim 8, wherein the basic phoneme prediction unit is configured to output the basic phoneme corresponding to each frame of speech feature according to an N×M matrix; wherein, N is equal to the frame of the speech The number, M is equal to the number of basic phonemes of the target language.
A computer system, characterized in that it comprises:

One or more processors; and a memory associated with the one or more processors, where the memory is used to store program instructions, which are executed when read and executed by the one or more processors The method of claims 1-7.