CN106601229A

CN106601229A - Voice awakening method based on soc chip

Info

Publication number: CN106601229A
Application number: CN201611003861.0A
Authority: CN
Inventors: 陈晓鹏; 殷瑞祥; 徐向民; 张伟彬; 邢晓芬
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2016-11-15
Filing date: 2016-11-15
Publication date: 2017-04-26

Abstract

The invention discloses a voice awakening method based on an soc chip. The voice awakening method comprises the steps that: S1, acquiring voice data and sampling the voice data by means of the chip, and converting an analog signal into a digital signal; S2, carrying out MFCC feature extraction on voice data of the digital signal; S3, conducting voice activity detection on MFCC feature values, judging whether a new frame of MFCC data of the current MFCC feature value is a voice frame, if not, jumping to the step S2 and releasing the data, and if so, subjecting the MFCC feature values to processing at the next step; S4, recognizing the MFCC feature values by adopting a voice recognition algorithm based on an HMM model, and awakening control equipment if a recognition result is an effective instruction, otherwise jumping to the step S2. According to the voice awakening method provided by the invention, a real-time system implemented by adopting the algorithm with high robustness has high recognition rate, and achieves the requirements for low power consumption and high performance.

Description

A voice wake-up method based on soc chip

技术领域technical field

本发明涉及语音识别技术领域，尤其涉及一种基于soc芯片的语音唤醒方法。The invention relates to the technical field of voice recognition, in particular to a voice wake-up method based on a SoC chip.

背景技术Background technique

随着时代的发展，越来越多的电子设备进入人们的日常生活中，人们在享受电子设备带来便利的同时，希望电子设备能更加智能，实现无触控的交互方式。With the development of the times, more and more electronic devices have entered people's daily life. While enjoying the convenience brought by electronic devices, people hope that electronic devices can be more intelligent and realize non-touch interaction.

语音唤醒，即用户说出设定的语音指令，让处于休眠状态下的设备直接进入到等待指令状态。通过该技术，任何人在任何环境、任何时间对设备直接说出预设的唤醒词，就能激活设备，从而实现低功耗和无触控的交互。Voice wake-up, that is, the user speaks the set voice command, so that the device in the dormant state directly enters the state of waiting for the command. Through this technology, anyone can directly speak the preset wake-up word to the device in any environment and at any time to activate the device, thereby achieving low power consumption and touch-free interaction.

然而目前出现的语音唤醒技术大部分是基于计算机和手机终端实现的，需要强大的处理器进行支撑，不适合用于工业应用。而基于mcu实现的语音唤醒技术虽然成本低廉，但是由于处理器性能的限制无法达到理想的效果。However, most of the currently emerging voice wake-up technologies are implemented based on computers and mobile phone terminals, which require powerful processors for support and are not suitable for industrial applications. Although the voice wake-up technology implemented based on the MCU is low in cost, it cannot achieve the desired effect due to the limitation of the processor performance.

发明内容Contents of the invention

本发明要解决的技术问题在于，提供一种基于soc芯片的语音唤醒方法，通过采用鲁棒性高的算法实现的实时系统具有较高的识别率，达到低功耗和高性能的要求。The technical problem to be solved by the present invention is to provide a voice wake-up method based on a soc chip, and the real-time system realized by using a robust algorithm has a high recognition rate and meets the requirements of low power consumption and high performance.

为解决上述技术问题，本发明提供如下技术方案：一种基于soc芯片的语音唤醒方法，包括以下步骤：In order to solve the problems of the technologies described above, the present invention provides the following technical solutions: a voice wake-up method based on a soc chip, comprising the following steps:

S1、芯片采集语音数据，并对其进行采样，将模拟信号转换成数字信号；S1. The chip collects voice data, samples it, and converts the analog signal into a digital signal;

S2、将数字信号的语音数据进行MFCC特征提取；S2, performing MFCC feature extraction on the voice data of the digital signal;

S3、对MFCC特征值进行语音活动检测，判断当前MFCC特征值的新一帧MFCC数据是否为语音帧，若否则返回步骤S2并释放数据，若是则将MFCC特征值进入下一步骤处理；S3, carry out voice activity detection to MFCC characteristic value, judge whether the new frame MFCC data of current MFCC characteristic value is a voice frame, if otherwise return to step S2 and release data, if then enter next step processing with MFCC characteristic value;

S4、通过基于HMM模型的语音识别算法对MFCC特征值进行识别，若识别结果为有效指令，则唤醒控制设备；反之则返回步骤S2。S4. Recognize the MFCC characteristic value through the speech recognition algorithm based on the HMM model. If the recognition result is a valid command, wake up the control device; otherwise, return to step S2.

进一步地，所述步骤S2中MFCC特征提取，其具体为：Further, the MFCC feature extraction in the step S2 is specifically:

1)、数字信号的预处理，包括预加重、分帧和加窗；1), digital signal preprocessing, including pre-emphasis, framing and windowing;

2)、对每一帧信号进行FFT变换，求频谱，进而求得幅度谱|X_n(k)|；2), carry out FFT transformation to each frame signal, obtain frequency spectrum, and then obtain amplitude spectrum |X _n (k)|;

3)、对幅度谱|X_n(k)|加Mel滤波器组W_l(k)，公式如下：3), add Mel filter bank W _l (k) to amplitude spectrum |X _n (k)|, the formula is as follows:

其中k指FFT的第k个点；o(l)、c(l)、h(l)分别为第l个三角滤波器的下限频率、中心频率和上限频率；Where k refers to the kth point of FFT; o(l), c(l), h(l) are the lower limit frequency, center frequency and upper limit frequency of the lth triangular filter respectively;

4)、对所有的滤波器输出做对数运算，再进一步做离散余弦变换DCT得MFCC特征值，公式如下：4), perform logarithmic operation on all filter outputs, and then further perform discrete cosine transform DCT to obtain MFCC eigenvalues, the formula is as follows:

其中N、L为26，指滤波器个数；i指MFCC系数阶数，i取12，即为得到了12个倒谱特征；此外，再加上一帧的对数能量作为第13个特征参数，定义如下：Among them, N and L are 26, which refers to the number of filters; i refers to the order of MFCC coefficients, and i takes 12, which means that 12 cepstrum features are obtained; in addition, the logarithmic energy of one frame is added as the 13th feature parameters, defined as follows:

其中，X_n(k)为幅度，由此可得到13个特征参数，包括12个倒谱特征加1个对数能量；Among them, X _n (k) is the magnitude, from which 13 characteristic parameters can be obtained, including 12 cepstrum features plus 1 logarithmic energy;

5)、所求的13个标准的倒谱参数MFCC只反映了语音参数的静态特性，语音的动态特性根据所述静态特征的差分谱来描述；计算13个MFCC特征的一阶差分dtm(i)和二阶差分dtmm(i)：5), the cepstrum parameter MFCC of 13 standards that seeks has only reflected the static characteristic of speech parameter, and the dynamic characteristic of speech is described according to the differential spectrum of described static feature; Calculate the first-order difference dtm(i of 13 MFCC features ) and second-order difference dtmm(i):

13个标准MFCC特征和它的13个一阶差分、13个二阶差分组成39维的MFCC特征参数，至此MFCC特征提取完毕。The 13 standard MFCC features and its 13 first-order differences and 13 second-order differences form 39-dimensional MFCC feature parameters. So far, the MFCC feature extraction is complete.

进一步地，所述步骤S3中对特征值进行语音活动检测，采用基于GMM模型的语音活动检测方法，其假设语音和背景噪音在特定的特征空间中符合高斯混合分布，在特征空间中分别构建静音模型、非静音模型；接着对MFCC特征的新一帧MFCC数据进行计算，分别算出静音模型的似然值P1、非静音模型的似然值P2；比较似然值P1、似然值P2的大小，若P1大于P2则当前MFCC数据帧为语音帧，否则静音帧。Further, in the step S3, the voice activity detection is performed on the feature value, and the voice activity detection method based on the GMM model is adopted, which assumes that the voice and background noise conform to the Gaussian mixture distribution in a specific feature space, and respectively construct silence in the feature space model, non-muted model; then calculate the new frame of MFCC data of MFCC features, and calculate the likelihood value P1 of the silent model and the likelihood value P2 of the non-muted model respectively; compare the size of the likelihood value P1 and the likelihood value P2 , if P1 is greater than P2, the current MFCC data frame is a speech frame, otherwise a mute frame.

进一步地，若所述当前MFCC数据帧被判断为语音帧后，判断下一帧MFCC数据帧时，似然值P1和似然值P2分别乘以对应的转移概率，比较两个乘积结果，若似然值P1的乘积结果大于似然值P2的乘积结果，则当前MFCC数据帧为语音帧，否则为静音帧；Further, if the current MFCC data frame is judged as a speech frame, when judging the next frame of MFCC data frame, the likelihood value P1 and the likelihood value P2 are multiplied by the corresponding transition probability respectively, and the two product results are compared, if If the product result of the likelihood value P1 is greater than the product result of the likelihood value P2, the current MFCC data frame is a voice frame, otherwise it is a silent frame;

若所述当前MFCC数据帧被判断为静音帧后，判断下一帧MFCC数据帧时，似然值P1和似然值P2分别乘以对应的转移概率，比较两个乘积结果，若似然值P1的乘积结果大于似然值P2的乘积结果，则当前MFCC数据帧为语音帧，否则为静音帧；If after the current MFCC data frame is judged as a silent frame, when judging the next frame of MFCC data frame, the likelihood value P1 and the likelihood value P2 are respectively multiplied by the corresponding transition probability, and the two product results are compared, if the likelihood value If the product result of P1 is greater than the product result of the likelihood value P2, the current MFCC data frame is a speech frame, otherwise it is a silent frame;

所述对应的转移概率为预先设置好的模型数据。The corresponding transition probability is preset model data.

进一步地，所述静音模型的似然值P1、非静音模型的似然值P2的计算方法，具体为：Further, the calculation method of the likelihood value P1 of the silent model and the likelihood value P2 of the non-silent model is specifically:

其中静音模型、非静音模型均由13个39维高斯模型构成；一个M阶高斯模型的概率密度函数是由M个高斯概率密度函数加权求和得到的，如下式3.1：Among them, the silent model and the non-silent model are composed of 13 39-dimensional Gaussian models; the probability density function of an M-order Gaussian model is obtained by the weighted sum of M Gaussian probability density functions, as shown in the following formula 3.1:

式中，M为多维高斯模型个数，M取13；X为一个D维随机矢量，即为39维MFCC特征值；b_i(X)为子分布，ω_i为混合权重；每个子分布是D维的联合高斯概率分布，如下式3.2：In the formula, M is the number of multidimensional Gaussian models, and M is 13; X is a D-dimensional random vector, which is a 39-dimensional MFCC feature value; b _i (X) is a sub-distribution, and ω _i is a mixing weight; each sub-distribution is The D-dimensional joint Gaussian probability distribution is as follows: 3.2:

其中μ_i是第i维的均值；σ_i ²为方差；x_i为输入的第i维的MFCC特征值；D表示总维数，D取39；Among them, μ _i is the mean value of the i-th dimension; σ _i ² is the variance; x _i is the input MFCC feature value of the i-th dimension; D represents the total dimension, and D is 39;

由于式3.2计算过于复杂，对其进行推导简化：Since the calculation of formula 3.2 is too complicated, its derivation is simplified:

式两边取对数可得：Take the logarithm on both sides of the formula to get:

可知加号左边都为训练好的模型中已知的参数，可以提前训练好，故设gconst作为模型的一个参数：It can be seen that the left side of the plus sign is the known parameters in the trained model, which can be trained in advance, so let gconst be used as a parameter of the model:

所以式3.2变换为求下式：So formula 3.2 is transformed into the following formula:

进而式3.1简化为：Then Equation 3.1 is simplified to:

将MFCC数据帧和模型参数带入上式中，即可得到该帧数据的静音模型的似然值和非静音模型的似然值。Putting the MFCC data frame and model parameters into the above formula, the likelihood value of the mute model and the likelihood value of the non-mutation model of the frame data can be obtained.

进一步地，所述将将MFCC数据帧和模型参数带入上式中，即可得到该帧数据的静音模型的似然值和非静音模型的似然值，具体步骤为：Further, the MFCC data frame and model parameters will be brought into the above formula to obtain the likelihood value of the silent model of the frame data and the likelihood value of the non-silent model, and the specific steps are:

1)、对每一帧语音的MFCC特征值分别与静音模型和非静音模型进行匹配计算，先进行(x_i-μ_i)²/σ²计算,计算结果进行累加，得到两个模型的多维高斯分布的指数部分fa0和fa1：1) Match and calculate the MFCC eigenvalues of each frame of speech with the silent model and the non-silent model, first perform ( _xi -μ _i ) ² /σ ² calculations, and accumulate the calculation results to obtain the multidimensional model of the two models Exponential parts fa0 and fa1 of a Gaussian distribution:

其中均值μ_i和方差从模型数据中直接获取；where mean μ _i and variance Get it directly from the model data;

2)、对上一步的计算结果进行如下计算，可得到多维高斯分布的似然值b：2), the calculation results of the previous step are calculated as follows, and the likelihood value b of the multidimensional Gaussian distribution can be obtained:

其中gconst为提前训练的数据，从模型数据中直接获取，至此完成式3.3中的多维高斯分布似然值ln b_i(X)计算；Among them, gconst is the pre-trained data, directly obtained from the model data, so far the calculation of the multidimensional Gaussian distribution likelihood value ln b _i (X) in formula 3.3 is completed;

3)、由上文可知静音模型和非静音模型分别包含13个多维高斯分布，所以步骤1、2循环13次后可得13个多维高斯分布的似然值ln b_i(X)，将这些似然值和相应的权重ω_i带入下式，得当前帧对静音模型的似然值P₁和对非静音模型的似然值P₂：3) From the above, it can be seen that the silent model and the non-silent model respectively contain 13 multidimensional Gaussian distributions, so the likelihood values ln b _i (X) of 13 multidimensional Gaussian distributions can be obtained after steps 1 and 2 are cycled 13 times, and these The likelihood value and the corresponding weight ω _i are brought into the following formula to obtain the likelihood value P ₁ of the current frame for the silent model and the likelihood value P ₂ for the non-silent model:

进一步地，所述步骤S4基于HMM模型的语音识别算法，其具体为：Further, the step S4 is based on the speech recognition algorithm of the HMM model, which is specifically:

S41、载入HMM模型，构造HMM链的识别网络；S41, load the HMM model, construct the recognition network of the HMM chain;

S42、将MFCC特征值与HMM模型的识别网络匹配，计算初始似然值；S42. Match the MFCC eigenvalues with the recognition network of the HMM model, and calculate the initial likelihood value;

S43、根据初始似然值，Token Passing算法找到HMM链网络中的最佳路径，完成译码的工作；S43. According to the initial likelihood value, the Token Passing algorithm finds the best path in the HMM chain network, and completes the decoding work;

S45、判断语音指令是否与HMM链匹配，若是则为有效语音，若否则为无效语音。S45. Determine whether the voice instruction matches the HMM chain, if yes, it is valid voice, otherwise, it is invalid voice.

采用上述技术方案后，本发明至少具有如下有益效果：After adopting the above technical solution, the present invention has at least the following beneficial effects:

(1)本发明通过将原算法中部分计算转换到log域，把大量乘法运算转换成加法运算，成功降低了在微处理器上实现时的时延；通过专用硬件对算法的复杂计算进行加速，降低时延，最终达到了实时识别的目的；(1) The present invention converts a large number of multiplication operations into addition operations by converting part of the calculations in the original algorithm to the log domain, and successfully reduces the time delay when it is implemented on a microprocessor; the complex calculation of the algorithm is accelerated by special hardware , reduce the delay, and finally achieve the purpose of real-time recognition;

(2)本发明通过采用鲁棒性高的算法实现的实时系统具有较高的识别率；(2) the present invention has higher recognition rate by adopting the real-time system realized by the algorithm with high robustness;

(3)本发明具有易升级性，本发明的算法分为独立的三个模块特征提取、语音活动检测和语音识别，后续有性能更佳的算法可以通过单独替换子模块的方式对系统进行优化。(3) The present invention is easy to upgrade. The algorithm of the present invention is divided into three independent modules: feature extraction, voice activity detection and voice recognition. Subsequent algorithms with better performance can be used to optimize the system by replacing sub-modules .

附图说明Description of drawings

图1是本发明一种基于soc芯片的语音唤醒方法的整体流程图；Fig. 1 is the overall flowchart of a kind of voice wake-up method based on soc chip of the present invention;

图2是本发明一种基于soc芯片的语音唤醒方法的三角滤波器示意图；Fig. 2 is the triangular filter schematic diagram of a kind of voice wake-up method based on soc chip of the present invention;

图3是本发明一种基于soc芯片的语音唤醒方法的三角滤波器组示意图；Fig. 3 is a triangular filter bank schematic diagram of a voice wake-up method based on a soc chip of the present invention;

图4是本发明一种基于soc芯片的语音唤醒方法的语音活动检测流程图；Fig. 4 is the voice activity detection flowchart of a kind of voice wake-up method based on soc chip of the present invention;

图5是本发明一种基于soc芯片的语音唤醒方法的39维高斯模型的参数构成示意图；Fig. 5 is the parameter composition schematic diagram of the 39-dimensional Gaussian model of a kind of voice wake-up method based on soc chip of the present invention;

图6是本发明一种基于soc芯片的语音唤醒方法的语音活动检测步骤流程图；Fig. 6 is a flow chart of voice activity detection steps of a voice wake-up method based on a soc chip of the present invention;

图7是本发明一种基于soc芯片的语音唤醒方法的在语音活动检测中预先训练好的模型数据示意图；Fig. 7 is a schematic diagram of model data pre-trained in voice activity detection of a method for voice wake-up based on a soc chip of the present invention;

图8是本发明一种基于soc芯片的语音唤醒方法的语音识别算法整体步骤流程图；Fig. 8 is a flow chart of the overall steps of the voice recognition algorithm of a voice wake-up method based on a soc chip of the present invention;

图9是本发明一种基于soc芯片的语音唤醒方法的在语音识别算法中实例的HMM链示意图。FIG. 9 is a schematic diagram of an HMM chain in an example of a voice recognition algorithm based on a soc chip in the present invention.

具体实施方式detailed description

需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互结合，下面结合附图和具体实施例对本申请作进一步详细说明。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be further described in detail below in conjunction with the drawings and specific embodiments.

图1为本发明整体算法流程图，其中各模块计算流程如下：Fig. 1 is the overall algorithm flow chart of the present invention, and wherein each module calculation process is as follows:

1、语音前端处理：1. Voice front-end processing:

语音前端处理就是将语音数据的信号通过采样，把模拟信号转换为数字信号。本方案中采样率为16K。语音数字信号为PCM(Pulse Code Modulation)Voice front-end processing is to convert the analog signal into a digital signal by sampling the voice data signal. In this solution, the sampling rate is 16K. Voice digital signal is PCM (Pulse Code Modulation)

格式，即脉冲编码调制，它将声音模拟信号采样后得到量化后的语音数据，是最基本最原始的一种语音格式。本发明中ADC模块集成在soc芯片中，每10ms做一次语音检测处理，采样频率为每秒采集16K个数据，数据位宽为16bits。Format, that is, pulse code modulation, which samples the sound analog signal to obtain quantized speech data, is the most basic and original speech format. In the present invention, the ADC module is integrated in the soc chip, and a voice detection process is performed every 10 ms. The sampling frequency is 16K data collected per second, and the data bit width is 16 bits.

2、MFCC特征数据提取：2. MFCC feature data extraction:

1)信号的预处理，包括预加重(Preemphasis)，分帧(Frame Blocking)，加窗(Windowing)；语音信号的采样频率fs＝16KHz，由于语音信号在10‐30ms认为是稳定的，故设置每帧10ms，所以帧长为160点；帧移为帧长的1/2，即80；1) Signal preprocessing, including Preemphasis, Frame Blocking, and Windowing; the sampling frequency fs of the voice signal is 16KHz, and since the voice signal is considered stable at 10-30ms, set Each frame is 10ms, so the frame length is 160 points; the frame shift is 1/2 of the frame length, which is 80;

2)对每一帧进行256个点的FFT变换，求频谱，进而求得幅度谱|X_n(k)|；2) Carry out FFT transformation of 256 points for each frame, obtain the frequency spectrum, and then obtain the amplitude spectrum |X _n (k)|;

3)对幅度谱|X_n(k)|加Mel滤波器组W_l(k)，公式如下：3) Add Mel filter bank W _l (k) to the amplitude spectrum |X _n (k)|, the formula is as follows:

其中k指FFT的第k个点；o(l)、c(l)、h(l)为第l个三角滤波器的下限频率、中心频率和上限频率，如图2所示；Wherein k refers to the kth point of FFT; o(l), c(l), h(l) are the lower limit frequency, center frequency and upper limit frequency of the l triangular filter, as shown in Figure 2;

本发明中Mel滤波器组由26个三角滤波器组成，参数提前计算得到。三角滤波器组如图3所示，横坐标对应FFT中的点，纵坐标即W_l(k)，由于是对称的所以只取FFT前面一半的点计算频谱，然后加入到三角滤波器中；In the present invention, the Mel filter bank is composed of 26 triangular filters, and the parameters are calculated in advance. The triangular filter bank is shown in Figure 3. The abscissa corresponds to the point in the FFT, and the ordinate is W _l (k). Since it is symmetrical, only half of the points in front of the FFT are taken to calculate the spectrum, and then added to the triangular filter;

4)对所有的滤波器输出做对数运算(Logarlithm)，再进一步做离散余弦变换DCT可得MFCC，公式如下所示。4) Perform logarithmic operation (Logarlithm) on all filter outputs, and then further perform discrete cosine transform DCT to obtain MFCC, the formula is as follows.

其中N、L为26，指滤波器个数；i指MFCC系数阶数，本发明取12，即得到了12个倒谱特征；此外再加上一帧的对数能量作为第13个特征参数，定义如下：Among them, N and L are 26, referring to the number of filters; i refers to the order of MFCC coefficients, and the present invention takes 12, that is, 12 cepstrum features are obtained; in addition, the logarithmic energy of one frame is added as the 13th characteristic parameter , defined as follows:

由此可得到13个特征参数(12个倒谱特征加1个对数能量)；From this, 13 characteristic parameters (12 cepstrum features plus 1 logarithmic energy) can be obtained;

5)、这13个标准的倒谱参数MFCC只反映了语音参数的静态特性，语音的动态特性可以用这些静态特征的差分谱来描述；计算13个MFCC特征的一阶差分dtm(i)和二阶差分dtmm(i)：5), the cepstrum parameter MFCC of these 13 standards only reflects the static characteristic of speech parameter, the dynamic characteristic of speech can be described by the differential spectrum of these static features; Calculate the first-order difference dtm(i) of 13 MFCC features and Second order difference dtmm(i):

3、语音活动检测(VAD)：3. Voice Activity Detection (VAD):

本发明中采用基于GMM模型的声音活动检测方法，该方法假设语音和背景噪音在特定的特征空间中符合高斯混合分布，在特征空间中分别建立它们的GMM模型，然后用模型匹配的方法在被测信号中检测出有效的语音段；算法流程如图4所示：The present invention adopts the sound activity detection method based on GMM model, and this method assumes voice and background noise meet Gaussian mixture distribution in specific feature space, builds their GMM model respectively in feature space, then uses the method for model matching in being Effective speech segments are detected in the test signal; the algorithm flow is as shown in Figure 4:

模型通过HTK工具箱提前训练出来，1个39维的高斯模型由1个权重(MIXTURE)、39个均值(MEAN)、39个方差(VARIANCE)和1个gconst构成，如图5所示：The model is trained in advance through the HTK toolbox. A 39-dimensional Gaussian model consists of 1 weight (MIXTURE), 39 means (MEAN), 39 variances (VARIANCE) and 1 gconst, as shown in Figure 5:

静音模型和非静音模型分别由13个如图5所示的多维高斯模型构成；当新的一帧语音数据被采集进系统，将新的一帧39维MFCC特征值分别与静音和非静音模型进行似然值计算，比较两个似然值大小，似然值较大的模型即为当前帧的匹配模型，从而判断当前帧是否为语音帧，VAD详细处理流程如图6所示：The silent model and the non-silent model are respectively composed of 13 multidimensional Gaussian models as shown in Figure 5; when a new frame of speech data is collected into the system, the new frame of 39-dimensional MFCC eigenvalues are respectively compared with the silent and non-silent models Carry out the likelihood value calculation, compare the size of the two likelihood values, the model with the larger likelihood value is the matching model of the current frame, so as to judge whether the current frame is a speech frame, and the detailed processing flow of VAD is shown in Figure 6:

其中转移系数a₁₁、a₁₂、a₂₁、a₂₂为预先训练好的模型数据，如图7所示，a₁₁为前一帧是静音帧，当前帧也是静音帧的转移概率；a₁₂为前一帧是静音帧，当前帧却是语音帧的转移概率；a₂₁为前一帧是语音帧，当前帧却是静音帧的转移概率；a₂₂为前一帧是语音帧，当前帧也是语音帧的转移概率；Among them, the transfer coefficients a ₁₁ , a ₁₂ , a ₂₁ , and a ₂₂ are pre-trained model data, as shown in Figure 7, a ₁₁ is the transition probability that the previous frame is a silent frame and the current frame is also a silent frame; a ₁₂ is The previous frame is a silent frame, but the current frame is the transition probability of a speech frame; a ₂₁ is the transition probability that the previous frame is a speech frame, but the current frame is a silent frame; a ₂₂ is a speech frame for the previous frame, and the current frame is also The transition probability of the speech frame;

整个处理过程中最复杂的计算为似然值的计算，下面对似然值的计算进行介绍：The most complicated calculation in the whole process is the calculation of the likelihood value. The calculation of the likelihood value is introduced below:

13阶的多维高斯混合模型的概率密度函数是由13个多维高斯概率密度函数加权求和得到的，如下式3.1：The probability density function of the 13th-order multidimensional Gaussian mixture model is obtained by the weighted sum of 13 multidimensional Gaussian probability density functions, as shown in the following formula 3.1:

式中，M为多维高斯模型个数，本发明中为13；X为一个D维随机矢量(即前文提到的39维MFCC特征值)，b_i(X)为子分布，ω_i为混合权重。每个子分布是D维的联合高斯概率分布，如下式：In the formula, M is the number of multidimensional Gaussian models, which is 13 in the present invention; X is a D-dimensional random vector (that is, the 39-dimensional MFCC eigenvalue mentioned above), b _i (X) is a sub-distribution, and ω _i is a mixed Weights. Each subdistribution is a D-dimensional joint Gaussian probability distribution, as follows:

对于1维的来说μ是期望，σ²是方差；对于多维来说D表示X的维数，表示D*D的协方差矩阵，定义为∑＝E[(x-μ)(x-μ)^T]，|∑|为该协方差的行列式的值；For 1-dimensional, μ is the expectation, and σ ² is the variance; for multi-dimensional, D represents the dimension of X, and represents the covariance matrix of D*D, which is defined as ∑=E[(x-μ)(x-μ ) ^T ], |∑| is the value of the determinant of the covariance;

所以VAD算法的具体计算步骤为：Therefore, the specific calculation steps of the VAD algorithm are as follows:

1)对每一帧语音的39维MFCC特征值分别与静音和非静音模型进行匹配计算，先进行(X-μ_i)²/σ²计算,并对39个结果进行累加，得到两个模型的多维高斯分布的指数部分fa0和fa1(该计算由硬件加速IP完成)：1) Match and calculate the 39-dimensional MFCC eigenvalues of each frame of speech with the silent and non-silent models, first perform (X-μ _i ) ² /σ ² calculations, and accumulate the 39 results to obtain two models The exponential parts fa0 and fa1 of the multidimensional Gaussian distribution of (this calculation is done by the hardware acceleration IP):

2)对上一步结果进行如下计算，可得到多维高斯分布的似然值：2) Perform the following calculation on the results of the previous step to obtain the likelihood value of the multidimensional Gaussian distribution:

b＝exp(fa0)b=exp(fa0)

其中gconst为提前训练的数据，从模型数据中直接获取。至此完成式3.2中的多维高斯分布似然值计算；Among them, gconst is the pre-trained data, which is directly obtained from the model data. So far, the calculation of the likelihood value of the multidimensional Gaussian distribution in formula 3.2 is completed;

3)由上文可知静音模型和非静音模型分别包含13个多维高斯分布，所以步骤1、2循环13次后可得13个多维高斯分布的似然值，将这些似然值乘以模型权重并相加即式3.1，可得静音模型和非静音模型的似然值；所以步骤1、2循环13次后可得13个多维高斯分布的似然值ln b_i(X)，将这些似然值和相应的权重ω_i带入下式，可得当前帧对静音模型的似然值P₁和对非静音模型的似然值P₂：3) From the above, it can be seen that the silent model and the non-silent model contain 13 multidimensional Gaussian distributions respectively, so the likelihood values of 13 multidimensional Gaussian distributions can be obtained after steps 1 and 2 are cycled 13 times, and these likelihood values are multiplied by the model weight And adding them together is Equation 3.1 to get the likelihood values of the silent model and the non-silent model; therefore, after 13 cycles of steps 1 and 2, the likelihood values ln b _i (X) of 13 multidimensional Gaussian distributions can be obtained, and these likelihoods The likelihood value and the corresponding weight ω _i are brought into the following formula, and the likelihood value P ₁ of the current frame to the silent model and the likelihood value P ₂ to the non-silent model can be obtained:

4)最后乘上转移概率a：4) Finally multiply the transition probability a:

如果前一帧数据是语音帧，则当前帧是语音帧的概率＝a₂₂*P₂；If the previous frame data is a speech frame, then the probability that the current frame is a speech frame= _a22 *P2 _;

当前帧是静音帧的概率＝a₂₁*P₁；Probability that the current frame is a silent frame=a ₂₁ *P ₁ ;

如果前一帧数据是静音帧，则当前帧是语音帧的概率＝a₁₂*P₂；If the previous frame data is a silent frame, then the probability that the current frame is a voice frame=a ₁₂ *P ₂ ;

当前帧是静音帧的概率＝a₁₁*P₁；Probability that the current frame is a silent frame=a ₁₁ *P ₁ ;

比较是语音帧的概率和是静音帧的概率大小，语音帧的概率大则认为当前帧是语音帧，反之则为静音帧，至此VAD算法完成。Compare the probability of a speech frame with the probability of a silent frame. If the probability of a speech frame is high, the current frame is considered a speech frame, otherwise it is a silent frame. So far, the VAD algorithm is completed.

4、语音识别算法：4. Speech recognition algorithm:

本模块流程如图8所示，其中模型的加载和构建HMM链在程序最开始初始化时完成，后续无需再重复进行；当上级VAD模块检测出有效语音，才进入本模块进行计算。本模块调用的HMM模型的每个状态都由24个GMM构成,流程介绍如下：The flow of this module is shown in Figure 8. The loading of the model and the construction of the HMM chain are completed at the initial initialization of the program, and there is no need to repeat it later; when the upper-level VAD module detects a valid voice, it enters this module for calculation. Each state of the HMM model called by this module is composed of 24 GMMs. The process is as follows:

(1)、载入HMM模型，构造HMM链的识别网络；(1) Load the HMM model and construct the recognition network of the HMM chain;

(2)、将MFCC特征值与HMM模型的识别网络匹配，计算初始似然值；(2), match the MFCC eigenvalue with the recognition network of the HMM model, and calculate the initial likelihood value;

(3)、根据初始似然值，Token Passing算法找到HMM链网络中的最佳路径，完成译码的工作；(3) According to the initial likelihood value, the Token Passing algorithm finds the best path in the HMM chain network and completes the decoding work;

(4)、判断语音指令是否与HMM链匹配，若是则为有效语音，若否则为无效语音。(4), judging whether the voice command matches the HMM chain, if so, it is a valid voice, otherwise it is an invalid voice.

下面描述整个流程：以“关机”为例，下面为“关机”对应的HMM链(实际HMM链更长，每个音节由多个状态构成，这里为方便讲解，进行了简化)。“关机”可拆分成音节“g”“uan”“j”“i”，用HMM模型将4个音节描述成4个状态，并相连可得一下HMM链,如图9所示；The whole process is described below: taking "shutdown" as an example, the following is the HMM chain corresponding to "shutdown" (the actual HMM chain is longer, and each syllable is composed of multiple states, which are simplified here for the convenience of explanation). "Shutdown" can be divided into syllables "g", "uan", "j" and "i", and the HMM model is used to describe the 4 syllables into 4 states, and connect them to obtain the HMM chain, as shown in Figure 9;

A、在这条网络的起点(即“g”)初始化令牌值P_g＝0；A. Initialize the token value P _g =0 at the starting point of this network (namely "g");

B、当第一帧MFCC数据到来时，开始token-passing，第一帧只有P_g这个令牌值，令牌值P_g会向状态“g”和“uan”传递，具体表现为：B. When the first frame of MFCC data arrives, token-passing starts. The first frame only has the token value P _g , and the token value P _g will be passed to the state "g" and "uan", specifically as follows:

P_g＝P_g+a₁₁+log(GMM_g)P _g ＝P _g +a ₁₁ +log(GMM _g )

P_uan＝P_g+a₁₂+log(GMM_uan)P _uan =P _g +a ₁₂ +log(GMM _uan )

log(GMM_g)为MFCC数据对状态“g”的似然值，log(GMM_uan)为MFCC数据对状态“uan”的似然值，似然值的计算方式与vad一致，见式3.3和3.4。log(GMM _g ) is the likelihood value of MFCC data to state “g”, log(GMM _uan ) is the likelihood value of MFCC data to state “uan”, the calculation method of likelihood value is consistent with vad, see formula 3.3 and 3.4.

C、当第二帧数据到来时，此时状态“g”和“uan”都有令牌值，所以令牌向这两个状态所连接的状态传递。C. When the second frame of data arrives, the states "g" and "uan" both have token values, so the tokens are passed to the state connected to these two states.

对状态“g”的令牌值更新：Token value update for state "g":

P_g＝P_g+a₁₁+log(GMM_g)P _g ＝P _g +a ₁₁ +log(GMM _g )

对状态“uan”的令牌值更新：Token value update for state "uan":

P_g→uan＝P_g+a₁₂ P _g→uan =P _g +a ₁₂

P_uan→uan＝P_uan+a₂₂ P _uan→uan ＝P _uan +a ₂₂

更新后：P_uan＝max(P_g→uan，P_uan→uan)+log(GMM_uan)After update: P _uan = max(P _g→uan ，P _uan→uan )+log(GMM _uan )

由于状态“uan”左侧与“g”相连，同时自己与自己相连，所以会得到两个令牌值，此时要比较这两个令牌值，选取大的那个保留下来。Since the left side of the state "uan" is connected to "g", and at the same time, it is connected to itself, so two token values will be obtained. At this time, the two token values must be compared, and the larger one should be selected and kept.

对状态“j”的令牌进行更新Make an update to the token in state "j"

P_j＝P_uan+a₂₃+log(GMM_j)P _j ＝P _uan +a ₂₃ +log(GMM _j )

D、当第三帧到来时，对状态“g”的令牌值更新：D. When the third frame arrives, update the token value of state "g":

P_g＝P_g+a₁₁+log(GMM_g)P _g ＝P _g +a ₁₁ +log(GMM _g )

对状态“uan”的令牌值更新：Token value update for state "uan":

P_g→uan＝P_g+a₁₂ P _g→uan =P _g +a ₁₂

P_uan→uan＝P_uan+a₂₂ P _uan→uan ＝P _uan +a ₂₂

对状态“j”的令牌进行更新Make an update to the token in state "j"

P_uan→j＝P_uan+a₂₃ P _uan→j ＝P _uan +a ₂₃

P_j→j＝P_j+a₃₃ P _j→j ＝P _j +a ₃₃

更新后：P_j＝max(P_uan→j，P_j→j)+log(GMM_j)After update: P _j = max(P _uan→j ，P _j→j )+log(GMM _j )

对状态“i”的令牌进行更新：To update the token for state "i":

P_i＝P_j+a₃₄+log(GMM_i)P _i =P _j +a ₃₄ +log(GMM _i )

E、当第四帧到来时，对状态“g”的令牌值更新：E. When the fourth frame arrives, update the token value of state "g":

P_g＝P_g+a₁₁+log(GMM_g)P _g ＝P _g +a ₁₁ +log(GMM _g )

对状态“uan”的令牌值更新：Token value update for state "uan":

P_g→uan＝P_g+a₁₂ P _g→uan =P _g +a ₁₂

P_uan→uan＝P_uan+a₂₂ P _uan→uan ＝P _uan +a ₂₂

对状态“j”的令牌进行更新Make an update to the token in state "j"

P_uan→j＝P_uan+a₂₃ P _uan→j ＝P _uan +a ₂₃

P_j→j＝P_j+a₃₃ P _j→j ＝P _j +a ₃₃

对状态“i”的令牌进行更新：To update the token for state "i":

P_j→i＝P_j+a₃₄ P _j→i = P _j +a ₃₄

P_i→i＝P_i+a₄₄ P _i→i =P _i +a ₄₄

更新后：P_i＝max(P_j→i，P_i→i)+log(GMM_i)After update: P _i ＝max(P _j→i ，P _i→i )+log(GMM _i )

至此所有语音指令帧都输入完毕，开始令牌比较，将四个状态的令牌值进行大小排序，如果HMM链的最后一个状态(即“i”)的令牌值最大，则说明输入的语音指令与“关机”这条HMM链匹配，译码结果是“关机”。否则就认为输入的是无效语音。So far, all voice instruction frames have been input, start token comparison, and sort the token values of the four states. If the token value of the last state of the HMM chain (that is, "i") is the largest, it means that the input voice The instruction matches the HMM chain of "shutdown", and the decoding result is "shutdown". Otherwise, it is considered that the input is an invalid voice.

整个译码过程可以看出随着帧数增加，令牌从左端一直扩散到右端，这个过程中每个状态都有一个令牌，且令牌会向相邻的状态传递并计算，当到达指定的帧数(帧数由预设的语音指令长度决定，如“关机”就较短，“芝麻开门”由于语音较长，帧数也会较多)，就将所有状态的令牌进行排序，如果HMM链的末尾状态上的令牌值最大则说明这次输入的语音与这条HMM链匹配。在实际应用中可以增加可识别语音指令的数量，此时就会有多条HMM链，这样的话最后一帧，所有HMM链的所有状态都会进行排序，以此确定具体是哪一条指令。The whole decoding process can be seen that as the number of frames increases, tokens spread from the left end to the right end. In this process, each state has a token, and the token will be passed to the adjacent state and calculated. When the specified The number of frames (the number of frames is determined by the length of the preset voice command, such as "shutdown" is shorter, and the number of frames of "Open Sesame" is longer due to the longer voice), sort the tokens of all states, If the token value on the end state of the HMM chain is the largest, it means that the input voice matches this HMM chain. In practical applications, the number of recognizable voice commands can be increased. At this time, there will be multiple HMM chains. In this way, in the last frame, all states of all HMM chains will be sorted to determine which command it is.

尽管已经示出和描述了本发明的实施例，对于本领域的普通技术人员而言，可以理解的是，在不脱离本发明的原理和精神的情况下可以对这些实施例进行多种等效的变化、修改、替换和变型，本发明的范围由所附权利要求及其等同范围限定。While embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various equivalents can be made to these embodiments without departing from the principles and spirit of the invention. Changes, modifications, substitutions and variations, the scope of the present invention is defined by the appended claims and their equivalents.

Claims

1. a voice wake-up method based on soc chip, is characterized in that, comprises the following steps:

S1. The chip collects voice data, samples it, and converts the analog signal into a digital signal;

S2, performing MFCC feature extraction on the voice data of the digital signal;

S3, carry out voice activity detection to MFCC characteristic value, judge whether the new frame MFCC data of current MFCC characteristic value is a voice frame, if otherwise return to step S2 and release data, if then enter next step processing with MFCC characteristic value;

S4. Recognize the MFCC characteristic value through the speech recognition algorithm based on the HMM model. If the recognition result is a valid command, wake up the control device; otherwise, return to step S2.

2. a kind of voice wake-up method based on soc chip as claimed in claim 1, is characterized in that, in the described step S2, MFCC feature extraction, it is specially:

1), digital signal preprocessing, including pre-emphasis, framing and windowing;

2), carry out FFT transformation to each frame signal, obtain frequency spectrum, and then obtain amplitude spectrum |X _n (k)|;

3), add Mel filter bank W _l (k) to amplitude spectrum |X _n (k)|, the formula is as follows:

m m ((l l)) = = {Σ Σ}_{k k = = o o ((l l))}^{h h ((l l))} {W W}_{l l} ((k k)) | | {X x}_{n no} ((k k)) | |,, l l = = 11,, 22,, ... ...,, 2626

{W W}_{l l} ((k k)) = = \{\begin{matrix} \frac{k k - - o o ((l l))}{c c ((l l)) - - o o ((l l))},, o o ((l l)) \leq \leq k k \leq \leq C C ((l l)) \\ \frac{h h ((l l)) - - k k}{h h ((l l)) - - c c ((l l))},, c c ((l l)) \leq \leq k k \leq \leq h h ((l l)) \end{matrix}

Where k refers to the kth point of FFT; o(l), c(l), h(l) are the lower limit frequency, center frequency and upper limit frequency of the lth triangular filter respectively;

4), perform logarithmic operation on all filter outputs, and then further perform discrete cosine transform DCT to obtain MFCC eigenvalues, the formula is as follows:

c c ((i i)) = = \sqrt{\frac{22}{N N}} {Σ Σ}_{i i = = 11}^{L L} log log m m ((l l)) c c o o s the s {{((l l - - \frac{11}{22})) \frac{i i π π}{L L}}}

Among them, N and L are 26, which refers to the number of filters; i refers to the order of MFCC coefficients, and i takes 12, that is, 12 cepstrum features are obtained; in addition, the logarithmic energy of one frame is added as the 13th feature parameters, defined as follows:

c c ((1313)) = = 1010 lg lg {Σ Σ}_{k k = = 11}^{256256} {(({X x}_{n no} ((k k))))}^{22}

Among them, X _n (k) is the magnitude, from which 13 characteristic parameters can be obtained, including 12 cepstrum features plus 1 logarithmic energy;

5), the cepstrum parameter MFCC of 13 standards that seeks has only reflected the static characteristic of speech parameter, and the dynamic characteristic of speech is described according to the differential spectrum of described static feature; Calculate the first-order difference dtm(i of 13 MFCC features ) and second-order difference dtmm(i):

d d t t m m ((i i)) = = \frac{- - 22 c c ((i i - - 22)) - - c c ((i i - - 11)) + + c c ((i i + + 11)) + + 22 c c ((i i + + 22))}{33}

d d t t m m m m ((i i)) = = \frac{- - 22 d d t t m m ((i i - - 22)) - - d d t t m m ((i i - - 11)) + + d d t t m m ((i i + + 11)) + + 22 d d t t m m ((i i + + 22))}{33}

The 13 standard MFCC features and its 13 first-order differences and 13 second-order differences form 39-dimensional MFCC feature parameters. So far, the MFCC feature extraction is complete.

3. a kind of voice wake-up method based on soc chip as claimed in claim 1, it is characterized in that, in described step S3, feature value is carried out voice activity detection, adopts the voice activity detection method based on GMM model, it assumes voice and The background noise conforms to the Gaussian mixture distribution in a specific feature space, and the silent model and the non-silent model are respectively constructed in the feature space; then the new frame of MFCC data of the MFCC feature is calculated, and the likelihood value P1 of the silent model and the non-silent model are calculated respectively. The likelihood value P2 of the silence model; compare the likelihood value P1 and the likelihood value P2, if P1 is greater than P2, the current MFCC data frame is a speech frame, otherwise a mute frame.

4. a kind of voice wake-up method based on soc chip as claimed in claim 3, is characterized in that, if after described current MFCC data frame is judged as speech frame, when judging next frame MFCC data frame, likelihood value P1 Multiply the corresponding transition probability with the likelihood value P2 respectively, compare two product results, if the product result of the likelihood value P1 is greater than the product result of the likelihood value P2, then the current MFCC data frame is a speech frame, otherwise it is a silent frame;

If after the current MFCC data frame is judged as a silent frame, when judging the next frame of MFCC data frame, the likelihood value P1 and the likelihood value P2 are respectively multiplied by the corresponding transition probability, and the two product results are compared, if the likelihood value If the product result of P1 is greater than the product result of the likelihood value P2, the current MFCC data frame is a speech frame, otherwise it is a silent frame;

The corresponding transition probability is preset model data.

5. a kind of voice wake-up method based on soc chip as claimed in claim 3, is characterized in that, the calculation method of the likelihood value P1 of described mute model, the likelihood value P2 of non-quiet model, specifically:

Among them, the silent model and the non-silent model are composed of 13 39-dimensional Gaussian models; the probability density function of an M-order Gaussian model is obtained by the weighted sum of M Gaussian probability density functions, as shown in the following formula 3.1:

P P ((X x | | λ λ)) = = {Σ Σ}_{i i = = 11}^{M m} {ω ω}_{i i} {b b}_{i i} ((X x)) - - - - - - 3.1 3.1

In the formula, M is the number of multidimensional Gaussian models, and M is 13; X is a D-dimensional random vector, which is a 39-dimensional MFCC feature value; b _i (X) is a sub-distribution, and ω _i is a mixing weight; each sub-distribution is The D-dimensional joint Gaussian probability distribution is as follows: 3.2:

{b b}_{i i} ((X x)) = = \frac{11}{{((22 π π))}^{\frac{D D.}{22}} | | {Σ Σ}_{i i} | | \frac{11}{22}} exp exp {{- - \frac{11}{22} {((X x - - {μ μ}_{i i}))}^{T T} {Σ Σ}_{i i}^{- - 11} ((X x - - {μ μ}_{i i}))}} - - - - - - 3.2 3.2

Among them, μ _i is the mean value of the i-th dimension; σ _i ² is the variance; x _i is the input MFCC feature value of the i-th dimension; D represents the total dimension, and D is 39;

Since the calculation of formula 3.2 is too complicated, its derivation is simplified:

{b b}_{i i} ((X x)) = = \frac{11}{{((22 π π))}^{\frac{D D.}{22}} {Π Π}_{i i = = 11}^{3939} {σ σ}_{i i}} {e e}^{- - {Σ Σ}_{i i = = 11}^{3939} \frac{{(({x x}_{i i} - - {μ μ}_{i i}))}^{22}}{22 {σ σ}_{i i}^{22}}}

Take the logarithm on both sides of the formula to get:

ln ln {b b}_{i i} ((X x)) = = - - \frac{11}{22} {{22 ln ln [[{((22 π π))}^{\frac{D D.}{22}} | | {Π Π}_{i i = = 11}^{3939} {σ σ}_{i i}^{22} {| |}^{\frac{11}{22}}]] + + {Σ Σ}_{i i = = 11}^{3939} \frac{{(({x x}_{i i} - - {μ μ}_{i i}))}^{22}}{{σ σ}_{i i}^{22}}}}

It can be seen that the left side of the plus sign is the known parameters in the trained model, which can be trained in advance, so let gconst be used as a parameter of the model:

g g c c o o n no s the s t t = = 22 ln ln [[{((22 π π))}^{\frac{D D.}{22}} | | {Π Π}_{i i = = 11}^{3939} {σ σ}_{i i}^{22} {| |}^{\frac{11}{22}}]]

So formula 3.2 is transformed into the following formula:

ln ln {b b}_{i i} ((X x)) = = - - \frac{11}{22} [[g g c c o o n no s the s t t + + {Σ Σ}_{i i = = 11}^{3939} \frac{{(({x x}_{i i} - - {μ μ}_{i i}))}^{22}}{{σ σ}_{i i}^{22}}]] - - - - - - 3.3 3.3

Then Equation 3.1 is simplified to:

Putting the MFCC data frame and model parameters into the above formula, the likelihood value of the mute model and the likelihood value of the non-mutation model of the frame data can be obtained.

6. a kind of voice wake-up method based on soc chip as claimed in claim 5, it is characterized in that, described will bring MFCC data frame and model parameter in the above formula, can obtain the similarity of the mute model of this frame data The likelihood value and the likelihood value of the non-silent model, the specific steps are:

1) Match and calculate the MFCC eigenvalues of each frame of speech with the silent model and the non-silent model, first perform ( _xi -μ _i ) ² /σ ² calculations, and accumulate the calculation results to obtain the multidimensional model of the two models Exponential parts fa0 and fa1 of a Gaussian distribution:

f f a a 00 = = {Σ Σ}_{i i = = 11}^{3939} \frac{{(({x x}_{i i} - - {μ μ}_{i i}))}^{22}}{{σ σ}_{i i}^{22}}

where mean μ _i and variance Get it directly from the model data;

2), the calculation results of the previous step are calculated as follows, and the likelihood value b of the multidimensional Gaussian distribution can be obtained:

b b = = - - \frac{11}{22} ((g g c c o o n no s the s t t + + f f a a 00))

Among them, gconst is the pre-trained data, directly obtained from the model data, so far the calculation of the multidimensional Gaussian distribution likelihood value ln b _i (X) in formula 3.3 is completed;

3) From the above, it can be seen that the silent model and the non-silent model respectively contain 13 multidimensional Gaussian distributions, so the likelihood values ln b _i (X) of 13 multidimensional Gaussian distributions can be obtained after steps 1 and 2 are cycled 13 times, and these The likelihood value and the corresponding weight ω _i are brought into the following formula to obtain the likelihood value P ₁ of the current frame for the silent model and the likelihood value P ₂ for the non-silent model:

7. a kind of voice wake-up method based on soc chip as claimed in claim 1, is characterized in that, described step S4 is based on the speech recognition algorithm of HMM model, and it is specially:

S41, load the HMM model, construct the recognition network of the HMM chain;

S42. Match the MFCC eigenvalues with the recognition network of the HMM model, and calculate the initial likelihood value;

S43. According to the initial likelihood value, the Token Passing algorithm finds the best path in the HMM chain network, and completes the decoding work;

S45. Determine whether the voice command matches the HMM chain, if so, it is a valid voice, otherwise, it is an invalid voice.