CN111986661A

CN111986661A - Deep neural network speech recognition method based on speech enhancement in complex environment

Info

Publication number: CN111986661A
Application number: CN202010880777.7A
Authority: CN
Inventors: 王兰美; 梁涛; 朱衍波; 廖桂生; 王桂宝; 孙长征
Original assignee: Xidian University; Shaanxi University of Technology
Current assignee: Xidian University; Shaanxi University of Technology
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-11-24
Anticipated expiration: 2040-08-28
Also published as: CN111986661B

Abstract

A deep neural network speech recognition method based on speech enhancement in a complex environment builds a model by taking a deep learning neural network and speech enhancement as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on voice signals under various complex voice conditions to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and then, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise factors, high in requirement on voice quality and single in application scene are well solved.

Description

Deep neural network speech recognition method based on speech enhancement in complex environment

技术领域technical field

本发明属于语音识别领域，尤其涉及一种复杂环境下基于语音增强的深度神经网络语音识别方法。The invention belongs to the field of speech recognition, and in particular relates to a deep neural network speech recognition method based on speech enhancement in a complex environment.

背景技术Background technique

近年来，科技创新屡破难关，经济繁荣社会进步，人们在解决吃、穿、住、行基本问题后，对构建美好生活提出了更多需求。这一美好愿景，促使QQ、微信等集生活、工作、娱乐于一身的虚拟社交软件大量涌现。虚拟社交软件给人们的生活，工作，交流沟通带来了极大便利，尤其是各大社交软件中的语音识别功能。语音识别，使得人们可以摆脱键盘、鼠标等传统交互方式的束缚，从而使用最自然的交流方式—语音交流来传递信息。同时，语音识别也逐渐在工业、通信、家电、家庭服务、医疗、电子消费产品等各个领域获得了广泛的应用。In recent years, technological innovation has repeatedly overcome difficulties, and the economy has prospered and social progress has been made. After solving the basic problems of food, clothing, housing, and transportation, people have put forward more demands for building a better life. This beautiful vision has prompted a large number of virtual social software such as QQ and WeChat that integrate life, work and entertainment. Virtual social software brings great convenience to people's life, work and communication, especially the speech recognition function in major social software. Speech recognition enables people to get rid of the shackles of traditional interaction methods such as keyboard and mouse, so as to use the most natural communication method—voice communication to transmit information. At the same time, speech recognition has gradually been widely used in various fields such as industry, communications, home appliances, home services, medical care, and electronic consumer products.

现如今大部分的社交软件在无背景噪音以及无干扰声源的纯净语音条件下语音识别准确率已经达到极高水平。当待识别语音信号包含噪音、干扰以及存在混响时，现有的语音识别系统的准确率便大幅下降。这一转变，主要是现有的语音识别系统，在语音识别前端的语音信号预处理阶段以及搭建声学模型阶段，并未考虑去噪和干扰抑制问题。Most of today's social software has achieved a very high level of speech recognition accuracy under the condition of pure speech without background noise and interfering sound sources. When the speech signal to be recognized contains noise, interference and reverberation, the accuracy of the existing speech recognition system is greatly reduced. This change is mainly due to the fact that the existing speech recognition system does not consider denoising and interference suppression in the speech signal preprocessing stage and the stage of building an acoustic model at the front end of speech recognition.

现有的中文语音识别算法，对语音信号质量要求苛刻，算法鲁棒性差，当语音质量较差或音频污损严重，便会导致语音识别失败。仅在纯净的理想的语音条件下获得小范围应用，为了提高语音识别在现实生活环境中的应用，针对现有算法的不足，本发明提出复杂环境下基于语音增强的深度神经网络语音识别方法。该方法以深度学习神经网络以及语音增强为技术背景。首先，在语音识别前端对各类待识别复杂语音条件下的语音信号进行语音增强；建立语言文本数据集，搭建语言模型，用算法对语言模型进行训练；建立中文汉语词典文件；搭建神经网络声学模型，并用增强后语音训练集，借助语言模型和词典对声学模型进行训练，得到声学模型权重文件，从而建立一个性能良好的复杂语音环境下的语音识别系统。The existing Chinese speech recognition algorithms have strict requirements on the quality of speech signals, and the algorithm has poor robustness. When the speech quality is poor or the audio contamination is serious, speech recognition will fail. Only a small range of applications can be obtained under pure and ideal speech conditions. In order to improve the application of speech recognition in real-life environments, in view of the shortcomings of existing algorithms, the present invention proposes a deep neural network speech recognition method based on speech enhancement in complex environments. The method is based on the technical background of deep learning neural network and speech enhancement. First, in the front-end of speech recognition, speech enhancement is performed on the speech signals under various complex speech conditions to be recognized; a language text data set is established, a language model is built, and an algorithm is used to train the language model; a Chinese-Chinese dictionary file is established; a neural network acoustics is built model, and use the enhanced speech training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to establish a speech recognition system with good performance in a complex speech environment.

鉴于语音识别技术在实际生活中的应用，本发明提出的复杂环境语音识别技术是包括纯净语音条件、高斯白噪音环境、背景噪音或干扰声源以及混响环境四类综合语音环境下的语音识别技术。本发明方法识别准确率高，模型泛化能力强，同时对各类环境因素具有很好的鲁棒性。In view of the application of speech recognition technology in real life, the complex environment speech recognition technology proposed by the present invention is a speech recognition under four comprehensive speech environments including pure speech conditions, Gaussian white noise environment, background noise or interference sound source and reverberation environment. technology. The method of the invention has high recognition accuracy, strong model generalization ability, and good robustness to various environmental factors.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种复杂环境下基于语音增强的深度神经网络语音识别方法。The purpose of the present invention is to provide a deep neural network speech recognition method based on speech enhancement in a complex environment.

为了实现上述目的，本发明采取如下的技术解决方案：In order to achieve the above object, the present invention adopts the following technical solutions:

复杂环境下基于语音增强的深度神经网络语音识别方法，以深度学习神经网络以及语音增强为技术背景搭建模型，具体的语音识别技术方案流程图见附图说明图1。首先搭建复杂语音环境数据集，在语音识别前端语音信号预处理阶段对待识别复杂语音条件下的语音信号进行语音增强；然后建立语言文本数据集，搭建语言模型，用算法对语言模型进行训练；并建立中文汉语词典文件；最后搭建神经网络声学模型，并用增强后语音训练集，借助语言模型和词典对声学模型进行训练，得到声学模型权重文件，从而实现复杂环境下中文语音的精准识别。很好地解决了现有语音识别算法对噪音敏感、对语音质量要求高、应用场景单一的问题。复杂环境下基于语音增强的深度神经网络语音识别方法步骤如下：A deep neural network speech recognition method based on speech enhancement in a complex environment builds a model with a deep learning neural network and speech enhancement as the technical background. The specific speech recognition technical solution flow chart is shown in Figure 1. First, build a complex speech environment data set, and in the speech signal preprocessing stage of the front-end speech recognition, the speech signals under complex speech conditions are to be enhanced; Establish a Chinese-Chinese dictionary file; finally, build a neural network acoustic model, and use the enhanced voice training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to achieve accurate recognition of Chinese speech in complex environments. It well solves the problems that existing speech recognition algorithms are sensitive to noise, have high requirements for speech quality, and have a single application scenario. The steps of deep neural network speech recognition method based on speech enhancement in complex environment are as follows:

步骤一、复杂环境下语音数据集的建立及处理。在该部分收集纯净环境语音、高斯白噪音环境语音、存在背景噪音或干扰声源环境语音以及混响环境下的语音共同组成语音识别系统的语音数据集C。然后，将语音数据集C中各环境下的语音数据分别分成训练集和测试集。分配比例为训练集语音条数：测试集语音条数＝5：1。将各环境下的训练集和测试集分别汇总并打乱分布，形成训练集X和测试集T。训练集X中的第i条语音表示为x_i；测试集T中第j条语音表示为t_j。同时对训练集X中的每一条语音，编辑一个.txt格式的标签文档，标签文档的内容包括该条语音的名字以及对应的正确汉语拼音序列。训练集语音标签文档的部分展示图见附图说明图2。Step 1. The establishment and processing of speech data sets in complex environments. In this part, pure ambient speech, Gaussian white noise ambient speech, ambient speech with background noise or interfering sound source, and speech in reverberation environment are collected to form the speech data set C of the speech recognition system. Then, the speech data in each environment in the speech data set C is divided into a training set and a test set, respectively. The distribution ratio is the number of voices in the training set: the number of voices in the test set=5:1. The training set and test set in each environment are aggregated and the distribution is disrupted to form a training set X and a test set T. The ith speech in the training set X is denoted as _xi ; the jth speech in the test set T is denoted as t _j . At the same time, for each speech in the training set X, a label document in .txt format is edited, and the content of the label document includes the name of the speech and the corresponding correct Chinese pinyin sequence. A partial presentation of the training set voice tag documents is shown in Figure 2 of the accompanying drawings.

步骤二、对建立的语音训练集X和测试集T进行语音增强，得到增强后的语音训练集

和测试集

增强后的语音训练集

中的第i条语音表示为

测试集

中第j条语音表示为

以语音训练集中第i条语音x_i的语音增强为例，具体的语音增强步骤如下，对待增强的语音信号x_i，用matlab软件内置的语音处理audioread函数对x_i进行读取处理，得到语音信号的采样率f_s以及包含语音信息的矩阵x_i(n)，x_i(n)为n时刻的语音采样值；然后对x_i(n)进行预加重处理得y_i(n)；再对y_i(n)加汉明窗进行分帧操作，得到语音信号的各个帧的信息y_i,r(n)，其中y_i,r(n)表示进行预加重增强后第i条语音信号的第r帧的语音信息矩阵；再对y_i,r(n)进行FFT变换得到第i个语音信号第r帧的短时信号频谱

然后用伽马通权重函数H_l按频带对

进行处理得第i个语音信号第r帧第l个频带上的功率P_i,r,l(r,l)，其中l的取值为0,...,39；依次按照如上步骤求取第r帧的各个频带的功率；再进行降噪去混响处理以及谱整合得

由此，已经求得增强后第i个语音信号第r帧的短时信号频谱，对其它帧的语音信号同样依次做如上的处理，得到各个帧的短时信号频谱，再通过IFFT变换在时域上进行语音信号帧合成得到增强之后的语音信号

将

放入增强后的语音训练集

中。具体的语音数据增强流程框架图见附图说明图3。Step 2: Perform voice enhancement on the established voice training set X and test set T to obtain an enhanced voice training set

and test set

Enhanced speech training set

The i-th speech in is expressed as

test set

The jth speech in the middle is expressed as

Taking the voice enhancement of the _i -th voice _xi in the voice training set as an example, the specific voice enhancement steps are as follows _. The sampling rate f _s of the signal and the matrix x _i (n) containing the speech information, x _i ( _n ) is the speech sampling value at time _n ; Perform framing operation on y _i (n) and Hamming window to obtain information y _i,r (n) of each frame of the speech signal, where y _i,r (n) represents the i-th speech signal after pre-emphasis enhancement The speech information matrix of the r-th frame; then perform FFT transformation on y _i,r (n) to obtain the short-term signal spectrum of the i-th speech signal at the r-th frame

Then use the gamma pass weight function H _l to pair by frequency band

Perform processing to obtain the power P _i,r,l (r,l) on the lth frequency band of the rth frame of the ith speech signal, where the value of l is 0,...,39; follow the steps above to obtain The power of each frequency band of the rth frame; then noise reduction and de-reverberation processing and spectral integration are performed to obtain

As a result, the short-term signal spectrum of the r-th frame of the i-th voice signal after enhancement has been obtained, and the above processing is also performed on the voice signals of other frames in turn to obtain the short-term signal spectrum of each frame. The speech signal after the speech signal frame synthesis is enhanced in the domain

Will

Put into the augmented speech training set

middle. The specific speech data enhancement process frame diagram is shown in FIG. 3 of the accompanying drawings.

步骤三、搭建语音识别声学模型。本专利搭建的语音识别声学模型采用CNN+CTC进行建模，输入层为步骤二增强后的训练集

中的语音信号

采用MFCC特征提取算法处理训练集语音信号

得到200维的特征值序列，隐藏层采用卷积层和池化层交替重复连接，并且引入Dropout层，防止过拟合，其中卷积层卷积核尺寸为3，池化窗口大小为2，最后输出层采用1423个神经元的全连接层进行输出，并用softmax函数进行激活，采用CTC的loss函数作为损失函数实现连接性时序多输出，输出为1423维的特征值正好对应步骤四搭建的中文汉语词典dict.txt文档中的1423个常用汉语拼音。具体语音识别声学模型网络框架图见附图说明图4。其中声学模型中卷积层、池化层、Dropout层以及全连接层的具体参数均已在图4中标出。Step 3: Build an acoustic model for speech recognition. The speech recognition acoustic model built by this patent adopts CNN+CTC for modeling, and the input layer is the training set enhanced in step 2

voice signal in

Using MFCC Feature Extraction Algorithm to Process Training Set Speech Signal

A 200-dimensional eigenvalue sequence is obtained. The hidden layer is alternately and repeatedly connected by the convolutional layer and the pooling layer, and the dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel size is 3, and the pooling window size is 2. Finally, the output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation. The loss function of CTC is used as the loss function to achieve multiple outputs of connected time series. The output is a 1423-dimensional feature value that corresponds to the Chinese built in step 4. 1423 commonly used pinyin in Chinese dictionary dict.txt document. The specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings. The specific parameters of the convolutional layer, pooling layer, dropout layer and fully connected layer in the acoustic model have been marked in Figure 4.

步骤四、搭建语音识别的2-gram语言模型以及词典。语言模型的搭建包括语言文本数据集的建立、2-gram语言模型搭建、中文汉语词典的搜集建立。语言文本数据集形式上表现为一个电子版.txt文件，内容为报纸、中学课文、著名小说。对于词典来说，一种语言的词典都是稳定不变的，对于本发明中的汉语文字词典来说，词典表现为一个dict.txt文件，其中标明了日常生活中常用的1423个汉语拼音对应的汉字，同时考虑汉语的一音多字情况。本发明搭建的词典部分展示图见附图说明图5。Step 4: Build a 2-gram language model and dictionary for speech recognition. The construction of the language model includes the establishment of language text datasets, the establishment of 2-gram language models, and the collection and establishment of Chinese-Chinese dictionaries. The language text dataset is represented as an electronic .txt file in the form of newspapers, middle school texts, and famous novels. For the dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the corresponding 1423 Chinese Pinyin commonly used in daily life. Chinese characters, while taking into account the situation of one-syllable and multiple-character Chinese. The display diagram of the dictionary part constructed by the present invention is shown in FIG. 5 in the description of the drawings.

步骤五、用建立的语言文本数据集对搭建的2-gram语言模型进行训练，得到语言模型的单词出现次数表以及状态转移表。对语言模型的具体训练方式如下：首先循环获取语言文本数据集中的文本内容并统计单个单词出现得次数，以及二个单词一起出现得次数，最后汇总得到单个单词出现次数表以及二个单词状态转移表。具体的语言模型训练框图见附图说明图6。Step 5: Train the built 2-gram language model with the established language text data set, and obtain a word occurrence table and a state transition table of the language model. The specific training method of the language model is as follows: first, the text content in the language text dataset is obtained in a loop, and the number of occurrences of a single word and the number of occurrences of two words together are counted, and finally the table of occurrences of a single word and the state transition of two words are obtained. surface. The specific language model training block diagram is shown in Figure 6 of the accompanying drawings.

步骤六、用训练好的语言模型和建立的词典以及增强后的语音训练集

对搭建的声学模型进行学习训练。得到声学模型的权重文件以及其它参数配置文件。具体的声学模型训练流程如下：初始化声学网络模型的各处权值；依次导入语音训练集

中的语音进行训练，对任意的语音信号

首先经MFCC特征提取算法处理，得语音信号200维的特征值序列然后按照附图说明图7所列，将语音信号的200维特征值序列依次经过各个卷积层、池化层、Dropout层、全连接层处理，最后输出层采用1423个神经元的全连接层进行输出，并用softmax函数进行激活，得语音信号的1423维声学特征；得到特征值后再在语言模型以及词典的作用下对1423维声学特征值进行解码并输出识别的语音信号

的汉语拼音序列；将声学模型识别出的汉语拼音序列与训练集

中

的汉语拼音标签序列进行对比计算误差并反向传播更新声学模型各处的权值，损失函数采用CTC的loss函数，并Adam算法进行优化。设置训练的batchsize＝16，迭代次数epoch＝50，每训练500条语音，保存一次权重文件，依次按照如上步骤处理训练集

的每一条语音，直至声学模型损失收敛，声学模型便训练完毕。保存声学模型的权重文件和各项配置文件。具体的语音识别声学模型训练框图见附图说明图7。Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set

Learn and train the built acoustic model. Obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows: initialize the weights of the acoustic network model; import the speech training set in turn

The speech in the training, for any speech signal

First, the MFCC feature extraction algorithm is processed to obtain a 200-dimensional eigenvalue sequence of the speech signal. Then, as listed in Figure 7 of the accompanying drawings, the 200-dimensional eigenvalue sequence of the speech signal is sequentially passed through each convolution layer, pooling layer, Dropout layer, Fully connected layer processing, the final output layer uses a fully connected layer of 1423 neurons for output, and activates it with the softmax function to obtain the 1423-dimensional acoustic features of the speech signal; after obtaining the feature values, the 1423 is processed by the language model and dictionary. dimensional acoustic eigenvalues to decode and output the recognized speech signal

The Chinese Pinyin sequence identified by the acoustic model and the training set

middle

The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC and is optimized by the Adam algorithm. Set the training batchsize=16, the number of iterations epoch=50, save a weight file for every 500 voices in training, and process the training set in turn according to the above steps

For each speech, the acoustic model is trained until the loss of the acoustic model converges. Save the weights file and various configuration files of the acoustic model. The specific speech recognition acoustic model training block diagram is shown in Fig. 7 of the accompanying drawings.

步骤七、用训练好的基于语音增强的中文语音识别系统对测试集

的语音进行识别，统计语音识别准确率并与传统算法进行性能对比分析。具体的语音识别测试系统流程框架图见附图说明图8。本专利的语音识别准确率以及与传统算法的性能比较部分展示图见图9、图10。Step 7. Use the trained Chinese speech recognition system based on speech enhancement to test the test set

The speech recognition is carried out, the accuracy of speech recognition is counted, and the performance is compared with the traditional algorithm. The specific speech recognition test system flow chart is shown in Figure 8 of the accompanying drawings. Figure 9 and Figure 10 show the partial display diagrams of the speech recognition accuracy rate of this patent and the performance comparison with traditional algorithms.

发明优点Invention Advantages

复杂环境下基于语音增强的深度神经网络语音识别方法，很好地解决了现有语音识别算法对噪音等复杂环境因素敏感、对语音质量要求高、语音识别应用场景单一的问题。同时，本发明提出的语音识别方法采用神经网络深度学习技术，进行声学建模，使得本发明搭建的模型迁移学习能力强，语音增强方法的引入也使本发明的语音识别系统在复杂环境因素干扰方面具有强大的鲁棒性。The deep neural network speech recognition method based on speech enhancement in complex environments can well solve the problems that existing speech recognition algorithms are sensitive to complex environmental factors such as noise, require high speech quality, and have a single speech recognition application scenario. At the same time, the speech recognition method proposed by the present invention adopts the neural network deep learning technology to perform acoustic modeling, so that the model built by the present invention has strong transfer learning ability, and the introduction of the speech enhancement method also makes the speech recognition system of the present invention interfere with complex environmental factors. Aspects have strong robustness.

附图说明Description of drawings

为了更清楚地说明本发明的技术方案，下面将对本发明描述中需要使用的附图做简单介绍，以便更好地了解本发明的发明内容。In order to illustrate the technical solutions of the present invention more clearly, the accompanying drawings to be used in the description of the present invention will be briefly introduced below, so as to better understand the content of the present invention.

图1为本发明的语音识别技术方案具体流程图；Fig. 1 is the concrete flow chart of the speech recognition technical scheme of the present invention;

图2为本发明的语音识别训练集语音标签部分展示图；Fig. 2 is the partial presentation diagram of the speech label of speech recognition training set of the present invention;

图3为本发明的语音识别语音增强流程框架图；Fig. 3 is the speech recognition speech enhancement process frame diagram of the present invention;

图4为本发明的语音识别声学模型网络框架图；Fig. 4 is the speech recognition acoustic model network frame diagram of the present invention;

图5为本发明搭建的词典部分展示图；5 is a display diagram of a dictionary part constructed by the present invention;

图6为本发明的语言模型训练流程图；Fig. 6 is the language model training flow chart of the present invention;

图7为本发明声学模型的训练图；Fig. 7 is the training diagram of the acoustic model of the present invention;

图8为本发明语音识别测试系统的流程框图；Fig. 8 is the flow chart of the speech recognition test system of the present invention;

图9为本发明的语音识别算法与传统算法在噪音环境下的效果对比展示图；9 is a graph showing the comparison of the effects of the speech recognition algorithm of the present invention and the traditional algorithm in a noisy environment;

图10为本发明的语音识别算法与传统算法在混响环境下的效果对比展示图；Fig. 10 is the effect comparison display diagram of the speech recognition algorithm of the present invention and the traditional algorithm under the reverberation environment;

具体实施方式Detailed ways

复杂环境下基于语音增强的深度神经网络语音识别方法具体实施步骤如下：The specific implementation steps of the deep neural network speech recognition method based on speech enhancement in complex environments are as follows:

步骤一、复杂环境下语音数据集的建立以及处理。在该部分收集纯净环境语音、高斯白噪音环境语音、存在背景噪音或干扰声源环境语音以及混响环境下的语音共同组成语音识别系统的语音数据集C。然后，将语音数据集C中各环境下的语音数据分别分成训练集和测试集。分配比例为训练集语音条数：测试集语音条数＝5：1。将各环境下的训练集和测试集分别汇总并打乱分布，形成训练集X和测试集T。训练集X中的第i条语音表示为x_i；测试集T中第j条语音表示为t_j。同时对训练集X中的每一条语音，编辑一个.txt格式的标签文档，标签文档的内容包括该条语音的名字以及对应的正确汉语拼音序列。训练集语音标签文档的部分展示图见附图说明图2。Step 1: The establishment and processing of a voice data set in a complex environment. In this part, pure ambient speech, Gaussian white noise ambient speech, ambient speech with background noise or interfering sound source, and speech in reverberation environment are collected to form the speech data set C of the speech recognition system. Then, the speech data in each environment in the speech data set C is divided into a training set and a test set, respectively. The distribution ratio is the number of voices in the training set: the number of voices in the test set=5:1. The training set and test set in each environment are aggregated and the distribution is disrupted to form a training set X and a test set T. The ith speech in the training set X is denoted as _xi ; the jth speech in the test set T is denoted as t _j . At the same time, for each speech in the training set X, a label document in .txt format is edited, and the content of the label document includes the name of the speech and the corresponding correct Chinese pinyin sequence. A partial presentation of the training set voice tag documents is shown in Figure 2 of the accompanying drawings.

具体收集方法分别如下：首先对于纯净条件的语音收集，在理想实验室条件下进行多人录制，以中文报纸、小说、学生课文为素材，单条语音录制时长10秒以内，共录制3000条纯净语音素材；对于高斯白噪音环境以及混响环境下的语音收集，采用Adobe Audition软件来进行合成，具体是采用录制的纯净语音和高斯白噪声进行合成，混响则直接采用软件自带的混响环境重新合成语音。其中高斯白噪音环境下的语音和混响环境下的语音各录制3000条；最后对于存在背景噪音或干扰声源的语音，采用实地录制为主，在工厂、餐厅等比较嘈杂的地方由多人进行实地录制，共录制语音3000条。同时，以上收集到的所有语音文件格式为.wav格式。将收集到语音进行分类，分类方式如下：将每一类语音环境中2500条语音作为语音识别系统的训练集，剩下的500条作为测试集。总结即语音识别训练集X共10000条，测试集T共2000条，将训练集与测试集分别打乱分布，避免训练出来的模型出现过拟合。The specific collection methods are as follows: First, for the voice collection under pure conditions, multiple people are recorded under ideal laboratory conditions, using Chinese newspapers, novels, and student texts as materials, a single voice recording time is less than 10 seconds, and a total of 3000 pure voices are recorded. Material; for voice collection in Gaussian white noise environment and reverberation environment, Adobe Audition software is used for synthesis, specifically, the recorded pure voice and Gaussian white noise are used for synthesis, and the reverberation environment is directly used in the software. Resynthesize the speech. Among them, 3,000 pieces of speech are recorded in the Gaussian white noise environment and in the reverberation environment. Finally, for the speech with background noise or interference sound source, the field recording is mainly used, and many people are in noisy places such as factories and restaurants. A total of 3,000 voices were recorded on the spot. At the same time, all the audio files collected above are in .wav format. The collected speech is classified as follows: 2500 speeches in each type of speech environment are used as the training set of the speech recognition system, and the remaining 500 speeches are used as the test set. In conclusion, there are 10,000 speech recognition training sets X and 2,000 test sets T. The distribution of the training set and the test set is scrambled to avoid overfitting of the trained model.

和测试集

增强后的语音训练集

中的第i条语音表示为

测试集

中第j条语音表示为

然后用伽马通权重函数H_l按频带对

将

放入增强后的语音训练集

and test set

Enhanced speech training set

The i-th speech in is expressed as

test set

The jth speech in the middle is expressed as

Then use the gamma pass weight function H _l to pair by frequency band

Will

Put into the augmented speech training set

语音增强每一步操作具体如下详述：Each step of speech enhancement is detailed as follows:

(一)语音信号预加重(1) Pre-emphasis of voice signal

对训练集X中第i个语音信号矩阵x_i(n)进行预加重得到y_i(n)，其中y_i(n)＝x_i(n)-αx_i(n-1)，α为一个常量在本专利中α＝0.98；x_i(n-1)为对训练集中的第i个语音的n-1时刻的采样矩阵。Perform pre-emphasis on the i-th speech signal matrix x _i (n) in the training set X to obtain y _i (n), where y _i (n)=x _i (n)-αx _i (n-1), α is a The constant is α=0.98 in this patent; x _i (n-1) is the sampling matrix for the n-1 time of the i-th speech in the training set.

(二)加窗分帧(2) Windowing and Framing

采用汉明窗w(n)对预加重之后的语音信号y_i(n)进行加窗分帧，将连续的语音信号分割成一帧一帧的离散信号y_i,r(n)；The Hamming window w(n) is used to window and divide the pre-emphasized speech signal y _i (n) into frames, and the continuous speech signal is divided into discrete signals y _i,r (n) of one frame and one frame;

其中

汉明窗函数，N为窗长，专利中取帧长为50ms，帧移为10ms。预加重后的语音信号y_i(n)加窗分帧处理可得到每一帧语音信号矩阵信息y_i,r(n)。y_i,r(n)表示进行预加重、加窗分帧后第i条语音信号的第r帧的语音信息矩阵。in

Hamming window function, N is the window length, the frame length is 50ms in the patent, and the frame shift is 10ms. The pre-emphasized speech signal y _i (n) can be windowed and framed to obtain the speech signal matrix information y _i,r (n) of each frame. y _i,r (n) represents the speech information matrix of the rth frame of the ith speech signal after pre-emphasis, windowing and framing.

(三)FFT变换(3) FFT transform

将第i条语音信号的第r帧的语音信息矩阵y_i,r(n)作FFT变换，将其从时域变换到频域，得到第i个语音信号第r帧的短时信号频谱

Perform FFT transformation on the speech information matrix y _i,r (n) of the rth frame of the ith speech signal, transform it from the time domain to the frequency domain, and obtain the short-term signal spectrum of the rth frame of the ith speech signal

(四)求语音信号的功率P_i,r,l(r,l)(4) Find the power of the speech signal P _i,r,l (r,l)

将每一帧的短时信号频谱

用伽马通权重函数进行处理求取语音信号每一帧每一个频带的功率；

P_i,r,l(r,l)表示语音信号y_i(n)第r帧第l个频带上的功率，k是一个虚拟变量表示离散频率的索引，ω_k是离散频率，

由于在FFT变换的时候采用50ms的帧长以及语音信号的采样率为16kHz，因此N＝1024；H_l表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱，是matlab软件语音处理内置函数，函数的输入参数为频带l；

表示第r帧语音信号的短时频谱，L＝40是所有通道的总数。The short-term signal spectrum of each frame

Use the gamma pass weight function to process to obtain the power of each frequency band of each frame of the speech signal;

P _i,r,l (r,l) represents the power on the lth frequency band of the rth frame of the speech signal _yi (n), k is a dummy variable representing the index of the discrete frequency, ω _k is the discrete frequency,

Since the frame length of 50ms and the sampling rate of the speech signal are 16kHz in the FFT transformation, N=1024; H _l represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k. , is the built-in function of matlab software speech processing, the input parameter of the function is frequency band l;

represents the short-term spectrum of the rth frame speech signal, L=40 is the total number of all channels.

(五)语音信号降噪去混响处理(5) Noise reduction and de-reverberation processing of speech signals

求得语音信号功率P_i,r,l(r,l)后，进行降噪去混响处理，具体步骤为：After the voice signal power P _i,r,l (r,l) is obtained, noise reduction and de-reverberation processing are performed. The specific steps are:

(1)求取第r帧第l个频带的低通功率M_i,r,l[r,l]，具体求解公式如下：(1) Obtain the low-pass power M _i,r,l [r,l] of the lth frequency band of the rth frame, and the specific solution formula is as follows:

M_i,r,l[r,l]＝λM_i,r,l[r-1,l]+(1-λ)P_i,r,l[r,l]M _i,r,l [r,l]=λM _i,r,l [r-1,l]+(1-λ)P _i,r,l [r,l]

M_i,r,l[r-1,l]表示第r-1帧第l个频带的低通功率；λ表示遗忘因子，因低通滤波器的带宽而变，本专利中λ＝0.4。M _i,r,l [r-1,l] represents the low-pass power of the l-th frequency band of the r-1th frame; λ represents the forgetting factor, which varies with the bandwidth of the low-pass filter, and λ=0.4 in this patent.

(2)去除信号中缓慢变化的成分以及功率下降沿包络，对语音信号的功率P_i,r,l[r,l]进行处理得到增强后的第r帧第l个频带的功率

其中

中c₀为一个常数因子，本专利取c₀＝0.01。(2) Remove the slowly changing components and the envelope of the power falling edge in the signal, and process the power P _i,r,l [r,l] of the speech signal to obtain the power of the lth frequency band of the rth frame after enhancement

in

Among them, c ₀ is a constant factor, and this patent takes c ₀ =0.01.

(3)按步骤(1)，(2)依次对信号的每一帧每一个频带进行增强处理。(3) According to steps (1) and (2), the enhancement processing is performed on each frequency band of each frame of the signal in turn.

(六)谱整合(6) Spectral integration

求得语音信号每一帧每一个频带上增强后功率

进行语音信号谱整合，可得到增强之后语音信号各帧的短时信号频谱，谱整合的公式如下：Obtain the enhanced power in each frequency band of each frame of the speech signal

By integrating the spectrum of the speech signal, the short-term signal spectrum of each frame of the enhanced speech signal can be obtained. The formula for spectrum integration is as follows:

上式中μ_i,r[r,k]表示第r帧第k个索引处的谱权重系数；

为未增强的第i个语音信号第r帧的短时信号频谱，

为增强后的第i个语音信号第r帧的短时信号频谱。In the above formula μ _i,r [r,k] represents the spectral weight coefficient at the kth index of the rth frame;

is the short-term signal spectrum of the rth frame of the unenhanced ith speech signal,

is the short-term signal spectrum of the rth frame of the ith speech signal after enhancement.

其中μ_i,r[r,k]的求解公式如下：where μ _i,r [r,k] is solved by the following formula:

μ_i,r[r,k]＝μ_i,r[r,N-k],N/2≤k≤N-1μ _i,r [r,k]＝μ _i,r [r,Nk],N/2≤k≤N-1

公式中的H_l表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱；ω_i,r,l[r,l]为第i个语音信号第r帧第l个频带的权重系数，权重系数是增强之后的频域与信号的原始频域的比值，求解公式如下：H _l in the formula represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k; ω _i,r,l [r,l] is the rth frame of the ith speech signal The weight coefficient of l frequency bands, the weight coefficient is the ratio of the enhanced frequency domain to the original frequency domain of the signal. The solution formula is as follows:

求得谱整合后的第i个语音信号的第r帧的增强后的短时信号频谱，按如上操作依次对各帧进行处理求得第i个语音信号各帧的增强后的短时信号频谱。对各帧增强后的语音信号

进行IFFT变换得到时域各帧的语音信号并且在时域进行帧拼接得到增强后的语音信号

IFFT变换以及语音信号时域帧拼接操作如下：Obtain the enhanced short-term signal spectrum of the r-th frame of the i-th speech signal after spectral integration, and process each frame in turn according to the above operation to obtain the enhanced short-term signal spectrum of each frame of the i-th speech signal. . Enhanced speech signal for each frame

Perform IFFT transformation to obtain the speech signal of each frame in the time domain, and perform frame splicing in the time domain to obtain the enhanced speech signal

The operations of IFFT transformation and speech signal time-domain frame splicing are as follows:

g为总帧数

g is the total number of frames

上式中，

为增强后的语音信号矩阵；

表示第r帧增强后的语音信号矩阵；g为语音信号的总帧数，这个值因语音信号的时长而变。得到增强后n时刻语音信号的采样矩阵

再用matlab软件内置的语音处理audioread函数按照语音信号的采样率f_s＝16Khz对

进行写入处理，得到增强后的语音信号

In the above formula,

is the enhanced speech signal matrix;

Represents the enhanced speech signal matrix of the rth frame; g is the total number of frames of the speech signal, and this value varies with the duration of the speech signal. Obtain the sampling matrix of the speech signal at time n after the enhancement

Then use the built-in speech processing audioread function of matlab software to pair according to the sampling rate of speech signal f _s = 16Khz

Perform writing processing to obtain an enhanced voice signal

至此，对语音训练集中一条语音的增强处理完毕，接下依次按照如上步骤处理训练集X和测试集T。并将增强后的训练集语音保存在

集中，增强后的测试集保存在

集中。So far, the enhancement processing of one voice in the voice training set is completed, and then the training set X and the test set T are processed in turn according to the above steps. and save the enhanced training set speech in

Centralized, the augmented test set is stored in

concentrated.

中语音信号

的200维的特征值序列，采用MFCC特征提取算法提取特征值序列；同时隐藏层采用卷积层和池化层交替重复连接，并且引入Dropout层，防止过拟合，其中卷积层卷积核尺寸为3，池化窗口大小为2，最后输出层采用1423个神经元的全连接层进行输出，并用softmax函数进行激活，采用CTC的loss函数作为损失函数实现连接性时序多输出，输出为1423维的特征值正好对应步骤四搭建的中文汉语词典dict.txt文档中的1423个常用汉语拼音。具体语音识别声学模型网络框架图见附图说明图4。其中声学模型中卷积层、池化层、Dropout层以及全连接层的具体参数均已在图4中标出。Step 3: Build an acoustic model for speech recognition. The speech recognition acoustic model built by this patent adopts CNN+CTC for modeling, and the input layer is the training set enhanced in step 2

medium voice signal

The MFCC feature extraction algorithm is used to extract the eigenvalue sequence; at the same time, the hidden layer is alternately connected with the convolutional layer and the pooling layer, and the Dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel The size is 3, the pooling window size is 2, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation, using the loss function of CTC as the loss function to achieve multi-output connectivity time series, the output is 1423 The eigenvalues of the dimension correspond to the 1423 commonly used Chinese pinyin in the Chinese-Chinese dictionary dict.txt document built in step 4. The specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings. The specific parameters of the convolutional layer, pooling layer, dropout layer and fully connected layer in the acoustic model have been marked in Figure 4.

步骤四、搭建语音识别语言模型。语言模型搭建包括语言文本数据集的建立、2-gram语言模型设计、中文汉语词典的搜集。Step 4: Build a speech recognition language model. The construction of language models includes the establishment of language text datasets, the design of 2-gram language models, and the collection of Chinese-Chinese dictionaries.

(一)语言文本数据库的建立(1) Establishment of language text database

首先，建立训练语言模型所需要的文本数据集。语言文本数据集形式上表现为一个电子版.txt文件，内容为报纸、中学课文、著名小说。收集报纸、中学课文、著名小说的电子版.txt文件建立语言文本数据库，注意语言文本数据库中文本数据的选取一定要具有代表性，能够反映出日常生活中的汉语用语习惯。First, build the text dataset needed to train the language model. The language text dataset is represented as an electronic .txt file in the form of newspapers, middle school texts, and famous novels. Collect the electronic .txt files of newspapers, middle school texts, and famous novels to establish a language text database. Note that the selection of text data in the language text database must be representative and reflect the Chinese language habits in daily life.

(二)2-gram语言模型搭建(2) 2-gram language model construction

本专利采用按词本身进行划分的语言模型训练方法2-gram算法搭建语言模型。其中2-gram中的2表示考虑当前词出现的概率只与其前2个词有关。2就是词序列记忆长度的约束数量。2-gram算法具体公式可以表示为：This patent uses the language model training method 2-gram algorithm which is divided according to the words themselves to build the language model. The 2 in the 2-gram means that the probability of the occurrence of the current word is only related to its previous 2 words. 2 is the number of constraints on the memory length of the word sequence. The specific formula of the 2-gram algorithm can be expressed as:

上式中W表示一段文字序列，w₁,w₂,...,w_q分别表示文字序列里面的每一个单词，q表示文字序列的长度；S(W)表示这一段文字序列符合语言学习惯的概率。d表示第d单词。In the above formula, W represents a text sequence, w ₁ , w ₂ ,...,w _q represent each word in the text sequence, respectively, and q represents the length of the text sequence; S(W) means that this text sequence conforms to linguistics probability of habituation. d represents the dth word.

(三)汉语词典建立(3) Establishment of Chinese Dictionary

搭建语音识别系统语言模型词典。对于词典来说，一种语言的词典都是稳定不变的，对于本发明中的汉语文字词典来说，词典表现为一个dict.txt文件，其中标明了日常生活中常用的1423个汉语拼音对应的汉字，同时考虑汉语的一音多字情况。本发明搭建的词典部分展示图见附图说明图5。Build a dictionary of language models for speech recognition systems. For a dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the corresponding 1423 Chinese Pinyin commonly used in daily life. Chinese characters, while taking into account the situation of one-syllable and multiple-character Chinese. The display diagram of the dictionary part constructed by the present invention is shown in FIG. 5 in the description of the drawings.

步骤五、用建立的语言文本数据集对搭建的2-gram语言模型进行训练，得到语言模型的单词出现次数表以及状态转移表。具体的语言模型训练框图见附图说明图6。对语言模型的具体训练方式如下：Step 5: Train the built 2-gram language model with the established language text data set, and obtain a word occurrence table and a state transition table of the language model. The specific language model training block diagram is shown in Figure 6 of the accompanying drawings. The specific training method of the language model is as follows:

(1)循环获取语言文本数据集中的文本内容并统计单个单词出现得次数，汇总得到单个单词出现次数表。(1) Loop to obtain the text content in the language text data set and count the occurrences of a single word, and obtain a table of the occurrences of a single word.

(2)循环获取语言文本数据集中二个单词一起出现得次数，汇总得到二个单词状态转移表。(2) The number of times the two words appear together in the language text data set are obtained in a loop, and the state transition table of the two words is obtained by summarizing them.

对搭建的声学模型进行学习训练。得到声学模型的权重文件以及其它参数配置文件。具体的声学模型训练流程如下：Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set

Learn and train the built acoustic model. Obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows:

(1)初始化声学网络模型的各处权值；(1) Initialize the weights of the acoustic network model;

(2)依次导入语音训练集

中的语音进行训练，对任意的语音信号

首先经MFCC特征提取算法处理，得语音信号200维的特征值序列然后按照附图说明图7所列，将语音信号的200维特征值序列依次经过各个卷积层、池化层、Dropout层、全连接层处理，最后输出层采用1423个神经元的全连接层进行输出，并用softmax函数进行激活，得语音信号的1423维声学特征；(2) Import the speech training set in turn

The speech in the training, for any speech signal

First, the MFCC feature extraction algorithm is processed to obtain a 200-dimensional eigenvalue sequence of the speech signal. Then, as listed in Figure 7 of the accompanying drawings, the 200-dimensional eigenvalue sequence of the speech signal is sequentially passed through each convolution layer, pooling layer, Dropout layer, Fully connected layer processing, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation to obtain 1423-dimensional acoustic features of the speech signal;

(3)得到特征值后再在语言模型以及词典的作用下对1423维声学特征值进行解码并输出识别的语音信号

的汉语拼音序列；(3) After obtaining the feature value, decode the 1423-dimensional acoustic feature value under the action of the language model and dictionary and output the recognized speech signal

The Chinese Pinyin sequence of ;

(4)将声学模型识别出的汉语拼音序列与训练集

中第i条语音

的汉语拼音标签序列进行对比计算误差并反向传播更新声学模型各处的权值，损失函数采用CTC的loss函数，并Adam算法进行优化。设置训练的batchsize＝16，迭代次数epoch＝50，每训练500条语音，保存一次权重文件；CTC的损失函数如下:(4) The Chinese pinyin sequence identified by the acoustic model is compared with the training set

The i-th voice in

The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC and is optimized by the Adam algorithm. Set the training batchsize=16, the number of iterations epoch=50, and save a weight file for every 500 speeches trained; the loss function of CTC is as follows:

上式中

表示训练集训练后产生的总损失，e表示输入语音即进行语音增强后训练集

中的语音信号

z为输出的汉字序列，F(z|e)表示输入为e，输出序列为z的概率。In the above formula

Represents the total loss after training in the training set, e represents the input speech, that is, the training set after speech enhancement

voice signal in

z is the output sequence of Chinese characters, and F(z|e) represents the probability that the input is e and the output sequence is z.

(5)依次按照如上步骤训练语音识别的声学模型，直至声学模型损失收敛，声学模型便训练完毕。保存声学模型的权重文件和各项配置文件。具体的语音识别声学模型训练图见附图说明图7。(5) The acoustic model for speech recognition is trained sequentially according to the above steps, until the loss of the acoustic model converges, and the acoustic model is trained. Save the weights file and various configuration files of the acoustic model. The specific training diagram of the speech recognition acoustic model is shown in FIG. 7 in the description of the drawings.

的语音进行识别，统计语音识别准确率并与传统算法进行性能对比分析。具体的语音识别测试系统流程框架图见附图说明图8。本专利的语音识别准确率以及与传统算法的在噪音环境下的性能比较部分展示图见附图说明图9；本专利的语音识别准确率以及与传统算法的在混响环境下的性能比较部分展示图见附图说明图10。Step 7. Use the trained Chinese speech recognition system based on speech enhancement to test the test set

The speech recognition is carried out, the accuracy of speech recognition is counted, and the performance is compared with the traditional algorithm. The specific speech recognition test system flow chart is shown in Figure 8 of the accompanying drawings. The speech recognition accuracy of this patent and the performance comparison with the traditional algorithm in a noisy environment are shown in Figure 9 in the description of the drawings; the speech recognition accuracy of this patent and the performance comparison with the traditional algorithm in a reverberation environment For a presentation, see Figure 10 in the description of the drawings.

具体实行方式如下:The specific implementation method is as follows:

(1)用传统的语音识别系统，对建立的复杂环境语音数据库的2000个未增强的语音测试集T进行语音识别测试，统计其语音识别的准确率。并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10。(1) Using the traditional speech recognition system, the speech recognition test is performed on the 2000 unenhanced speech test sets T of the established complex environment speech database, and the accuracy of speech recognition is counted. The representative speech recognition result diagrams are listed in the description of the drawings, see FIG. 9 and FIG. 10 in the description of the drawings.

(2)用本发明的基于语音增强的语音识别系统，对建立的语音数据库的2000个增强后的语音测试集

进行语音识别测试，统计本发明方法的语音识别准确率。并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10。(2) using the speech recognition system based on speech enhancement of the present invention, to the 2000 enhanced speech test sets of the established speech database

A speech recognition test is carried out, and the speech recognition accuracy rate of the method of the present invention is counted. The representative speech recognition result diagrams are listed in the description of the drawings, see FIG. 9 and FIG. 10 in the description of the drawings.

(3)最后对本发明提出的基于语音增强的语音识别系统进行性能分析。(3) Finally, analyze the performance of the speech recognition system based on speech enhancement proposed by the present invention.

统计完成后发现，本发明提出的基于语音增强的语音识别算法对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音的识别准确率大幅提升，性能提升大约在30％左右；与传统的语音识别算法相比，本发明算法识别准确率也大大提升，尤其是对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音识别，传统算法表现很差，而本发明算法表现优异，性能很好。取部分噪音环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图9。取部分混响环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图10。After the statistics are completed, it is found that the speech recognition algorithm based on speech enhancement proposed by the present invention greatly improves the recognition accuracy of speech in Gaussian white noise environment, background noise or interfering sound source environment and reverberation environment, and the performance is improved by about 30%. Compared with the traditional speech recognition algorithm, the recognition accuracy of the algorithm of the present invention is also greatly improved, especially for the speech recognition in the Gaussian white noise environment, the environment with background noise or interfering sound source and the reverberation environment, the performance of the traditional algorithm is very good. The algorithm of the present invention has excellent performance and good performance. A comparison diagram of the recognition effects of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial noise environment is shown in FIG. 9 in the description of the drawings. A comparison diagram of the recognition effects of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial reverberation environment is shown in FIG. 10 in the description of the drawings.

由此看见，本发明的复杂环境下基于语音增强的深度神经网络语音识别方法，很好地解决了现有语音识别算法对噪音环境敏感、对语音质量要求高、可应用场景单一的问题，实现了复杂语音环境下的语音识别。From this, it can be seen that the deep neural network speech recognition method based on speech enhancement of the present invention well solves the problems that the existing speech recognition algorithm is sensitive to the noise environment, has high requirements for speech quality, and can be applied in a single scenario. speech recognition in complex speech environments.

在上述各步骤中出现的符号i表示训练集和测试集中第i个进行语音增强处理的语音信号，i＝1,2,...,12000；符号r表示语音信号的第r帧，r＝1,2,3,...,g；g表示语音信号分帧之后的总帧数，g的取值因处理的语音时长而变；符号l表示语音信号的第l个频带，l＝0,1,2,...,39；k是一个虚拟变量表示离散频率的索引，k＝0,1,2,...,1023。The symbol i that appears in the above steps represents the ith speech signal in the training set and the test set for speech enhancement processing, i=1,2,...,12000; the symbol r represents the rth frame of the speech signal, r= 1,2,3,...,g; g represents the total number of frames after the speech signal is divided into frames, and the value of g changes due to the length of the processed speech; the symbol l represents the lth frequency band of the speech signal, and l=0 ,1,2,...,39; k is a dummy variable representing the index of discrete frequencies, k=0,1,2,...,1023.

以上所述，仅是本发明的较佳实施例而已，并非对本发明做任何形式上的限制，虽然本发明已以较佳实施例展示如上，然而并非用以限定本发明，任何熟悉本专业的技术人员，在不脱离本发明技术方案范围内，当可利用上述揭示的技术内容做出些许更动或修饰为等同变化的等效实施例，但凡是未脱离本发明技术方案的内容，依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰，均仍属于本发明技术方案的范围内。The above are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been shown above with preferred embodiments, it is not intended to limit the present invention. Technical personnel, within the scope of the technical solution of the present invention, can make some changes or modifications to equivalent examples of equivalent changes by using the technical content disclosed above, but any content that does not depart from the technical solution of the present invention, according to the present invention. The technical essence of the invention Any simple modifications, equivalent changes and modifications made to the above embodiments still fall within the scope of the technical solutions of the present invention.

发明优点Invention Advantages

本发明以深度学习神经网络以及语音增强为技术背景搭建模型。首先搭建复杂语音环境数据集，在语音识别前端语音信号预处理阶段对各类待识别复杂语音条件下的语音信号进行语音增强；然后建立语言文本数据集，搭建语言模型，用算法对语言模型进行训练；并建立中文汉语词典文件；然后搭建神经网络声学模型，并用增强后语音训练集，借助语言模型和词典对声学模型进行训练，得到声学模型权重文件，从而实现复杂环境下中文语音的精准识别。很好地解决了现有语音识别算法对噪音因素敏感、对语音质量要求高、应用场景单一的问题。The present invention builds a model based on the technical background of deep learning neural network and speech enhancement. First, a complex speech environment dataset is built, and speech enhancement is performed on the speech signals under various complex speech conditions to be recognized in the speech signal preprocessing stage of the speech recognition front-end; training; and build a Chinese-Chinese dictionary file; then build a neural network acoustic model, and use the enhanced speech training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to achieve accurate recognition of Chinese speech in complex environments . It solves the problems that the existing speech recognition algorithms are sensitive to noise factors, have high requirements for speech quality, and have a single application scenario.

Claims

1. The specific implementation steps of the deep neural network speech recognition method based on speech enhancement in complex environments are as follows:

Step 1. The establishment and processing of speech data sets in complex environments; in this part, pure ambient speech, Gaussian white noise ambient speech, ambient speech with background noise or interfering sound sources, and speech in a reverberation environment are collected to form the speech recognition system. Voice data set C; then, the voice data in each environment in the voice data set C are divided into training set and test set respectively; the distribution ratio is the number of voices in the training set: the number of voices in the test set=5:1; The training set and the test set are respectively aggregated and scrambled to form a training set X and a test set T; the ith speech in the training set X is represented as _xi ; the jth speech in the test set T is represented as t _j ; For each speech in the training set X, edit a label document in .txt format. The content of the label document includes the name of the speech and the corresponding correct Chinese pinyin sequence; the partial display diagram of the speech label document in the training set is shown in the accompanying drawings. figure 2;

The specific collection methods are as follows: First of all, for the voice collection under pure conditions, multiple recordings are made under ideal laboratory conditions, using Chinese newspapers, novels, and student texts as materials. The recording time of a single voice is less than 10 seconds, and a total of 3000 pure voices are recorded. Material; for voice collection in Gaussian white noise environment and reverberation environment, Adobe Audition software is used for synthesis, specifically, the recorded pure voice and Gaussian white noise are used for synthesis, and the reverberation directly adopts the reverberation environment that comes with the software. Re-synthesize the speech; among them, 3000 pieces of speech are recorded in the Gaussian white noise environment and the speech in the reverberation environment; finally, for the speech with background noise or interfering sound sources, the field recording is mainly used, and it is relatively noisy in factories, restaurants, etc. The local area was recorded by multiple people, and a total of 3,000 voices were recorded; at the same time, all the voice files collected above were in .wav format; the collected voices were classified as follows: 2,500 voices in each type of voice environment were As the training set of the speech recognition system, the remaining 500 pieces are used as the test set; the summary is that the speech recognition training set X has a total of 10,000 pieces, and the test set T has a total of 2,000 pieces. The model is overfitting;

Step 2: Perform voice enhancement on the established voice training set X and test set T to obtain an enhanced voice training set

and test set

Enhanced speech training set

The i-th speech in is expressed as

test set

The jth speech in the middle is expressed as

Then use the gamma pass weight function H _l to pair by frequency band

Will

Put into the augmented speech training set

In; the specific speech data enhancement process frame diagram is shown in Figure 3 of the accompanying drawings;

Each step of speech enhancement is detailed as follows:

(1) Pre-emphasis of voice signal

Perform pre-emphasis on the i-th speech signal matrix x _i (n) in the training set X to obtain y _i (n), where y _i (n)=x _i (n)-αx _i (n-1), α is a The constant is α=0.98 in this patent; x _i (n-1) is the sampling matrix of the n-1 moment of the i-th speech in the training set;

(2) Windowing and Framing

The Hamming window w(n) is used to window and divide the pre-emphasized speech signal y _i (n) into frames, and the continuous speech signal is divided into discrete signals y _i,r (n) of one frame and one frame;

in

Hamming window function, N is the window length, the frame length is 50ms in the patent, and the frame shift is 10ms; the pre-emphasized speech signal y _i (n) is added to the window and framed to obtain the matrix information y _i of each frame of speech signal _,r (n); y _i,r (n) represents the speech information matrix of the rth frame of the i-th speech signal after pre-emphasis, windowing and framing;

(3) FFT transform

(4) Find the power of the speech signal P _i,r,l (r,l)

The short-term signal spectrum of each frame

represents the short-term spectrum of the rth frame speech signal, L=40 is the total number of all channels;

(5) Noise reduction and de-reverberation processing of speech signals

After the voice signal power P _i,r,l (r,l) is obtained, noise reduction and de-reverberation processing are performed. The specific steps are:

(1) Obtain the low-pass power M _i,r,l [r,l] of the lth frequency band of the rth frame, and the specific solution formula is as follows:

M _i,r,l [r,l]=λM _i,r,l [r-1,l]+(1-λ)P _i,r,l [r,l]

M _i,r,l [r-1,l] represents the low-pass power of the l-th frequency band of the r-1th frame; λ represents the forgetting factor, which varies with the bandwidth of the low-pass filter, and λ=0.4 in this patent;

(2) Remove the slowly changing components and the envelope of the power falling edge in the signal, and process the power P _i,r,l [r,l] of the speech signal to obtain the power of the lth frequency band of the rth frame after enhancement

in

where c ₀ is a constant factor, and this patent takes c ₀ =0.01;

(3) according to step (1), (2) carry out enhancement processing to each frequency band of each frame of the signal in turn;

(6) Spectral integration

Obtain the enhanced power in each frequency band of each frame of the speech signal

In the above formula μ _i,r [r,k] represents the spectral weight coefficient at the kth index of the rth frame;

is the short-term signal spectrum of the rth frame of the ith speech signal after enhancement;

where μ _i,r [r,k] is solved by the following formula:

μ _i,r [r,k]＝μ _i,r [r,Nk],N/2≤k≤N-1

H _l in the formula represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k; ω _i,r,l [r,l] is the rth frame of the ith speech signal The weight coefficient of l frequency bands, the weight coefficient is the ratio of the enhanced frequency domain to the original frequency domain of the signal. The solution formula is as follows:

Obtain the enhanced short-term signal spectrum of the r-th frame of the i-th speech signal after spectral integration, and process each frame in turn according to the above operation to obtain the enhanced short-term signal spectrum of each frame of the i-th speech signal. ; Enhanced speech signal for each frame

g is the total number of frames

In the above formula,

is the enhanced speech signal matrix;

Represents the enhanced speech signal matrix of the rth frame; g is the total number of frames of the speech signal, and this value changes due to the duration of the speech signal; the sampling matrix of the speech signal at time n after the enhancement is obtained

Perform writing processing to obtain an enhanced voice signal

So far, the enhancement processing of a voice in the voice training set is completed, and then the training set X and the test set T are processed according to the above steps in turn; and the enhanced training set voice is saved in the

Centralized, the augmented test set is stored in

concentrated;

Step 3: Build an acoustic model for speech recognition; the acoustic model for speech recognition built in this patent is modeled by CNN+CTC, and the input layer is the training set enhanced in step 2

medium voice signal

The MFCC feature extraction algorithm is used to extract the eigenvalue sequence; at the same time, the hidden layer is alternately connected with the convolutional layer and the pooling layer, and the Dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel The size is 3, the pooling window size is 2, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation, using the loss function of CTC as the loss function to achieve multi-output connectivity time series, the output is 1423 The eigenvalues of the dimension correspond to the 1423 commonly used Chinese pinyin in the Chinese-Chinese dictionary dict.txt document built in step 4; the specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings; in the acoustic model, the convolution layer, pooling The specific parameters of the layer, dropout layer and fully connected layer have been marked in Figure 4;

Step 4: Build a speech recognition language model; the language model building includes the establishment of language text data sets, the design of 2-gram language models, and the collection of Chinese-Chinese dictionaries;

(1) Establishment of language text database

First, establish the text data set required for training the language model; the language text data set is represented as an electronic version .txt file in the form of newspapers, middle school texts, and famous novels; collect the electronic versions of newspapers, middle school texts, and famous novels. txt file to establish a language text database, pay attention to the selection of text data in the language text database must be representative, and can reflect the Chinese language habits in daily life;

(2) 2-gram language model construction

This patent uses the language model training method 2-gram algorithm, which is divided according to the words themselves, to build a language model; 2 in the 2-gram means that the probability of the occurrence of the current word is only related to its first two words; 2 is the memory length of the word sequence. The number of constraints; the specific formula of the 2-gram algorithm can be expressed as:

In the above formula, W represents a text sequence, w ₁ , w ₂ ,...,w _q represent each word in the text sequence, respectively, and q represents the length of the text sequence; S(W) means that this text sequence conforms to linguistics The probability of habit; d represents the dth word;

(3) Establishment of Chinese Dictionary

Build a speech recognition system language model dictionary; for the dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the daily life. The Chinese characters corresponding to the 1423 Chinese Pinyin commonly used in Chinese, and considering the situation of one-syllable and multiple-character Chinese, a partial display diagram of the dictionary built by the present invention is shown in Figure 5 of the accompanying drawings;

Step 5: Use the established language text data set to train the built 2-gram language model, and obtain the word occurrence table and state transition table of the language model; the specific language model training block diagram is shown in Figure 6 for the description of the drawings; for the language model The specific training methods are as follows:

(1) circularly obtain the text content in the language text data set and count the number of occurrences of a single word, and summarize the number of occurrences of a single word to obtain a table;

(2) cyclically obtain the number of times that two words appear together in the language text data set, and summarize to obtain two word state transition tables;

Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set

Learn and train the built acoustic model; obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows:

(1) Initialize the weights of the acoustic network model;

(2) Import the speech training set in turn

The speech in the training, for any speech signal

(3) After obtaining the feature value, decode the 1423-dimensional acoustic feature value under the action of the language model and dictionary and output the recognized speech signal

The Chinese Pinyin sequence of ;

(4) The Chinese pinyin sequence identified by the acoustic model is compared with the training set

The i-th voice in

The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC, and the Adam algorithm is optimized. The training batchsize=16, the number of iterations epoch=50, each Train 500 voices and save the weight file once; the loss function of CTC is as follows:

In the above formula

voice signal in

z is the output Chinese character sequence, F(z|e) represents the probability that the input is e and the output sequence is z;

(5) Follow the above steps to train the acoustic model of speech recognition in turn, until the loss of the acoustic model converges, and the acoustic model is trained; save the weight file and various configuration files of the acoustic model, and the specific speech recognition acoustic model training diagram is shown in the accompanying drawings. Figure 7;

Step 7. Use the trained Chinese speech recognition system based on speech enhancement to test the test set

The speech recognition accuracy rate of the patent and the performance comparison analysis with the traditional algorithm are carried out; the specific speech recognition test system flow chart is shown in Figure 8 of the accompanying drawings; the speech recognition accuracy rate of this patent and the performance of the traditional algorithm in noise environment The performance comparison part shown below is shown in Figure 9 in the description of the drawings; the speech recognition accuracy rate of this patent and the performance comparison with the traditional algorithm in the reverberation environment are shown in Figure 10 in the description of the drawings;

The specific implementation method is as follows:

(1) Use the traditional speech recognition system to perform speech recognition tests on 2000 unenhanced speech test sets T of the established complex environment speech database, and count the accuracy of its speech recognition; The results of speech recognition are shown in Figure 9 and Figure 10 in the description of the drawings;

(2) using the speech recognition system based on speech enhancement of the present invention, to the 2000 enhanced speech test sets of the established speech database

Carry out the speech recognition test, and count the speech recognition accuracy rate of the method of the present invention; and enumerate the representative speech recognition results in the description of the drawings, see Figures 9 and 10 in the description of the drawings;

(3) Finally, perform performance analysis on the speech recognition system based on speech enhancement proposed by the present invention;

After the statistics are completed, it is found that the speech recognition algorithm based on speech enhancement proposed by the present invention greatly improves the recognition accuracy of speech in Gaussian white noise environment, background noise or interfering sound source environment and reverberation environment, and the performance is improved by about 30%. Compared with the traditional speech recognition algorithm, the recognition accuracy of the algorithm of the present invention is also greatly improved, especially for the speech recognition in the Gaussian white noise environment, the environment with background noise or interfering sound source and the reverberation environment, the performance of the traditional algorithm is very good. The performance of the algorithm of the present invention is excellent, and the performance is very good. The comparison chart of the recognition effect of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial noise environment is shown in Figure 9 in the description of the drawings; the speech recognition of the present invention is taken in a partial reverberation environment The comparison chart of the recognition effect between the algorithm and the traditional speech recognition algorithm is shown in Figure 10 of the accompanying drawings;

From this, it can be seen that the deep neural network speech recognition method based on speech enhancement of the present invention well solves the problems that the existing speech recognition algorithm is sensitive to the noise environment, has high requirements for speech quality, and can be applied in a single scenario. speech recognition in complex speech environment;

The symbol i that appears in the above steps represents the ith speech signal in the training set and the test set for speech enhancement processing, i=1,2,...,12000; the symbol r represents the rth frame of the speech signal, r= 1,2,3,...,g; g represents the total number of frames after the speech signal is divided into frames, and the value of g changes due to the length of the processed speech; the symbol l represents the lth frequency band of the speech signal, and l=0 ,1,2,...,39; k is a dummy variable representing the index of discrete frequencies, k=0,1,2,...,1023.