CN111986661A - Deep neural network speech recognition method based on speech enhancement in complex environment - Google Patents
Deep neural network speech recognition method based on speech enhancement in complex environment Download PDFInfo
- Publication number
- CN111986661A CN111986661A CN202010880777.7A CN202010880777A CN111986661A CN 111986661 A CN111986661 A CN 111986661A CN 202010880777 A CN202010880777 A CN 202010880777A CN 111986661 A CN111986661 A CN 111986661A
- Authority
- CN
- China
- Prior art keywords
- speech
- frame
- signal
- voice
- speech recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 21
- 238000012549 training Methods 0.000 claims abstract description 105
- 238000012360 testing method Methods 0.000 claims description 53
- 230000006870 function Effects 0.000 claims description 34
- 238000001228 spectrum Methods 0.000 claims description 33
- 238000012545 processing Methods 0.000 claims description 31
- 238000010586 diagram Methods 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 22
- 230000008569 process Effects 0.000 claims description 15
- 238000005070 sampling Methods 0.000 claims description 14
- 238000011176 pooling Methods 0.000 claims description 12
- 230000009466 transformation Effects 0.000 claims description 11
- 230000010354 integration Effects 0.000 claims description 9
- 230000002452 interceptive effect Effects 0.000 claims description 9
- 230000003595 spectral effect Effects 0.000 claims description 9
- 230000015572 biosynthetic process Effects 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 7
- 238000003786 synthesis reaction Methods 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 230000007704 transition Effects 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 230000003190 augmentative effect Effects 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 4
- 239000000463 material Substances 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000008676 import Effects 0.000 claims description 3
- 230000009471 action Effects 0.000 claims description 2
- 238000013461 design Methods 0.000 claims description 2
- 230000037433 frameshift Effects 0.000 claims description 2
- 238000004458 analytical method Methods 0.000 claims 2
- 238000013135 deep learning Methods 0.000 abstract description 5
- 238000007781 pre-processing Methods 0.000 abstract description 4
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000007613 environmental effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 206010052804 Drug tolerance Diseases 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 230000026781 habituation Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
Description
技术领域technical field
本发明属于语音识别领域,尤其涉及一种复杂环境下基于语音增强的深度神经网络语音识别方法。The invention belongs to the field of speech recognition, and in particular relates to a deep neural network speech recognition method based on speech enhancement in a complex environment.
背景技术Background technique
近年来,科技创新屡破难关,经济繁荣社会进步,人们在解决吃、穿、住、行基本问题后,对构建美好生活提出了更多需求。这一美好愿景,促使QQ、微信等集生活、工作、娱乐于一身的虚拟社交软件大量涌现。虚拟社交软件给人们的生活,工作,交流沟通带来了极大便利,尤其是各大社交软件中的语音识别功能。语音识别,使得人们可以摆脱键盘、鼠标等传统交互方式的束缚,从而使用最自然的交流方式—语音交流来传递信息。同时,语音识别也逐渐在工业、通信、家电、家庭服务、医疗、电子消费产品等各个领域获得了广泛的应用。In recent years, technological innovation has repeatedly overcome difficulties, and the economy has prospered and social progress has been made. After solving the basic problems of food, clothing, housing, and transportation, people have put forward more demands for building a better life. This beautiful vision has prompted a large number of virtual social software such as QQ and WeChat that integrate life, work and entertainment. Virtual social software brings great convenience to people's life, work and communication, especially the speech recognition function in major social software. Speech recognition enables people to get rid of the shackles of traditional interaction methods such as keyboard and mouse, so as to use the most natural communication method—voice communication to transmit information. At the same time, speech recognition has gradually been widely used in various fields such as industry, communications, home appliances, home services, medical care, and electronic consumer products.
现如今大部分的社交软件在无背景噪音以及无干扰声源的纯净语音条件下语音识别准确率已经达到极高水平。当待识别语音信号包含噪音、干扰以及存在混响时,现有的语音识别系统的准确率便大幅下降。这一转变,主要是现有的语音识别系统,在语音识别前端的语音信号预处理阶段以及搭建声学模型阶段,并未考虑去噪和干扰抑制问题。Most of today's social software has achieved a very high level of speech recognition accuracy under the condition of pure speech without background noise and interfering sound sources. When the speech signal to be recognized contains noise, interference and reverberation, the accuracy of the existing speech recognition system is greatly reduced. This change is mainly due to the fact that the existing speech recognition system does not consider denoising and interference suppression in the speech signal preprocessing stage and the stage of building an acoustic model at the front end of speech recognition.
现有的中文语音识别算法,对语音信号质量要求苛刻,算法鲁棒性差,当语音质量较差或音频污损严重,便会导致语音识别失败。仅在纯净的理想的语音条件下获得小范围应用,为了提高语音识别在现实生活环境中的应用,针对现有算法的不足,本发明提出复杂环境下基于语音增强的深度神经网络语音识别方法。该方法以深度学习神经网络以及语音增强为技术背景。首先,在语音识别前端对各类待识别复杂语音条件下的语音信号进行语音增强;建立语言文本数据集,搭建语言模型,用算法对语言模型进行训练;建立中文汉语词典文件;搭建神经网络声学模型,并用增强后语音训练集,借助语言模型和词典对声学模型进行训练,得到声学模型权重文件,从而建立一个性能良好的复杂语音环境下的语音识别系统。The existing Chinese speech recognition algorithms have strict requirements on the quality of speech signals, and the algorithm has poor robustness. When the speech quality is poor or the audio contamination is serious, speech recognition will fail. Only a small range of applications can be obtained under pure and ideal speech conditions. In order to improve the application of speech recognition in real-life environments, in view of the shortcomings of existing algorithms, the present invention proposes a deep neural network speech recognition method based on speech enhancement in complex environments. The method is based on the technical background of deep learning neural network and speech enhancement. First, in the front-end of speech recognition, speech enhancement is performed on the speech signals under various complex speech conditions to be recognized; a language text data set is established, a language model is built, and an algorithm is used to train the language model; a Chinese-Chinese dictionary file is established; a neural network acoustics is built model, and use the enhanced speech training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to establish a speech recognition system with good performance in a complex speech environment.
鉴于语音识别技术在实际生活中的应用,本发明提出的复杂环境语音识别技术是包括纯净语音条件、高斯白噪音环境、背景噪音或干扰声源以及混响环境四类综合语音环境下的语音识别技术。本发明方法识别准确率高,模型泛化能力强,同时对各类环境因素具有很好的鲁棒性。In view of the application of speech recognition technology in real life, the complex environment speech recognition technology proposed by the present invention is a speech recognition under four comprehensive speech environments including pure speech conditions, Gaussian white noise environment, background noise or interference sound source and reverberation environment. technology. The method of the invention has high recognition accuracy, strong model generalization ability, and good robustness to various environmental factors.
发明内容SUMMARY OF THE INVENTION
本发明的目的是提供一种复杂环境下基于语音增强的深度神经网络语音识别方法。The purpose of the present invention is to provide a deep neural network speech recognition method based on speech enhancement in a complex environment.
为了实现上述目的,本发明采取如下的技术解决方案:In order to achieve the above object, the present invention adopts the following technical solutions:
复杂环境下基于语音增强的深度神经网络语音识别方法,以深度学习神经网络以及语音增强为技术背景搭建模型,具体的语音识别技术方案流程图见附图说明图1。首先搭建复杂语音环境数据集,在语音识别前端语音信号预处理阶段对待识别复杂语音条件下的语音信号进行语音增强;然后建立语言文本数据集,搭建语言模型,用算法对语言模型进行训练;并建立中文汉语词典文件;最后搭建神经网络声学模型,并用增强后语音训练集,借助语言模型和词典对声学模型进行训练,得到声学模型权重文件,从而实现复杂环境下中文语音的精准识别。很好地解决了现有语音识别算法对噪音敏感、对语音质量要求高、应用场景单一的问题。复杂环境下基于语音增强的深度神经网络语音识别方法步骤如下:A deep neural network speech recognition method based on speech enhancement in a complex environment builds a model with a deep learning neural network and speech enhancement as the technical background. The specific speech recognition technical solution flow chart is shown in Figure 1. First, build a complex speech environment data set, and in the speech signal preprocessing stage of the front-end speech recognition, the speech signals under complex speech conditions are to be enhanced; Establish a Chinese-Chinese dictionary file; finally, build a neural network acoustic model, and use the enhanced voice training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to achieve accurate recognition of Chinese speech in complex environments. It well solves the problems that existing speech recognition algorithms are sensitive to noise, have high requirements for speech quality, and have a single application scenario. The steps of deep neural network speech recognition method based on speech enhancement in complex environment are as follows:
步骤一、复杂环境下语音数据集的建立及处理。在该部分收集纯净环境语音、高斯白噪音环境语音、存在背景噪音或干扰声源环境语音以及混响环境下的语音共同组成语音识别系统的语音数据集C。然后,将语音数据集C中各环境下的语音数据分别分成训练集和测试集。分配比例为训练集语音条数:测试集语音条数=5:1。将各环境下的训练集和测试集分别汇总并打乱分布,形成训练集X和测试集T。训练集X中的第i条语音表示为xi;测试集T中第j条语音表示为tj。同时对训练集X中的每一条语音,编辑一个.txt格式的标签文档,标签文档的内容包括该条语音的名字以及对应的正确汉语拼音序列。训练集语音标签文档的部分展示图见附图说明图2。
步骤二、对建立的语音训练集X和测试集T进行语音增强,得到增强后的语音训练集和测试集增强后的语音训练集中的第i条语音表示为测试集中第j条语音表示为以语音训练集中第i条语音xi的语音增强为例,具体的语音增强步骤如下,对待增强的语音信号xi,用matlab软件内置的语音处理audioread函数对xi进行读取处理,得到语音信号的采样率fs以及包含语音信息的矩阵xi(n),xi(n)为n时刻的语音采样值;然后对xi(n)进行预加重处理得yi(n);再对yi(n)加汉明窗进行分帧操作,得到语音信号的各个帧的信息yi,r(n),其中yi,r(n)表示进行预加重增强后第i条语音信号的第r帧的语音信息矩阵;再对yi,r(n)进行FFT变换得到第i个语音信号第r帧的短时信号频谱然后用伽马通权重函数Hl按频带对进行处理得第i个语音信号第r帧第l个频带上的功率Pi,r,l(r,l),其中l的取值为0,...,39;依次按照如上步骤求取第r帧的各个频带的功率;再进行降噪去混响处理以及谱整合得由此,已经求得增强后第i个语音信号第r帧的短时信号频谱,对其它帧的语音信号同样依次做如上的处理,得到各个帧的短时信号频谱,再通过IFFT变换在时域上进行语音信号帧合成得到增强之后的语音信号将放入增强后的语音训练集中。具体的语音数据增强流程框架图见附图说明图3。Step 2: Perform voice enhancement on the established voice training set X and test set T to obtain an enhanced voice training set and test set Enhanced speech training set The i-th speech in is expressed as test set The jth speech in the middle is expressed as Taking the voice enhancement of the i -th voice xi in the voice training set as an example, the specific voice enhancement steps are as follows . The sampling rate f s of the signal and the matrix x i (n) containing the speech information, x i ( n ) is the speech sampling value at time n ; Perform framing operation on y i (n) and Hamming window to obtain information y i,r (n) of each frame of the speech signal, where y i,r (n) represents the i-th speech signal after pre-emphasis enhancement The speech information matrix of the r-th frame; then perform FFT transformation on y i,r (n) to obtain the short-term signal spectrum of the i-th speech signal at the r-th frame Then use the gamma pass weight function H l to pair by frequency band Perform processing to obtain the power P i,r,l (r,l) on the lth frequency band of the rth frame of the ith speech signal, where the value of l is 0,...,39; follow the steps above to obtain The power of each frequency band of the rth frame; then noise reduction and de-reverberation processing and spectral integration are performed to obtain As a result, the short-term signal spectrum of the r-th frame of the i-th voice signal after enhancement has been obtained, and the above processing is also performed on the voice signals of other frames in turn to obtain the short-term signal spectrum of each frame. The speech signal after the speech signal frame synthesis is enhanced in the domain Will Put into the augmented speech training set middle. The specific speech data enhancement process frame diagram is shown in FIG. 3 of the accompanying drawings.
步骤三、搭建语音识别声学模型。本专利搭建的语音识别声学模型采用CNN+CTC进行建模,输入层为步骤二增强后的训练集中的语音信号采用MFCC特征提取算法处理训练集语音信号得到200维的特征值序列,隐藏层采用卷积层和池化层交替重复连接,并且引入Dropout层,防止过拟合,其中卷积层卷积核尺寸为3,池化窗口大小为2,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,采用CTC的loss函数作为损失函数实现连接性时序多输出,输出为1423维的特征值正好对应步骤四搭建的中文汉语词典dict.txt文档中的1423个常用汉语拼音。具体语音识别声学模型网络框架图见附图说明图4。其中声学模型中卷积层、池化层、Dropout层以及全连接层的具体参数均已在图4中标出。Step 3: Build an acoustic model for speech recognition. The speech recognition acoustic model built by this patent adopts CNN+CTC for modeling, and the input layer is the training set enhanced in step 2 voice signal in Using MFCC Feature Extraction Algorithm to Process Training Set Speech Signal A 200-dimensional eigenvalue sequence is obtained. The hidden layer is alternately and repeatedly connected by the convolutional layer and the pooling layer, and the dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel size is 3, and the pooling window size is 2. Finally, the output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation. The loss function of CTC is used as the loss function to achieve multiple outputs of connected time series. The output is a 1423-dimensional feature value that corresponds to the Chinese built in step 4. 1423 commonly used pinyin in Chinese dictionary dict.txt document. The specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings. The specific parameters of the convolutional layer, pooling layer, dropout layer and fully connected layer in the acoustic model have been marked in Figure 4.
步骤四、搭建语音识别的2-gram语言模型以及词典。语言模型的搭建包括语言文本数据集的建立、2-gram语言模型搭建、中文汉语词典的搜集建立。语言文本数据集形式上表现为一个电子版.txt文件,内容为报纸、中学课文、著名小说。对于词典来说,一种语言的词典都是稳定不变的,对于本发明中的汉语文字词典来说,词典表现为一个dict.txt文件,其中标明了日常生活中常用的1423个汉语拼音对应的汉字,同时考虑汉语的一音多字情况。本发明搭建的词典部分展示图见附图说明图5。Step 4: Build a 2-gram language model and dictionary for speech recognition. The construction of the language model includes the establishment of language text datasets, the establishment of 2-gram language models, and the collection and establishment of Chinese-Chinese dictionaries. The language text dataset is represented as an electronic .txt file in the form of newspapers, middle school texts, and famous novels. For the dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the corresponding 1423 Chinese Pinyin commonly used in daily life. Chinese characters, while taking into account the situation of one-syllable and multiple-character Chinese. The display diagram of the dictionary part constructed by the present invention is shown in FIG. 5 in the description of the drawings.
步骤五、用建立的语言文本数据集对搭建的2-gram语言模型进行训练,得到语言模型的单词出现次数表以及状态转移表。对语言模型的具体训练方式如下:首先循环获取语言文本数据集中的文本内容并统计单个单词出现得次数,以及二个单词一起出现得次数,最后汇总得到单个单词出现次数表以及二个单词状态转移表。具体的语言模型训练框图见附图说明图6。Step 5: Train the built 2-gram language model with the established language text data set, and obtain a word occurrence table and a state transition table of the language model. The specific training method of the language model is as follows: first, the text content in the language text dataset is obtained in a loop, and the number of occurrences of a single word and the number of occurrences of two words together are counted, and finally the table of occurrences of a single word and the state transition of two words are obtained. surface. The specific language model training block diagram is shown in Figure 6 of the accompanying drawings.
步骤六、用训练好的语言模型和建立的词典以及增强后的语音训练集对搭建的声学模型进行学习训练。得到声学模型的权重文件以及其它参数配置文件。具体的声学模型训练流程如下:初始化声学网络模型的各处权值;依次导入语音训练集中的语音进行训练,对任意的语音信号首先经MFCC特征提取算法处理,得语音信号200维的特征值序列然后按照附图说明图7所列,将语音信号的200维特征值序列依次经过各个卷积层、池化层、Dropout层、全连接层处理,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,得语音信号的1423维声学特征;得到特征值后再在语言模型以及词典的作用下对1423维声学特征值进行解码并输出识别的语音信号的汉语拼音序列;将声学模型识别出的汉语拼音序列与训练集中的汉语拼音标签序列进行对比计算误差并反向传播更新声学模型各处的权值,损失函数采用CTC的loss函数,并Adam算法进行优化。设置训练的batchsize=16,迭代次数epoch=50,每训练500条语音,保存一次权重文件,依次按照如上步骤处理训练集的每一条语音,直至声学模型损失收敛,声学模型便训练完毕。保存声学模型的权重文件和各项配置文件。具体的语音识别声学模型训练框图见附图说明图7。Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set Learn and train the built acoustic model. Obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows: initialize the weights of the acoustic network model; import the speech training set in turn The speech in the training, for any speech signal First, the MFCC feature extraction algorithm is processed to obtain a 200-dimensional eigenvalue sequence of the speech signal. Then, as listed in Figure 7 of the accompanying drawings, the 200-dimensional eigenvalue sequence of the speech signal is sequentially passed through each convolution layer, pooling layer, Dropout layer, Fully connected layer processing, the final output layer uses a fully connected layer of 1423 neurons for output, and activates it with the softmax function to obtain the 1423-dimensional acoustic features of the speech signal; after obtaining the feature values, the 1423 is processed by the language model and dictionary. dimensional acoustic eigenvalues to decode and output the recognized speech signal The Chinese Pinyin sequence identified by the acoustic model and the training set middle The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC and is optimized by the Adam algorithm. Set the training batchsize=16, the number of iterations epoch=50, save a weight file for every 500 voices in training, and process the training set in turn according to the above steps For each speech, the acoustic model is trained until the loss of the acoustic model converges. Save the weights file and various configuration files of the acoustic model. The specific speech recognition acoustic model training block diagram is shown in Fig. 7 of the accompanying drawings.
步骤七、用训练好的基于语音增强的中文语音识别系统对测试集的语音进行识别,统计语音识别准确率并与传统算法进行性能对比分析。具体的语音识别测试系统流程框架图见附图说明图8。本专利的语音识别准确率以及与传统算法的性能比较部分展示图见图9、图10。
发明优点Invention Advantages
复杂环境下基于语音增强的深度神经网络语音识别方法,很好地解决了现有语音识别算法对噪音等复杂环境因素敏感、对语音质量要求高、语音识别应用场景单一的问题。同时,本发明提出的语音识别方法采用神经网络深度学习技术,进行声学建模,使得本发明搭建的模型迁移学习能力强,语音增强方法的引入也使本发明的语音识别系统在复杂环境因素干扰方面具有强大的鲁棒性。The deep neural network speech recognition method based on speech enhancement in complex environments can well solve the problems that existing speech recognition algorithms are sensitive to complex environmental factors such as noise, require high speech quality, and have a single speech recognition application scenario. At the same time, the speech recognition method proposed by the present invention adopts the neural network deep learning technology to perform acoustic modeling, so that the model built by the present invention has strong transfer learning ability, and the introduction of the speech enhancement method also makes the speech recognition system of the present invention interfere with complex environmental factors. Aspects have strong robustness.
附图说明Description of drawings
为了更清楚地说明本发明的技术方案,下面将对本发明描述中需要使用的附图做简单介绍,以便更好地了解本发明的发明内容。In order to illustrate the technical solutions of the present invention more clearly, the accompanying drawings to be used in the description of the present invention will be briefly introduced below, so as to better understand the content of the present invention.
图1为本发明的语音识别技术方案具体流程图;Fig. 1 is the concrete flow chart of the speech recognition technical scheme of the present invention;
图2为本发明的语音识别训练集语音标签部分展示图;Fig. 2 is the partial presentation diagram of the speech label of speech recognition training set of the present invention;
图3为本发明的语音识别语音增强流程框架图;Fig. 3 is the speech recognition speech enhancement process frame diagram of the present invention;
图4为本发明的语音识别声学模型网络框架图;Fig. 4 is the speech recognition acoustic model network frame diagram of the present invention;
图5为本发明搭建的词典部分展示图;5 is a display diagram of a dictionary part constructed by the present invention;
图6为本发明的语言模型训练流程图;Fig. 6 is the language model training flow chart of the present invention;
图7为本发明声学模型的训练图;Fig. 7 is the training diagram of the acoustic model of the present invention;
图8为本发明语音识别测试系统的流程框图;Fig. 8 is the flow chart of the speech recognition test system of the present invention;
图9为本发明的语音识别算法与传统算法在噪音环境下的效果对比展示图;9 is a graph showing the comparison of the effects of the speech recognition algorithm of the present invention and the traditional algorithm in a noisy environment;
图10为本发明的语音识别算法与传统算法在混响环境下的效果对比展示图;Fig. 10 is the effect comparison display diagram of the speech recognition algorithm of the present invention and the traditional algorithm under the reverberation environment;
具体实施方式Detailed ways
复杂环境下基于语音增强的深度神经网络语音识别方法具体实施步骤如下:The specific implementation steps of the deep neural network speech recognition method based on speech enhancement in complex environments are as follows:
步骤一、复杂环境下语音数据集的建立以及处理。在该部分收集纯净环境语音、高斯白噪音环境语音、存在背景噪音或干扰声源环境语音以及混响环境下的语音共同组成语音识别系统的语音数据集C。然后,将语音数据集C中各环境下的语音数据分别分成训练集和测试集。分配比例为训练集语音条数:测试集语音条数=5:1。将各环境下的训练集和测试集分别汇总并打乱分布,形成训练集X和测试集T。训练集X中的第i条语音表示为xi;测试集T中第j条语音表示为tj。同时对训练集X中的每一条语音,编辑一个.txt格式的标签文档,标签文档的内容包括该条语音的名字以及对应的正确汉语拼音序列。训练集语音标签文档的部分展示图见附图说明图2。Step 1: The establishment and processing of a voice data set in a complex environment. In this part, pure ambient speech, Gaussian white noise ambient speech, ambient speech with background noise or interfering sound source, and speech in reverberation environment are collected to form the speech data set C of the speech recognition system. Then, the speech data in each environment in the speech data set C is divided into a training set and a test set, respectively. The distribution ratio is the number of voices in the training set: the number of voices in the test set=5:1. The training set and test set in each environment are aggregated and the distribution is disrupted to form a training set X and a test set T. The ith speech in the training set X is denoted as xi ; the jth speech in the test set T is denoted as t j . At the same time, for each speech in the training set X, a label document in .txt format is edited, and the content of the label document includes the name of the speech and the corresponding correct Chinese pinyin sequence. A partial presentation of the training set voice tag documents is shown in Figure 2 of the accompanying drawings.
具体收集方法分别如下:首先对于纯净条件的语音收集,在理想实验室条件下进行多人录制,以中文报纸、小说、学生课文为素材,单条语音录制时长10秒以内,共录制3000条纯净语音素材;对于高斯白噪音环境以及混响环境下的语音收集,采用Adobe Audition软件来进行合成,具体是采用录制的纯净语音和高斯白噪声进行合成,混响则直接采用软件自带的混响环境重新合成语音。其中高斯白噪音环境下的语音和混响环境下的语音各录制3000条;最后对于存在背景噪音或干扰声源的语音,采用实地录制为主,在工厂、餐厅等比较嘈杂的地方由多人进行实地录制,共录制语音3000条。同时,以上收集到的所有语音文件格式为.wav格式。将收集到语音进行分类,分类方式如下:将每一类语音环境中2500条语音作为语音识别系统的训练集,剩下的500条作为测试集。总结即语音识别训练集X共10000条,测试集T共2000条,将训练集与测试集分别打乱分布,避免训练出来的模型出现过拟合。The specific collection methods are as follows: First, for the voice collection under pure conditions, multiple people are recorded under ideal laboratory conditions, using Chinese newspapers, novels, and student texts as materials, a single voice recording time is less than 10 seconds, and a total of 3000 pure voices are recorded. Material; for voice collection in Gaussian white noise environment and reverberation environment, Adobe Audition software is used for synthesis, specifically, the recorded pure voice and Gaussian white noise are used for synthesis, and the reverberation environment is directly used in the software. Resynthesize the speech. Among them, 3,000 pieces of speech are recorded in the Gaussian white noise environment and in the reverberation environment. Finally, for the speech with background noise or interference sound source, the field recording is mainly used, and many people are in noisy places such as factories and restaurants. A total of 3,000 voices were recorded on the spot. At the same time, all the audio files collected above are in .wav format. The collected speech is classified as follows: 2500 speeches in each type of speech environment are used as the training set of the speech recognition system, and the remaining 500 speeches are used as the test set. In conclusion, there are 10,000 speech recognition training sets X and 2,000 test sets T. The distribution of the training set and the test set is scrambled to avoid overfitting of the trained model.
步骤二、对建立的语音训练集X和测试集T进行语音增强,得到增强后的语音训练集和测试集增强后的语音训练集中的第i条语音表示为测试集中第j条语音表示为以语音训练集中第i条语音xi的语音增强为例,具体的语音增强步骤如下,对待增强的语音信号xi,用matlab软件内置的语音处理audioread函数对xi进行读取处理,得到语音信号的采样率fs以及包含语音信息的矩阵xi(n),xi(n)为n时刻的语音采样值;然后对xi(n)进行预加重处理得yi(n);再对yi(n)加汉明窗进行分帧操作,得到语音信号的各个帧的信息yi,r(n),其中yi,r(n)表示进行预加重增强后第i条语音信号的第r帧的语音信息矩阵;再对yi,r(n)进行FFT变换得到第i个语音信号第r帧的短时信号频谱然后用伽马通权重函数Hl按频带对进行处理得第i个语音信号第r帧第l个频带上的功率Pi,r,l(r,l),其中l的取值为0,...,39;依次按照如上步骤求取第r帧的各个频带的功率;再进行降噪去混响处理以及谱整合得由此,已经求得增强后第i个语音信号第r帧的短时信号频谱,对其它帧的语音信号同样依次做如上的处理,得到各个帧的短时信号频谱,再通过IFFT变换在时域上进行语音信号帧合成得到增强之后的语音信号将放入增强后的语音训练集中。具体的语音数据增强流程框架图见附图说明图3。Step 2: Perform voice enhancement on the established voice training set X and test set T to obtain an enhanced voice training set and test set Enhanced speech training set The i-th speech in is expressed as test set The jth speech in the middle is expressed as Taking the voice enhancement of the i -th voice xi in the voice training set as an example, the specific voice enhancement steps are as follows . The sampling rate f s of the signal and the matrix x i (n) containing the speech information, x i ( n ) is the speech sampling value at time n ; Perform framing operation on y i (n) and Hamming window to obtain information y i,r (n) of each frame of the speech signal, where y i,r (n) represents the i-th speech signal after pre-emphasis enhancement The speech information matrix of the r-th frame; then perform FFT transformation on y i,r (n) to obtain the short-term signal spectrum of the i-th speech signal at the r-th frame Then use the gamma pass weight function H l to pair by frequency band Perform processing to obtain the power P i,r,l (r,l) on the lth frequency band of the rth frame of the ith speech signal, where the value of l is 0,...,39; follow the steps above to obtain The power of each frequency band of the rth frame; then noise reduction and de-reverberation processing and spectral integration are performed to obtain As a result, the short-term signal spectrum of the r-th frame of the i-th voice signal after enhancement has been obtained, and the above processing is also performed on the voice signals of other frames in turn to obtain the short-term signal spectrum of each frame. The speech signal after the speech signal frame synthesis is enhanced in the domain Will Put into the augmented speech training set middle. The specific speech data enhancement process frame diagram is shown in FIG. 3 of the accompanying drawings.
语音增强每一步操作具体如下详述:Each step of speech enhancement is detailed as follows:
(一)语音信号预加重(1) Pre-emphasis of voice signal
对训练集X中第i个语音信号矩阵xi(n)进行预加重得到yi(n),其中yi(n)=xi(n)-αxi(n-1),α为一个常量在本专利中α=0.98;xi(n-1)为对训练集中的第i个语音的n-1时刻的采样矩阵。Perform pre-emphasis on the i-th speech signal matrix x i (n) in the training set X to obtain y i (n), where y i (n)=x i (n)-αx i (n-1), α is a The constant is α=0.98 in this patent; x i (n-1) is the sampling matrix for the n-1 time of the i-th speech in the training set.
(二)加窗分帧(2) Windowing and Framing
采用汉明窗w(n)对预加重之后的语音信号yi(n)进行加窗分帧,将连续的语音信号分割成一帧一帧的离散信号yi,r(n);The Hamming window w(n) is used to window and divide the pre-emphasized speech signal y i (n) into frames, and the continuous speech signal is divided into discrete signals y i,r (n) of one frame and one frame;
其中汉明窗函数,N为窗长,专利中取帧长为50ms,帧移为10ms。预加重后的语音信号yi(n)加窗分帧处理可得到每一帧语音信号矩阵信息yi,r(n)。yi,r(n)表示进行预加重、加窗分帧后第i条语音信号的第r帧的语音信息矩阵。in Hamming window function, N is the window length, the frame length is 50ms in the patent, and the frame shift is 10ms. The pre-emphasized speech signal y i (n) can be windowed and framed to obtain the speech signal matrix information y i,r (n) of each frame. y i,r (n) represents the speech information matrix of the rth frame of the ith speech signal after pre-emphasis, windowing and framing.
(三)FFT变换(3) FFT transform
将第i条语音信号的第r帧的语音信息矩阵yi,r(n)作FFT变换,将其从时域变换到频域,得到第i个语音信号第r帧的短时信号频谱 Perform FFT transformation on the speech information matrix y i,r (n) of the rth frame of the ith speech signal, transform it from the time domain to the frequency domain, and obtain the short-term signal spectrum of the rth frame of the ith speech signal
(四)求语音信号的功率Pi,r,l(r,l)(4) Find the power of the speech signal P i,r,l (r,l)
将每一帧的短时信号频谱用伽马通权重函数进行处理求取语音信号每一帧每一个频带的功率;Pi,r,l(r,l)表示语音信号yi(n)第r帧第l个频带上的功率,k是一个虚拟变量表示离散频率的索引,ωk是离散频率,由于在FFT变换的时候采用50ms的帧长以及语音信号的采样率为16kHz,因此N=1024;Hl表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱,是matlab软件语音处理内置函数,函数的输入参数为频带l;表示第r帧语音信号的短时频谱,L=40是所有通道的总数。The short-term signal spectrum of each frame Use the gamma pass weight function to process to obtain the power of each frequency band of each frame of the speech signal; P i,r,l (r,l) represents the power on the lth frequency band of the rth frame of the speech signal yi (n), k is a dummy variable representing the index of the discrete frequency, ω k is the discrete frequency, Since the frame length of 50ms and the sampling rate of the speech signal are 16kHz in the FFT transformation, N=1024; H l represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k. , is the built-in function of matlab software speech processing, the input parameter of the function is frequency band l; represents the short-term spectrum of the rth frame speech signal, L=40 is the total number of all channels.
(五)语音信号降噪去混响处理(5) Noise reduction and de-reverberation processing of speech signals
求得语音信号功率Pi,r,l(r,l)后,进行降噪去混响处理,具体步骤为:After the voice signal power P i,r,l (r,l) is obtained, noise reduction and de-reverberation processing are performed. The specific steps are:
(1)求取第r帧第l个频带的低通功率Mi,r,l[r,l],具体求解公式如下:(1) Obtain the low-pass power M i,r,l [r,l] of the lth frequency band of the rth frame, and the specific solution formula is as follows:
Mi,r,l[r,l]=λMi,r,l[r-1,l]+(1-λ)Pi,r,l[r,l]M i,r,l [r,l]=λM i,r,l [r-1,l]+(1-λ)P i,r,l [r,l]
Mi,r,l[r-1,l]表示第r-1帧第l个频带的低通功率;λ表示遗忘因子,因低通滤波器的带宽而变,本专利中λ=0.4。M i,r,l [r-1,l] represents the low-pass power of the l-th frequency band of the r-1th frame; λ represents the forgetting factor, which varies with the bandwidth of the low-pass filter, and λ=0.4 in this patent.
(2)去除信号中缓慢变化的成分以及功率下降沿包络,对语音信号的功率Pi,r,l[r,l]进行处理得到增强后的第r帧第l个频带的功率其中中c0为一个常数因子,本专利取c0=0.01。(2) Remove the slowly changing components and the envelope of the power falling edge in the signal, and process the power P i,r,l [r,l] of the speech signal to obtain the power of the lth frequency band of the rth frame after enhancement in Among them, c 0 is a constant factor, and this patent takes c 0 =0.01.
(3)按步骤(1),(2)依次对信号的每一帧每一个频带进行增强处理。(3) According to steps (1) and (2), the enhancement processing is performed on each frequency band of each frame of the signal in turn.
(六)谱整合(6) Spectral integration
求得语音信号每一帧每一个频带上增强后功率进行语音信号谱整合,可得到增强之后语音信号各帧的短时信号频谱,谱整合的公式如下:Obtain the enhanced power in each frequency band of each frame of the speech signal By integrating the spectrum of the speech signal, the short-term signal spectrum of each frame of the enhanced speech signal can be obtained. The formula for spectrum integration is as follows:
上式中μi,r[r,k]表示第r帧第k个索引处的谱权重系数;为未增强的第i个语音信号第r帧的短时信号频谱,为增强后的第i个语音信号第r帧的短时信号频谱。In the above formula μ i,r [r,k] represents the spectral weight coefficient at the kth index of the rth frame; is the short-term signal spectrum of the rth frame of the unenhanced ith speech signal, is the short-term signal spectrum of the rth frame of the ith speech signal after enhancement.
其中μi,r[r,k]的求解公式如下:where μ i,r [r,k] is solved by the following formula:
μi,r[r,k]=μi,r[r,N-k],N/2≤k≤N-1μ i,r [r,k]=μ i,r [r,Nk],N/2≤k≤N-1
公式中的Hl表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱;ωi,r,l[r,l]为第i个语音信号第r帧第l个频带的权重系数,权重系数是增强之后的频域与信号的原始频域的比值,求解公式如下:H l in the formula represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k; ω i,r,l [r,l] is the rth frame of the ith speech signal The weight coefficient of l frequency bands, the weight coefficient is the ratio of the enhanced frequency domain to the original frequency domain of the signal. The solution formula is as follows:
求得谱整合后的第i个语音信号的第r帧的增强后的短时信号频谱,按如上操作依次对各帧进行处理求得第i个语音信号各帧的增强后的短时信号频谱。对各帧增强后的语音信号进行IFFT变换得到时域各帧的语音信号并且在时域进行帧拼接得到增强后的语音信号IFFT变换以及语音信号时域帧拼接操作如下:Obtain the enhanced short-term signal spectrum of the r-th frame of the i-th speech signal after spectral integration, and process each frame in turn according to the above operation to obtain the enhanced short-term signal spectrum of each frame of the i-th speech signal. . Enhanced speech signal for each frame Perform IFFT transformation to obtain the speech signal of each frame in the time domain, and perform frame splicing in the time domain to obtain the enhanced speech signal The operations of IFFT transformation and speech signal time-domain frame splicing are as follows:
g为总帧数 g is the total number of frames
上式中,为增强后的语音信号矩阵;表示第r帧增强后的语音信号矩阵;g为语音信号的总帧数,这个值因语音信号的时长而变。得到增强后n时刻语音信号的采样矩阵再用matlab软件内置的语音处理audioread函数按照语音信号的采样率fs=16Khz对进行写入处理,得到增强后的语音信号 In the above formula, is the enhanced speech signal matrix; Represents the enhanced speech signal matrix of the rth frame; g is the total number of frames of the speech signal, and this value varies with the duration of the speech signal. Obtain the sampling matrix of the speech signal at time n after the enhancement Then use the built-in speech processing audioread function of matlab software to pair according to the sampling rate of speech signal f s = 16Khz Perform writing processing to obtain an enhanced voice signal
至此,对语音训练集中一条语音的增强处理完毕,接下依次按照如上步骤处理训练集X和测试集T。并将增强后的训练集语音保存在集中,增强后的测试集保存在集中。So far, the enhancement processing of one voice in the voice training set is completed, and then the training set X and the test set T are processed in turn according to the above steps. and save the enhanced training set speech in Centralized, the augmented test set is stored in concentrated.
步骤三、搭建语音识别声学模型。本专利搭建的语音识别声学模型采用CNN+CTC进行建模,输入层为步骤二增强后的训练集中语音信号的200维的特征值序列,采用MFCC特征提取算法提取特征值序列;同时隐藏层采用卷积层和池化层交替重复连接,并且引入Dropout层,防止过拟合,其中卷积层卷积核尺寸为3,池化窗口大小为2,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,采用CTC的loss函数作为损失函数实现连接性时序多输出,输出为1423维的特征值正好对应步骤四搭建的中文汉语词典dict.txt文档中的1423个常用汉语拼音。具体语音识别声学模型网络框架图见附图说明图4。其中声学模型中卷积层、池化层、Dropout层以及全连接层的具体参数均已在图4中标出。Step 3: Build an acoustic model for speech recognition. The speech recognition acoustic model built by this patent adopts CNN+CTC for modeling, and the input layer is the training set enhanced in step 2 medium voice signal The MFCC feature extraction algorithm is used to extract the eigenvalue sequence; at the same time, the hidden layer is alternately connected with the convolutional layer and the pooling layer, and the Dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel The size is 3, the pooling window size is 2, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation, using the loss function of CTC as the loss function to achieve multi-output connectivity time series, the output is 1423 The eigenvalues of the dimension correspond to the 1423 commonly used Chinese pinyin in the Chinese-Chinese dictionary dict.txt document built in step 4. The specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings. The specific parameters of the convolutional layer, pooling layer, dropout layer and fully connected layer in the acoustic model have been marked in Figure 4.
步骤四、搭建语音识别语言模型。语言模型搭建包括语言文本数据集的建立、2-gram语言模型设计、中文汉语词典的搜集。Step 4: Build a speech recognition language model. The construction of language models includes the establishment of language text datasets, the design of 2-gram language models, and the collection of Chinese-Chinese dictionaries.
(一)语言文本数据库的建立(1) Establishment of language text database
首先,建立训练语言模型所需要的文本数据集。语言文本数据集形式上表现为一个电子版.txt文件,内容为报纸、中学课文、著名小说。收集报纸、中学课文、著名小说的电子版.txt文件建立语言文本数据库,注意语言文本数据库中文本数据的选取一定要具有代表性,能够反映出日常生活中的汉语用语习惯。First, build the text dataset needed to train the language model. The language text dataset is represented as an electronic .txt file in the form of newspapers, middle school texts, and famous novels. Collect the electronic .txt files of newspapers, middle school texts, and famous novels to establish a language text database. Note that the selection of text data in the language text database must be representative and reflect the Chinese language habits in daily life.
(二)2-gram语言模型搭建(2) 2-gram language model construction
本专利采用按词本身进行划分的语言模型训练方法2-gram算法搭建语言模型。其中2-gram中的2表示考虑当前词出现的概率只与其前2个词有关。2就是词序列记忆长度的约束数量。2-gram算法具体公式可以表示为:This patent uses the language model training method 2-gram algorithm which is divided according to the words themselves to build the language model. The 2 in the 2-gram means that the probability of the occurrence of the current word is only related to its previous 2 words. 2 is the number of constraints on the memory length of the word sequence. The specific formula of the 2-gram algorithm can be expressed as:
上式中W表示一段文字序列,w1,w2,...,wq分别表示文字序列里面的每一个单词,q表示文字序列的长度;S(W)表示这一段文字序列符合语言学习惯的概率。d表示第d单词。In the above formula, W represents a text sequence, w 1 , w 2 ,...,w q represent each word in the text sequence, respectively, and q represents the length of the text sequence; S(W) means that this text sequence conforms to linguistics probability of habituation. d represents the dth word.
(三)汉语词典建立(3) Establishment of Chinese Dictionary
搭建语音识别系统语言模型词典。对于词典来说,一种语言的词典都是稳定不变的,对于本发明中的汉语文字词典来说,词典表现为一个dict.txt文件,其中标明了日常生活中常用的1423个汉语拼音对应的汉字,同时考虑汉语的一音多字情况。本发明搭建的词典部分展示图见附图说明图5。Build a dictionary of language models for speech recognition systems. For a dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the corresponding 1423 Chinese Pinyin commonly used in daily life. Chinese characters, while taking into account the situation of one-syllable and multiple-character Chinese. The display diagram of the dictionary part constructed by the present invention is shown in FIG. 5 in the description of the drawings.
步骤五、用建立的语言文本数据集对搭建的2-gram语言模型进行训练,得到语言模型的单词出现次数表以及状态转移表。具体的语言模型训练框图见附图说明图6。对语言模型的具体训练方式如下:Step 5: Train the built 2-gram language model with the established language text data set, and obtain a word occurrence table and a state transition table of the language model. The specific language model training block diagram is shown in Figure 6 of the accompanying drawings. The specific training method of the language model is as follows:
(1)循环获取语言文本数据集中的文本内容并统计单个单词出现得次数,汇总得到单个单词出现次数表。(1) Loop to obtain the text content in the language text data set and count the occurrences of a single word, and obtain a table of the occurrences of a single word.
(2)循环获取语言文本数据集中二个单词一起出现得次数,汇总得到二个单词状态转移表。(2) The number of times the two words appear together in the language text data set are obtained in a loop, and the state transition table of the two words is obtained by summarizing them.
步骤六、用训练好的语言模型和建立的词典以及增强后的语音训练集对搭建的声学模型进行学习训练。得到声学模型的权重文件以及其它参数配置文件。具体的声学模型训练流程如下:Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set Learn and train the built acoustic model. Obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows:
(1)初始化声学网络模型的各处权值;(1) Initialize the weights of the acoustic network model;
(2)依次导入语音训练集中的语音进行训练,对任意的语音信号首先经MFCC特征提取算法处理,得语音信号200维的特征值序列然后按照附图说明图7所列,将语音信号的200维特征值序列依次经过各个卷积层、池化层、Dropout层、全连接层处理,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,得语音信号的1423维声学特征;(2) Import the speech training set in turn The speech in the training, for any speech signal First, the MFCC feature extraction algorithm is processed to obtain a 200-dimensional eigenvalue sequence of the speech signal. Then, as listed in Figure 7 of the accompanying drawings, the 200-dimensional eigenvalue sequence of the speech signal is sequentially passed through each convolution layer, pooling layer, Dropout layer, Fully connected layer processing, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation to obtain 1423-dimensional acoustic features of the speech signal;
(3)得到特征值后再在语言模型以及词典的作用下对1423维声学特征值进行解码并输出识别的语音信号的汉语拼音序列;(3) After obtaining the feature value, decode the 1423-dimensional acoustic feature value under the action of the language model and dictionary and output the recognized speech signal The Chinese Pinyin sequence of ;
(4)将声学模型识别出的汉语拼音序列与训练集中第i条语音的汉语拼音标签序列进行对比计算误差并反向传播更新声学模型各处的权值,损失函数采用CTC的loss函数,并Adam算法进行优化。设置训练的batchsize=16,迭代次数epoch=50,每训练500条语音,保存一次权重文件;CTC的损失函数如下:(4) The Chinese pinyin sequence identified by the acoustic model is compared with the training set The i-th voice in The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC and is optimized by the Adam algorithm. Set the training batchsize=16, the number of iterations epoch=50, and save a weight file for every 500 speeches trained; the loss function of CTC is as follows:
上式中表示训练集训练后产生的总损失,e表示输入语音即进行语音增强后训练集中的语音信号z为输出的汉字序列,F(z|e)表示输入为e,输出序列为z的概率。In the above formula Represents the total loss after training in the training set, e represents the input speech, that is, the training set after speech enhancement voice signal in z is the output sequence of Chinese characters, and F(z|e) represents the probability that the input is e and the output sequence is z.
(5)依次按照如上步骤训练语音识别的声学模型,直至声学模型损失收敛,声学模型便训练完毕。保存声学模型的权重文件和各项配置文件。具体的语音识别声学模型训练图见附图说明图7。(5) The acoustic model for speech recognition is trained sequentially according to the above steps, until the loss of the acoustic model converges, and the acoustic model is trained. Save the weights file and various configuration files of the acoustic model. The specific training diagram of the speech recognition acoustic model is shown in FIG. 7 in the description of the drawings.
步骤七、用训练好的基于语音增强的中文语音识别系统对测试集的语音进行识别,统计语音识别准确率并与传统算法进行性能对比分析。具体的语音识别测试系统流程框架图见附图说明图8。本专利的语音识别准确率以及与传统算法的在噪音环境下的性能比较部分展示图见附图说明图9;本专利的语音识别准确率以及与传统算法的在混响环境下的性能比较部分展示图见附图说明图10。
具体实行方式如下:The specific implementation method is as follows:
(1)用传统的语音识别系统,对建立的复杂环境语音数据库的2000个未增强的语音测试集T进行语音识别测试,统计其语音识别的准确率。并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10。(1) Using the traditional speech recognition system, the speech recognition test is performed on the 2000 unenhanced speech test sets T of the established complex environment speech database, and the accuracy of speech recognition is counted. The representative speech recognition result diagrams are listed in the description of the drawings, see FIG. 9 and FIG. 10 in the description of the drawings.
(2)用本发明的基于语音增强的语音识别系统,对建立的语音数据库的2000个增强后的语音测试集进行语音识别测试,统计本发明方法的语音识别准确率。并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10。(2) using the speech recognition system based on speech enhancement of the present invention, to the 2000 enhanced speech test sets of the established speech database A speech recognition test is carried out, and the speech recognition accuracy rate of the method of the present invention is counted. The representative speech recognition result diagrams are listed in the description of the drawings, see FIG. 9 and FIG. 10 in the description of the drawings.
(3)最后对本发明提出的基于语音增强的语音识别系统进行性能分析。(3) Finally, analyze the performance of the speech recognition system based on speech enhancement proposed by the present invention.
统计完成后发现,本发明提出的基于语音增强的语音识别算法对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音的识别准确率大幅提升,性能提升大约在30%左右;与传统的语音识别算法相比,本发明算法识别准确率也大大提升,尤其是对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音识别,传统算法表现很差,而本发明算法表现优异,性能很好。取部分噪音环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图9。取部分混响环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图10。After the statistics are completed, it is found that the speech recognition algorithm based on speech enhancement proposed by the present invention greatly improves the recognition accuracy of speech in Gaussian white noise environment, background noise or interfering sound source environment and reverberation environment, and the performance is improved by about 30%. Compared with the traditional speech recognition algorithm, the recognition accuracy of the algorithm of the present invention is also greatly improved, especially for the speech recognition in the Gaussian white noise environment, the environment with background noise or interfering sound source and the reverberation environment, the performance of the traditional algorithm is very good. The algorithm of the present invention has excellent performance and good performance. A comparison diagram of the recognition effects of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial noise environment is shown in FIG. 9 in the description of the drawings. A comparison diagram of the recognition effects of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial reverberation environment is shown in FIG. 10 in the description of the drawings.
由此看见,本发明的复杂环境下基于语音增强的深度神经网络语音识别方法,很好地解决了现有语音识别算法对噪音环境敏感、对语音质量要求高、可应用场景单一的问题,实现了复杂语音环境下的语音识别。From this, it can be seen that the deep neural network speech recognition method based on speech enhancement of the present invention well solves the problems that the existing speech recognition algorithm is sensitive to the noise environment, has high requirements for speech quality, and can be applied in a single scenario. speech recognition in complex speech environments.
在上述各步骤中出现的符号i表示训练集和测试集中第i个进行语音增强处理的语音信号,i=1,2,...,12000;符号r表示语音信号的第r帧,r=1,2,3,...,g;g表示语音信号分帧之后的总帧数,g的取值因处理的语音时长而变;符号l表示语音信号的第l个频带,l=0,1,2,...,39;k是一个虚拟变量表示离散频率的索引,k=0,1,2,...,1023。The symbol i that appears in the above steps represents the ith speech signal in the training set and the test set for speech enhancement processing, i=1,2,...,12000; the symbol r represents the rth frame of the speech signal, r= 1,2,3,...,g; g represents the total number of frames after the speech signal is divided into frames, and the value of g changes due to the length of the processed speech; the symbol l represents the lth frequency band of the speech signal, and l=0 ,1,2,...,39; k is a dummy variable representing the index of discrete frequencies, k=0,1,2,...,1023.
以上所述,仅是本发明的较佳实施例而已,并非对本发明做任何形式上的限制,虽然本发明已以较佳实施例展示如上,然而并非用以限定本发明,任何熟悉本专业的技术人员,在不脱离本发明技术方案范围内,当可利用上述揭示的技术内容做出些许更动或修饰为等同变化的等效实施例,但凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰,均仍属于本发明技术方案的范围内。The above are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been shown above with preferred embodiments, it is not intended to limit the present invention. Technical personnel, within the scope of the technical solution of the present invention, can make some changes or modifications to equivalent examples of equivalent changes by using the technical content disclosed above, but any content that does not depart from the technical solution of the present invention, according to the present invention. The technical essence of the invention Any simple modifications, equivalent changes and modifications made to the above embodiments still fall within the scope of the technical solutions of the present invention.
发明优点Invention Advantages
本发明以深度学习神经网络以及语音增强为技术背景搭建模型。首先搭建复杂语音环境数据集,在语音识别前端语音信号预处理阶段对各类待识别复杂语音条件下的语音信号进行语音增强;然后建立语言文本数据集,搭建语言模型,用算法对语言模型进行训练;并建立中文汉语词典文件;然后搭建神经网络声学模型,并用增强后语音训练集,借助语言模型和词典对声学模型进行训练,得到声学模型权重文件,从而实现复杂环境下中文语音的精准识别。很好地解决了现有语音识别算法对噪音因素敏感、对语音质量要求高、应用场景单一的问题。The present invention builds a model based on the technical background of deep learning neural network and speech enhancement. First, a complex speech environment dataset is built, and speech enhancement is performed on the speech signals under various complex speech conditions to be recognized in the speech signal preprocessing stage of the speech recognition front-end; training; and build a Chinese-Chinese dictionary file; then build a neural network acoustic model, and use the enhanced speech training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to achieve accurate recognition of Chinese speech in complex environments . It solves the problems that the existing speech recognition algorithms are sensitive to noise factors, have high requirements for speech quality, and have a single application scenario.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010880777.7A CN111986661B (en) | 2020-08-28 | 2020-08-28 | Deep neural network voice recognition method based on voice enhancement in complex environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010880777.7A CN111986661B (en) | 2020-08-28 | 2020-08-28 | Deep neural network voice recognition method based on voice enhancement in complex environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111986661A true CN111986661A (en) | 2020-11-24 |
CN111986661B CN111986661B (en) | 2024-02-09 |
Family
ID=73440031
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010880777.7A Active CN111986661B (en) | 2020-08-28 | 2020-08-28 | Deep neural network voice recognition method based on voice enhancement in complex environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111986661B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112633175A (en) * | 2020-12-24 | 2021-04-09 | 哈尔滨理工大学 | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment |
CN112786051A (en) * | 2020-12-28 | 2021-05-11 | 出门问问(苏州)信息科技有限公司 | Voice data identification method and device |
CN113257262A (en) * | 2021-05-11 | 2021-08-13 | 广东电网有限责任公司清远供电局 | Voice signal processing method, device, equipment and storage medium |
CN113808581A (en) * | 2021-08-17 | 2021-12-17 | 山东大学 | Chinese speech recognition method for acoustic and language model training and joint optimization |
CN114444609A (en) * | 2022-02-08 | 2022-05-06 | 腾讯科技(深圳)有限公司 | Data processing method and device, electronic equipment and computer readable storage medium |
CN114743544A (en) * | 2022-04-19 | 2022-07-12 | 南京大学 | Pinyin-based two-stage decoupling Chinese speech recognition model |
CN116312517A (en) * | 2023-02-28 | 2023-06-23 | 四川航天电液控制有限公司 | A voice recognition control system for centralized control commands in fully mechanized mining face |
CN116580708A (en) * | 2023-05-30 | 2023-08-11 | 中国人民解放军61623部队 | Intelligent voice processing method and system |
CN118136022A (en) * | 2024-04-09 | 2024-06-04 | 海识(烟台)信息科技有限公司 | Intelligent voice recognition system and method |
CN119601011A (en) * | 2025-02-10 | 2025-03-11 | 北京海百川科技有限公司 | Multilingual speech recognition and interaction method for humanoid robots |
CN120032659A (en) * | 2025-04-17 | 2025-05-23 | 清枫(北京)科技有限公司 | Noise filtering method, device and medium based on deep learning to build noise model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160240190A1 (en) * | 2015-02-12 | 2016-08-18 | Electronics And Telecommunications Research Institute | Apparatus and method for large vocabulary continuous speech recognition |
CN109272990A (en) * | 2018-09-25 | 2019-01-25 | 江南大学 | Speech recognition method based on convolutional neural network |
KR20190032868A (en) * | 2017-09-20 | 2019-03-28 | 현대자동차주식회사 | Method and apparatus for voice recognition |
-
2020
- 2020-08-28 CN CN202010880777.7A patent/CN111986661B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160240190A1 (en) * | 2015-02-12 | 2016-08-18 | Electronics And Telecommunications Research Institute | Apparatus and method for large vocabulary continuous speech recognition |
KR20190032868A (en) * | 2017-09-20 | 2019-03-28 | 현대자동차주식회사 | Method and apparatus for voice recognition |
CN109272990A (en) * | 2018-09-25 | 2019-01-25 | 江南大学 | Speech recognition method based on convolutional neural network |
Non-Patent Citations (1)
Title |
---|
潘粤成;刘卓;潘文豪;蔡典仑;韦政松;: "一种基于CNN/CTC的端到端普通话语音识别方法", 现代信息科技, no. 05 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112633175A (en) * | 2020-12-24 | 2021-04-09 | 哈尔滨理工大学 | Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment |
CN112786051B (en) * | 2020-12-28 | 2023-08-01 | 问问智能信息科技有限公司 | Voice data recognition method and device |
CN112786051A (en) * | 2020-12-28 | 2021-05-11 | 出门问问(苏州)信息科技有限公司 | Voice data identification method and device |
CN113257262A (en) * | 2021-05-11 | 2021-08-13 | 广东电网有限责任公司清远供电局 | Voice signal processing method, device, equipment and storage medium |
CN113808581B (en) * | 2021-08-17 | 2024-03-12 | 山东大学 | Chinese voice recognition method based on acoustic and language model training and joint optimization |
CN113808581A (en) * | 2021-08-17 | 2021-12-17 | 山东大学 | Chinese speech recognition method for acoustic and language model training and joint optimization |
CN114444609A (en) * | 2022-02-08 | 2022-05-06 | 腾讯科技(深圳)有限公司 | Data processing method and device, electronic equipment and computer readable storage medium |
CN114444609B (en) * | 2022-02-08 | 2024-10-01 | 腾讯科技(深圳)有限公司 | Data processing method, device, electronic equipment and computer readable storage medium |
CN114743544A (en) * | 2022-04-19 | 2022-07-12 | 南京大学 | Pinyin-based two-stage decoupling Chinese speech recognition model |
CN114743544B (en) * | 2022-04-19 | 2025-01-03 | 南京大学 | A two-stage decoupled Chinese speech recognition model based on pinyin |
CN116312517A (en) * | 2023-02-28 | 2023-06-23 | 四川航天电液控制有限公司 | A voice recognition control system for centralized control commands in fully mechanized mining face |
CN116580708A (en) * | 2023-05-30 | 2023-08-11 | 中国人民解放军61623部队 | Intelligent voice processing method and system |
CN118136022A (en) * | 2024-04-09 | 2024-06-04 | 海识(烟台)信息科技有限公司 | Intelligent voice recognition system and method |
CN119601011A (en) * | 2025-02-10 | 2025-03-11 | 北京海百川科技有限公司 | Multilingual speech recognition and interaction method for humanoid robots |
CN120032659A (en) * | 2025-04-17 | 2025-05-23 | 清枫(北京)科技有限公司 | Noise filtering method, device and medium based on deep learning to build noise model |
Also Published As
Publication number | Publication date |
---|---|
CN111986661B (en) | 2024-02-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111986661A (en) | Deep neural network speech recognition method based on speech enhancement in complex environment | |
CN108319666A (en) | A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion | |
CN108550375A (en) | A kind of emotion identification method, device and computer equipment based on voice signal | |
US20210043197A1 (en) | Intent recognition method based on deep learning network | |
CN106952649A (en) | Speaker Recognition Method Based on Convolutional Neural Network and Spectrogram | |
CN109829058A (en) | A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning | |
CN109949799B (en) | A semantic parsing method and system | |
CN105702251B (en) | Speech emotion recognition method based on Top-k enhanced audio bag-of-words model | |
CN114863937B (en) | Hybrid bird song recognition method based on deep transfer learning and XGBoost | |
CN105895082A (en) | Acoustic model training method and device as well as speech recognition method and device | |
CN113611285B (en) | Language recognition method based on cascading bidirectional temporal pooling | |
CN118038851B (en) | A multi-dialect speech recognition method, system, device and medium | |
CN115602165B (en) | Digital employee intelligent system based on financial system | |
Chourasia et al. | Emotion recognition from speech signal using deep learning | |
CN109325238A (en) | A method for multi-entity sentiment analysis in long texts | |
CN113837299A (en) | Network training method and device based on artificial intelligence and electronic equipment | |
Almekhlafi et al. | A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks | |
CN114420108A (en) | A speech recognition model training method, device, computer equipment and medium | |
CN111401069A (en) | Intention recognition method and intention recognition device for conversation text and terminal | |
Jie | Speech emotion recognition based on convolutional neural network | |
CN113808620B (en) | Tibetan language emotion recognition method based on CNN and LSTM | |
CN115359778A (en) | Confrontation and meta-learning method based on speaker emotion voice synthesis model | |
KR20120117297A (en) | Method for summerizing meeting minutes based on sentence network | |
Li et al. | Intelligibility enhancement via normal-to-lombard speech conversion with long short-term memory network and bayesian Gaussian mixture model | |
CN113593537B (en) | Speech emotion recognition method and device based on complementary feature learning framework |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |