[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111986661A - Deep neural network speech recognition method based on speech enhancement in complex environment - Google Patents

Deep neural network speech recognition method based on speech enhancement in complex environment Download PDF

Info

Publication number
CN111986661A
CN111986661A CN202010880777.7A CN202010880777A CN111986661A CN 111986661 A CN111986661 A CN 111986661A CN 202010880777 A CN202010880777 A CN 202010880777A CN 111986661 A CN111986661 A CN 111986661A
Authority
CN
China
Prior art keywords
speech
frame
signal
voice
speech recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010880777.7A
Other languages
Chinese (zh)
Other versions
CN111986661B (en
Inventor
王兰美
梁涛
朱衍波
廖桂生
王桂宝
孙长征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Shaanxi University of Technology
Original Assignee
Xidian University
Shaanxi University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University, Shaanxi University of Technology filed Critical Xidian University
Priority to CN202010880777.7A priority Critical patent/CN111986661B/en
Publication of CN111986661A publication Critical patent/CN111986661A/en
Application granted granted Critical
Publication of CN111986661B publication Critical patent/CN111986661B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A deep neural network speech recognition method based on speech enhancement in a complex environment builds a model by taking a deep learning neural network and speech enhancement as technical backgrounds. Firstly, a complex voice environment data set is built, and voice enhancement is carried out on voice signals under various complex voice conditions to be recognized in a voice recognition front-end voice signal preprocessing stage; then establishing a language text data set, building a language model, and training the language model by using an algorithm; establishing a Chinese dictionary file; and then, building a neural network acoustic model, training the acoustic model by means of the language model and the dictionary by using the enhanced voice training set to obtain an acoustic model weight file, thereby realizing the accurate recognition of Chinese voice in a complex environment. The problems that the existing voice recognition algorithm is sensitive to noise factors, high in requirement on voice quality and single in application scene are well solved.

Description

复杂环境下基于语音增强的深度神经网络语音识别方法Deep neural network speech recognition method based on speech enhancement in complex environment

技术领域technical field

本发明属于语音识别领域,尤其涉及一种复杂环境下基于语音增强的深度神经网络语音识别方法。The invention belongs to the field of speech recognition, and in particular relates to a deep neural network speech recognition method based on speech enhancement in a complex environment.

背景技术Background technique

近年来,科技创新屡破难关,经济繁荣社会进步,人们在解决吃、穿、住、行基本问题后,对构建美好生活提出了更多需求。这一美好愿景,促使QQ、微信等集生活、工作、娱乐于一身的虚拟社交软件大量涌现。虚拟社交软件给人们的生活,工作,交流沟通带来了极大便利,尤其是各大社交软件中的语音识别功能。语音识别,使得人们可以摆脱键盘、鼠标等传统交互方式的束缚,从而使用最自然的交流方式—语音交流来传递信息。同时,语音识别也逐渐在工业、通信、家电、家庭服务、医疗、电子消费产品等各个领域获得了广泛的应用。In recent years, technological innovation has repeatedly overcome difficulties, and the economy has prospered and social progress has been made. After solving the basic problems of food, clothing, housing, and transportation, people have put forward more demands for building a better life. This beautiful vision has prompted a large number of virtual social software such as QQ and WeChat that integrate life, work and entertainment. Virtual social software brings great convenience to people's life, work and communication, especially the speech recognition function in major social software. Speech recognition enables people to get rid of the shackles of traditional interaction methods such as keyboard and mouse, so as to use the most natural communication method—voice communication to transmit information. At the same time, speech recognition has gradually been widely used in various fields such as industry, communications, home appliances, home services, medical care, and electronic consumer products.

现如今大部分的社交软件在无背景噪音以及无干扰声源的纯净语音条件下语音识别准确率已经达到极高水平。当待识别语音信号包含噪音、干扰以及存在混响时,现有的语音识别系统的准确率便大幅下降。这一转变,主要是现有的语音识别系统,在语音识别前端的语音信号预处理阶段以及搭建声学模型阶段,并未考虑去噪和干扰抑制问题。Most of today's social software has achieved a very high level of speech recognition accuracy under the condition of pure speech without background noise and interfering sound sources. When the speech signal to be recognized contains noise, interference and reverberation, the accuracy of the existing speech recognition system is greatly reduced. This change is mainly due to the fact that the existing speech recognition system does not consider denoising and interference suppression in the speech signal preprocessing stage and the stage of building an acoustic model at the front end of speech recognition.

现有的中文语音识别算法,对语音信号质量要求苛刻,算法鲁棒性差,当语音质量较差或音频污损严重,便会导致语音识别失败。仅在纯净的理想的语音条件下获得小范围应用,为了提高语音识别在现实生活环境中的应用,针对现有算法的不足,本发明提出复杂环境下基于语音增强的深度神经网络语音识别方法。该方法以深度学习神经网络以及语音增强为技术背景。首先,在语音识别前端对各类待识别复杂语音条件下的语音信号进行语音增强;建立语言文本数据集,搭建语言模型,用算法对语言模型进行训练;建立中文汉语词典文件;搭建神经网络声学模型,并用增强后语音训练集,借助语言模型和词典对声学模型进行训练,得到声学模型权重文件,从而建立一个性能良好的复杂语音环境下的语音识别系统。The existing Chinese speech recognition algorithms have strict requirements on the quality of speech signals, and the algorithm has poor robustness. When the speech quality is poor or the audio contamination is serious, speech recognition will fail. Only a small range of applications can be obtained under pure and ideal speech conditions. In order to improve the application of speech recognition in real-life environments, in view of the shortcomings of existing algorithms, the present invention proposes a deep neural network speech recognition method based on speech enhancement in complex environments. The method is based on the technical background of deep learning neural network and speech enhancement. First, in the front-end of speech recognition, speech enhancement is performed on the speech signals under various complex speech conditions to be recognized; a language text data set is established, a language model is built, and an algorithm is used to train the language model; a Chinese-Chinese dictionary file is established; a neural network acoustics is built model, and use the enhanced speech training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to establish a speech recognition system with good performance in a complex speech environment.

鉴于语音识别技术在实际生活中的应用,本发明提出的复杂环境语音识别技术是包括纯净语音条件、高斯白噪音环境、背景噪音或干扰声源以及混响环境四类综合语音环境下的语音识别技术。本发明方法识别准确率高,模型泛化能力强,同时对各类环境因素具有很好的鲁棒性。In view of the application of speech recognition technology in real life, the complex environment speech recognition technology proposed by the present invention is a speech recognition under four comprehensive speech environments including pure speech conditions, Gaussian white noise environment, background noise or interference sound source and reverberation environment. technology. The method of the invention has high recognition accuracy, strong model generalization ability, and good robustness to various environmental factors.

发明内容SUMMARY OF THE INVENTION

本发明的目的是提供一种复杂环境下基于语音增强的深度神经网络语音识别方法。The purpose of the present invention is to provide a deep neural network speech recognition method based on speech enhancement in a complex environment.

为了实现上述目的,本发明采取如下的技术解决方案:In order to achieve the above object, the present invention adopts the following technical solutions:

复杂环境下基于语音增强的深度神经网络语音识别方法,以深度学习神经网络以及语音增强为技术背景搭建模型,具体的语音识别技术方案流程图见附图说明图1。首先搭建复杂语音环境数据集,在语音识别前端语音信号预处理阶段对待识别复杂语音条件下的语音信号进行语音增强;然后建立语言文本数据集,搭建语言模型,用算法对语言模型进行训练;并建立中文汉语词典文件;最后搭建神经网络声学模型,并用增强后语音训练集,借助语言模型和词典对声学模型进行训练,得到声学模型权重文件,从而实现复杂环境下中文语音的精准识别。很好地解决了现有语音识别算法对噪音敏感、对语音质量要求高、应用场景单一的问题。复杂环境下基于语音增强的深度神经网络语音识别方法步骤如下:A deep neural network speech recognition method based on speech enhancement in a complex environment builds a model with a deep learning neural network and speech enhancement as the technical background. The specific speech recognition technical solution flow chart is shown in Figure 1. First, build a complex speech environment data set, and in the speech signal preprocessing stage of the front-end speech recognition, the speech signals under complex speech conditions are to be enhanced; Establish a Chinese-Chinese dictionary file; finally, build a neural network acoustic model, and use the enhanced voice training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to achieve accurate recognition of Chinese speech in complex environments. It well solves the problems that existing speech recognition algorithms are sensitive to noise, have high requirements for speech quality, and have a single application scenario. The steps of deep neural network speech recognition method based on speech enhancement in complex environment are as follows:

步骤一、复杂环境下语音数据集的建立及处理。在该部分收集纯净环境语音、高斯白噪音环境语音、存在背景噪音或干扰声源环境语音以及混响环境下的语音共同组成语音识别系统的语音数据集C。然后,将语音数据集C中各环境下的语音数据分别分成训练集和测试集。分配比例为训练集语音条数:测试集语音条数=5:1。将各环境下的训练集和测试集分别汇总并打乱分布,形成训练集X和测试集T。训练集X中的第i条语音表示为xi;测试集T中第j条语音表示为tj。同时对训练集X中的每一条语音,编辑一个.txt格式的标签文档,标签文档的内容包括该条语音的名字以及对应的正确汉语拼音序列。训练集语音标签文档的部分展示图见附图说明图2。Step 1. The establishment and processing of speech data sets in complex environments. In this part, pure ambient speech, Gaussian white noise ambient speech, ambient speech with background noise or interfering sound source, and speech in reverberation environment are collected to form the speech data set C of the speech recognition system. Then, the speech data in each environment in the speech data set C is divided into a training set and a test set, respectively. The distribution ratio is the number of voices in the training set: the number of voices in the test set=5:1. The training set and test set in each environment are aggregated and the distribution is disrupted to form a training set X and a test set T. The ith speech in the training set X is denoted as xi ; the jth speech in the test set T is denoted as t j . At the same time, for each speech in the training set X, a label document in .txt format is edited, and the content of the label document includes the name of the speech and the corresponding correct Chinese pinyin sequence. A partial presentation of the training set voice tag documents is shown in Figure 2 of the accompanying drawings.

步骤二、对建立的语音训练集X和测试集T进行语音增强,得到增强后的语音训练集

Figure BDA0002654055760000031
和测试集
Figure BDA0002654055760000032
增强后的语音训练集
Figure BDA0002654055760000033
中的第i条语音表示为
Figure BDA0002654055760000034
测试集
Figure BDA0002654055760000035
中第j条语音表示为
Figure BDA0002654055760000036
以语音训练集中第i条语音xi的语音增强为例,具体的语音增强步骤如下,对待增强的语音信号xi,用matlab软件内置的语音处理audioread函数对xi进行读取处理,得到语音信号的采样率fs以及包含语音信息的矩阵xi(n),xi(n)为n时刻的语音采样值;然后对xi(n)进行预加重处理得yi(n);再对yi(n)加汉明窗进行分帧操作,得到语音信号的各个帧的信息yi,r(n),其中yi,r(n)表示进行预加重增强后第i条语音信号的第r帧的语音信息矩阵;再对yi,r(n)进行FFT变换得到第i个语音信号第r帧的短时信号频谱
Figure BDA0002654055760000037
然后用伽马通权重函数Hl按频带对
Figure BDA0002654055760000038
进行处理得第i个语音信号第r帧第l个频带上的功率Pi,r,l(r,l),其中l的取值为0,...,39;依次按照如上步骤求取第r帧的各个频带的功率;再进行降噪去混响处理以及谱整合得
Figure BDA0002654055760000039
由此,已经求得增强后第i个语音信号第r帧的短时信号频谱,对其它帧的语音信号同样依次做如上的处理,得到各个帧的短时信号频谱,再通过IFFT变换在时域上进行语音信号帧合成得到增强之后的语音信号
Figure BDA0002654055760000041
Figure BDA0002654055760000042
放入增强后的语音训练集
Figure BDA0002654055760000043
中。具体的语音数据增强流程框架图见附图说明图3。Step 2: Perform voice enhancement on the established voice training set X and test set T to obtain an enhanced voice training set
Figure BDA0002654055760000031
and test set
Figure BDA0002654055760000032
Enhanced speech training set
Figure BDA0002654055760000033
The i-th speech in is expressed as
Figure BDA0002654055760000034
test set
Figure BDA0002654055760000035
The jth speech in the middle is expressed as
Figure BDA0002654055760000036
Taking the voice enhancement of the i -th voice xi in the voice training set as an example, the specific voice enhancement steps are as follows . The sampling rate f s of the signal and the matrix x i (n) containing the speech information, x i ( n ) is the speech sampling value at time n ; Perform framing operation on y i (n) and Hamming window to obtain information y i,r (n) of each frame of the speech signal, where y i,r (n) represents the i-th speech signal after pre-emphasis enhancement The speech information matrix of the r-th frame; then perform FFT transformation on y i,r (n) to obtain the short-term signal spectrum of the i-th speech signal at the r-th frame
Figure BDA0002654055760000037
Then use the gamma pass weight function H l to pair by frequency band
Figure BDA0002654055760000038
Perform processing to obtain the power P i,r,l (r,l) on the lth frequency band of the rth frame of the ith speech signal, where the value of l is 0,...,39; follow the steps above to obtain The power of each frequency band of the rth frame; then noise reduction and de-reverberation processing and spectral integration are performed to obtain
Figure BDA0002654055760000039
As a result, the short-term signal spectrum of the r-th frame of the i-th voice signal after enhancement has been obtained, and the above processing is also performed on the voice signals of other frames in turn to obtain the short-term signal spectrum of each frame. The speech signal after the speech signal frame synthesis is enhanced in the domain
Figure BDA0002654055760000041
Will
Figure BDA0002654055760000042
Put into the augmented speech training set
Figure BDA0002654055760000043
middle. The specific speech data enhancement process frame diagram is shown in FIG. 3 of the accompanying drawings.

步骤三、搭建语音识别声学模型。本专利搭建的语音识别声学模型采用CNN+CTC进行建模,输入层为步骤二增强后的训练集

Figure BDA0002654055760000044
中的语音信号
Figure BDA0002654055760000045
采用MFCC特征提取算法处理训练集语音信号
Figure BDA0002654055760000046
得到200维的特征值序列,隐藏层采用卷积层和池化层交替重复连接,并且引入Dropout层,防止过拟合,其中卷积层卷积核尺寸为3,池化窗口大小为2,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,采用CTC的loss函数作为损失函数实现连接性时序多输出,输出为1423维的特征值正好对应步骤四搭建的中文汉语词典dict.txt文档中的1423个常用汉语拼音。具体语音识别声学模型网络框架图见附图说明图4。其中声学模型中卷积层、池化层、Dropout层以及全连接层的具体参数均已在图4中标出。Step 3: Build an acoustic model for speech recognition. The speech recognition acoustic model built by this patent adopts CNN+CTC for modeling, and the input layer is the training set enhanced in step 2
Figure BDA0002654055760000044
voice signal in
Figure BDA0002654055760000045
Using MFCC Feature Extraction Algorithm to Process Training Set Speech Signal
Figure BDA0002654055760000046
A 200-dimensional eigenvalue sequence is obtained. The hidden layer is alternately and repeatedly connected by the convolutional layer and the pooling layer, and the dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel size is 3, and the pooling window size is 2. Finally, the output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation. The loss function of CTC is used as the loss function to achieve multiple outputs of connected time series. The output is a 1423-dimensional feature value that corresponds to the Chinese built in step 4. 1423 commonly used pinyin in Chinese dictionary dict.txt document. The specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings. The specific parameters of the convolutional layer, pooling layer, dropout layer and fully connected layer in the acoustic model have been marked in Figure 4.

步骤四、搭建语音识别的2-gram语言模型以及词典。语言模型的搭建包括语言文本数据集的建立、2-gram语言模型搭建、中文汉语词典的搜集建立。语言文本数据集形式上表现为一个电子版.txt文件,内容为报纸、中学课文、著名小说。对于词典来说,一种语言的词典都是稳定不变的,对于本发明中的汉语文字词典来说,词典表现为一个dict.txt文件,其中标明了日常生活中常用的1423个汉语拼音对应的汉字,同时考虑汉语的一音多字情况。本发明搭建的词典部分展示图见附图说明图5。Step 4: Build a 2-gram language model and dictionary for speech recognition. The construction of the language model includes the establishment of language text datasets, the establishment of 2-gram language models, and the collection and establishment of Chinese-Chinese dictionaries. The language text dataset is represented as an electronic .txt file in the form of newspapers, middle school texts, and famous novels. For the dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the corresponding 1423 Chinese Pinyin commonly used in daily life. Chinese characters, while taking into account the situation of one-syllable and multiple-character Chinese. The display diagram of the dictionary part constructed by the present invention is shown in FIG. 5 in the description of the drawings.

步骤五、用建立的语言文本数据集对搭建的2-gram语言模型进行训练,得到语言模型的单词出现次数表以及状态转移表。对语言模型的具体训练方式如下:首先循环获取语言文本数据集中的文本内容并统计单个单词出现得次数,以及二个单词一起出现得次数,最后汇总得到单个单词出现次数表以及二个单词状态转移表。具体的语言模型训练框图见附图说明图6。Step 5: Train the built 2-gram language model with the established language text data set, and obtain a word occurrence table and a state transition table of the language model. The specific training method of the language model is as follows: first, the text content in the language text dataset is obtained in a loop, and the number of occurrences of a single word and the number of occurrences of two words together are counted, and finally the table of occurrences of a single word and the state transition of two words are obtained. surface. The specific language model training block diagram is shown in Figure 6 of the accompanying drawings.

步骤六、用训练好的语言模型和建立的词典以及增强后的语音训练集

Figure BDA0002654055760000051
对搭建的声学模型进行学习训练。得到声学模型的权重文件以及其它参数配置文件。具体的声学模型训练流程如下:初始化声学网络模型的各处权值;依次导入语音训练集
Figure BDA0002654055760000052
中的语音进行训练,对任意的语音信号
Figure BDA0002654055760000053
首先经MFCC特征提取算法处理,得语音信号200维的特征值序列然后按照附图说明图7所列,将语音信号的200维特征值序列依次经过各个卷积层、池化层、Dropout层、全连接层处理,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,得语音信号的1423维声学特征;得到特征值后再在语言模型以及词典的作用下对1423维声学特征值进行解码并输出识别的语音信号
Figure BDA0002654055760000054
的汉语拼音序列;将声学模型识别出的汉语拼音序列与训练集
Figure BDA0002654055760000055
Figure BDA0002654055760000056
的汉语拼音标签序列进行对比计算误差并反向传播更新声学模型各处的权值,损失函数采用CTC的loss函数,并Adam算法进行优化。设置训练的batchsize=16,迭代次数epoch=50,每训练500条语音,保存一次权重文件,依次按照如上步骤处理训练集
Figure BDA0002654055760000057
的每一条语音,直至声学模型损失收敛,声学模型便训练完毕。保存声学模型的权重文件和各项配置文件。具体的语音识别声学模型训练框图见附图说明图7。Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set
Figure BDA0002654055760000051
Learn and train the built acoustic model. Obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows: initialize the weights of the acoustic network model; import the speech training set in turn
Figure BDA0002654055760000052
The speech in the training, for any speech signal
Figure BDA0002654055760000053
First, the MFCC feature extraction algorithm is processed to obtain a 200-dimensional eigenvalue sequence of the speech signal. Then, as listed in Figure 7 of the accompanying drawings, the 200-dimensional eigenvalue sequence of the speech signal is sequentially passed through each convolution layer, pooling layer, Dropout layer, Fully connected layer processing, the final output layer uses a fully connected layer of 1423 neurons for output, and activates it with the softmax function to obtain the 1423-dimensional acoustic features of the speech signal; after obtaining the feature values, the 1423 is processed by the language model and dictionary. dimensional acoustic eigenvalues to decode and output the recognized speech signal
Figure BDA0002654055760000054
The Chinese Pinyin sequence identified by the acoustic model and the training set
Figure BDA0002654055760000055
middle
Figure BDA0002654055760000056
The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC and is optimized by the Adam algorithm. Set the training batchsize=16, the number of iterations epoch=50, save a weight file for every 500 voices in training, and process the training set in turn according to the above steps
Figure BDA0002654055760000057
For each speech, the acoustic model is trained until the loss of the acoustic model converges. Save the weights file and various configuration files of the acoustic model. The specific speech recognition acoustic model training block diagram is shown in Fig. 7 of the accompanying drawings.

步骤七、用训练好的基于语音增强的中文语音识别系统对测试集

Figure BDA0002654055760000058
的语音进行识别,统计语音识别准确率并与传统算法进行性能对比分析。具体的语音识别测试系统流程框架图见附图说明图8。本专利的语音识别准确率以及与传统算法的性能比较部分展示图见图9、图10。Step 7. Use the trained Chinese speech recognition system based on speech enhancement to test the test set
Figure BDA0002654055760000058
The speech recognition is carried out, the accuracy of speech recognition is counted, and the performance is compared with the traditional algorithm. The specific speech recognition test system flow chart is shown in Figure 8 of the accompanying drawings. Figure 9 and Figure 10 show the partial display diagrams of the speech recognition accuracy rate of this patent and the performance comparison with traditional algorithms.

发明优点Invention Advantages

复杂环境下基于语音增强的深度神经网络语音识别方法,很好地解决了现有语音识别算法对噪音等复杂环境因素敏感、对语音质量要求高、语音识别应用场景单一的问题。同时,本发明提出的语音识别方法采用神经网络深度学习技术,进行声学建模,使得本发明搭建的模型迁移学习能力强,语音增强方法的引入也使本发明的语音识别系统在复杂环境因素干扰方面具有强大的鲁棒性。The deep neural network speech recognition method based on speech enhancement in complex environments can well solve the problems that existing speech recognition algorithms are sensitive to complex environmental factors such as noise, require high speech quality, and have a single speech recognition application scenario. At the same time, the speech recognition method proposed by the present invention adopts the neural network deep learning technology to perform acoustic modeling, so that the model built by the present invention has strong transfer learning ability, and the introduction of the speech enhancement method also makes the speech recognition system of the present invention interfere with complex environmental factors. Aspects have strong robustness.

附图说明Description of drawings

为了更清楚地说明本发明的技术方案,下面将对本发明描述中需要使用的附图做简单介绍,以便更好地了解本发明的发明内容。In order to illustrate the technical solutions of the present invention more clearly, the accompanying drawings to be used in the description of the present invention will be briefly introduced below, so as to better understand the content of the present invention.

图1为本发明的语音识别技术方案具体流程图;Fig. 1 is the concrete flow chart of the speech recognition technical scheme of the present invention;

图2为本发明的语音识别训练集语音标签部分展示图;Fig. 2 is the partial presentation diagram of the speech label of speech recognition training set of the present invention;

图3为本发明的语音识别语音增强流程框架图;Fig. 3 is the speech recognition speech enhancement process frame diagram of the present invention;

图4为本发明的语音识别声学模型网络框架图;Fig. 4 is the speech recognition acoustic model network frame diagram of the present invention;

图5为本发明搭建的词典部分展示图;5 is a display diagram of a dictionary part constructed by the present invention;

图6为本发明的语言模型训练流程图;Fig. 6 is the language model training flow chart of the present invention;

图7为本发明声学模型的训练图;Fig. 7 is the training diagram of the acoustic model of the present invention;

图8为本发明语音识别测试系统的流程框图;Fig. 8 is the flow chart of the speech recognition test system of the present invention;

图9为本发明的语音识别算法与传统算法在噪音环境下的效果对比展示图;9 is a graph showing the comparison of the effects of the speech recognition algorithm of the present invention and the traditional algorithm in a noisy environment;

图10为本发明的语音识别算法与传统算法在混响环境下的效果对比展示图;Fig. 10 is the effect comparison display diagram of the speech recognition algorithm of the present invention and the traditional algorithm under the reverberation environment;

具体实施方式Detailed ways

复杂环境下基于语音增强的深度神经网络语音识别方法具体实施步骤如下:The specific implementation steps of the deep neural network speech recognition method based on speech enhancement in complex environments are as follows:

步骤一、复杂环境下语音数据集的建立以及处理。在该部分收集纯净环境语音、高斯白噪音环境语音、存在背景噪音或干扰声源环境语音以及混响环境下的语音共同组成语音识别系统的语音数据集C。然后,将语音数据集C中各环境下的语音数据分别分成训练集和测试集。分配比例为训练集语音条数:测试集语音条数=5:1。将各环境下的训练集和测试集分别汇总并打乱分布,形成训练集X和测试集T。训练集X中的第i条语音表示为xi;测试集T中第j条语音表示为tj。同时对训练集X中的每一条语音,编辑一个.txt格式的标签文档,标签文档的内容包括该条语音的名字以及对应的正确汉语拼音序列。训练集语音标签文档的部分展示图见附图说明图2。Step 1: The establishment and processing of a voice data set in a complex environment. In this part, pure ambient speech, Gaussian white noise ambient speech, ambient speech with background noise or interfering sound source, and speech in reverberation environment are collected to form the speech data set C of the speech recognition system. Then, the speech data in each environment in the speech data set C is divided into a training set and a test set, respectively. The distribution ratio is the number of voices in the training set: the number of voices in the test set=5:1. The training set and test set in each environment are aggregated and the distribution is disrupted to form a training set X and a test set T. The ith speech in the training set X is denoted as xi ; the jth speech in the test set T is denoted as t j . At the same time, for each speech in the training set X, a label document in .txt format is edited, and the content of the label document includes the name of the speech and the corresponding correct Chinese pinyin sequence. A partial presentation of the training set voice tag documents is shown in Figure 2 of the accompanying drawings.

具体收集方法分别如下:首先对于纯净条件的语音收集,在理想实验室条件下进行多人录制,以中文报纸、小说、学生课文为素材,单条语音录制时长10秒以内,共录制3000条纯净语音素材;对于高斯白噪音环境以及混响环境下的语音收集,采用Adobe Audition软件来进行合成,具体是采用录制的纯净语音和高斯白噪声进行合成,混响则直接采用软件自带的混响环境重新合成语音。其中高斯白噪音环境下的语音和混响环境下的语音各录制3000条;最后对于存在背景噪音或干扰声源的语音,采用实地录制为主,在工厂、餐厅等比较嘈杂的地方由多人进行实地录制,共录制语音3000条。同时,以上收集到的所有语音文件格式为.wav格式。将收集到语音进行分类,分类方式如下:将每一类语音环境中2500条语音作为语音识别系统的训练集,剩下的500条作为测试集。总结即语音识别训练集X共10000条,测试集T共2000条,将训练集与测试集分别打乱分布,避免训练出来的模型出现过拟合。The specific collection methods are as follows: First, for the voice collection under pure conditions, multiple people are recorded under ideal laboratory conditions, using Chinese newspapers, novels, and student texts as materials, a single voice recording time is less than 10 seconds, and a total of 3000 pure voices are recorded. Material; for voice collection in Gaussian white noise environment and reverberation environment, Adobe Audition software is used for synthesis, specifically, the recorded pure voice and Gaussian white noise are used for synthesis, and the reverberation environment is directly used in the software. Resynthesize the speech. Among them, 3,000 pieces of speech are recorded in the Gaussian white noise environment and in the reverberation environment. Finally, for the speech with background noise or interference sound source, the field recording is mainly used, and many people are in noisy places such as factories and restaurants. A total of 3,000 voices were recorded on the spot. At the same time, all the audio files collected above are in .wav format. The collected speech is classified as follows: 2500 speeches in each type of speech environment are used as the training set of the speech recognition system, and the remaining 500 speeches are used as the test set. In conclusion, there are 10,000 speech recognition training sets X and 2,000 test sets T. The distribution of the training set and the test set is scrambled to avoid overfitting of the trained model.

步骤二、对建立的语音训练集X和测试集T进行语音增强,得到增强后的语音训练集

Figure BDA0002654055760000071
和测试集
Figure BDA0002654055760000072
增强后的语音训练集
Figure BDA0002654055760000073
中的第i条语音表示为
Figure BDA0002654055760000074
测试集
Figure BDA0002654055760000075
中第j条语音表示为
Figure BDA0002654055760000076
以语音训练集中第i条语音xi的语音增强为例,具体的语音增强步骤如下,对待增强的语音信号xi,用matlab软件内置的语音处理audioread函数对xi进行读取处理,得到语音信号的采样率fs以及包含语音信息的矩阵xi(n),xi(n)为n时刻的语音采样值;然后对xi(n)进行预加重处理得yi(n);再对yi(n)加汉明窗进行分帧操作,得到语音信号的各个帧的信息yi,r(n),其中yi,r(n)表示进行预加重增强后第i条语音信号的第r帧的语音信息矩阵;再对yi,r(n)进行FFT变换得到第i个语音信号第r帧的短时信号频谱
Figure BDA0002654055760000081
然后用伽马通权重函数Hl按频带对
Figure BDA0002654055760000082
进行处理得第i个语音信号第r帧第l个频带上的功率Pi,r,l(r,l),其中l的取值为0,...,39;依次按照如上步骤求取第r帧的各个频带的功率;再进行降噪去混响处理以及谱整合得
Figure BDA0002654055760000083
由此,已经求得增强后第i个语音信号第r帧的短时信号频谱,对其它帧的语音信号同样依次做如上的处理,得到各个帧的短时信号频谱,再通过IFFT变换在时域上进行语音信号帧合成得到增强之后的语音信号
Figure BDA0002654055760000084
Figure BDA0002654055760000085
放入增强后的语音训练集
Figure BDA0002654055760000086
中。具体的语音数据增强流程框架图见附图说明图3。Step 2: Perform voice enhancement on the established voice training set X and test set T to obtain an enhanced voice training set
Figure BDA0002654055760000071
and test set
Figure BDA0002654055760000072
Enhanced speech training set
Figure BDA0002654055760000073
The i-th speech in is expressed as
Figure BDA0002654055760000074
test set
Figure BDA0002654055760000075
The jth speech in the middle is expressed as
Figure BDA0002654055760000076
Taking the voice enhancement of the i -th voice xi in the voice training set as an example, the specific voice enhancement steps are as follows . The sampling rate f s of the signal and the matrix x i (n) containing the speech information, x i ( n ) is the speech sampling value at time n ; Perform framing operation on y i (n) and Hamming window to obtain information y i,r (n) of each frame of the speech signal, where y i,r (n) represents the i-th speech signal after pre-emphasis enhancement The speech information matrix of the r-th frame; then perform FFT transformation on y i,r (n) to obtain the short-term signal spectrum of the i-th speech signal at the r-th frame
Figure BDA0002654055760000081
Then use the gamma pass weight function H l to pair by frequency band
Figure BDA0002654055760000082
Perform processing to obtain the power P i,r,l (r,l) on the lth frequency band of the rth frame of the ith speech signal, where the value of l is 0,...,39; follow the steps above to obtain The power of each frequency band of the rth frame; then noise reduction and de-reverberation processing and spectral integration are performed to obtain
Figure BDA0002654055760000083
As a result, the short-term signal spectrum of the r-th frame of the i-th voice signal after enhancement has been obtained, and the above processing is also performed on the voice signals of other frames in turn to obtain the short-term signal spectrum of each frame. The speech signal after the speech signal frame synthesis is enhanced in the domain
Figure BDA0002654055760000084
Will
Figure BDA0002654055760000085
Put into the augmented speech training set
Figure BDA0002654055760000086
middle. The specific speech data enhancement process frame diagram is shown in FIG. 3 of the accompanying drawings.

语音增强每一步操作具体如下详述:Each step of speech enhancement is detailed as follows:

(一)语音信号预加重(1) Pre-emphasis of voice signal

对训练集X中第i个语音信号矩阵xi(n)进行预加重得到yi(n),其中yi(n)=xi(n)-αxi(n-1),α为一个常量在本专利中α=0.98;xi(n-1)为对训练集中的第i个语音的n-1时刻的采样矩阵。Perform pre-emphasis on the i-th speech signal matrix x i (n) in the training set X to obtain y i (n), where y i (n)=x i (n)-αx i (n-1), α is a The constant is α=0.98 in this patent; x i (n-1) is the sampling matrix for the n-1 time of the i-th speech in the training set.

(二)加窗分帧(2) Windowing and Framing

采用汉明窗w(n)对预加重之后的语音信号yi(n)进行加窗分帧,将连续的语音信号分割成一帧一帧的离散信号yi,r(n);The Hamming window w(n) is used to window and divide the pre-emphasized speech signal y i (n) into frames, and the continuous speech signal is divided into discrete signals y i,r (n) of one frame and one frame;

其中

Figure BDA0002654055760000087
汉明窗函数,N为窗长,专利中取帧长为50ms,帧移为10ms。预加重后的语音信号yi(n)加窗分帧处理可得到每一帧语音信号矩阵信息yi,r(n)。yi,r(n)表示进行预加重、加窗分帧后第i条语音信号的第r帧的语音信息矩阵。in
Figure BDA0002654055760000087
Hamming window function, N is the window length, the frame length is 50ms in the patent, and the frame shift is 10ms. The pre-emphasized speech signal y i (n) can be windowed and framed to obtain the speech signal matrix information y i,r (n) of each frame. y i,r (n) represents the speech information matrix of the rth frame of the ith speech signal after pre-emphasis, windowing and framing.

(三)FFT变换(3) FFT transform

将第i条语音信号的第r帧的语音信息矩阵yi,r(n)作FFT变换,将其从时域变换到频域,得到第i个语音信号第r帧的短时信号频谱

Figure BDA0002654055760000091
Perform FFT transformation on the speech information matrix y i,r (n) of the rth frame of the ith speech signal, transform it from the time domain to the frequency domain, and obtain the short-term signal spectrum of the rth frame of the ith speech signal
Figure BDA0002654055760000091

(四)求语音信号的功率Pi,r,l(r,l)(4) Find the power of the speech signal P i,r,l (r,l)

将每一帧的短时信号频谱

Figure BDA0002654055760000092
用伽马通权重函数进行处理求取语音信号每一帧每一个频带的功率;
Figure BDA0002654055760000093
Pi,r,l(r,l)表示语音信号yi(n)第r帧第l个频带上的功率,k是一个虚拟变量表示离散频率的索引,ωk是离散频率,
Figure BDA0002654055760000094
由于在FFT变换的时候采用50ms的帧长以及语音信号的采样率为16kHz,因此N=1024;Hl表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱,是matlab软件语音处理内置函数,函数的输入参数为频带l;
Figure BDA0002654055760000095
表示第r帧语音信号的短时频谱,L=40是所有通道的总数。The short-term signal spectrum of each frame
Figure BDA0002654055760000092
Use the gamma pass weight function to process to obtain the power of each frequency band of each frame of the speech signal;
Figure BDA0002654055760000093
P i,r,l (r,l) represents the power on the lth frequency band of the rth frame of the speech signal yi (n), k is a dummy variable representing the index of the discrete frequency, ω k is the discrete frequency,
Figure BDA0002654055760000094
Since the frame length of 50ms and the sampling rate of the speech signal are 16kHz in the FFT transformation, N=1024; H l represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k. , is the built-in function of matlab software speech processing, the input parameter of the function is frequency band l;
Figure BDA0002654055760000095
represents the short-term spectrum of the rth frame speech signal, L=40 is the total number of all channels.

(五)语音信号降噪去混响处理(5) Noise reduction and de-reverberation processing of speech signals

求得语音信号功率Pi,r,l(r,l)后,进行降噪去混响处理,具体步骤为:After the voice signal power P i,r,l (r,l) is obtained, noise reduction and de-reverberation processing are performed. The specific steps are:

(1)求取第r帧第l个频带的低通功率Mi,r,l[r,l],具体求解公式如下:(1) Obtain the low-pass power M i,r,l [r,l] of the lth frequency band of the rth frame, and the specific solution formula is as follows:

Mi,r,l[r,l]=λMi,r,l[r-1,l]+(1-λ)Pi,r,l[r,l]M i,r,l [r,l]=λM i,r,l [r-1,l]+(1-λ)P i,r,l [r,l]

Mi,r,l[r-1,l]表示第r-1帧第l个频带的低通功率;λ表示遗忘因子,因低通滤波器的带宽而变,本专利中λ=0.4。M i,r,l [r-1,l] represents the low-pass power of the l-th frequency band of the r-1th frame; λ represents the forgetting factor, which varies with the bandwidth of the low-pass filter, and λ=0.4 in this patent.

(2)去除信号中缓慢变化的成分以及功率下降沿包络,对语音信号的功率Pi,r,l[r,l]进行处理得到增强后的第r帧第l个频带的功率

Figure BDA0002654055760000096
其中
Figure BDA0002654055760000097
中c0为一个常数因子,本专利取c0=0.01。(2) Remove the slowly changing components and the envelope of the power falling edge in the signal, and process the power P i,r,l [r,l] of the speech signal to obtain the power of the lth frequency band of the rth frame after enhancement
Figure BDA0002654055760000096
in
Figure BDA0002654055760000097
Among them, c 0 is a constant factor, and this patent takes c 0 =0.01.

(3)按步骤(1),(2)依次对信号的每一帧每一个频带进行增强处理。(3) According to steps (1) and (2), the enhancement processing is performed on each frequency band of each frame of the signal in turn.

(六)谱整合(6) Spectral integration

求得语音信号每一帧每一个频带上增强后功率

Figure BDA0002654055760000101
进行语音信号谱整合,可得到增强之后语音信号各帧的短时信号频谱,谱整合的公式如下:Obtain the enhanced power in each frequency band of each frame of the speech signal
Figure BDA0002654055760000101
By integrating the spectrum of the speech signal, the short-term signal spectrum of each frame of the enhanced speech signal can be obtained. The formula for spectrum integration is as follows:

Figure BDA0002654055760000102
Figure BDA0002654055760000102

上式中μi,r[r,k]表示第r帧第k个索引处的谱权重系数;

Figure BDA0002654055760000103
为未增强的第i个语音信号第r帧的短时信号频谱,
Figure BDA0002654055760000104
为增强后的第i个语音信号第r帧的短时信号频谱。In the above formula μ i,r [r,k] represents the spectral weight coefficient at the kth index of the rth frame;
Figure BDA0002654055760000103
is the short-term signal spectrum of the rth frame of the unenhanced ith speech signal,
Figure BDA0002654055760000104
is the short-term signal spectrum of the rth frame of the ith speech signal after enhancement.

其中μi,r[r,k]的求解公式如下:where μ i,r [r,k] is solved by the following formula:

Figure BDA0002654055760000105
Figure BDA0002654055760000105

μi,r[r,k]=μi,r[r,N-k],N/2≤k≤N-1μ i,r [r,k]=μ i,r [r,Nk],N/2≤k≤N-1

公式中的Hl表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱;ωi,r,l[r,l]为第i个语音信号第r帧第l个频带的权重系数,权重系数是增强之后的频域与信号的原始频域的比值,求解公式如下:H l in the formula represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k; ω i,r,l [r,l] is the rth frame of the ith speech signal The weight coefficient of l frequency bands, the weight coefficient is the ratio of the enhanced frequency domain to the original frequency domain of the signal. The solution formula is as follows:

Figure BDA0002654055760000106
Figure BDA0002654055760000106

求得谱整合后的第i个语音信号的第r帧的增强后的短时信号频谱,按如上操作依次对各帧进行处理求得第i个语音信号各帧的增强后的短时信号频谱。对各帧增强后的语音信号

Figure BDA0002654055760000107
进行IFFT变换得到时域各帧的语音信号并且在时域进行帧拼接得到增强后的语音信号
Figure BDA0002654055760000108
IFFT变换以及语音信号时域帧拼接操作如下:Obtain the enhanced short-term signal spectrum of the r-th frame of the i-th speech signal after spectral integration, and process each frame in turn according to the above operation to obtain the enhanced short-term signal spectrum of each frame of the i-th speech signal. . Enhanced speech signal for each frame
Figure BDA0002654055760000107
Perform IFFT transformation to obtain the speech signal of each frame in the time domain, and perform frame splicing in the time domain to obtain the enhanced speech signal
Figure BDA0002654055760000108
The operations of IFFT transformation and speech signal time-domain frame splicing are as follows:

Figure BDA0002654055760000111
Figure BDA0002654055760000111

Figure BDA0002654055760000112
g为总帧数
Figure BDA0002654055760000112
g is the total number of frames

上式中,

Figure BDA0002654055760000113
为增强后的语音信号矩阵;
Figure BDA0002654055760000114
表示第r帧增强后的语音信号矩阵;g为语音信号的总帧数,这个值因语音信号的时长而变。得到增强后n时刻语音信号的采样矩阵
Figure BDA0002654055760000115
再用matlab软件内置的语音处理audioread函数按照语音信号的采样率fs=16Khz对
Figure BDA0002654055760000116
进行写入处理,得到增强后的语音信号
Figure BDA0002654055760000117
In the above formula,
Figure BDA0002654055760000113
is the enhanced speech signal matrix;
Figure BDA0002654055760000114
Represents the enhanced speech signal matrix of the rth frame; g is the total number of frames of the speech signal, and this value varies with the duration of the speech signal. Obtain the sampling matrix of the speech signal at time n after the enhancement
Figure BDA0002654055760000115
Then use the built-in speech processing audioread function of matlab software to pair according to the sampling rate of speech signal f s = 16Khz
Figure BDA0002654055760000116
Perform writing processing to obtain an enhanced voice signal
Figure BDA0002654055760000117

至此,对语音训练集中一条语音的增强处理完毕,接下依次按照如上步骤处理训练集X和测试集T。并将增强后的训练集语音保存在

Figure BDA0002654055760000118
集中,增强后的测试集保存在
Figure BDA0002654055760000119
集中。So far, the enhancement processing of one voice in the voice training set is completed, and then the training set X and the test set T are processed in turn according to the above steps. and save the enhanced training set speech in
Figure BDA0002654055760000118
Centralized, the augmented test set is stored in
Figure BDA0002654055760000119
concentrated.

步骤三、搭建语音识别声学模型。本专利搭建的语音识别声学模型采用CNN+CTC进行建模,输入层为步骤二增强后的训练集

Figure BDA00026540557600001110
中语音信号
Figure BDA00026540557600001111
的200维的特征值序列,采用MFCC特征提取算法提取特征值序列;同时隐藏层采用卷积层和池化层交替重复连接,并且引入Dropout层,防止过拟合,其中卷积层卷积核尺寸为3,池化窗口大小为2,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,采用CTC的loss函数作为损失函数实现连接性时序多输出,输出为1423维的特征值正好对应步骤四搭建的中文汉语词典dict.txt文档中的1423个常用汉语拼音。具体语音识别声学模型网络框架图见附图说明图4。其中声学模型中卷积层、池化层、Dropout层以及全连接层的具体参数均已在图4中标出。Step 3: Build an acoustic model for speech recognition. The speech recognition acoustic model built by this patent adopts CNN+CTC for modeling, and the input layer is the training set enhanced in step 2
Figure BDA00026540557600001110
medium voice signal
Figure BDA00026540557600001111
The MFCC feature extraction algorithm is used to extract the eigenvalue sequence; at the same time, the hidden layer is alternately connected with the convolutional layer and the pooling layer, and the Dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel The size is 3, the pooling window size is 2, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation, using the loss function of CTC as the loss function to achieve multi-output connectivity time series, the output is 1423 The eigenvalues of the dimension correspond to the 1423 commonly used Chinese pinyin in the Chinese-Chinese dictionary dict.txt document built in step 4. The specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings. The specific parameters of the convolutional layer, pooling layer, dropout layer and fully connected layer in the acoustic model have been marked in Figure 4.

步骤四、搭建语音识别语言模型。语言模型搭建包括语言文本数据集的建立、2-gram语言模型设计、中文汉语词典的搜集。Step 4: Build a speech recognition language model. The construction of language models includes the establishment of language text datasets, the design of 2-gram language models, and the collection of Chinese-Chinese dictionaries.

(一)语言文本数据库的建立(1) Establishment of language text database

首先,建立训练语言模型所需要的文本数据集。语言文本数据集形式上表现为一个电子版.txt文件,内容为报纸、中学课文、著名小说。收集报纸、中学课文、著名小说的电子版.txt文件建立语言文本数据库,注意语言文本数据库中文本数据的选取一定要具有代表性,能够反映出日常生活中的汉语用语习惯。First, build the text dataset needed to train the language model. The language text dataset is represented as an electronic .txt file in the form of newspapers, middle school texts, and famous novels. Collect the electronic .txt files of newspapers, middle school texts, and famous novels to establish a language text database. Note that the selection of text data in the language text database must be representative and reflect the Chinese language habits in daily life.

(二)2-gram语言模型搭建(2) 2-gram language model construction

本专利采用按词本身进行划分的语言模型训练方法2-gram算法搭建语言模型。其中2-gram中的2表示考虑当前词出现的概率只与其前2个词有关。2就是词序列记忆长度的约束数量。2-gram算法具体公式可以表示为:This patent uses the language model training method 2-gram algorithm which is divided according to the words themselves to build the language model. The 2 in the 2-gram means that the probability of the occurrence of the current word is only related to its previous 2 words. 2 is the number of constraints on the memory length of the word sequence. The specific formula of the 2-gram algorithm can be expressed as:

Figure BDA0002654055760000121
Figure BDA0002654055760000121

上式中W表示一段文字序列,w1,w2,...,wq分别表示文字序列里面的每一个单词,q表示文字序列的长度;S(W)表示这一段文字序列符合语言学习惯的概率。d表示第d单词。In the above formula, W represents a text sequence, w 1 , w 2 ,...,w q represent each word in the text sequence, respectively, and q represents the length of the text sequence; S(W) means that this text sequence conforms to linguistics probability of habituation. d represents the dth word.

(三)汉语词典建立(3) Establishment of Chinese Dictionary

搭建语音识别系统语言模型词典。对于词典来说,一种语言的词典都是稳定不变的,对于本发明中的汉语文字词典来说,词典表现为一个dict.txt文件,其中标明了日常生活中常用的1423个汉语拼音对应的汉字,同时考虑汉语的一音多字情况。本发明搭建的词典部分展示图见附图说明图5。Build a dictionary of language models for speech recognition systems. For a dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the corresponding 1423 Chinese Pinyin commonly used in daily life. Chinese characters, while taking into account the situation of one-syllable and multiple-character Chinese. The display diagram of the dictionary part constructed by the present invention is shown in FIG. 5 in the description of the drawings.

步骤五、用建立的语言文本数据集对搭建的2-gram语言模型进行训练,得到语言模型的单词出现次数表以及状态转移表。具体的语言模型训练框图见附图说明图6。对语言模型的具体训练方式如下:Step 5: Train the built 2-gram language model with the established language text data set, and obtain a word occurrence table and a state transition table of the language model. The specific language model training block diagram is shown in Figure 6 of the accompanying drawings. The specific training method of the language model is as follows:

(1)循环获取语言文本数据集中的文本内容并统计单个单词出现得次数,汇总得到单个单词出现次数表。(1) Loop to obtain the text content in the language text data set and count the occurrences of a single word, and obtain a table of the occurrences of a single word.

(2)循环获取语言文本数据集中二个单词一起出现得次数,汇总得到二个单词状态转移表。(2) The number of times the two words appear together in the language text data set are obtained in a loop, and the state transition table of the two words is obtained by summarizing them.

步骤六、用训练好的语言模型和建立的词典以及增强后的语音训练集

Figure BDA0002654055760000122
对搭建的声学模型进行学习训练。得到声学模型的权重文件以及其它参数配置文件。具体的声学模型训练流程如下:Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set
Figure BDA0002654055760000122
Learn and train the built acoustic model. Obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows:

(1)初始化声学网络模型的各处权值;(1) Initialize the weights of the acoustic network model;

(2)依次导入语音训练集

Figure BDA0002654055760000131
中的语音进行训练,对任意的语音信号
Figure BDA0002654055760000132
首先经MFCC特征提取算法处理,得语音信号200维的特征值序列然后按照附图说明图7所列,将语音信号的200维特征值序列依次经过各个卷积层、池化层、Dropout层、全连接层处理,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,得语音信号的1423维声学特征;(2) Import the speech training set in turn
Figure BDA0002654055760000131
The speech in the training, for any speech signal
Figure BDA0002654055760000132
First, the MFCC feature extraction algorithm is processed to obtain a 200-dimensional eigenvalue sequence of the speech signal. Then, as listed in Figure 7 of the accompanying drawings, the 200-dimensional eigenvalue sequence of the speech signal is sequentially passed through each convolution layer, pooling layer, Dropout layer, Fully connected layer processing, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation to obtain 1423-dimensional acoustic features of the speech signal;

(3)得到特征值后再在语言模型以及词典的作用下对1423维声学特征值进行解码并输出识别的语音信号

Figure BDA0002654055760000133
的汉语拼音序列;(3) After obtaining the feature value, decode the 1423-dimensional acoustic feature value under the action of the language model and dictionary and output the recognized speech signal
Figure BDA0002654055760000133
The Chinese Pinyin sequence of ;

(4)将声学模型识别出的汉语拼音序列与训练集

Figure BDA0002654055760000134
中第i条语音
Figure BDA0002654055760000135
的汉语拼音标签序列进行对比计算误差并反向传播更新声学模型各处的权值,损失函数采用CTC的loss函数,并Adam算法进行优化。设置训练的batchsize=16,迭代次数epoch=50,每训练500条语音,保存一次权重文件;CTC的损失函数如下:(4) The Chinese pinyin sequence identified by the acoustic model is compared with the training set
Figure BDA0002654055760000134
The i-th voice in
Figure BDA0002654055760000135
The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC and is optimized by the Adam algorithm. Set the training batchsize=16, the number of iterations epoch=50, and save a weight file for every 500 speeches trained; the loss function of CTC is as follows:

Figure BDA0002654055760000136
Figure BDA0002654055760000136

上式中

Figure BDA0002654055760000137
表示训练集训练后产生的总损失,e表示输入语音即进行语音增强后训练集
Figure BDA0002654055760000138
中的语音信号
Figure BDA0002654055760000139
z为输出的汉字序列,F(z|e)表示输入为e,输出序列为z的概率。In the above formula
Figure BDA0002654055760000137
Represents the total loss after training in the training set, e represents the input speech, that is, the training set after speech enhancement
Figure BDA0002654055760000138
voice signal in
Figure BDA0002654055760000139
z is the output sequence of Chinese characters, and F(z|e) represents the probability that the input is e and the output sequence is z.

(5)依次按照如上步骤训练语音识别的声学模型,直至声学模型损失收敛,声学模型便训练完毕。保存声学模型的权重文件和各项配置文件。具体的语音识别声学模型训练图见附图说明图7。(5) The acoustic model for speech recognition is trained sequentially according to the above steps, until the loss of the acoustic model converges, and the acoustic model is trained. Save the weights file and various configuration files of the acoustic model. The specific training diagram of the speech recognition acoustic model is shown in FIG. 7 in the description of the drawings.

步骤七、用训练好的基于语音增强的中文语音识别系统对测试集

Figure BDA00026540557600001310
的语音进行识别,统计语音识别准确率并与传统算法进行性能对比分析。具体的语音识别测试系统流程框架图见附图说明图8。本专利的语音识别准确率以及与传统算法的在噪音环境下的性能比较部分展示图见附图说明图9;本专利的语音识别准确率以及与传统算法的在混响环境下的性能比较部分展示图见附图说明图10。Step 7. Use the trained Chinese speech recognition system based on speech enhancement to test the test set
Figure BDA00026540557600001310
The speech recognition is carried out, the accuracy of speech recognition is counted, and the performance is compared with the traditional algorithm. The specific speech recognition test system flow chart is shown in Figure 8 of the accompanying drawings. The speech recognition accuracy of this patent and the performance comparison with the traditional algorithm in a noisy environment are shown in Figure 9 in the description of the drawings; the speech recognition accuracy of this patent and the performance comparison with the traditional algorithm in a reverberation environment For a presentation, see Figure 10 in the description of the drawings.

具体实行方式如下:The specific implementation method is as follows:

(1)用传统的语音识别系统,对建立的复杂环境语音数据库的2000个未增强的语音测试集T进行语音识别测试,统计其语音识别的准确率。并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10。(1) Using the traditional speech recognition system, the speech recognition test is performed on the 2000 unenhanced speech test sets T of the established complex environment speech database, and the accuracy of speech recognition is counted. The representative speech recognition result diagrams are listed in the description of the drawings, see FIG. 9 and FIG. 10 in the description of the drawings.

(2)用本发明的基于语音增强的语音识别系统,对建立的语音数据库的2000个增强后的语音测试集

Figure BDA0002654055760000141
进行语音识别测试,统计本发明方法的语音识别准确率。并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10。(2) using the speech recognition system based on speech enhancement of the present invention, to the 2000 enhanced speech test sets of the established speech database
Figure BDA0002654055760000141
A speech recognition test is carried out, and the speech recognition accuracy rate of the method of the present invention is counted. The representative speech recognition result diagrams are listed in the description of the drawings, see FIG. 9 and FIG. 10 in the description of the drawings.

(3)最后对本发明提出的基于语音增强的语音识别系统进行性能分析。(3) Finally, analyze the performance of the speech recognition system based on speech enhancement proposed by the present invention.

统计完成后发现,本发明提出的基于语音增强的语音识别算法对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音的识别准确率大幅提升,性能提升大约在30%左右;与传统的语音识别算法相比,本发明算法识别准确率也大大提升,尤其是对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音识别,传统算法表现很差,而本发明算法表现优异,性能很好。取部分噪音环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图9。取部分混响环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图10。After the statistics are completed, it is found that the speech recognition algorithm based on speech enhancement proposed by the present invention greatly improves the recognition accuracy of speech in Gaussian white noise environment, background noise or interfering sound source environment and reverberation environment, and the performance is improved by about 30%. Compared with the traditional speech recognition algorithm, the recognition accuracy of the algorithm of the present invention is also greatly improved, especially for the speech recognition in the Gaussian white noise environment, the environment with background noise or interfering sound source and the reverberation environment, the performance of the traditional algorithm is very good. The algorithm of the present invention has excellent performance and good performance. A comparison diagram of the recognition effects of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial noise environment is shown in FIG. 9 in the description of the drawings. A comparison diagram of the recognition effects of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial reverberation environment is shown in FIG. 10 in the description of the drawings.

由此看见,本发明的复杂环境下基于语音增强的深度神经网络语音识别方法,很好地解决了现有语音识别算法对噪音环境敏感、对语音质量要求高、可应用场景单一的问题,实现了复杂语音环境下的语音识别。From this, it can be seen that the deep neural network speech recognition method based on speech enhancement of the present invention well solves the problems that the existing speech recognition algorithm is sensitive to the noise environment, has high requirements for speech quality, and can be applied in a single scenario. speech recognition in complex speech environments.

在上述各步骤中出现的符号i表示训练集和测试集中第i个进行语音增强处理的语音信号,i=1,2,...,12000;符号r表示语音信号的第r帧,r=1,2,3,...,g;g表示语音信号分帧之后的总帧数,g的取值因处理的语音时长而变;符号l表示语音信号的第l个频带,l=0,1,2,...,39;k是一个虚拟变量表示离散频率的索引,k=0,1,2,...,1023。The symbol i that appears in the above steps represents the ith speech signal in the training set and the test set for speech enhancement processing, i=1,2,...,12000; the symbol r represents the rth frame of the speech signal, r= 1,2,3,...,g; g represents the total number of frames after the speech signal is divided into frames, and the value of g changes due to the length of the processed speech; the symbol l represents the lth frequency band of the speech signal, and l=0 ,1,2,...,39; k is a dummy variable representing the index of discrete frequencies, k=0,1,2,...,1023.

以上所述,仅是本发明的较佳实施例而已,并非对本发明做任何形式上的限制,虽然本发明已以较佳实施例展示如上,然而并非用以限定本发明,任何熟悉本专业的技术人员,在不脱离本发明技术方案范围内,当可利用上述揭示的技术内容做出些许更动或修饰为等同变化的等效实施例,但凡是未脱离本发明技术方案的内容,依据本发明的技术实质对以上实施例所作的任何简单修改、等同变化与修饰,均仍属于本发明技术方案的范围内。The above are only preferred embodiments of the present invention, and do not limit the present invention in any form. Although the present invention has been shown above with preferred embodiments, it is not intended to limit the present invention. Technical personnel, within the scope of the technical solution of the present invention, can make some changes or modifications to equivalent examples of equivalent changes by using the technical content disclosed above, but any content that does not depart from the technical solution of the present invention, according to the present invention. The technical essence of the invention Any simple modifications, equivalent changes and modifications made to the above embodiments still fall within the scope of the technical solutions of the present invention.

发明优点Invention Advantages

本发明以深度学习神经网络以及语音增强为技术背景搭建模型。首先搭建复杂语音环境数据集,在语音识别前端语音信号预处理阶段对各类待识别复杂语音条件下的语音信号进行语音增强;然后建立语言文本数据集,搭建语言模型,用算法对语言模型进行训练;并建立中文汉语词典文件;然后搭建神经网络声学模型,并用增强后语音训练集,借助语言模型和词典对声学模型进行训练,得到声学模型权重文件,从而实现复杂环境下中文语音的精准识别。很好地解决了现有语音识别算法对噪音因素敏感、对语音质量要求高、应用场景单一的问题。The present invention builds a model based on the technical background of deep learning neural network and speech enhancement. First, a complex speech environment dataset is built, and speech enhancement is performed on the speech signals under various complex speech conditions to be recognized in the speech signal preprocessing stage of the speech recognition front-end; training; and build a Chinese-Chinese dictionary file; then build a neural network acoustic model, and use the enhanced speech training set to train the acoustic model with the help of the language model and dictionary to obtain the acoustic model weight file, so as to achieve accurate recognition of Chinese speech in complex environments . It solves the problems that the existing speech recognition algorithms are sensitive to noise factors, have high requirements for speech quality, and have a single application scenario.

Claims (1)

1.复杂环境下基于语音增强的深度神经网络语音识别方法具体实施步骤如下:1. The specific implementation steps of the deep neural network speech recognition method based on speech enhancement in complex environments are as follows: 步骤一、复杂环境下语音数据集的建立以及处理;在该部分收集纯净环境语音、高斯白噪音环境语音、存在背景噪音或干扰声源环境语音以及混响环境下的语音共同组成语音识别系统的语音数据集C;然后,将语音数据集C中各环境下的语音数据分别分成训练集和测试集;分配比例为训练集语音条数:测试集语音条数=5:1;将各环境下的训练集和测试集分别汇总并打乱分布,形成训练集X和测试集T;训练集X中的第i条语音表示为xi;测试集T中第j条语音表示为tj;同时对训练集X中的每一条语音,编辑一个.txt格式的标签文档,标签文档的内容包括该条语音的名字以及对应的正确汉语拼音序列;训练集语音标签文档的部分展示图见附图说明图2;Step 1. The establishment and processing of speech data sets in complex environments; in this part, pure ambient speech, Gaussian white noise ambient speech, ambient speech with background noise or interfering sound sources, and speech in a reverberation environment are collected to form the speech recognition system. Voice data set C; then, the voice data in each environment in the voice data set C are divided into training set and test set respectively; the distribution ratio is the number of voices in the training set: the number of voices in the test set=5:1; The training set and the test set are respectively aggregated and scrambled to form a training set X and a test set T; the ith speech in the training set X is represented as xi ; the jth speech in the test set T is represented as t j ; For each speech in the training set X, edit a label document in .txt format. The content of the label document includes the name of the speech and the corresponding correct Chinese pinyin sequence; the partial display diagram of the speech label document in the training set is shown in the accompanying drawings. figure 2; 具体收集方法分别如下:首先对于纯净条件的语音收集,在理想实验室条件下进行多人录制,以中文报纸、小说、学生课文为素材,单条语音录制时长10秒以内,共录制3000条纯净语音素材;对于高斯白噪音环境以及混响环境下的语音收集,采用Adobe Audition软件来进行合成,具体是采用录制的纯净语音和高斯白噪声进行合成,混响则直接采用软件自带的混响环境重新合成语音;其中高斯白噪音环境下的语音和混响环境下的语音各录制3000条;最后对于存在背景噪音或干扰声源的语音,采用实地录制为主,在工厂、餐厅等比较嘈杂的地方由多人进行实地录制,共录制语音3000条;同时,以上收集到的所有语音文件格式为.wav格式;将收集到语音进行分类,分类方式如下:将每一类语音环境中2500条语音作为语音识别系统的训练集,剩下的500条作为测试集;总结即语音识别训练集X共10000条,测试集T共2000条,将训练集与测试集分别打乱分布,避免训练出来的模型出现过拟合;The specific collection methods are as follows: First of all, for the voice collection under pure conditions, multiple recordings are made under ideal laboratory conditions, using Chinese newspapers, novels, and student texts as materials. The recording time of a single voice is less than 10 seconds, and a total of 3000 pure voices are recorded. Material; for voice collection in Gaussian white noise environment and reverberation environment, Adobe Audition software is used for synthesis, specifically, the recorded pure voice and Gaussian white noise are used for synthesis, and the reverberation directly adopts the reverberation environment that comes with the software. Re-synthesize the speech; among them, 3000 pieces of speech are recorded in the Gaussian white noise environment and the speech in the reverberation environment; finally, for the speech with background noise or interfering sound sources, the field recording is mainly used, and it is relatively noisy in factories, restaurants, etc. The local area was recorded by multiple people, and a total of 3,000 voices were recorded; at the same time, all the voice files collected above were in .wav format; the collected voices were classified as follows: 2,500 voices in each type of voice environment were As the training set of the speech recognition system, the remaining 500 pieces are used as the test set; the summary is that the speech recognition training set X has a total of 10,000 pieces, and the test set T has a total of 2,000 pieces. The model is overfitting; 步骤二、对建立的语音训练集X和测试集T进行语音增强,得到增强后的语音训练集
Figure FDA0002654055750000021
和测试集
Figure FDA0002654055750000022
增强后的语音训练集
Figure FDA0002654055750000023
中的第i条语音表示为
Figure FDA0002654055750000024
测试集
Figure FDA0002654055750000025
中第j条语音表示为
Figure FDA0002654055750000026
以语音训练集中第i条语音xi的语音增强为例,具体的语音增强步骤如下,对待增强的语音信号xi,用matlab软件内置的语音处理audioread函数对xi进行读取处理,得到语音信号的采样率fs以及包含语音信息的矩阵xi(n),xi(n)为n时刻的语音采样值;然后对xi(n)进行预加重处理得yi(n);再对yi(n)加汉明窗进行分帧操作,得到语音信号的各个帧的信息yi,r(n),其中yi,r(n)表示进行预加重增强后第i条语音信号的第r帧的语音信息矩阵;再对yi,r(n)进行FFT变换得到第i个语音信号第r帧的短时信号频谱
Figure FDA0002654055750000027
然后用伽马通权重函数Hl按频带对
Figure FDA0002654055750000028
进行处理得第i个语音信号第r帧第l个频带上的功率Pi,r,l(r,l),其中l的取值为0,...,39;依次按照如上步骤求取第r帧的各个频带的功率;再进行降噪去混响处理以及谱整合得
Figure FDA0002654055750000029
由此,已经求得增强后第i个语音信号第r帧的短时信号频谱,对其它帧的语音信号同样依次做如上的处理,得到各个帧的短时信号频谱,再通过IFFT变换在时域上进行语音信号帧合成得到增强之后的语音信号
Figure FDA00026540557500000210
Figure FDA00026540557500000211
放入增强后的语音训练集
Figure FDA00026540557500000212
中;具体的语音数据增强流程框架图见附图说明图3;
Step 2: Perform voice enhancement on the established voice training set X and test set T to obtain an enhanced voice training set
Figure FDA0002654055750000021
and test set
Figure FDA0002654055750000022
Enhanced speech training set
Figure FDA0002654055750000023
The i-th speech in is expressed as
Figure FDA0002654055750000024
test set
Figure FDA0002654055750000025
The jth speech in the middle is expressed as
Figure FDA0002654055750000026
Taking the voice enhancement of the i -th voice xi in the voice training set as an example, the specific voice enhancement steps are as follows . The sampling rate f s of the signal and the matrix x i (n) containing the speech information, x i ( n ) is the speech sampling value at time n ; Perform framing operation on y i (n) and Hamming window to obtain information y i,r (n) of each frame of the speech signal, where y i,r (n) represents the i-th speech signal after pre-emphasis enhancement The speech information matrix of the r-th frame; then perform FFT transformation on y i,r (n) to obtain the short-term signal spectrum of the i-th speech signal at the r-th frame
Figure FDA0002654055750000027
Then use the gamma pass weight function H l to pair by frequency band
Figure FDA0002654055750000028
Perform processing to obtain the power P i,r,l (r,l) on the lth frequency band of the rth frame of the ith speech signal, where the value of l is 0,...,39; follow the steps above to obtain The power of each frequency band of the rth frame; then noise reduction and de-reverberation processing and spectral integration are performed to obtain
Figure FDA0002654055750000029
As a result, the short-term signal spectrum of the r-th frame of the i-th voice signal after enhancement has been obtained, and the above processing is also performed on the voice signals of other frames in turn to obtain the short-term signal spectrum of each frame. The speech signal after the speech signal frame synthesis is enhanced in the domain
Figure FDA00026540557500000210
Will
Figure FDA00026540557500000211
Put into the augmented speech training set
Figure FDA00026540557500000212
In; the specific speech data enhancement process frame diagram is shown in Figure 3 of the accompanying drawings;
语音增强每一步操作具体如下详述:Each step of speech enhancement is detailed as follows: (一)语音信号预加重(1) Pre-emphasis of voice signal 对训练集X中第i个语音信号矩阵xi(n)进行预加重得到yi(n),其中yi(n)=xi(n)-αxi(n-1),α为一个常量在本专利中α=0.98;xi(n-1)为对训练集中的第i个语音的n-1时刻的采样矩阵;Perform pre-emphasis on the i-th speech signal matrix x i (n) in the training set X to obtain y i (n), where y i (n)=x i (n)-αx i (n-1), α is a The constant is α=0.98 in this patent; x i (n-1) is the sampling matrix of the n-1 moment of the i-th speech in the training set; (二)加窗分帧(2) Windowing and Framing 采用汉明窗w(n)对预加重之后的语音信号yi(n)进行加窗分帧,将连续的语音信号分割成一帧一帧的离散信号yi,r(n);The Hamming window w(n) is used to window and divide the pre-emphasized speech signal y i (n) into frames, and the continuous speech signal is divided into discrete signals y i,r (n) of one frame and one frame; 其中
Figure FDA0002654055750000031
汉明窗函数,N为窗长,专利中取帧长为50ms,帧移为10ms;预加重后的语音信号yi(n)加窗分帧处理可得到每一帧语音信号矩阵信息yi,r(n);yi,r(n)表示进行预加重、加窗分帧后第i条语音信号的第r帧的语音信息矩阵;
in
Figure FDA0002654055750000031
Hamming window function, N is the window length, the frame length is 50ms in the patent, and the frame shift is 10ms; the pre-emphasized speech signal y i (n) is added to the window and framed to obtain the matrix information y i of each frame of speech signal ,r (n); y i,r (n) represents the speech information matrix of the rth frame of the i-th speech signal after pre-emphasis, windowing and framing;
(三)FFT变换(3) FFT transform 将第i条语音信号的第r帧的语音信息矩阵yi,r(n)作FFT变换,将其从时域变换到频域,得到第i个语音信号第r帧的短时信号频谱
Figure FDA0002654055750000032
Perform FFT transformation on the speech information matrix y i,r (n) of the rth frame of the ith speech signal, transform it from the time domain to the frequency domain, and obtain the short-term signal spectrum of the rth frame of the ith speech signal
Figure FDA0002654055750000032
(四)求语音信号的功率Pi,r,l(r,l)(4) Find the power of the speech signal P i,r,l (r,l) 将每一帧的短时信号频谱
Figure FDA0002654055750000033
用伽马通权重函数进行处理求取语音信号每一帧每一个频带的功率;
Figure FDA0002654055750000034
Pi,r,l(r,l)表示语音信号yi(n)第r帧第l个频带上的功率,k是一个虚拟变量表示离散频率的索引,ωk是离散频率,
Figure FDA0002654055750000035
由于在FFT变换的时候采用50ms的帧长以及语音信号的采样率为16kHz,因此N=1024;Hl表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱,是matlab软件语音处理内置函数,函数的输入参数为频带l;
Figure FDA0002654055750000036
表示第r帧语音信号的短时频谱,L=40是所有通道的总数;
The short-term signal spectrum of each frame
Figure FDA0002654055750000033
Use the gamma pass weight function to process to obtain the power of each frequency band of each frame of the speech signal;
Figure FDA0002654055750000034
P i,r,l (r,l) represents the power on the lth frequency band of the rth frame of the speech signal yi (n), k is a dummy variable representing the index of the discrete frequency, ω k is the discrete frequency,
Figure FDA0002654055750000035
Since the frame length of 50ms and the sampling rate of the speech signal are 16kHz in the FFT transformation, N=1024; H l represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k. , is the built-in function of matlab software speech processing, the input parameter of the function is frequency band l;
Figure FDA0002654055750000036
represents the short-term spectrum of the rth frame speech signal, L=40 is the total number of all channels;
(五)语音信号降噪去混响处理(5) Noise reduction and de-reverberation processing of speech signals 求得语音信号功率Pi,r,l(r,l)后,进行降噪去混响处理,具体步骤为:After the voice signal power P i,r,l (r,l) is obtained, noise reduction and de-reverberation processing are performed. The specific steps are: (1)求取第r帧第l个频带的低通功率Mi,r,l[r,l],具体求解公式如下:(1) Obtain the low-pass power M i,r,l [r,l] of the lth frequency band of the rth frame, and the specific solution formula is as follows: Mi,r,l[r,l]=λMi,r,l[r-1,l]+(1-λ)Pi,r,l[r,l]M i,r,l [r,l]=λM i,r,l [r-1,l]+(1-λ)P i,r,l [r,l] Mi,r,l[r-1,l]表示第r-1帧第l个频带的低通功率;λ表示遗忘因子,因低通滤波器的带宽而变,本专利中λ=0.4;M i,r,l [r-1,l] represents the low-pass power of the l-th frequency band of the r-1th frame; λ represents the forgetting factor, which varies with the bandwidth of the low-pass filter, and λ=0.4 in this patent; (2)去除信号中缓慢变化的成分以及功率下降沿包络,对语音信号的功率Pi,r,l[r,l]进行处理得到增强后的第r帧第l个频带的功率
Figure FDA0002654055750000041
(2) Remove the slowly changing components and the envelope of the power falling edge in the signal, and process the power P i,r,l [r,l] of the speech signal to obtain the power of the lth frequency band of the rth frame after enhancement
Figure FDA0002654055750000041
其中
Figure FDA0002654055750000042
中c0为一个常数因子,本专利取c0=0.01;
in
Figure FDA0002654055750000042
where c 0 is a constant factor, and this patent takes c 0 =0.01;
(3)按步骤(1),(2)依次对信号的每一帧每一个频带进行增强处理;(3) according to step (1), (2) carry out enhancement processing to each frequency band of each frame of the signal in turn; (六)谱整合(6) Spectral integration 求得语音信号每一帧每一个频带上增强后功率
Figure FDA0002654055750000043
进行语音信号谱整合,可得到增强之后语音信号各帧的短时信号频谱,谱整合的公式如下:
Obtain the enhanced power in each frequency band of each frame of the speech signal
Figure FDA0002654055750000043
By integrating the spectrum of the speech signal, the short-term signal spectrum of each frame of the enhanced speech signal can be obtained. The formula for spectrum integration is as follows:
Figure FDA0002654055750000044
Figure FDA0002654055750000044
上式中μi,r[r,k]表示第r帧第k个索引处的谱权重系数;
Figure FDA0002654055750000045
为未增强的第i个语音信号第r帧的短时信号频谱,
Figure FDA0002654055750000046
为增强后的第i个语音信号第r帧的短时信号频谱;
In the above formula μ i,r [r,k] represents the spectral weight coefficient at the kth index of the rth frame;
Figure FDA0002654055750000045
is the short-term signal spectrum of the rth frame of the unenhanced ith speech signal,
Figure FDA0002654055750000046
is the short-term signal spectrum of the rth frame of the ith speech signal after enhancement;
其中μi,r[r,k]的求解公式如下:where μ i,r [r,k] is solved by the following formula:
Figure FDA0002654055750000047
Figure FDA0002654055750000047
μi,r[r,k]=μi,r[r,N-k],N/2≤k≤N-1μ i,r [r,k]=μ i,r [r,Nk],N/2≤k≤N-1 公式中的Hl表示是在频率索引k处计算得到的第l个频带的伽马通滤波器组的频谱;ωi,r,l[r,l]为第i个语音信号第r帧第l个频带的权重系数,权重系数是增强之后的频域与信号的原始频域的比值,求解公式如下:H l in the formula represents the spectrum of the gamma pass filter bank of the lth frequency band calculated at the frequency index k; ω i,r,l [r,l] is the rth frame of the ith speech signal The weight coefficient of l frequency bands, the weight coefficient is the ratio of the enhanced frequency domain to the original frequency domain of the signal. The solution formula is as follows:
Figure FDA0002654055750000051
Figure FDA0002654055750000051
求得谱整合后的第i个语音信号的第r帧的增强后的短时信号频谱,按如上操作依次对各帧进行处理求得第i个语音信号各帧的增强后的短时信号频谱;对各帧增强后的语音信号
Figure FDA0002654055750000052
进行IFFT变换得到时域各帧的语音信号并且在时域进行帧拼接得到增强后的语音信号
Figure FDA0002654055750000053
IFFT变换以及语音信号时域帧拼接操作如下:
Obtain the enhanced short-term signal spectrum of the r-th frame of the i-th speech signal after spectral integration, and process each frame in turn according to the above operation to obtain the enhanced short-term signal spectrum of each frame of the i-th speech signal. ; Enhanced speech signal for each frame
Figure FDA0002654055750000052
Perform IFFT transformation to obtain the speech signal of each frame in the time domain, and perform frame splicing in the time domain to obtain the enhanced speech signal
Figure FDA0002654055750000053
The operations of IFFT transformation and speech signal time-domain frame splicing are as follows:
Figure FDA0002654055750000054
Figure FDA0002654055750000054
Figure FDA0002654055750000055
g为总帧数
Figure FDA0002654055750000055
g is the total number of frames
上式中,
Figure FDA0002654055750000056
为增强后的语音信号矩阵;
Figure FDA0002654055750000057
表示第r帧增强后的语音信号矩阵;g为语音信号的总帧数,这个值因语音信号的时长而变;得到增强后n时刻语音信号的采样矩阵
Figure FDA0002654055750000058
再用matlab软件内置的语音处理audioread函数按照语音信号的采样率fs=16Khz对
Figure FDA0002654055750000059
进行写入处理,得到增强后的语音信号
Figure FDA00026540557500000510
In the above formula,
Figure FDA0002654055750000056
is the enhanced speech signal matrix;
Figure FDA0002654055750000057
Represents the enhanced speech signal matrix of the rth frame; g is the total number of frames of the speech signal, and this value changes due to the duration of the speech signal; the sampling matrix of the speech signal at time n after the enhancement is obtained
Figure FDA0002654055750000058
Then use the built-in speech processing audioread function of matlab software to pair according to the sampling rate of speech signal f s = 16Khz
Figure FDA0002654055750000059
Perform writing processing to obtain an enhanced voice signal
Figure FDA00026540557500000510
至此,对语音训练集中一条语音的增强处理完毕,接下依次按照如上步骤处理训练集X和测试集T;并将增强后的训练集语音保存在
Figure FDA00026540557500000511
集中,增强后的测试集保存在
Figure FDA00026540557500000512
集中;
So far, the enhancement processing of a voice in the voice training set is completed, and then the training set X and the test set T are processed according to the above steps in turn; and the enhanced training set voice is saved in the
Figure FDA00026540557500000511
Centralized, the augmented test set is stored in
Figure FDA00026540557500000512
concentrated;
步骤三、搭建语音识别声学模型;本专利搭建的语音识别声学模型采用CNN+CTC进行建模,输入层为步骤二增强后的训练集
Figure FDA00026540557500000513
中语音信号
Figure FDA00026540557500000514
的200维的特征值序列,采用MFCC特征提取算法提取特征值序列;同时隐藏层采用卷积层和池化层交替重复连接,并且引入Dropout层,防止过拟合,其中卷积层卷积核尺寸为3,池化窗口大小为2,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,采用CTC的loss函数作为损失函数实现连接性时序多输出,输出为1423维的特征值正好对应步骤四搭建的中文汉语词典dict.txt文档中的1423个常用汉语拼音;具体语音识别声学模型网络框架图见附图说明图4;其中声学模型中卷积层、池化层、Dropout层以及全连接层的具体参数均已在图4中标出;
Step 3: Build an acoustic model for speech recognition; the acoustic model for speech recognition built in this patent is modeled by CNN+CTC, and the input layer is the training set enhanced in step 2
Figure FDA00026540557500000513
medium voice signal
Figure FDA00026540557500000514
The MFCC feature extraction algorithm is used to extract the eigenvalue sequence; at the same time, the hidden layer is alternately connected with the convolutional layer and the pooling layer, and the Dropout layer is introduced to prevent overfitting. The convolutional layer convolution kernel The size is 3, the pooling window size is 2, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation, using the loss function of CTC as the loss function to achieve multi-output connectivity time series, the output is 1423 The eigenvalues of the dimension correspond to the 1423 commonly used Chinese pinyin in the Chinese-Chinese dictionary dict.txt document built in step 4; the specific speech recognition acoustic model network frame diagram is shown in Figure 4 of the accompanying drawings; in the acoustic model, the convolution layer, pooling The specific parameters of the layer, dropout layer and fully connected layer have been marked in Figure 4;
步骤四、搭建语音识别语言模型;语言模型搭建包括语言文本数据集的建立、2-gram语言模型设计、中文汉语词典的搜集;Step 4: Build a speech recognition language model; the language model building includes the establishment of language text data sets, the design of 2-gram language models, and the collection of Chinese-Chinese dictionaries; (一)语言文本数据库的建立(1) Establishment of language text database 首先,建立训练语言模型所需要的文本数据集;语言文本数据集形式上表现为一个电子版.txt文件,内容为报纸、中学课文、著名小说;收集报纸、中学课文、著名小说的电子版.txt文件建立语言文本数据库,注意语言文本数据库中文本数据的选取一定要具有代表性,能够反映出日常生活中的汉语用语习惯;First, establish the text data set required for training the language model; the language text data set is represented as an electronic version .txt file in the form of newspapers, middle school texts, and famous novels; collect the electronic versions of newspapers, middle school texts, and famous novels. txt file to establish a language text database, pay attention to the selection of text data in the language text database must be representative, and can reflect the Chinese language habits in daily life; (二)2-gram语言模型搭建(2) 2-gram language model construction 本专利采用按词本身进行划分的语言模型训练方法2-gram算法搭建语言模型;其中2-gram中的2表示考虑当前词出现的概率只与其前2个词有关;2就是词序列记忆长度的约束数量;2-gram算法具体公式可以表示为:This patent uses the language model training method 2-gram algorithm, which is divided according to the words themselves, to build a language model; 2 in the 2-gram means that the probability of the occurrence of the current word is only related to its first two words; 2 is the memory length of the word sequence. The number of constraints; the specific formula of the 2-gram algorithm can be expressed as:
Figure FDA0002654055750000061
Figure FDA0002654055750000061
上式中W表示一段文字序列,w1,w2,...,wq分别表示文字序列里面的每一个单词,q表示文字序列的长度;S(W)表示这一段文字序列符合语言学习惯的概率;d表示第d单词;In the above formula, W represents a text sequence, w 1 , w 2 ,...,w q represent each word in the text sequence, respectively, and q represents the length of the text sequence; S(W) means that this text sequence conforms to linguistics The probability of habit; d represents the dth word; (三)汉语词典建立(3) Establishment of Chinese Dictionary 搭建语音识别系统语言模型词典;对于词典来说,一种语言的词典都是稳定不变的,对于本发明中的汉语文字词典来说,词典表现为一个dict.txt文件,其中标明了日常生活中常用的1423个汉语拼音对应的汉字,同时考虑汉语的一音多字情况,本发明搭建的词典的部分展示图见附图说明图5;Build a speech recognition system language model dictionary; for the dictionary, the dictionary of a language is stable and unchanged. For the Chinese word dictionary in the present invention, the dictionary is represented as a dict.txt file, which indicates the daily life. The Chinese characters corresponding to the 1423 Chinese Pinyin commonly used in Chinese, and considering the situation of one-syllable and multiple-character Chinese, a partial display diagram of the dictionary built by the present invention is shown in Figure 5 of the accompanying drawings; 步骤五、用建立的语言文本数据集对搭建的2-gram语言模型进行训练,得到语言模型的单词出现次数表以及状态转移表;具体的语言模型训练框图见附图说明图6;对语言模型的具体训练方式如下:Step 5: Use the established language text data set to train the built 2-gram language model, and obtain the word occurrence table and state transition table of the language model; the specific language model training block diagram is shown in Figure 6 for the description of the drawings; for the language model The specific training methods are as follows: (1)循环获取语言文本数据集中的文本内容并统计单个单词出现得次数,汇总得到单个单词出现次数表;(1) circularly obtain the text content in the language text data set and count the number of occurrences of a single word, and summarize the number of occurrences of a single word to obtain a table; (2)循环获取语言文本数据集中二个单词一起出现得次数,汇总得到二个单词状态转移表;(2) cyclically obtain the number of times that two words appear together in the language text data set, and summarize to obtain two word state transition tables; 步骤六、用训练好的语言模型和建立的词典以及增强后的语音训练集
Figure FDA0002654055750000071
对搭建的声学模型进行学习训练;得到声学模型的权重文件以及其它参数配置文件,具体的声学模型训练流程如下:
Step 6. Use the trained language model, the established dictionary, and the enhanced speech training set
Figure FDA0002654055750000071
Learn and train the built acoustic model; obtain the weight file and other parameter configuration files of the acoustic model. The specific acoustic model training process is as follows:
(1)初始化声学网络模型的各处权值;(1) Initialize the weights of the acoustic network model; (2)依次导入语音训练集
Figure FDA0002654055750000072
中的语音进行训练,对任意的语音信号
Figure FDA0002654055750000073
首先经MFCC特征提取算法处理,得语音信号200维的特征值序列然后按照附图说明图7所列,将语音信号的200维特征值序列依次经过各个卷积层、池化层、Dropout层、全连接层处理,最后输出层采用1423个神经元的全连接层进行输出,并用softmax函数进行激活,得语音信号的1423维声学特征;
(2) Import the speech training set in turn
Figure FDA0002654055750000072
The speech in the training, for any speech signal
Figure FDA0002654055750000073
First, the MFCC feature extraction algorithm is processed to obtain a 200-dimensional eigenvalue sequence of the speech signal. Then, as listed in Figure 7 of the accompanying drawings, the 200-dimensional eigenvalue sequence of the speech signal is sequentially passed through each convolution layer, pooling layer, Dropout layer, Fully connected layer processing, the final output layer uses a fully connected layer of 1423 neurons for output, and uses the softmax function for activation to obtain 1423-dimensional acoustic features of the speech signal;
(3)得到特征值后再在语言模型以及词典的作用下对1423维声学特征值进行解码并输出识别的语音信号
Figure FDA0002654055750000074
的汉语拼音序列;
(3) After obtaining the feature value, decode the 1423-dimensional acoustic feature value under the action of the language model and dictionary and output the recognized speech signal
Figure FDA0002654055750000074
The Chinese Pinyin sequence of ;
(4)将声学模型识别出的汉语拼音序列与训练集
Figure FDA0002654055750000075
中第i条语音
Figure FDA0002654055750000076
的汉语拼音标签序列进行对比计算误差并反向传播更新声学模型各处的权值,损失函数采用CTC的loss函数,并Adam算法进行优化,设置训练的batchsize=16,迭代次数epoch=50,每训练500条语音,保存一次权重文件;CTC的损失函数如下:
(4) The Chinese pinyin sequence identified by the acoustic model is compared with the training set
Figure FDA0002654055750000075
The i-th voice in
Figure FDA0002654055750000076
The Hanyu Pinyin label sequence is compared to calculate the error and back-propagates to update the weights of the acoustic model. The loss function adopts the loss function of CTC, and the Adam algorithm is optimized. The training batchsize=16, the number of iterations epoch=50, each Train 500 voices and save the weight file once; the loss function of CTC is as follows:
Figure FDA0002654055750000081
Figure FDA0002654055750000081
上式中
Figure FDA0002654055750000082
表示训练集训练后产生的总损失,e表示输入语音即进行语音增强后训练集
Figure FDA0002654055750000083
中的语音信号
Figure FDA0002654055750000084
z为输出的汉字序列,F(z|e)表示输入为e,输出序列为z的概率;
In the above formula
Figure FDA0002654055750000082
Represents the total loss after training in the training set, e represents the input speech, that is, the training set after speech enhancement
Figure FDA0002654055750000083
voice signal in
Figure FDA0002654055750000084
z is the output Chinese character sequence, F(z|e) represents the probability that the input is e and the output sequence is z;
(5)依次按照如上步骤训练语音识别的声学模型,直至声学模型损失收敛,声学模型便训练完毕;保存声学模型的权重文件和各项配置文件,具体的语音识别声学模型训练图见附图说明图7;(5) Follow the above steps to train the acoustic model of speech recognition in turn, until the loss of the acoustic model converges, and the acoustic model is trained; save the weight file and various configuration files of the acoustic model, and the specific speech recognition acoustic model training diagram is shown in the accompanying drawings. Figure 7; 步骤七、用训练好的基于语音增强的中文语音识别系统对测试集
Figure FDA0002654055750000085
的语音进行识别,统计语音识别准确率并与传统算法进行性能对比分析;具体的语音识别测试系统流程框架图见附图说明图8;本专利的语音识别准确率以及与传统算法的在噪音环境下的性能比较部分展示图见附图说明图9;本专利的语音识别准确率以及与传统算法的在混响环境下的性能比较部分展示图见附图说明图10;
Step 7. Use the trained Chinese speech recognition system based on speech enhancement to test the test set
Figure FDA0002654055750000085
The speech recognition accuracy rate of the patent and the performance comparison analysis with the traditional algorithm are carried out; the specific speech recognition test system flow chart is shown in Figure 8 of the accompanying drawings; the speech recognition accuracy rate of this patent and the performance of the traditional algorithm in noise environment The performance comparison part shown below is shown in Figure 9 in the description of the drawings; the speech recognition accuracy rate of this patent and the performance comparison with the traditional algorithm in the reverberation environment are shown in Figure 10 in the description of the drawings;
具体实行方式如下:The specific implementation method is as follows: (1)用传统的语音识别系统,对建立的复杂环境语音数据库的2000个未增强的语音测试集T进行语音识别测试,统计其语音识别的准确率;并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10;(1) Use the traditional speech recognition system to perform speech recognition tests on 2000 unenhanced speech test sets T of the established complex environment speech database, and count the accuracy of its speech recognition; The results of speech recognition are shown in Figure 9 and Figure 10 in the description of the drawings; (2)用本发明的基于语音增强的语音识别系统,对建立的语音数据库的2000个增强后的语音测试集
Figure FDA0002654055750000086
进行语音识别测试,统计本发明方法的语音识别准确率;并于附图说明列举出具有代表性的语音识别结果图见附图说明图9、图10;
(2) using the speech recognition system based on speech enhancement of the present invention, to the 2000 enhanced speech test sets of the established speech database
Figure FDA0002654055750000086
Carry out the speech recognition test, and count the speech recognition accuracy rate of the method of the present invention; and enumerate the representative speech recognition results in the description of the drawings, see Figures 9 and 10 in the description of the drawings;
(3)最后对本发明提出的基于语音增强的语音识别系统进行性能分析;(3) Finally, perform performance analysis on the speech recognition system based on speech enhancement proposed by the present invention; 统计完成后发现,本发明提出的基于语音增强的语音识别算法对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音的识别准确率大幅提升,性能提升大约在30%左右;与传统的语音识别算法相比,本发明算法识别准确率也大大提升,尤其是对高斯白噪音环境、存在背景噪音或干扰声源环境以及混响环境下的语音识别,传统算法表现很差,而本发明算法表现优异,性能很好,取部分噪音环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图9;取部分混响环境下本发明语音识别算法和传统语音识别算法识别效果对比图展示见附图说明图10;After the statistics are completed, it is found that the speech recognition algorithm based on speech enhancement proposed by the present invention greatly improves the recognition accuracy of speech in Gaussian white noise environment, background noise or interfering sound source environment and reverberation environment, and the performance is improved by about 30%. Compared with the traditional speech recognition algorithm, the recognition accuracy of the algorithm of the present invention is also greatly improved, especially for the speech recognition in the Gaussian white noise environment, the environment with background noise or interfering sound source and the reverberation environment, the performance of the traditional algorithm is very good. The performance of the algorithm of the present invention is excellent, and the performance is very good. The comparison chart of the recognition effect of the speech recognition algorithm of the present invention and the traditional speech recognition algorithm in a partial noise environment is shown in Figure 9 in the description of the drawings; the speech recognition of the present invention is taken in a partial reverberation environment The comparison chart of the recognition effect between the algorithm and the traditional speech recognition algorithm is shown in Figure 10 of the accompanying drawings; 由此看见,本发明的复杂环境下基于语音增强的深度神经网络语音识别方法,很好地解决了现有语音识别算法对噪音环境敏感、对语音质量要求高、可应用场景单一的问题,实现了复杂语音环境下的语音识别;From this, it can be seen that the deep neural network speech recognition method based on speech enhancement of the present invention well solves the problems that the existing speech recognition algorithm is sensitive to the noise environment, has high requirements for speech quality, and can be applied in a single scenario. speech recognition in complex speech environment; 在上述各步骤中出现的符号i表示训练集和测试集中第i个进行语音增强处理的语音信号,i=1,2,...,12000;符号r表示语音信号的第r帧,r=1,2,3,...,g;g表示语音信号分帧之后的总帧数,g的取值因处理的语音时长而变;符号l表示语音信号的第l个频带,l=0,1,2,...,39;k是一个虚拟变量表示离散频率的索引,k=0,1,2,...,1023。The symbol i that appears in the above steps represents the ith speech signal in the training set and the test set for speech enhancement processing, i=1,2,...,12000; the symbol r represents the rth frame of the speech signal, r= 1,2,3,...,g; g represents the total number of frames after the speech signal is divided into frames, and the value of g changes due to the length of the processed speech; the symbol l represents the lth frequency band of the speech signal, and l=0 ,1,2,...,39; k is a dummy variable representing the index of discrete frequencies, k=0,1,2,...,1023.
CN202010880777.7A 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment Active CN111986661B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010880777.7A CN111986661B (en) 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010880777.7A CN111986661B (en) 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment

Publications (2)

Publication Number Publication Date
CN111986661A true CN111986661A (en) 2020-11-24
CN111986661B CN111986661B (en) 2024-02-09

Family

ID=73440031

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010880777.7A Active CN111986661B (en) 2020-08-28 2020-08-28 Deep neural network voice recognition method based on voice enhancement in complex environment

Country Status (1)

Country Link
CN (1) CN111986661B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112786051A (en) * 2020-12-28 2021-05-11 出门问问(苏州)信息科技有限公司 Voice data identification method and device
CN113257262A (en) * 2021-05-11 2021-08-13 广东电网有限责任公司清远供电局 Voice signal processing method, device, equipment and storage medium
CN113808581A (en) * 2021-08-17 2021-12-17 山东大学 Chinese speech recognition method for acoustic and language model training and joint optimization
CN114444609A (en) * 2022-02-08 2022-05-06 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN114743544A (en) * 2022-04-19 2022-07-12 南京大学 Pinyin-based two-stage decoupling Chinese speech recognition model
CN116312517A (en) * 2023-02-28 2023-06-23 四川航天电液控制有限公司 A voice recognition control system for centralized control commands in fully mechanized mining face
CN116580708A (en) * 2023-05-30 2023-08-11 中国人民解放军61623部队 Intelligent voice processing method and system
CN118136022A (en) * 2024-04-09 2024-06-04 海识(烟台)信息科技有限公司 Intelligent voice recognition system and method
CN119601011A (en) * 2025-02-10 2025-03-11 北京海百川科技有限公司 Multilingual speech recognition and interaction method for humanoid robots
CN120032659A (en) * 2025-04-17 2025-05-23 清枫(北京)科技有限公司 Noise filtering method, device and medium based on deep learning to build noise model

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160240190A1 (en) * 2015-02-12 2016-08-18 Electronics And Telecommunications Research Institute Apparatus and method for large vocabulary continuous speech recognition
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Speech recognition method based on convolutional neural network
KR20190032868A (en) * 2017-09-20 2019-03-28 현대자동차주식회사 Method and apparatus for voice recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160240190A1 (en) * 2015-02-12 2016-08-18 Electronics And Telecommunications Research Institute Apparatus and method for large vocabulary continuous speech recognition
KR20190032868A (en) * 2017-09-20 2019-03-28 현대자동차주식회사 Method and apparatus for voice recognition
CN109272990A (en) * 2018-09-25 2019-01-25 江南大学 Speech recognition method based on convolutional neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
潘粤成;刘卓;潘文豪;蔡典仑;韦政松;: "一种基于CNN/CTC的端到端普通话语音识别方法", 现代信息科技, no. 05 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633175A (en) * 2020-12-24 2021-04-09 哈尔滨理工大学 Single note real-time recognition algorithm based on multi-scale convolution neural network under complex environment
CN112786051B (en) * 2020-12-28 2023-08-01 问问智能信息科技有限公司 Voice data recognition method and device
CN112786051A (en) * 2020-12-28 2021-05-11 出门问问(苏州)信息科技有限公司 Voice data identification method and device
CN113257262A (en) * 2021-05-11 2021-08-13 广东电网有限责任公司清远供电局 Voice signal processing method, device, equipment and storage medium
CN113808581B (en) * 2021-08-17 2024-03-12 山东大学 Chinese voice recognition method based on acoustic and language model training and joint optimization
CN113808581A (en) * 2021-08-17 2021-12-17 山东大学 Chinese speech recognition method for acoustic and language model training and joint optimization
CN114444609A (en) * 2022-02-08 2022-05-06 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and computer readable storage medium
CN114444609B (en) * 2022-02-08 2024-10-01 腾讯科技(深圳)有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN114743544A (en) * 2022-04-19 2022-07-12 南京大学 Pinyin-based two-stage decoupling Chinese speech recognition model
CN114743544B (en) * 2022-04-19 2025-01-03 南京大学 A two-stage decoupled Chinese speech recognition model based on pinyin
CN116312517A (en) * 2023-02-28 2023-06-23 四川航天电液控制有限公司 A voice recognition control system for centralized control commands in fully mechanized mining face
CN116580708A (en) * 2023-05-30 2023-08-11 中国人民解放军61623部队 Intelligent voice processing method and system
CN118136022A (en) * 2024-04-09 2024-06-04 海识(烟台)信息科技有限公司 Intelligent voice recognition system and method
CN119601011A (en) * 2025-02-10 2025-03-11 北京海百川科技有限公司 Multilingual speech recognition and interaction method for humanoid robots
CN120032659A (en) * 2025-04-17 2025-05-23 清枫(北京)科技有限公司 Noise filtering method, device and medium based on deep learning to build noise model

Also Published As

Publication number Publication date
CN111986661B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN111986661A (en) Deep neural network speech recognition method based on speech enhancement in complex environment
CN108319666A (en) A kind of electric service appraisal procedure based on multi-modal the analysis of public opinion
CN108550375A (en) A kind of emotion identification method, device and computer equipment based on voice signal
US20210043197A1 (en) Intent recognition method based on deep learning network
CN106952649A (en) Speaker Recognition Method Based on Convolutional Neural Network and Spectrogram
CN109829058A (en) A kind of classifying identification method improving accent recognition accuracy rate based on multi-task learning
CN109949799B (en) A semantic parsing method and system
CN105702251B (en) Speech emotion recognition method based on Top-k enhanced audio bag-of-words model
CN114863937B (en) Hybrid bird song recognition method based on deep transfer learning and XGBoost
CN105895082A (en) Acoustic model training method and device as well as speech recognition method and device
CN113611285B (en) Language recognition method based on cascading bidirectional temporal pooling
CN118038851B (en) A multi-dialect speech recognition method, system, device and medium
CN115602165B (en) Digital employee intelligent system based on financial system
Chourasia et al. Emotion recognition from speech signal using deep learning
CN109325238A (en) A method for multi-entity sentiment analysis in long texts
CN113837299A (en) Network training method and device based on artificial intelligence and electronic equipment
Almekhlafi et al. A classification benchmark for Arabic alphabet phonemes with diacritics in deep neural networks
CN114420108A (en) A speech recognition model training method, device, computer equipment and medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
Jie Speech emotion recognition based on convolutional neural network
CN113808620B (en) Tibetan language emotion recognition method based on CNN and LSTM
CN115359778A (en) Confrontation and meta-learning method based on speaker emotion voice synthesis model
KR20120117297A (en) Method for summerizing meeting minutes based on sentence network
Li et al. Intelligibility enhancement via normal-to-lombard speech conversion with long short-term memory network and bayesian Gaussian mixture model
CN113593537B (en) Speech emotion recognition method and device based on complementary feature learning framework

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant