Disclosure of Invention
In view of the above, it is desirable to provide a speech recognition method, apparatus, computer device, storage medium, and computer program product capable of improving the accuracy of speech recognition.
In a first aspect, the present application provides a speech recognition method. The method comprises the following steps:
acquiring audio sets with the same audio sampling rate;
extracting time sequence characteristics and frequency characteristics of each audio frequency of the audio frequency set to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information;
in the process of using the audio set to iteratively train a speech recognition model, for each audio, coding the feature matrix data corresponding to the audio through the speech recognition model to obtain a phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio;
determining to obtain a phoneme alignment loss value of the speech recognition model in forward propagation based on the phoneme probability distribution matrix and the text sentence labeled for the audio;
and adjusting model parameters of the speech recognition model based on the phoneme alignment loss value to continue iteration until an iteration stop condition is met, so as to obtain a trained speech recognition model.
In one embodiment, before the performing the timing feature and frequency feature extraction process on each audio of the audio set, the method further includes:
randomly extracting partial audio from the audio set;
performing at least one of the following data enhancement processes on the randomly extracted audio:
simulating a first difference between the sound magnitudes of different speaker utterances, for the randomly extracted audio, enhancing or attenuating the volume of the audio based on the first difference;
simulating a second difference between speaking speech rates of different speakers, and for the randomly extracted audio, speeding up or slowing down the speech rate of the audio based on the second difference;
simulating a third difference of speech speed prosody change of different speakers in the speaking process, and distorting audio waveform data on a preset time frame based on the third difference aiming at the randomly extracted audio;
simulating a fourth difference between timbre frequency magnitudes of different speakers, warping audio frequencies over a preset frequency range based on the fourth difference for the randomly extracted audio.
In one embodiment, before the performing the timing feature and frequency feature extraction process on each audio of the audio set, the method further includes:
counting the pronunciation conditions of different pronouncing persons in a certain service scene;
extracting the occurrence probability of the different levels of sound size, speaking speed, speed rhythm and tone frequency generated by different speakers in the pronunciation condition under the service scene;
and selecting the audios in the audio set to execute volume disturbance, speed disturbance, time warping and frequency warping data enhancement by taking the same occurrence probability as an aim and taking the comprehensive coverage as a principle.
In one embodiment, the performing timing characteristic and frequency characteristic extraction processing on each audio of the audio set to obtain characteristic matrix data corresponding to the audio and including timing and frequency characteristic information includes:
calculating characteristic values of different frequencies of the audio on each time frame based on a plurality of band-pass filters with triangular filtering characteristics to obtain Mel frequency spectrum matrix data comprising frequency characteristic information;
and performing dimensionality reduction on the Mel frequency spectrum matrix data to obtain feature matrix data.
In one embodiment, the speech recognition model to be trained comprises a multi-layer convolutional neural network; the performing dimension reduction processing on the mel-frequency spectrum matrix data to obtain feature matrix data comprises:
inputting the Mel frequency spectrum matrix data into each layer of the convolutional neural network, triggering each layer of the convolutional neural network to perform two-dimensional convolution calculation, and obtaining feature matrix data after dimension reduction based on a calculation result;
the adjusting model parameters of the speech recognition model based on the phoneme loss values to continue iteration comprises:
adjusting parameters of each layer of convolutional neural network of the speech recognition model based on the phoneme loss value to continue iteration.
In one embodiment, the encoding, for each audio, the feature matrix data corresponding to the audio through the speech recognition model to obtain a phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio includes:
for each audio, inputting the feature matrix data corresponding to the audio into a multi-layer attention model with a convolution network of the voice recognition model, and sequentially performing feature extraction on the concerned information of each layer based on each layer of the multi-layer attention model; each layer in the multi-head attention model has different attention information;
splicing the features extracted from each layer in the multi-head attention model to obtain phoneme probability distribution of each time frame in the audio;
and generating a phoneme probability distribution matrix comprising all the time frames according to the phoneme probability distribution of each time frame in the audio.
In one embodiment, the determining the phoneme alignment loss value forward propagated by the speech recognition model based on the phoneme probability distribution matrix and the text sentence labeled for the audio includes:
performing phoneme alignment on the text sentence labeled by the audio and the phoneme probability distribution matrix;
after performing the phoneme alignment process, determining a phoneme alignment loss value of the speech recognition model for forward propagation.
In a second aspect, the present application further provides a speech recognition apparatus. The device comprises:
the method comprises the steps of obtaining a template, wherein the template is used for obtaining audio sets with the same audio sampling rate;
the characteristic calculation module is used for extracting time sequence characteristics and frequency characteristics of each audio frequency of the audio frequency set to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information;
a loss value calculation module, configured to, in a process of iteratively training a speech recognition model using the audio set, encode, by using the speech recognition model, the feature matrix data corresponding to the audio for each audio to obtain a phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio; determining to obtain a phoneme alignment loss value of the speech recognition model in forward propagation based on the phoneme probability distribution matrix and the text sentence labeled for the audio;
and the optimization module is used for adjusting the model parameters of the speech recognition model based on the phoneme alignment loss value so as to continue iteration until an iteration stop condition is met, and obtaining the trained speech recognition model.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory having stored thereon a computer program and a processor for performing the steps of the above-described speech recognition method.
In a fourth aspect, the present application further provides a computer-readable storage medium. The computer-readable storage medium has stored thereon a computer program which is executed by a processor for performing the steps of the above-described speech recognition method.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprises a computer program which is executed by a processor for performing the steps of the above-mentioned speech recognition method.
The voice recognition method, the voice recognition device, the computer equipment, the storage medium and the computer program product obtain the audio sets with the same audio sampling rate; and extracting time sequence characteristics and frequency characteristics of each audio frequency of the audio frequency set to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information. In the process of using the audio set to iteratively train the speech recognition model, for each audio, the feature matrix data corresponding to the audio is encoded through the speech recognition model to obtain the phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio, and the recognition of the audio by the speech recognition model can be accurate to a phoneme level. And determining to obtain a phoneme alignment loss value of forward propagation of the speech recognition model based on the phoneme probability distribution matrix and the text sentence labeled aiming at the audio. And adjusting model parameters of the voice recognition model based on the loss value to continue iteration until an iteration stop condition is met, so as to obtain the trained voice recognition model. Therefore, when the language recognition model is used, the voice recognition model recognizes the audio from the phoneme level precision, so that the semantic fluency is improved, and the recognition accuracy is improved.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The speech recognition method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 110 communicates with the server 120 through a network. The terminal 110 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices, and the server 120 may be implemented by an independent server or a server cluster formed by a plurality of servers.
The terminal 110 may sample rate convert the audio of the audio set to obtain an audio set with the same sample rate and send the audio set to the server 120. The server 120 obtains audio sets with the same audio sampling rate; and extracting time sequence characteristics and frequency characteristics of each audio frequency of the audio frequency set to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information. In the process of iteratively training the speech recognition model by using the audio set, the server 120 encodes feature matrix data corresponding to the audio through the speech recognition model for each audio to obtain a phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio. The server 120 determines to obtain a phoneme alignment loss value of the speech recognition model in forward propagation based on the phoneme probability distribution matrix and the text sentence labeled for the audio, and adjusts model parameters of the speech recognition model based on the phoneme alignment loss value to continue iteration until an iteration stop condition is met, so as to finally obtain the trained speech recognition model. The server 120 recognizes the voice transmitted by the terminal 110 using the trained voice recognition model, and transmits a corresponding recognition result to the terminal 110.
In an embodiment, as shown in fig. 2, a speech recognition method is provided, and this embodiment is illustrated by applying the method to a server, and it is to be understood that the method may also be applied to a terminal, and may also be applied to a system including the terminal and the server, and is implemented by interaction between the terminal and the server. In this embodiment, the method includes the steps of:
s202, acquiring an audio set with the same audio sampling rate; and extracting time sequence characteristics and frequency characteristics of each audio frequency of the audio frequency set to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information.
The audio sampling rate refers to a frequency used for converting an analog signal of the audio into a digital signal in the process of generating the audio, such as 44,100 hz or 48,000 hz. The feature matrix data includes feature values of different frequencies of the audio over respective time frames and feature values of the time series.
In one embodiment, the server may perform at least one data enhancement method including volume perturbation, speed perturbation, time warping, frequency warping, random noise, etc. on the audio of the audio set before performing the timing feature and frequency feature extraction process on each audio of the audio set.
In one embodiment, the server randomly extracts audio for data enhancement for the audio of the audio set.
In one embodiment, the server may obtain mel-frequency spectrum matrix data including frequency feature information based on feature values of different frequencies of the audio in each time frame, and perform dimension reduction processing on the mel-frequency spectrum matrix data to obtain the feature matrix data.
In one embodiment, the server may perform dimensionality reduction processing on the mel-frequency spectrum matrix based on three convolutional neural network layers of the speech recognition model to obtain feature matrix data.
Specifically, the server acquires audio sets with the same audio sampling rate; and performing time sequence characteristic and frequency characteristic extraction processing on each audio frequency of the audio frequency set on each time frame to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information of each time frame.
S204, in the process of using the audio set to iteratively train the voice recognition model, for each audio, encoding feature matrix data corresponding to the audio through the voice recognition model to obtain phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio.
Wherein the encoding is to process the feature matrix data to obtain a phoneme probability distribution for each time frame of the audio. The encoding is performed by an encoder of a speech recognition model.
The phoneme is the smallest unit of speech divided according to the natural attributes of the speech. The phoneme probability distribution refers to the probability distribution of the phoneme values, and the phoneme probability distribution matrix refers to the phoneme probability distribution of the audio over all time frames.
Specifically, in the process of iteratively training the speech recognition model by using the audio set, the server may encode, for each audio, feature matrix data corresponding to the audio by using an encoder of the speech recognition model to obtain a phoneme probability distribution of each time frame in the audio, and generate a phoneme probability distribution matrix corresponding to the audio based on the phoneme probability distributions of all the time frames.
In one embodiment, the speech recognition model is a Transformer model (a model that incorporates a mechanism of attention). In other embodiments, the speech recognition model may also be other models capable of obtaining a phoneme probability distribution value for the audio over each time frame.
In one embodiment, the server may obtain a phoneme probability distribution matrix corresponding to the audio based on a multi-head attention model with seven-layer convolution of a transform model.
S206, determining and obtaining a phoneme alignment loss value of the forward propagation of the speech recognition model based on the phoneme probability distribution matrix and the text sentence labeled aiming at the audio.
The text sentence is a text form of the content expressed by the audio, and the phoneme alignment loss value refers to a loss value calculated after aligning a real text phoneme label and all correct paths corresponding to the label in the phoneme probability distribution matrix.
In one embodiment, the server is a phoneme alignment loss value calculated by performing a phoneme alignment process on the phoneme probability distribution matrix and the phonemes of the text sentence through a time series class classifier of the neural network.
In one embodiment, the server is a phoneme alignment loss value calculated by a CTC (connection terminal Classification, an algorithm for solving the Classification problem of time series class data that can avoid manual alignment of input and output) algorithm.
Specifically, the server performs a phoneme alignment process on the phoneme probability distribution matrix and the text sentence labeled with the audio by using an algorithm model capable of performing phoneme alignment, and obtains a phoneme alignment loss value of the speech recognition model in forward propagation.
And S208, adjusting model parameters of the speech recognition model based on the phoneme alignment loss value to continue iteration until an iteration stop condition is met, so as to obtain the trained speech recognition model.
Specifically, the server derives each parameter of the speech recognition model according to the phoneme alignment loss value to obtain an update gradient of the speech recognition model, so that the speech recognition model is updated and parameter learning is performed. And the server iteratively trains the speech recognition model by using the audio set circularly until the phoneme loss value is stable, and the training is stopped to obtain the trained speech recognition model.
In one embodiment, the server may re-perform the data enhancement processing on the audio set before each iterative training.
In one embodiment, the server may gradually decrease the learning rate of the speech recognition model during each iterative training process.
In one embodiment, the server may gradually decrease the learning rate of a multi-headed attention model in the speech recognition model during each iterative training process.
In one embodiment, the server may gradually decrease the learning rate of the three convolutional neural network layers in the speech recognition model during each iterative training process.
In one embodiment, the learning rate of the speech recognition model is gradually decreased as the number of training steps increases.
In one embodiment, the learning rate of the speech recognition model is updated based on the inverse of the current number of training steps.
The voice recognition method, the voice recognition device, the computer equipment and the storage medium acquire the audio sets with the same audio sampling rate; and extracting time sequence characteristics and frequency characteristics of each audio frequency of the audio frequency set to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information. In the process of using the audio set to iteratively train the speech recognition model, for each audio, the feature matrix data corresponding to the audio is encoded through the speech recognition model to obtain the phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio, and the recognition of the audio by the speech recognition model can be accurate to a phoneme level. And determining to obtain a phoneme alignment loss value of forward propagation of the speech recognition model based on the phoneme probability distribution matrix and the text sentence labeled aiming at the audio. And adjusting model parameters of the voice recognition model based on the loss value to continue iteration until an iteration stop condition is met, so as to obtain the trained voice recognition model. Therefore, when the language recognition model is used, the voice recognition model recognizes the audio from the phoneme level precision, so that the semantic fluency is improved, and the recognition accuracy is improved.
In one embodiment, before the time-series feature and frequency feature extraction process is performed on each audio of the set of audios, the method further comprises: randomly extracting partial audio from the audio set; performing at least one of the following data enhancement processes on the randomly extracted audio: simulating a first difference between the sound magnitudes of different speaker utterances, for the randomly extracted audio, enhancing or attenuating the volume of the audio based on the first difference; simulating a second difference between speaking speech rates of different speakers, and for the randomly extracted audio, speeding up or slowing down the speech rate of the audio based on the second difference; simulating a third difference of speech speed prosody change of different speakers in the speaking process, and distorting audio waveform data on a preset time frame based on the third difference aiming at the randomly extracted audio; simulating a fourth difference between timbre frequency magnitudes of different speakers, warping audio frequencies over a preset frequency range based on the fourth difference for the randomly extracted audio.
Specifically, before performing the time sequence feature and frequency feature extraction processing on each audio of the audio set, the server randomly extracts a part of the audio set, and performs data enhancement processing including at least one of volume disturbance, speed disturbance, time warping, frequency warping, random noise and the like on the part of the audio.
In one embodiment, the process of simulating the first difference, the second difference, the third difference and the fourth difference includes classifying the corresponding target objects, obtaining the differences between the target objects of different levels, and simulating the difference characteristics into the data enhancement process of the audio, wherein the target objects correspond to the sound size, the speech speed rhythm and the audio frequency, respectively. For example, simulating the first difference between the speaking sound levels of different speakers includes classifying the speaking sound levels of different speakers, obtaining the first difference between the speaking sound levels of different levels, and simulating the first difference characteristic to the audio volume data enhancement process.
In one embodiment, the volume perturbation process includes: the server can grade the speaking sound sizes of different speakers, obtain a first difference between the sound sizes of different levels, and increase or decrease the volume of the audio frequency through the first difference. It will be appreciated that the ability of the speech recognition model to recognize different volume audio frequencies may be enhanced by increasing and decreasing the volume.
In one embodiment, the process of speed perturbation comprises: the server can grade the speaking speed of different speakers to obtain a second difference between the speaking speeds of different grades, and the speed of the audio is increased or reduced through the second difference. It will be appreciated that the speech recognition model's ability to recognize different speech rates may be enhanced by increasing and decreasing the speech rate.
In one embodiment, the process of time warping comprises: the server may rank the speech rate prosody size during the speaking process to obtain a third difference between the speech rate prosody of different levels, and perform nonlinear distortion on the audio waveform data over a time frame of one or more random segments based on the third difference. It is understood that the recognition ability of the speech recognition model for the variation of the speech rate and rhythm can be enhanced by the distortion of the audio waveform.
In one embodiment, the process of frequency warping comprises: the server can grade the tone frequency sizes of different speakers, obtain a fourth difference between the tone frequencies of different grades, and distort the audio frequency over the preset frequency range based on the fourth difference. It will be appreciated that the ability of the speech recognition model to recognize different timbres may be enhanced by warping the frequencies.
In one embodiment, the process of random noise comprises: the server may randomly take a piece of background noise from an existing background noise library to be superimposed on the audio data. It is understood that the noise-rejection capability of the speech recognition model under noise can be enhanced by superimposing noise on the audio.
In this embodiment, at least one of volume disturbance, speed disturbance, time warping, frequency warping, random noise, and the like is randomly performed on the audio to obtain richer audio, so that richer audio feature matrix data can be obtained for training the speech recognition model, and the generalization performance and the recognition accuracy of the speech recognition model are improved. Also, in the present embodiment, since diversified data enhancement is performed on audio, so that a diversified data set with an increased scale can be obtained based on a small-scale data set, the model generalization performance and the recognition accuracy can be improved.
In one embodiment, before the time-series feature and frequency feature extraction process is performed on each audio of the set of audios, the method further comprises: calculating a first pronunciation range characteristic of the voice of a target speaker in a target service scene; the first pronunciation range characteristics comprise different levels of sound size, speaking speed, speed rhythm and target probability of occurrence of tone frequency in the voice of a target speaker; for each audio of the set of audios, performing volume perturbation, velocity perturbation, time warping, and frequency warping data enhancement on the audio such that the second pronunciation range characteristic and the first pronunciation range characteristic of the audio and the corresponding enhanced respective audio are the same.
Wherein the target speaker may be one person or a plurality of persons. The target speaker's voice includes multiple tones and is the smallest set of different levels of voice size, speaking speech rate, speech rate prosody, and timbre frequency in the target business scenario.
In one embodiment, the different levels of voice magnitude, speaking rate, speech rate prosody, and timbre frequency are based on a ranking of voice magnitude, speaking rate, speech rate prosody, and timbre frequency. For example, in terms of sound level, 1 level is set for a sound volume of 1 db, 2 levels are set for a sound volume of 10 db, 3 levels are set for a sound volume of 18 db, and so on.
In the embodiment, the first pronunciation range characteristic of the voice of the target speaker in the target service scene is calculated; and aiming at each audio frequency of the audio frequency set, carrying out volume disturbance, speed disturbance, time distortion and frequency distortion data enhancement on the audio frequency, so that the second pronunciation range characteristic and the first pronunciation range characteristic of the audio frequency and each audio frequency after corresponding enhancement are the same, and richer and comprehensive audio frequencies are obtained, and more comprehensive audio frequency characteristic matrix data can be obtained to be used for training a voice recognition model, so that the generalization performance and the recognition accuracy of the voice recognition model are improved. And under the condition that voice acquisition is difficult or the voice acquisition cost is higher, the abundant and comprehensive audio frequency set can be obtained by performing data enhancement on the audio frequency set with small data size, so that the manual acquisition cost is reduced, and the problem of difficult voice acquisition is solved.
In one embodiment, the extracting of the time sequence feature and the frequency feature is performed on each audio of the audio set, and obtaining feature matrix data corresponding to the audio and including the time sequence and frequency feature information includes: calculating characteristic values of different frequencies of the audio on each time frame based on a plurality of band-pass filters with triangular filtering characteristics to obtain Mel frequency spectrum matrix data comprising frequency characteristic information; and performing dimensionality reduction on the Mel frequency spectrum matrix data to obtain feature matrix data.
A bandpass filter is a device that allows waves in a particular audio band to pass through while shielding other audio bands. The mel-frequency spectrum matrix data are used for representing matrix data obtained by processing audio frequency based on the mel-frequency spectrum.
Specifically, the server sets a plurality of band-pass filters in the spectral range of the voice, each filter having a triangular filtering characteristic with a center frequency uniformly distributed in the human ear perception frequency range. The server calculates characteristic values of different frequencies of the audio on each time frame based on each filter to obtain Mel frequency spectrum matrix data including frequency characteristic information, and the server performs reduction processing on the Mel frequency spectrum matrix data to obtain characteristic matrix data.
In one embodiment, the number of filters may be 80.
In this embodiment, mel-frequency spectrum matrix data including frequency feature information is obtained by calculating feature values of different frequencies of an audio frequency in each time frame, and the mel-frequency spectrum matrix data is subjected to dimension reduction processing to obtain feature matrix data. Therefore, the server can obtain data including the audio frequency characteristic information and perform dimensionality reduction on the data, so that characteristic matrix data with comprehensive information and small data volume is obtained, the training load of a voice recognition model is reduced, and the accuracy is improved.
In one embodiment, the speech recognition model to be trained comprises a multi-layer convolutional neural network; the method for obtaining the feature matrix data by performing dimensionality reduction on the Mel frequency spectrum matrix data comprises the following steps: inputting Mel frequency spectrum matrix data into each layer of convolutional neural network, triggering each layer of convolutional neural network to perform two-dimensional convolutional calculation, and obtaining feature matrix data after dimension reduction based on the calculation result; adjusting model parameters of the speech recognition model based on the phoneme alignment loss value to continue the iteration comprises: parameters of each layer of the convolutional neural network of the speech recognition model are adjusted based on the phoneme alignment loss value to continue the iteration.
Specifically, the speech recognition model to be trained comprises a plurality of layers of convolutional neural networks, and each layer of convolutional neural network can be used for performing two-dimensional convolution calculation. The server inputs the Mel frequency spectrum matrix data into a convolution neural network of the voice recognition model, each layer of convolution neural network carries out two-dimensional convolution calculation on the input Mel frequency spectrum matrix data, and feature matrix data after dimension reduction are obtained based on calculation results.
Specifically, after obtaining the phoneme alignment loss finger through step S206, the server may further adjust parameters of each layer of convolutional neural network of the speech recognition model based on the phoneme alignment loss finger.
In one embodiment, the server may set (3, 3), step by step (2, 2) the convolution kernel of the convolutional neural network of the speech recognition model.
In this embodiment, the mel-frequency spectrum matrix data is input to each layer of convolutional neural network of the speech recognition model to perform dimensionality reduction, and parameters of each layer of convolutional neural network are adjusted based on the phoneme loss value to meet the requirement of iterative training, so that the accuracy of the speech recognition model is improved.
In one embodiment, for each audio, encoding feature matrix data corresponding to the audio through a speech recognition model to obtain a phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio includes: for each audio, inputting feature matrix data corresponding to the audio into a multi-layer attention model with a convolution network of a voice recognition model, and sequentially performing feature extraction on the concerned information of each layer based on the multi-layer attention model; the information of each layer in the multi-head attention model is different; splicing the extracted features of each layer in the multi-head attention model to obtain the phoneme probability distribution of each time frame in the audio; from the phoneme probability distribution for each time frame in the audio, a phoneme probability distribution matrix is generated that includes all the time frames.
The multi-head attention model refers to a model which has a plurality of different attention processes on input Mel frequency spectrum matrix data and correspondingly has a multilayer convolution network.
Specifically, the speech recognition model includes a multi-head attention model in which information of interest differs for each layer. The server inputs the feature matrix data corresponding to the audio into a multi-layer multi-head attention model with a convolution network of the voice recognition model aiming at each audio, and sequentially performs calculation including layer normalization, a feedforward network, convolution, multi-head attention and a feedforward network on the basis of each layer of the multi-head attention model, so that feature extraction is performed on the information concerned by the layer, the extracted features are spliced, and phoneme probability distribution of each time frame in the audio is obtained. The server generates a phoneme probability distribution matrix including all the time frames from the phoneme probability distribution of each time frame in the audio.
In one embodiment, the speech recognition model comprises a multi-head attention model of a seven-layer tape-wrap network.
In one embodiment, the computation of the layer normalization performed by each layer of the multi-head attention model is used to normalize the data input to all neurons of the current layer.
In one embodiment, the feedforward network calculations performed by each layer of the multi-headed attention model are to achieve a linear connection between layers.
In one embodiment, the convolution calculations performed for each layer of the multi-head attention model are based on a one-dimensional convolution with the convolution kernel set to 32 and the step set to 1.
In one embodiment, the multi-head attention calculation performed by each layer of the multi-head attention model selects a plurality of feature information for parallel calculation.
In this embodiment, a multi-head attention model with different information of each layer is used to extract features of the information of interest, and the extracted features of each layer are spliced to obtain a phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix including all the time frames, thereby generating an accurate and comprehensive phoneme probability matrix and improving the accuracy of the speech recognition model. In the present embodiment, sequential encoding is realized by processing each time frame, and thus, a streaming speech recognition is realized.
In one embodiment, determining the phoneme alignment loss value resulting from forward propagation of the speech recognition model based on the phoneme probability distribution matrix and the text sentence for the audio annotation comprises: performing phoneme alignment on the text sentence labeled by the audio and the phoneme probability distribution matrix; after performing the phoneme alignment process, a phoneme alignment loss value of the speech recognition model forward propagation is determined.
Specifically, in the process of calculating the loss value between the text sentence of the audio annotation and the phoneme probability distribution matrix, the server executes phoneme alignment processing, thereby calculating the phoneme alignment loss value.
In one embodiment, the server may process the audio tagged text sentences and the phoneme probability distribution matrix based on a CTC algorithm, perform a phoneme alignment process, and thereby calculate a phoneme alignment loss value.
In the embodiment, phoneme alignment is carried out on the text sentence labeled by the audio and the phoneme probability distribution matrix; after performing the phoneme alignment process, a phoneme alignment loss value of the speech recognition model forward propagation is determined. Therefore, the calculation result of the loss value is more accurate, so that a more accurate reference standard object can be obtained when the parameters of the speech recognition model are adjusted based on the phoneme alignment loss value, and the accuracy of the speech recognition model is improved.
In one embodiment, as shown in FIG. 3, a schematic diagram of a speech recognition method is provided. In particular, the server resamples the audio in the set of audio to obtain audio with the same sampling rate. The server can also perform data enhancement on the audio in the audio set, and the processing of the data enhancement mainly comprises the following steps: at least one of volume disturbance, speed disturbance, time distortion, frequency distortion, random noise and the like, so as to improve the diversity of audio data and the generalization capability of models, and enable a diversified and comprehensive data set to be obtained through the audio set with small data volume. The server extracts feature data of the audios in the audio set, and performs down-sampling through three convolutional neural network layers of the voice recognition model to obtain feature matrix data corresponding to the audios and comprising time sequence and frequency feature information, so that the server can reduce the system load of the voice recognition model while generating training data meeting preset requirements. The server also inputs the feature matrix data into a multi-head attention model (only one layer is drawn in the figure, and 7 layers are represented by 7 x) of a seven-layer tape-convolution network of the voice recognition model, so that each layer sequentially executes calculation including layer normalization, a feedforward network, convolution, multi-head attention and the feedforward network, each layer extracts different concerned information, and a phoneme probability distribution matrix is finally generated after splicing. The server performs a phoneme alignment process on the phoneme probability distribution matrix and the text sentences labeled by the audio by using a time series class classifier, which may be a CTC algorithm, as will be appreciated. And after obtaining the phoneme alignment loss value based on the time sequence classifier, the server adjusts the parameters of the speech model based on the phoneme alignment loss value and iteratively trains the speech recognition model.
It should be understood that, although the steps in the flowcharts in the embodiments of the present application are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flow chart may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or the stages is not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a part of the steps or the stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a speech recognition apparatus for implementing the speech recognition method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme recorded in the method, so specific limitations in one or more embodiments of the speech recognition device provided below can be referred to the limitations of the speech recognition method in the above, and are not described herein again.
In one embodiment, as shown in fig. 4, there is provided a speech recognition apparatus 400 comprising: an obtaining module 402, a feature calculating module 404, a loss value calculating module 406, and an optimizing module 408, wherein:
an obtaining module 402 is configured to obtain audio sets with the same audio sampling rate.
The feature calculating module 404 is configured to perform time sequence feature and frequency feature extraction processing on each audio of the audio set to obtain feature matrix data corresponding to the audio and including time sequence and frequency feature information.
A loss value calculation module 406, configured to, in a process of iteratively training a speech recognition model using an audio set, encode, by using the speech recognition model, feature matrix data corresponding to an audio for each audio to obtain a phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio; and determining to obtain a phoneme alignment loss value of forward propagation of the speech recognition model based on the phoneme probability distribution matrix and the text sentence labeled aiming at the audio.
And the optimizing module 408 is configured to adjust model parameters of the speech recognition model based on the phoneme alignment loss value to continue iteration until an iteration stop condition is met, so as to obtain a trained speech recognition model.
In one embodiment, before the time-series feature and frequency feature extraction process is performed on each audio of the set of audios, the feature calculation module 404 is further configured to: randomly extracting partial audio from the audio set; performing at least one of the following data enhancement processes on the randomly extracted audio: simulating a first difference between the sound magnitudes of different speaker utterances, for the randomly extracted audio, enhancing or attenuating the volume of the audio based on the first difference; simulating a second difference between speaking speech rates of different speakers, and for the randomly extracted audio, speeding up or slowing down the speech rate of the audio based on the second difference; simulating a third difference of speech speed prosody change of different speakers in the speaking process, and distorting audio waveform data on a preset time frame based on the third difference aiming at the randomly extracted audio; simulating a fourth difference between timbre frequency magnitudes of different speakers, warping audio frequencies over a preset frequency range based on the fourth difference for the randomly extracted audio.
In one embodiment, as shown in FIG. 5, the feature calculation module 404 includes: a feature extraction module 404a, and a dimension reduction module 404b, wherein:
the feature extraction module 404a is configured to calculate feature values of different frequencies of the audio over each time frame based on a plurality of band pass filters with triangular filtering features, so as to obtain mel-frequency spectrum matrix data including timing sequence and frequency feature information.
The dimension reduction module 404b is configured to perform dimension reduction processing on the mel-frequency spectrum matrix data to obtain feature matrix data.
In one embodiment, the speech recognition model to be trained comprises a multi-layer convolutional neural network; the loss value calculation module 406 is further configured to: inputting Mel frequency spectrum matrix data into each layer of convolutional neural network, triggering each layer of convolutional neural network to perform two-dimensional convolutional calculation, and obtaining feature matrix data after dimension reduction based on the calculation result; the optimization module 408 is also used to adjust parameters of each layer of the convolutional neural network of the speech recognition model based on the phoneme loss value to continue the iteration.
In one embodiment, the loss value calculation module 406 is further configured to, for each audio, input feature matrix data corresponding to the audio into a multi-head attention model of a multi-layer convolutional network of a speech recognition model, so as to perform feature extraction on information of interest of each layer in turn based on each layer of the multi-head attention model; the information of each layer in the multi-head attention model is different; splicing the extracted features of each layer in the multi-head attention model to obtain the phoneme probability distribution of each time frame in the audio; from the phoneme probability distribution for each time frame in the audio, a phoneme probability distribution matrix is generated that includes all the time frames.
In one embodiment, the loss value calculation module 406 is further configured to: performing phoneme alignment on the text sentence labeled by the audio and the phoneme probability distribution matrix; after performing the phoneme alignment process, a phoneme alignment loss value of the speech recognition model forward propagation is determined.
The voice recognition device acquires audio sets with the same audio sampling rate; and extracting time sequence characteristics and frequency characteristics of each audio frequency of the audio frequency set to obtain characteristic matrix data which corresponds to the audio frequency and comprises time sequence and frequency characteristic information. In the process of using the audio set to iteratively train the speech recognition model, for each audio, the feature matrix data corresponding to the audio is encoded through the speech recognition model to obtain the phoneme probability distribution of each time frame in the audio, so as to generate a phoneme probability distribution matrix corresponding to the audio, and the recognition of the audio by the speech recognition model can be accurate to a phoneme level. And determining to obtain a phoneme alignment loss value of forward propagation of the speech recognition model based on the phoneme probability distribution matrix and the text sentence labeled aiming at the audio. And adjusting model parameters of the voice recognition model based on the loss value to continue iteration until an iteration stop condition is met, so as to obtain the trained voice recognition model. Therefore, when the language recognition model is used, the voice recognition model recognizes the audio from the phoneme level precision, so that the semantic fluency is improved, and the recognition accuracy is improved.
For the specific limitation of the speech recognition, reference may be made to the above limitation of the speech recognition method, and details are not repeated here. The respective modules in the above-described speech recognition apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing audio collection data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech recognition method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.