CN111599340A

CN111599340A - Polyphone pronunciation prediction method and device and computer readable storage medium

Info

Publication number: CN111599340A
Application number: CN202010727658.8A
Authority: CN
Inventors: 司马华鹏; 王培雨
Original assignee: Nanjing Guiji Intelligent Technology Co ltd
Current assignee: Nanjing Guiji Intelligent Technology Co ltd
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-08-28

Abstract

The invention discloses a polyphone pronunciation prediction method, relates to the technical field of computer voice processing, and aims to solve the problem that the polyphone pronunciation labeling accuracy rate is low in the prior art. The technical scheme is characterized in that a large amount of texts containing polyphones and pinyin complete spellings thereof are obtained; training on the designed model by using a batch iterative training method to obtain a polyphone prediction model; in a text pronunciation marking system, a text input by a user is obtained, pronunciation of the text is predicted by using a polyphone prediction model, single-tone pinyin is obtained by table lookup, and pinyin corresponding to the text is spliced and output. The method and the device utilize the context information of the deep neural network learning text to predict the polyphone pronunciation, thereby achieving the effect of improving the accuracy of the polyphone pronunciation prediction.

Description

Polyphone pronunciation prediction method and device and computer readable storage medium

Technical Field

The invention relates to the technical field of computer voice processing, in particular to a polyphone pronunciation prediction method.

Background

Speech synthesis, a technique for allowing a computer to synthesize corresponding speech according to text content, enables a machine to speak, and is a key for improving human-computer interaction experience. Currently, deep learning techniques have also entered the field of speech synthesis and achieved good results. The invention is used for converting Chinese text with polyphones into correct pinyin, and is a key step of speech synthesis.

The pronunciation prediction method for polyphone mainly comprises the following steps: 1. the method has lower accuracy rate obviously according to the pronunciation with the highest frequency; 2. summarizing a polyphone word stock and a corpus, and then processing the polyphones by a phrase matching method, but the method is limited by the size of the corpus, and the situation that a single word or a word has multiple pronunciations cannot be solved by using a pure word stock, and the matching ambiguity errors can be introduced if the corpus is too large; 3. a linguist makes rules, and then trains model recognition by combining machine learning methods such as rules and decision trees, but the rule making is difficult. Therefore, the accuracy of the pronunciation prediction of the existing polyphones is low.

Disclosure of Invention

The invention aims to provide a polyphone pronunciation prediction method, a polyphone pronunciation prediction device and a computer readable storage medium.

The above object of the present invention is achieved by the following technical solutions:

a polyphone pronunciation prediction method comprises the following steps:

importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;

performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation;

combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation; wherein,

the training of the polyphone prediction model comprises the following steps:

inputting a training text containing polyphones, marking corresponding correct pronunciation, and outputting a data text corresponding to the training text; inputting a data text into a pre-training language model to obtain vector representation of data; inputting the vector into a deep learning model to perform batch iterative training to obtain a polyphone prediction model;

marking the corresponding correct pronunciation comprises marking polyphonic characters in the training text according to the correct pronunciation, and marking the monophonic characters by symbols.

The invention is further configured to: the deep learning model comprises a convolution operation between circulation and an input vector to obtain two vectors obtained by respectively performing the convolution operation on the position context of a polyphone, splicing the two vectors, inputting the spliced vectors into a GRU network for resetting and updating, randomly inactivating the output vector of the GRU network, outputting a multi-dimensional vector, converting the output multi-dimensional vector into a one-dimensional vector, mapping each element of the one-dimensional vector to the probability corresponding to each pronunciation through a function, and outputting the pronunciation with the maximum probability.

The invention is further configured to: the pre-training model is a Word2vec or bert model.

The invention is further configured to: the training of the polyphone prediction model comprises the steps of training the model by adopting a random gradient descent algorithm in each iteration, and evaluating the fitting degree of the model by adopting a cross entropy loss function.

The second aim of the invention is realized by the following technical scheme:

a polyphone pronunciation prediction device comprising:

the polyphone prediction module is used for importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;

the single-tone character pronunciation marking module is used for marking the pronunciation of the input text to obtain the pronunciation of the single-tone character;

and the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation.

The invention is further configured to: the polyphone prediction module comprises:

the input layer is used for inputting a training text containing polyphones, marking corresponding correct pronunciation and outputting a data text corresponding to the training text;

the pre-training layer is used for inputting the marked text into a pre-training language model and acquiring vector representation of data;

the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors obtained by performing convolution operation on the context of the position where the polyphone is located;

the splicing layer is used for splicing the two vectors output by the convolution layer;

the GRU network layer is used for selectively resetting and updating the vectors output by the splicing layer;

the Dropout layer is used for randomly inactivating the output vector of the GRU network layer;

the full connection layer is used for converting the multidimensional vector output by the Dropout layer into a one-dimensional vector;

and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities.

The invention is further configured to: the polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.

The invention is further configured to: the voice synthesis module is used for synthesizing the pronunciation output by the pronunciation combination module into voice and outputting audio.

The third object of the invention is realized by the following technical scheme:

a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform a polyphonic pronunciation prediction method as described above.

In conclusion, the beneficial technical effects of the invention are as follows: the method comprises the steps of obtaining a text input by a user, detecting the position of a polyphone in the text, predicting the pronunciation of the polyphone, looking up a table to obtain the pinyin of the monophonic character, splicing and outputting the pinyin corresponding to the text, and combining a word bank and a deep learning technology to improve the accuracy of converting the polyphone into the pinyin.

Drawings

FIG. 1 is an overall flow chart of a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating the training of a polyphonic prediction model according to an embodiment of the present invention;

FIG. 3 is a block diagram showing the overall structure of a second embodiment of the present invention;

fig. 4 is a block diagram of a polyphone prediction module according to a second embodiment of the present invention.

Detailed Description

Example one

The invention discloses a polyphone pronunciation prediction method, which can be used for front-end text processing in voice recognition, can also be used in the fields of voice synthesis and the like requiring polyphone voice labeling, and can be applied to electronic equipment such as computers, servers, vehicle-mounted terminals and the like. Further, the present invention may be applied to all connected scenarios in a Direct Memory Access (DMA) link or other scenarios, which is not limited in this respect.

Referring to fig. 1, the method includes the steps of: importing an input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the text; performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation; and combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation.

In this embodiment, the input text may be input by a user through a device, such as a mobile device like a smart phone or a tablet, or an input device like a mouse or a keyboard, or may be text obtained by an automatic speech recognition technology.

It should be noted that the two steps of importing the input text into the polyphone prediction model and performing the phonetic annotation of the monophonic character are not sequentially performed, and may even be performed simultaneously.

The polyphones in the Chinese characters are more than six hundred probably, the frequently used polyphones are more than one hundred, and the single-tone characters and some rare polyphones can be inquired through a word stock without model prediction. Therefore, in an alternative, the polyphone prediction model can only record the prediction of common polyphones, so that the practicability of the model is improved, and rare polyphones are recorded in a word stock for query labeling. When the recorded common polyphones appear in the text, the polyphone position is marked, so that the subsequent prediction is facilitated.

Referring to fig. 2, the training of the polyphonic prediction model includes the steps of:

s1, inputting a training text containing polyphone characters, marking the correct pronunciation corresponding to the training text, and outputting a data text corresponding to the training text, wherein the data text is a character text without the pronunciation;

s2, inputting the data text into a pre-training language model, obtaining vector representation of the data, and obtaining prior knowledge of the pre-training model;

and S3, inputting the vectors into the deep learning model to perform batch iterative training to obtain a polyphone prediction model.

In step S1, the training text is obtained by collecting a text data set a1 in a real application scene of speech recognition, selecting some open news such as dog search news, microblog corpora, and the like, obtaining a corpus data set a2, selecting sentences containing polyphones to form a corpus set, performing initial pronunciation labeling through a dictionary, and then obtaining a final training corpus set a through manual inspection and correction.

In particular, in the scheme, in the training text, single-tone characters are marked by special symbols, and polyphonic characters are marked by normal pronunciation. For example, the sentence "please return money as soon as possible", which is labeled "hai 2 NA huan2 NA", wherein "2" represents the tone, and will not be described in detail with reference to the rule hereinafter. The marking method is simple, clear and easy to carry out, the training text marked according to the method can be used for subsequent polyphone model training, the pinyin predicted by the model and the marked pinyin are compared, loss is calculated, and training is more efficient.

In step S2, the pre-training model is preferably a Word2 vector or bert model. The training of these models can be directly trained using large amounts of text data, learning the co-occurrence and precedence knowledge of the words and characters of these texts. The pre-training language model after training can output corresponding vector representation according to the input words or expressions, and the sizes of different vector values reflect the relationship between different words or word meanings. Particularly, if other pre-training models are used, the format of the training text needs to be converted to form labeled data, and then the labeled data is input into the pre-training models. The technical scheme provides a method for improving training effect by adopting a pre-training model, wherein a one hot method is adopted for text coding before the pre-training model is adopted, the same weight is given to all characters, and no more prior information exists. The pre-training model is obtained by training a large amount of label-free data, and the pre-training model is used for feature extraction, so that context information learned by a large amount of text data can be obtained, different vectors of different characters can be represented, and the problem of limited labeled training data is greatly improved.

In step S3, batch iterative training is a common training method used in deep learning, training is performed by sending training data into a neural network in batches, and the selection of the specific batch size and the iteration number needs to be determined through experiments according to the machine performance and the specific training effect.

The deep learning model comprises a step of performing convolution operation on circulation and input vectors to obtain two vectors respectively corresponding to the position context of the polyphone, and the two vectors are spliced and input into a GRU (Gate Recurrent Unit) network for resetting and updating. The GRU is a circulating network, two gating mechanisms of reset gate and update gate in the GRU can selectively reset and update input vectors during model training, and compared with network models such as a multilayer perceptron and a convolutional neural network, the GRU can well solve the problems of long-term memory and gradient disappearance and has better learning capacity for sequence features.

And then, randomly inactivating the GRU network output vector to output a multidimensional vector. When the network is too complex and the training data are less, the method is easy to over-learn, the accuracy rate is high during training, the accuracy rate is low in an application scene, and random inactivation operation randomly zeros some vectors during training, so that the complexity of the network is reduced, and overfitting can be effectively prevented.

And finally, converting the multi-dimensional vector after random inactivation into a one-dimensional vector, mapping each element of the one-dimensional vector to the corresponding probability of each pronunciation through a function, and outputting the pronunciation with the maximum probability.

The implementation principle of the above embodiment is as follows: training a polyphone prediction model, comprising:

1. and acquiring a training corpus A.

2. The Word2Vec pre-training language model is utilized, and the pre-training language model can be trained by using a dog searching news corpus, wherein the corpus comprises common Chinese characters and a Word vector dimension is 300. And inputting a word vector matrix W into the training corpus A to obtain a vector matrix X of the input corpus, wherein the matrix X is three-dimensional data, the first dimension is the number of samples, the second dimension is the length of each sentence, and the third dimension is 300, namely the dimension of a word vector.

3. Performing convolution operation, wherein the sizes of convolution kernels are 3, 4 and 5, the size of the convolution kernel is the size of a window, performing convolution operation by using convolution kernel circulation and an input vector matrix X, obtaining a value which is the characteristic in the corresponding window size through one-time convolution, and then averaging all values of each convolution kernel to obtain the characteristic value of the whole sentence, wherein the characteristic value is not only local information limited to a single window but also characteristic information of the whole sentence. And respectively extracting different features by adopting 120 convolution kernels to obtain 120 feature values which are spliced into a vector C.

4. The pronunciation of polyphone in the sentence is determined by the context, the context of the polyphone is processed in step 3 to obtain two feature vectors C1 and C2, the dimension of each vector is 120, and the two vectors are spliced into a 240-dimensional vector P at the splicing layer. For the special cases of beginning and end of sentence, zero filling treatment is carried out, if polyphone is at the beginning of sentence, 120-dimensional zero vector is used in the front, and if polyphone is at the end of sentence, 120-dimensional zero vector is used in the back.

5. The vector P is input to a bidirectional GRU network, which is a network suitable for sequence learning problems that can solve well the long-term memory and gradient vanishing problems. 256 hidden units are used for the GRU, and since the bidirectional GRU is adopted, the vector G with the dimension of 512 is output because the vector G needs to be multiplied by 2.

6. And (3) randomly inactivating the vector G, randomly discarding a certain proportion of network connection to reduce the network complexity and reduce the possibility of overfitting, wherein the layer only plays a role in training and cannot be carried out in formal use, and the vector D is output.

7. The dimension of the vector D is 512, the last pronunciation prediction is carried out, the last step of conversion is also needed, the dimension of the vector is converted into the total number m of all polyphonic pronunciations through a conversion matrix S, each dimension corresponds to one pronunciation, and the vector Q is output.

8. The Softmax function can convert the vector Q into a set of numbers between 0 and 1, each number representing the probability of the corresponding reading. The Softmax function is defined as:

wherein e is a natural constant, z_iFor each value of the vector Q, m is the dimension of the vector Q, and i takes an integer value in the range of 1 to m.

9. Training a model by adopting a random gradient descent algorithm, and evaluating the degree of model fitting by adopting a cross entropy loss (Cross EntropyLoss) function, wherein the formula of the loss function is as follows:

wherein p is a predicted value, q is a true value, and the smaller the value of the function is, the better the fitting degree of the model is. The model is trained using a stochastic gradient descent algorithm based on this loss function.

The training set adopted by the invention contains 50 thousands of sentences, contains 150 common polyphones, uses 47.5 thousands as the training set and 2.5 thousands as the test set, and after 20 rounds of training, the accuracy rate of the test set reaches 96%.

After training, acquiring an input text, if the input text contains polyphones, inputting the polyphones into a polyphone prediction model to obtain polyphone pronunciations, meanwhile, labeling the single-tone pronunciations of the input text in a dictionary program mode, splicing the single-tone pronunciations with the polyphone pronunciations to acquire complete pronunciation labels of the input text.

Example two

The invention discloses a polyphone pronunciation prediction device, which refers to fig. 3 and comprises a polyphone prediction module, a pronunciation prediction module and a pronunciation prediction module, wherein the polyphone prediction module is used for leading an input text into a trained polyphone prediction model and acquiring the pronunciation of polyphones in the text;

Referring to fig. 4, the polyphone prediction module includes:

the input layer is used for inputting a training text containing polyphones and outputting a labeled data text;

the pre-training layer is used for inputting the labeled data text into a pre-training language model and acquiring vector representation of the data;

the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors respectively corresponding to the position context of the polyphone;

and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities. In this embodiment, the output layer includes a Softmax function.

The polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.

In the speech synthesis field, the device also comprises a speech synthesis module which is used for synthesizing the pronunciation output by the pronunciation combination module into speech and outputting audio, and the device can be widely applied to the fields of intelligent customer service speech interaction, audio reading, barrier-free broadcasting and the like.

In the polyphone prediction module, an input layer is used for obtaining a training text with polyphones, a pre-training layer processes the training text to obtain vector data, and batch iterative training is carried out by using the vector data. Specifically, the convolutional layer plays a role of abstracting text features, the convolutional layer performs convolution operation with input vectors circularly through a plurality of convolution kernels to obtain output vectors, and different convolution kernels learn different features. The characters before and after the polyphone are key information influencing the pronunciation of the polyphone, and the characters before and after the polyphone are respectively input into the convolution layer to obtain two output vectors. The stitching layer is used to stitch the two vectors together. The GRU network layer selectively resets and updates the vectors output by the splicing layer, solves the problems of long-term memory and gradient disappearance, and has good learning capacity for sequence characteristics. The Dropout layer randomly sets some vectors to zero during training, so that the complexity of the network is reduced, and overfitting can be effectively prevented. After the operation, a multi-dimensional vector is output, the full-connection layer is used for converting the multi-dimensional vector into a one-dimensional vector, and all the features are mapped into one vector. The Softmax function is commonly used in an output layer of a multi-classification neural network, can map input vectors between 0 and 1 and is used for representing the probability of each class, the output layer adopts the function to give the probability of each pronunciation of the polyphone, and the pronunciation with the maximum probability is used as output.

EXAMPLE III

The invention discloses a computer-readable storage medium, which comprises a set of computer-executable instructions, and when the instructions are executed, the computer-readable storage medium is used for executing a polyphonic pronunciation prediction method in the first embodiment.

The embodiments of the present invention are preferred embodiments of the present invention, and the scope of the present invention is not limited by these embodiments, so: all equivalent changes made according to the structure, shape and principle of the invention are covered by the protection scope of the invention.

Claims

1. A polyphone pronunciation prediction method is characterized by comprising the following steps:

the training of the polyphone prediction model comprises the following steps:

marking the corresponding correct pronunciation comprises marking polyphonic characters in the training text according to the correct pronunciation, and marking the monophonic characters by using symbols;

the deep learning model comprises a convolution operation between circulation and an input vector to obtain two vectors obtained by respectively performing the convolution operation on the position context of a polyphone, splicing the two vectors, inputting the spliced vectors into a GRU network for resetting and updating, randomly inactivating the output vector of the GRU network, outputting a multi-dimensional vector, converting the output multi-dimensional vector into a one-dimensional vector, mapping each element of the one-dimensional vector to the probability corresponding to each pronunciation through a function, and outputting the pronunciation with the maximum probability.

2. A polyphonic pronunciation prediction method according to claim 1, characterized by: the pre-training model is a Word2vec or bert model.

3. A polyphonic pronunciation prediction method according to claim 2, characterized by: the training of the polyphone prediction model comprises the steps of training the model by adopting a random gradient descent algorithm in each iteration, and evaluating the fitting degree of the model by adopting a cross entropy loss function.

4. A polyphone pronunciation prediction device, comprising:

the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation;

the polyphone prediction module comprises:

the pre-training layer is used for inputting the data text into the pre-training language model and acquiring the vector representation of the data;

5. A polyphonic pronunciation prediction device according to claim 4, wherein: the polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.

6. A polyphonic pronunciation prediction device according to claim 5, wherein: the voice synthesis module is used for synthesizing the pronunciation output by the pronunciation combination module into voice and outputting audio.

7. A computer-readable storage medium characterized by: comprising a set of computer executable instructions for performing a polyphonic pronunciation prediction method as claimed in any one of claims 1 to 3 when executed.