CN111599340A - Polyphone pronunciation prediction method and device and computer readable storage medium - Google Patents
Polyphone pronunciation prediction method and device and computer readable storage medium Download PDFInfo
- Publication number
- CN111599340A CN111599340A CN202010727658.8A CN202010727658A CN111599340A CN 111599340 A CN111599340 A CN 111599340A CN 202010727658 A CN202010727658 A CN 202010727658A CN 111599340 A CN111599340 A CN 111599340A
- Authority
- CN
- China
- Prior art keywords
- pronunciation
- polyphone
- text
- training
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000012549 training Methods 0.000 claims abstract description 87
- 238000002372 labelling Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 101
- 230000006870 function Effects 0.000 claims description 20
- 230000015572 biosynthetic process Effects 0.000 claims description 8
- 238000003786 synthesis reaction Methods 0.000 claims description 8
- 230000000415 inactivating effect Effects 0.000 claims description 7
- 238000013136 deep learning model Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000000694 effects Effects 0.000 abstract description 4
- 238000012545 processing Methods 0.000 abstract description 4
- 238000013528 artificial neural network Methods 0.000 abstract description 3
- 239000011159 matrix material Substances 0.000 description 5
- 238000013135 deep learning Methods 0.000 description 3
- 230000007787 long-term memory Effects 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000002779 inactivation Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a polyphone pronunciation prediction method, relates to the technical field of computer voice processing, and aims to solve the problem that the polyphone pronunciation labeling accuracy rate is low in the prior art. The technical scheme is characterized in that a large amount of texts containing polyphones and pinyin complete spellings thereof are obtained; training on the designed model by using a batch iterative training method to obtain a polyphone prediction model; in a text pronunciation marking system, a text input by a user is obtained, pronunciation of the text is predicted by using a polyphone prediction model, single-tone pinyin is obtained by table lookup, and pinyin corresponding to the text is spliced and output. The method and the device utilize the context information of the deep neural network learning text to predict the polyphone pronunciation, thereby achieving the effect of improving the accuracy of the polyphone pronunciation prediction.
Description
Technical Field
The invention relates to the technical field of computer voice processing, in particular to a polyphone pronunciation prediction method.
Background
Speech synthesis, a technique for allowing a computer to synthesize corresponding speech according to text content, enables a machine to speak, and is a key for improving human-computer interaction experience. Currently, deep learning techniques have also entered the field of speech synthesis and achieved good results. The invention is used for converting Chinese text with polyphones into correct pinyin, and is a key step of speech synthesis.
The pronunciation prediction method for polyphone mainly comprises the following steps: 1. the method has lower accuracy rate obviously according to the pronunciation with the highest frequency; 2. summarizing a polyphone word stock and a corpus, and then processing the polyphones by a phrase matching method, but the method is limited by the size of the corpus, and the situation that a single word or a word has multiple pronunciations cannot be solved by using a pure word stock, and the matching ambiguity errors can be introduced if the corpus is too large; 3. a linguist makes rules, and then trains model recognition by combining machine learning methods such as rules and decision trees, but the rule making is difficult. Therefore, the accuracy of the pronunciation prediction of the existing polyphones is low.
Disclosure of Invention
The invention aims to provide a polyphone pronunciation prediction method, a polyphone pronunciation prediction device and a computer readable storage medium.
The above object of the present invention is achieved by the following technical solutions:
a polyphone pronunciation prediction method comprises the following steps:
importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation;
combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation; wherein,
the training of the polyphone prediction model comprises the following steps:
inputting a training text containing polyphones, marking corresponding correct pronunciation, and outputting a data text corresponding to the training text; inputting a data text into a pre-training language model to obtain vector representation of data; inputting the vector into a deep learning model to perform batch iterative training to obtain a polyphone prediction model;
marking the corresponding correct pronunciation comprises marking polyphonic characters in the training text according to the correct pronunciation, and marking the monophonic characters by symbols.
The invention is further configured to: the deep learning model comprises a convolution operation between circulation and an input vector to obtain two vectors obtained by respectively performing the convolution operation on the position context of a polyphone, splicing the two vectors, inputting the spliced vectors into a GRU network for resetting and updating, randomly inactivating the output vector of the GRU network, outputting a multi-dimensional vector, converting the output multi-dimensional vector into a one-dimensional vector, mapping each element of the one-dimensional vector to the probability corresponding to each pronunciation through a function, and outputting the pronunciation with the maximum probability.
The invention is further configured to: the pre-training model is a Word2vec or bert model.
The invention is further configured to: the training of the polyphone prediction model comprises the steps of training the model by adopting a random gradient descent algorithm in each iteration, and evaluating the fitting degree of the model by adopting a cross entropy loss function.
The second aim of the invention is realized by the following technical scheme:
a polyphone pronunciation prediction device comprising:
the polyphone prediction module is used for importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
the single-tone character pronunciation marking module is used for marking the pronunciation of the input text to obtain the pronunciation of the single-tone character;
and the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation.
The invention is further configured to: the polyphone prediction module comprises:
the input layer is used for inputting a training text containing polyphones, marking corresponding correct pronunciation and outputting a data text corresponding to the training text;
the pre-training layer is used for inputting the marked text into a pre-training language model and acquiring vector representation of data;
the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors obtained by performing convolution operation on the context of the position where the polyphone is located;
the splicing layer is used for splicing the two vectors output by the convolution layer;
the GRU network layer is used for selectively resetting and updating the vectors output by the splicing layer;
the Dropout layer is used for randomly inactivating the output vector of the GRU network layer;
the full connection layer is used for converting the multidimensional vector output by the Dropout layer into a one-dimensional vector;
and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities.
The invention is further configured to: the polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.
The invention is further configured to: the voice synthesis module is used for synthesizing the pronunciation output by the pronunciation combination module into voice and outputting audio.
The third object of the invention is realized by the following technical scheme:
a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform a polyphonic pronunciation prediction method as described above.
In conclusion, the beneficial technical effects of the invention are as follows: the method comprises the steps of obtaining a text input by a user, detecting the position of a polyphone in the text, predicting the pronunciation of the polyphone, looking up a table to obtain the pinyin of the monophonic character, splicing and outputting the pinyin corresponding to the text, and combining a word bank and a deep learning technology to improve the accuracy of converting the polyphone into the pinyin.
Drawings
FIG. 1 is an overall flow chart of a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating the training of a polyphonic prediction model according to an embodiment of the present invention;
FIG. 3 is a block diagram showing the overall structure of a second embodiment of the present invention;
fig. 4 is a block diagram of a polyphone prediction module according to a second embodiment of the present invention.
Detailed Description
Example one
The invention discloses a polyphone pronunciation prediction method, which can be used for front-end text processing in voice recognition, can also be used in the fields of voice synthesis and the like requiring polyphone voice labeling, and can be applied to electronic equipment such as computers, servers, vehicle-mounted terminals and the like. Further, the present invention may be applied to all connected scenarios in a Direct Memory Access (DMA) link or other scenarios, which is not limited in this respect.
Referring to fig. 1, the method includes the steps of: importing an input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the text; performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation; and combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation.
In this embodiment, the input text may be input by a user through a device, such as a mobile device like a smart phone or a tablet, or an input device like a mouse or a keyboard, or may be text obtained by an automatic speech recognition technology.
It should be noted that the two steps of importing the input text into the polyphone prediction model and performing the phonetic annotation of the monophonic character are not sequentially performed, and may even be performed simultaneously.
The polyphones in the Chinese characters are more than six hundred probably, the frequently used polyphones are more than one hundred, and the single-tone characters and some rare polyphones can be inquired through a word stock without model prediction. Therefore, in an alternative, the polyphone prediction model can only record the prediction of common polyphones, so that the practicability of the model is improved, and rare polyphones are recorded in a word stock for query labeling. When the recorded common polyphones appear in the text, the polyphone position is marked, so that the subsequent prediction is facilitated.
Referring to fig. 2, the training of the polyphonic prediction model includes the steps of:
s1, inputting a training text containing polyphone characters, marking the correct pronunciation corresponding to the training text, and outputting a data text corresponding to the training text, wherein the data text is a character text without the pronunciation;
s2, inputting the data text into a pre-training language model, obtaining vector representation of the data, and obtaining prior knowledge of the pre-training model;
and S3, inputting the vectors into the deep learning model to perform batch iterative training to obtain a polyphone prediction model.
In step S1, the training text is obtained by collecting a text data set a1 in a real application scene of speech recognition, selecting some open news such as dog search news, microblog corpora, and the like, obtaining a corpus data set a2, selecting sentences containing polyphones to form a corpus set, performing initial pronunciation labeling through a dictionary, and then obtaining a final training corpus set a through manual inspection and correction.
In particular, in the scheme, in the training text, single-tone characters are marked by special symbols, and polyphonic characters are marked by normal pronunciation. For example, the sentence "please return money as soon as possible", which is labeled "hai 2 NA huan2 NA", wherein "2" represents the tone, and will not be described in detail with reference to the rule hereinafter. The marking method is simple, clear and easy to carry out, the training text marked according to the method can be used for subsequent polyphone model training, the pinyin predicted by the model and the marked pinyin are compared, loss is calculated, and training is more efficient.
In step S2, the pre-training model is preferably a Word2 vector or bert model. The training of these models can be directly trained using large amounts of text data, learning the co-occurrence and precedence knowledge of the words and characters of these texts. The pre-training language model after training can output corresponding vector representation according to the input words or expressions, and the sizes of different vector values reflect the relationship between different words or word meanings. Particularly, if other pre-training models are used, the format of the training text needs to be converted to form labeled data, and then the labeled data is input into the pre-training models. The technical scheme provides a method for improving training effect by adopting a pre-training model, wherein a one hot method is adopted for text coding before the pre-training model is adopted, the same weight is given to all characters, and no more prior information exists. The pre-training model is obtained by training a large amount of label-free data, and the pre-training model is used for feature extraction, so that context information learned by a large amount of text data can be obtained, different vectors of different characters can be represented, and the problem of limited labeled training data is greatly improved.
In step S3, batch iterative training is a common training method used in deep learning, training is performed by sending training data into a neural network in batches, and the selection of the specific batch size and the iteration number needs to be determined through experiments according to the machine performance and the specific training effect.
The deep learning model comprises a step of performing convolution operation on circulation and input vectors to obtain two vectors respectively corresponding to the position context of the polyphone, and the two vectors are spliced and input into a GRU (Gate Recurrent Unit) network for resetting and updating. The GRU is a circulating network, two gating mechanisms of reset gate and update gate in the GRU can selectively reset and update input vectors during model training, and compared with network models such as a multilayer perceptron and a convolutional neural network, the GRU can well solve the problems of long-term memory and gradient disappearance and has better learning capacity for sequence features.
And then, randomly inactivating the GRU network output vector to output a multidimensional vector. When the network is too complex and the training data are less, the method is easy to over-learn, the accuracy rate is high during training, the accuracy rate is low in an application scene, and random inactivation operation randomly zeros some vectors during training, so that the complexity of the network is reduced, and overfitting can be effectively prevented.
And finally, converting the multi-dimensional vector after random inactivation into a one-dimensional vector, mapping each element of the one-dimensional vector to the corresponding probability of each pronunciation through a function, and outputting the pronunciation with the maximum probability.
The implementation principle of the above embodiment is as follows: training a polyphone prediction model, comprising:
1. and acquiring a training corpus A.
2. The Word2Vec pre-training language model is utilized, and the pre-training language model can be trained by using a dog searching news corpus, wherein the corpus comprises common Chinese characters and a Word vector dimension is 300. And inputting a word vector matrix W into the training corpus A to obtain a vector matrix X of the input corpus, wherein the matrix X is three-dimensional data, the first dimension is the number of samples, the second dimension is the length of each sentence, and the third dimension is 300, namely the dimension of a word vector.
3. Performing convolution operation, wherein the sizes of convolution kernels are 3, 4 and 5, the size of the convolution kernel is the size of a window, performing convolution operation by using convolution kernel circulation and an input vector matrix X, obtaining a value which is the characteristic in the corresponding window size through one-time convolution, and then averaging all values of each convolution kernel to obtain the characteristic value of the whole sentence, wherein the characteristic value is not only local information limited to a single window but also characteristic information of the whole sentence. And respectively extracting different features by adopting 120 convolution kernels to obtain 120 feature values which are spliced into a vector C.
4. The pronunciation of polyphone in the sentence is determined by the context, the context of the polyphone is processed in step 3 to obtain two feature vectors C1 and C2, the dimension of each vector is 120, and the two vectors are spliced into a 240-dimensional vector P at the splicing layer. For the special cases of beginning and end of sentence, zero filling treatment is carried out, if polyphone is at the beginning of sentence, 120-dimensional zero vector is used in the front, and if polyphone is at the end of sentence, 120-dimensional zero vector is used in the back.
5. The vector P is input to a bidirectional GRU network, which is a network suitable for sequence learning problems that can solve well the long-term memory and gradient vanishing problems. 256 hidden units are used for the GRU, and since the bidirectional GRU is adopted, the vector G with the dimension of 512 is output because the vector G needs to be multiplied by 2.
6. And (3) randomly inactivating the vector G, randomly discarding a certain proportion of network connection to reduce the network complexity and reduce the possibility of overfitting, wherein the layer only plays a role in training and cannot be carried out in formal use, and the vector D is output.
7. The dimension of the vector D is 512, the last pronunciation prediction is carried out, the last step of conversion is also needed, the dimension of the vector is converted into the total number m of all polyphonic pronunciations through a conversion matrix S, each dimension corresponds to one pronunciation, and the vector Q is output.
8. The Softmax function can convert the vector Q into a set of numbers between 0 and 1, each number representing the probability of the corresponding reading. The Softmax function is defined as:
wherein e is a natural constant, ziFor each value of the vector Q, m is the dimension of the vector Q, and i takes an integer value in the range of 1 to m.
9. Training a model by adopting a random gradient descent algorithm, and evaluating the degree of model fitting by adopting a cross entropy loss (Cross EntropyLoss) function, wherein the formula of the loss function is as follows:
wherein p is a predicted value, q is a true value, and the smaller the value of the function is, the better the fitting degree of the model is. The model is trained using a stochastic gradient descent algorithm based on this loss function.
The training set adopted by the invention contains 50 thousands of sentences, contains 150 common polyphones, uses 47.5 thousands as the training set and 2.5 thousands as the test set, and after 20 rounds of training, the accuracy rate of the test set reaches 96%.
After training, acquiring an input text, if the input text contains polyphones, inputting the polyphones into a polyphone prediction model to obtain polyphone pronunciations, meanwhile, labeling the single-tone pronunciations of the input text in a dictionary program mode, splicing the single-tone pronunciations with the polyphone pronunciations to acquire complete pronunciation labels of the input text.
Example two
The invention discloses a polyphone pronunciation prediction device, which refers to fig. 3 and comprises a polyphone prediction module, a pronunciation prediction module and a pronunciation prediction module, wherein the polyphone prediction module is used for leading an input text into a trained polyphone prediction model and acquiring the pronunciation of polyphones in the text;
the single-tone character pronunciation marking module is used for marking the pronunciation of the input text to obtain the pronunciation of the single-tone character;
and the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation.
Referring to fig. 4, the polyphone prediction module includes:
the input layer is used for inputting a training text containing polyphones and outputting a labeled data text;
the pre-training layer is used for inputting the labeled data text into a pre-training language model and acquiring vector representation of the data;
the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors respectively corresponding to the position context of the polyphone;
the splicing layer is used for splicing the two vectors output by the convolution layer;
the GRU network layer is used for selectively resetting and updating the vectors output by the splicing layer;
the Dropout layer is used for randomly inactivating the output vector of the GRU network layer;
the full connection layer is used for converting the multidimensional vector output by the Dropout layer into a one-dimensional vector;
and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities. In this embodiment, the output layer includes a Softmax function.
The polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.
In the speech synthesis field, the device also comprises a speech synthesis module which is used for synthesizing the pronunciation output by the pronunciation combination module into speech and outputting audio, and the device can be widely applied to the fields of intelligent customer service speech interaction, audio reading, barrier-free broadcasting and the like.
In the polyphone prediction module, an input layer is used for obtaining a training text with polyphones, a pre-training layer processes the training text to obtain vector data, and batch iterative training is carried out by using the vector data. Specifically, the convolutional layer plays a role of abstracting text features, the convolutional layer performs convolution operation with input vectors circularly through a plurality of convolution kernels to obtain output vectors, and different convolution kernels learn different features. The characters before and after the polyphone are key information influencing the pronunciation of the polyphone, and the characters before and after the polyphone are respectively input into the convolution layer to obtain two output vectors. The stitching layer is used to stitch the two vectors together. The GRU network layer selectively resets and updates the vectors output by the splicing layer, solves the problems of long-term memory and gradient disappearance, and has good learning capacity for sequence characteristics. The Dropout layer randomly sets some vectors to zero during training, so that the complexity of the network is reduced, and overfitting can be effectively prevented. After the operation, a multi-dimensional vector is output, the full-connection layer is used for converting the multi-dimensional vector into a one-dimensional vector, and all the features are mapped into one vector. The Softmax function is commonly used in an output layer of a multi-classification neural network, can map input vectors between 0 and 1 and is used for representing the probability of each class, the output layer adopts the function to give the probability of each pronunciation of the polyphone, and the pronunciation with the maximum probability is used as output.
EXAMPLE III
The invention discloses a computer-readable storage medium, which comprises a set of computer-executable instructions, and when the instructions are executed, the computer-readable storage medium is used for executing a polyphonic pronunciation prediction method in the first embodiment.
The embodiments of the present invention are preferred embodiments of the present invention, and the scope of the present invention is not limited by these embodiments, so: all equivalent changes made according to the structure, shape and principle of the invention are covered by the protection scope of the invention.
Claims (7)
1. A polyphone pronunciation prediction method is characterized by comprising the following steps:
importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation;
combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation; wherein,
the training of the polyphone prediction model comprises the following steps:
inputting a training text containing polyphones, marking corresponding correct pronunciation, and outputting a data text corresponding to the training text; inputting a data text into a pre-training language model to obtain vector representation of data; inputting the vector into a deep learning model to perform batch iterative training to obtain a polyphone prediction model;
marking the corresponding correct pronunciation comprises marking polyphonic characters in the training text according to the correct pronunciation, and marking the monophonic characters by using symbols;
the deep learning model comprises a convolution operation between circulation and an input vector to obtain two vectors obtained by respectively performing the convolution operation on the position context of a polyphone, splicing the two vectors, inputting the spliced vectors into a GRU network for resetting and updating, randomly inactivating the output vector of the GRU network, outputting a multi-dimensional vector, converting the output multi-dimensional vector into a one-dimensional vector, mapping each element of the one-dimensional vector to the probability corresponding to each pronunciation through a function, and outputting the pronunciation with the maximum probability.
2. A polyphonic pronunciation prediction method according to claim 1, characterized by: the pre-training model is a Word2vec or bert model.
3. A polyphonic pronunciation prediction method according to claim 2, characterized by: the training of the polyphone prediction model comprises the steps of training the model by adopting a random gradient descent algorithm in each iteration, and evaluating the fitting degree of the model by adopting a cross entropy loss function.
4. A polyphone pronunciation prediction device, comprising:
the polyphone prediction module is used for importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
the single-tone character pronunciation marking module is used for marking the pronunciation of the input text to obtain the pronunciation of the single-tone character;
the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation;
the polyphone prediction module comprises:
the input layer is used for inputting a training text containing polyphones, marking corresponding correct pronunciation and outputting a data text corresponding to the training text;
the pre-training layer is used for inputting the data text into the pre-training language model and acquiring the vector representation of the data;
the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors obtained by performing convolution operation on the context of the position where the polyphone is located;
the splicing layer is used for splicing the two vectors output by the convolution layer;
the GRU network layer is used for selectively resetting and updating the vectors output by the splicing layer;
the Dropout layer is used for randomly inactivating the output vector of the GRU network layer;
the full connection layer is used for converting the multidimensional vector output by the Dropout layer into a one-dimensional vector;
and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities.
5. A polyphonic pronunciation prediction device according to claim 4, wherein: the polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.
6. A polyphonic pronunciation prediction device according to claim 5, wherein: the voice synthesis module is used for synthesizing the pronunciation output by the pronunciation combination module into voice and outputting audio.
7. A computer-readable storage medium characterized by: comprising a set of computer executable instructions for performing a polyphonic pronunciation prediction method as claimed in any one of claims 1 to 3 when executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010727658.8A CN111599340A (en) | 2020-07-27 | 2020-07-27 | Polyphone pronunciation prediction method and device and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010727658.8A CN111599340A (en) | 2020-07-27 | 2020-07-27 | Polyphone pronunciation prediction method and device and computer readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111599340A true CN111599340A (en) | 2020-08-28 |
Family
ID=72186722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010727658.8A Pending CN111599340A (en) | 2020-07-27 | 2020-07-27 | Polyphone pronunciation prediction method and device and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111599340A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111737957A (en) * | 2020-08-25 | 2020-10-02 | 北京世纪好未来教育科技有限公司 | Chinese character pinyin conversion method and device, electronic equipment and storage medium |
CN112348073A (en) * | 2020-10-30 | 2021-02-09 | 北京达佳互联信息技术有限公司 | Polyphone recognition method and device, electronic equipment and storage medium |
CN112580335A (en) * | 2020-12-28 | 2021-03-30 | 建信金融科技有限责任公司 | Method and device for disambiguating polyphone |
CN112735376A (en) * | 2020-12-29 | 2021-04-30 | 竹间智能科技(上海)有限公司 | Self-learning platform |
CN112966476A (en) * | 2021-04-19 | 2021-06-15 | 马上消费金融股份有限公司 | Text processing method and device, electronic equipment and storage medium |
WO2022121166A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Method, apparatus and device for predicting heteronym pronunciation, and storage medium |
CN114742044A (en) * | 2022-03-18 | 2022-07-12 | 联想(北京)有限公司 | Information processing method and device and electronic equipment |
CN115273809A (en) * | 2022-06-22 | 2022-11-01 | 北京市商汤科技开发有限公司 | Training method of polyphone pronunciation prediction network, and speech generation method and device |
CN116266266A (en) * | 2022-11-08 | 2023-06-20 | 美的集团(上海)有限公司 | Multi-tone word disambiguation method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103578464A (en) * | 2013-10-18 | 2014-02-12 | 威盛电子股份有限公司 | Language model establishing method, speech recognition method and electronic device |
JP2017208097A (en) * | 2016-05-20 | 2017-11-24 | 富士通株式会社 | Ambiguity avoidance method of polyphonic entity and ambiguity avoidance device of polyphonic entity |
CN110277085A (en) * | 2019-06-25 | 2019-09-24 | 腾讯科技(深圳)有限公司 | Determine the method and device of polyphone pronunciation |
CN110782870A (en) * | 2019-09-06 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN110909879A (en) * | 2019-12-09 | 2020-03-24 | 北京爱数智慧科技有限公司 | Auto-regressive neural network disambiguation model, training and using method, device and system |
-
2020
- 2020-07-27 CN CN202010727658.8A patent/CN111599340A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103578464A (en) * | 2013-10-18 | 2014-02-12 | 威盛电子股份有限公司 | Language model establishing method, speech recognition method and electronic device |
JP2017208097A (en) * | 2016-05-20 | 2017-11-24 | 富士通株式会社 | Ambiguity avoidance method of polyphonic entity and ambiguity avoidance device of polyphonic entity |
CN110277085A (en) * | 2019-06-25 | 2019-09-24 | 腾讯科技(深圳)有限公司 | Determine the method and device of polyphone pronunciation |
CN110782870A (en) * | 2019-09-06 | 2020-02-11 | 腾讯科技(深圳)有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
CN110909879A (en) * | 2019-12-09 | 2020-03-24 | 北京爱数智慧科技有限公司 | Auto-regressive neural network disambiguation model, training and using method, device and system |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111737957B (en) * | 2020-08-25 | 2021-06-01 | 北京世纪好未来教育科技有限公司 | Chinese character pinyin conversion method and device, electronic equipment and storage medium |
CN111737957A (en) * | 2020-08-25 | 2020-10-02 | 北京世纪好未来教育科技有限公司 | Chinese character pinyin conversion method and device, electronic equipment and storage medium |
CN112348073A (en) * | 2020-10-30 | 2021-02-09 | 北京达佳互联信息技术有限公司 | Polyphone recognition method and device, electronic equipment and storage medium |
CN112348073B (en) * | 2020-10-30 | 2024-05-17 | 北京达佳互联信息技术有限公司 | Multi-tone character recognition method and device, electronic equipment and storage medium |
JP7441864B2 (en) | 2020-12-10 | 2024-03-01 | 平安科技(深▲せん▼)有限公司 | Methods, devices, equipment, and storage media for predicting polyphonic pronunciation |
WO2022121166A1 (en) * | 2020-12-10 | 2022-06-16 | 平安科技(深圳)有限公司 | Method, apparatus and device for predicting heteronym pronunciation, and storage medium |
JP2023509257A (en) * | 2020-12-10 | 2023-03-08 | 平安科技(深▲せん▼)有限公司 | Method, apparatus, equipment, and storage medium for predicting polyphonic pronunciation |
CN112580335A (en) * | 2020-12-28 | 2021-03-30 | 建信金融科技有限责任公司 | Method and device for disambiguating polyphone |
CN112580335B (en) * | 2020-12-28 | 2023-03-24 | 建信金融科技有限责任公司 | Method and device for disambiguating polyphone |
CN112735376A (en) * | 2020-12-29 | 2021-04-30 | 竹间智能科技(上海)有限公司 | Self-learning platform |
CN112966476A (en) * | 2021-04-19 | 2021-06-15 | 马上消费金融股份有限公司 | Text processing method and device, electronic equipment and storage medium |
CN112966476B (en) * | 2021-04-19 | 2022-03-25 | 马上消费金融股份有限公司 | Text processing method and device, electronic equipment and storage medium |
CN114742044A (en) * | 2022-03-18 | 2022-07-12 | 联想(北京)有限公司 | Information processing method and device and electronic equipment |
CN115273809A (en) * | 2022-06-22 | 2022-11-01 | 北京市商汤科技开发有限公司 | Training method of polyphone pronunciation prediction network, and speech generation method and device |
CN116266266B (en) * | 2022-11-08 | 2024-02-20 | 美的集团(上海)有限公司 | Multi-tone word disambiguation method, device, equipment and storage medium |
CN116266266A (en) * | 2022-11-08 | 2023-06-20 | 美的集团(上海)有限公司 | Multi-tone word disambiguation method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111599340A (en) | Polyphone pronunciation prediction method and device and computer readable storage medium | |
WO2020224219A1 (en) | Chinese word segmentation method and apparatus, electronic device and readable storage medium | |
CN110765785B (en) | Chinese-English translation method based on neural network and related equipment thereof | |
CN111931517B (en) | Text translation method, device, electronic equipment and storage medium | |
CN116127953B (en) | Chinese spelling error correction method, device and medium based on contrast learning | |
CN111079418B (en) | Named entity recognition method, device, electronic equipment and storage medium | |
CN110096572B (en) | Sample generation method, device and computer readable medium | |
CN112613293B (en) | Digest generation method, digest generation device, electronic equipment and storage medium | |
CN113158656B (en) | Ironic content recognition method, ironic content recognition device, electronic device, and storage medium | |
CN110968725B (en) | Image content description information generation method, electronic device and storage medium | |
CN114676255A (en) | Text processing method, device, equipment, storage medium and computer program product | |
CN111414746A (en) | Matching statement determination method, device, equipment and storage medium | |
CN113268576B (en) | Deep learning-based department semantic information extraction method and device | |
CN115600597A (en) | Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium | |
CN114218945A (en) | Entity identification method, device, server and storage medium | |
CN113449514A (en) | Text error correction method and device suitable for specific vertical field | |
WO2023134085A1 (en) | Question answer prediction method and prediction apparatus, electronic device, and storage medium | |
CN114281996B (en) | Method, device, equipment and storage medium for classifying long text | |
CN112632956A (en) | Text matching method, device, terminal and storage medium | |
CN114637852B (en) | Entity relation extraction method, device, equipment and storage medium of medical text | |
CN116842168A (en) | Cross-domain problem processing method and device, electronic equipment and storage medium | |
CN110347813B (en) | Corpus processing method and device, storage medium and electronic equipment | |
CN115952284A (en) | Medical text relation extraction method fusing density clustering and ERNIE | |
CN111090720B (en) | Hot word adding method and device | |
CN115292492A (en) | Method, device and equipment for training intention classification model and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200828 |
|
RJ01 | Rejection of invention patent application after publication |