[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN111599340A - Polyphone pronunciation prediction method and device and computer readable storage medium - Google Patents

Polyphone pronunciation prediction method and device and computer readable storage medium Download PDF

Info

Publication number
CN111599340A
CN111599340A CN202010727658.8A CN202010727658A CN111599340A CN 111599340 A CN111599340 A CN 111599340A CN 202010727658 A CN202010727658 A CN 202010727658A CN 111599340 A CN111599340 A CN 111599340A
Authority
CN
China
Prior art keywords
pronunciation
polyphone
text
training
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010727658.8A
Other languages
Chinese (zh)
Inventor
司马华鹏
王培雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Guiji Intelligent Technology Co ltd
Original Assignee
Nanjing Guiji Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Guiji Intelligent Technology Co ltd filed Critical Nanjing Guiji Intelligent Technology Co ltd
Priority to CN202010727658.8A priority Critical patent/CN111599340A/en
Publication of CN111599340A publication Critical patent/CN111599340A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a polyphone pronunciation prediction method, relates to the technical field of computer voice processing, and aims to solve the problem that the polyphone pronunciation labeling accuracy rate is low in the prior art. The technical scheme is characterized in that a large amount of texts containing polyphones and pinyin complete spellings thereof are obtained; training on the designed model by using a batch iterative training method to obtain a polyphone prediction model; in a text pronunciation marking system, a text input by a user is obtained, pronunciation of the text is predicted by using a polyphone prediction model, single-tone pinyin is obtained by table lookup, and pinyin corresponding to the text is spliced and output. The method and the device utilize the context information of the deep neural network learning text to predict the polyphone pronunciation, thereby achieving the effect of improving the accuracy of the polyphone pronunciation prediction.

Description

Polyphone pronunciation prediction method and device and computer readable storage medium
Technical Field
The invention relates to the technical field of computer voice processing, in particular to a polyphone pronunciation prediction method.
Background
Speech synthesis, a technique for allowing a computer to synthesize corresponding speech according to text content, enables a machine to speak, and is a key for improving human-computer interaction experience. Currently, deep learning techniques have also entered the field of speech synthesis and achieved good results. The invention is used for converting Chinese text with polyphones into correct pinyin, and is a key step of speech synthesis.
The pronunciation prediction method for polyphone mainly comprises the following steps: 1. the method has lower accuracy rate obviously according to the pronunciation with the highest frequency; 2. summarizing a polyphone word stock and a corpus, and then processing the polyphones by a phrase matching method, but the method is limited by the size of the corpus, and the situation that a single word or a word has multiple pronunciations cannot be solved by using a pure word stock, and the matching ambiguity errors can be introduced if the corpus is too large; 3. a linguist makes rules, and then trains model recognition by combining machine learning methods such as rules and decision trees, but the rule making is difficult. Therefore, the accuracy of the pronunciation prediction of the existing polyphones is low.
Disclosure of Invention
The invention aims to provide a polyphone pronunciation prediction method, a polyphone pronunciation prediction device and a computer readable storage medium.
The above object of the present invention is achieved by the following technical solutions:
a polyphone pronunciation prediction method comprises the following steps:
importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation;
combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation; wherein,
the training of the polyphone prediction model comprises the following steps:
inputting a training text containing polyphones, marking corresponding correct pronunciation, and outputting a data text corresponding to the training text; inputting a data text into a pre-training language model to obtain vector representation of data; inputting the vector into a deep learning model to perform batch iterative training to obtain a polyphone prediction model;
marking the corresponding correct pronunciation comprises marking polyphonic characters in the training text according to the correct pronunciation, and marking the monophonic characters by symbols.
The invention is further configured to: the deep learning model comprises a convolution operation between circulation and an input vector to obtain two vectors obtained by respectively performing the convolution operation on the position context of a polyphone, splicing the two vectors, inputting the spliced vectors into a GRU network for resetting and updating, randomly inactivating the output vector of the GRU network, outputting a multi-dimensional vector, converting the output multi-dimensional vector into a one-dimensional vector, mapping each element of the one-dimensional vector to the probability corresponding to each pronunciation through a function, and outputting the pronunciation with the maximum probability.
The invention is further configured to: the pre-training model is a Word2vec or bert model.
The invention is further configured to: the training of the polyphone prediction model comprises the steps of training the model by adopting a random gradient descent algorithm in each iteration, and evaluating the fitting degree of the model by adopting a cross entropy loss function.
The second aim of the invention is realized by the following technical scheme:
a polyphone pronunciation prediction device comprising:
the polyphone prediction module is used for importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
the single-tone character pronunciation marking module is used for marking the pronunciation of the input text to obtain the pronunciation of the single-tone character;
and the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation.
The invention is further configured to: the polyphone prediction module comprises:
the input layer is used for inputting a training text containing polyphones, marking corresponding correct pronunciation and outputting a data text corresponding to the training text;
the pre-training layer is used for inputting the marked text into a pre-training language model and acquiring vector representation of data;
the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors obtained by performing convolution operation on the context of the position where the polyphone is located;
the splicing layer is used for splicing the two vectors output by the convolution layer;
the GRU network layer is used for selectively resetting and updating the vectors output by the splicing layer;
the Dropout layer is used for randomly inactivating the output vector of the GRU network layer;
the full connection layer is used for converting the multidimensional vector output by the Dropout layer into a one-dimensional vector;
and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities.
The invention is further configured to: the polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.
The invention is further configured to: the voice synthesis module is used for synthesizing the pronunciation output by the pronunciation combination module into voice and outputting audio.
The third object of the invention is realized by the following technical scheme:
a computer-readable storage medium comprising a set of computer-executable instructions that, when executed, perform a polyphonic pronunciation prediction method as described above.
In conclusion, the beneficial technical effects of the invention are as follows: the method comprises the steps of obtaining a text input by a user, detecting the position of a polyphone in the text, predicting the pronunciation of the polyphone, looking up a table to obtain the pinyin of the monophonic character, splicing and outputting the pinyin corresponding to the text, and combining a word bank and a deep learning technology to improve the accuracy of converting the polyphone into the pinyin.
Drawings
FIG. 1 is an overall flow chart of a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating the training of a polyphonic prediction model according to an embodiment of the present invention;
FIG. 3 is a block diagram showing the overall structure of a second embodiment of the present invention;
fig. 4 is a block diagram of a polyphone prediction module according to a second embodiment of the present invention.
Detailed Description
Example one
The invention discloses a polyphone pronunciation prediction method, which can be used for front-end text processing in voice recognition, can also be used in the fields of voice synthesis and the like requiring polyphone voice labeling, and can be applied to electronic equipment such as computers, servers, vehicle-mounted terminals and the like. Further, the present invention may be applied to all connected scenarios in a Direct Memory Access (DMA) link or other scenarios, which is not limited in this respect.
Referring to fig. 1, the method includes the steps of: importing an input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the text; performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation; and combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation.
In this embodiment, the input text may be input by a user through a device, such as a mobile device like a smart phone or a tablet, or an input device like a mouse or a keyboard, or may be text obtained by an automatic speech recognition technology.
It should be noted that the two steps of importing the input text into the polyphone prediction model and performing the phonetic annotation of the monophonic character are not sequentially performed, and may even be performed simultaneously.
The polyphones in the Chinese characters are more than six hundred probably, the frequently used polyphones are more than one hundred, and the single-tone characters and some rare polyphones can be inquired through a word stock without model prediction. Therefore, in an alternative, the polyphone prediction model can only record the prediction of common polyphones, so that the practicability of the model is improved, and rare polyphones are recorded in a word stock for query labeling. When the recorded common polyphones appear in the text, the polyphone position is marked, so that the subsequent prediction is facilitated.
Referring to fig. 2, the training of the polyphonic prediction model includes the steps of:
s1, inputting a training text containing polyphone characters, marking the correct pronunciation corresponding to the training text, and outputting a data text corresponding to the training text, wherein the data text is a character text without the pronunciation;
s2, inputting the data text into a pre-training language model, obtaining vector representation of the data, and obtaining prior knowledge of the pre-training model;
and S3, inputting the vectors into the deep learning model to perform batch iterative training to obtain a polyphone prediction model.
In step S1, the training text is obtained by collecting a text data set a1 in a real application scene of speech recognition, selecting some open news such as dog search news, microblog corpora, and the like, obtaining a corpus data set a2, selecting sentences containing polyphones to form a corpus set, performing initial pronunciation labeling through a dictionary, and then obtaining a final training corpus set a through manual inspection and correction.
In particular, in the scheme, in the training text, single-tone characters are marked by special symbols, and polyphonic characters are marked by normal pronunciation. For example, the sentence "please return money as soon as possible", which is labeled "hai 2 NA huan2 NA", wherein "2" represents the tone, and will not be described in detail with reference to the rule hereinafter. The marking method is simple, clear and easy to carry out, the training text marked according to the method can be used for subsequent polyphone model training, the pinyin predicted by the model and the marked pinyin are compared, loss is calculated, and training is more efficient.
In step S2, the pre-training model is preferably a Word2 vector or bert model. The training of these models can be directly trained using large amounts of text data, learning the co-occurrence and precedence knowledge of the words and characters of these texts. The pre-training language model after training can output corresponding vector representation according to the input words or expressions, and the sizes of different vector values reflect the relationship between different words or word meanings. Particularly, if other pre-training models are used, the format of the training text needs to be converted to form labeled data, and then the labeled data is input into the pre-training models. The technical scheme provides a method for improving training effect by adopting a pre-training model, wherein a one hot method is adopted for text coding before the pre-training model is adopted, the same weight is given to all characters, and no more prior information exists. The pre-training model is obtained by training a large amount of label-free data, and the pre-training model is used for feature extraction, so that context information learned by a large amount of text data can be obtained, different vectors of different characters can be represented, and the problem of limited labeled training data is greatly improved.
In step S3, batch iterative training is a common training method used in deep learning, training is performed by sending training data into a neural network in batches, and the selection of the specific batch size and the iteration number needs to be determined through experiments according to the machine performance and the specific training effect.
The deep learning model comprises a step of performing convolution operation on circulation and input vectors to obtain two vectors respectively corresponding to the position context of the polyphone, and the two vectors are spliced and input into a GRU (Gate Recurrent Unit) network for resetting and updating. The GRU is a circulating network, two gating mechanisms of reset gate and update gate in the GRU can selectively reset and update input vectors during model training, and compared with network models such as a multilayer perceptron and a convolutional neural network, the GRU can well solve the problems of long-term memory and gradient disappearance and has better learning capacity for sequence features.
And then, randomly inactivating the GRU network output vector to output a multidimensional vector. When the network is too complex and the training data are less, the method is easy to over-learn, the accuracy rate is high during training, the accuracy rate is low in an application scene, and random inactivation operation randomly zeros some vectors during training, so that the complexity of the network is reduced, and overfitting can be effectively prevented.
And finally, converting the multi-dimensional vector after random inactivation into a one-dimensional vector, mapping each element of the one-dimensional vector to the corresponding probability of each pronunciation through a function, and outputting the pronunciation with the maximum probability.
The implementation principle of the above embodiment is as follows: training a polyphone prediction model, comprising:
1. and acquiring a training corpus A.
2. The Word2Vec pre-training language model is utilized, and the pre-training language model can be trained by using a dog searching news corpus, wherein the corpus comprises common Chinese characters and a Word vector dimension is 300. And inputting a word vector matrix W into the training corpus A to obtain a vector matrix X of the input corpus, wherein the matrix X is three-dimensional data, the first dimension is the number of samples, the second dimension is the length of each sentence, and the third dimension is 300, namely the dimension of a word vector.
3. Performing convolution operation, wherein the sizes of convolution kernels are 3, 4 and 5, the size of the convolution kernel is the size of a window, performing convolution operation by using convolution kernel circulation and an input vector matrix X, obtaining a value which is the characteristic in the corresponding window size through one-time convolution, and then averaging all values of each convolution kernel to obtain the characteristic value of the whole sentence, wherein the characteristic value is not only local information limited to a single window but also characteristic information of the whole sentence. And respectively extracting different features by adopting 120 convolution kernels to obtain 120 feature values which are spliced into a vector C.
4. The pronunciation of polyphone in the sentence is determined by the context, the context of the polyphone is processed in step 3 to obtain two feature vectors C1 and C2, the dimension of each vector is 120, and the two vectors are spliced into a 240-dimensional vector P at the splicing layer. For the special cases of beginning and end of sentence, zero filling treatment is carried out, if polyphone is at the beginning of sentence, 120-dimensional zero vector is used in the front, and if polyphone is at the end of sentence, 120-dimensional zero vector is used in the back.
5. The vector P is input to a bidirectional GRU network, which is a network suitable for sequence learning problems that can solve well the long-term memory and gradient vanishing problems. 256 hidden units are used for the GRU, and since the bidirectional GRU is adopted, the vector G with the dimension of 512 is output because the vector G needs to be multiplied by 2.
6. And (3) randomly inactivating the vector G, randomly discarding a certain proportion of network connection to reduce the network complexity and reduce the possibility of overfitting, wherein the layer only plays a role in training and cannot be carried out in formal use, and the vector D is output.
7. The dimension of the vector D is 512, the last pronunciation prediction is carried out, the last step of conversion is also needed, the dimension of the vector is converted into the total number m of all polyphonic pronunciations through a conversion matrix S, each dimension corresponds to one pronunciation, and the vector Q is output.
8. The Softmax function can convert the vector Q into a set of numbers between 0 and 1, each number representing the probability of the corresponding reading. The Softmax function is defined as:
Figure 446327DEST_PATH_IMAGE001
wherein e is a natural constant, ziFor each value of the vector Q, m is the dimension of the vector Q, and i takes an integer value in the range of 1 to m.
9. Training a model by adopting a random gradient descent algorithm, and evaluating the degree of model fitting by adopting a cross entropy loss (Cross EntropyLoss) function, wherein the formula of the loss function is as follows:
Figure 534368DEST_PATH_IMAGE002
wherein p is a predicted value, q is a true value, and the smaller the value of the function is, the better the fitting degree of the model is. The model is trained using a stochastic gradient descent algorithm based on this loss function.
The training set adopted by the invention contains 50 thousands of sentences, contains 150 common polyphones, uses 47.5 thousands as the training set and 2.5 thousands as the test set, and after 20 rounds of training, the accuracy rate of the test set reaches 96%.
After training, acquiring an input text, if the input text contains polyphones, inputting the polyphones into a polyphone prediction model to obtain polyphone pronunciations, meanwhile, labeling the single-tone pronunciations of the input text in a dictionary program mode, splicing the single-tone pronunciations with the polyphone pronunciations to acquire complete pronunciation labels of the input text.
Example two
The invention discloses a polyphone pronunciation prediction device, which refers to fig. 3 and comprises a polyphone prediction module, a pronunciation prediction module and a pronunciation prediction module, wherein the polyphone prediction module is used for leading an input text into a trained polyphone prediction model and acquiring the pronunciation of polyphones in the text;
the single-tone character pronunciation marking module is used for marking the pronunciation of the input text to obtain the pronunciation of the single-tone character;
and the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation.
Referring to fig. 4, the polyphone prediction module includes:
the input layer is used for inputting a training text containing polyphones and outputting a labeled data text;
the pre-training layer is used for inputting the labeled data text into a pre-training language model and acquiring vector representation of the data;
the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors respectively corresponding to the position context of the polyphone;
the splicing layer is used for splicing the two vectors output by the convolution layer;
the GRU network layer is used for selectively resetting and updating the vectors output by the splicing layer;
the Dropout layer is used for randomly inactivating the output vector of the GRU network layer;
the full connection layer is used for converting the multidimensional vector output by the Dropout layer into a one-dimensional vector;
and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities. In this embodiment, the output layer includes a Softmax function.
The polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.
In the speech synthesis field, the device also comprises a speech synthesis module which is used for synthesizing the pronunciation output by the pronunciation combination module into speech and outputting audio, and the device can be widely applied to the fields of intelligent customer service speech interaction, audio reading, barrier-free broadcasting and the like.
In the polyphone prediction module, an input layer is used for obtaining a training text with polyphones, a pre-training layer processes the training text to obtain vector data, and batch iterative training is carried out by using the vector data. Specifically, the convolutional layer plays a role of abstracting text features, the convolutional layer performs convolution operation with input vectors circularly through a plurality of convolution kernels to obtain output vectors, and different convolution kernels learn different features. The characters before and after the polyphone are key information influencing the pronunciation of the polyphone, and the characters before and after the polyphone are respectively input into the convolution layer to obtain two output vectors. The stitching layer is used to stitch the two vectors together. The GRU network layer selectively resets and updates the vectors output by the splicing layer, solves the problems of long-term memory and gradient disappearance, and has good learning capacity for sequence characteristics. The Dropout layer randomly sets some vectors to zero during training, so that the complexity of the network is reduced, and overfitting can be effectively prevented. After the operation, a multi-dimensional vector is output, the full-connection layer is used for converting the multi-dimensional vector into a one-dimensional vector, and all the features are mapped into one vector. The Softmax function is commonly used in an output layer of a multi-classification neural network, can map input vectors between 0 and 1 and is used for representing the probability of each class, the output layer adopts the function to give the probability of each pronunciation of the polyphone, and the pronunciation with the maximum probability is used as output.
EXAMPLE III
The invention discloses a computer-readable storage medium, which comprises a set of computer-executable instructions, and when the instructions are executed, the computer-readable storage medium is used for executing a polyphonic pronunciation prediction method in the first embodiment.
The embodiments of the present invention are preferred embodiments of the present invention, and the scope of the present invention is not limited by these embodiments, so: all equivalent changes made according to the structure, shape and principle of the invention are covered by the protection scope of the invention.

Claims (7)

1. A polyphone pronunciation prediction method is characterized by comprising the following steps:
importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
performing single-tone character pronunciation labeling on an input text to acquire single-tone character pronunciation;
combining the single-tone character pronunciation and the polyphone pronunciation according to the text sequence, and outputting the whole text pronunciation; wherein,
the training of the polyphone prediction model comprises the following steps:
inputting a training text containing polyphones, marking corresponding correct pronunciation, and outputting a data text corresponding to the training text; inputting a data text into a pre-training language model to obtain vector representation of data; inputting the vector into a deep learning model to perform batch iterative training to obtain a polyphone prediction model;
marking the corresponding correct pronunciation comprises marking polyphonic characters in the training text according to the correct pronunciation, and marking the monophonic characters by using symbols;
the deep learning model comprises a convolution operation between circulation and an input vector to obtain two vectors obtained by respectively performing the convolution operation on the position context of a polyphone, splicing the two vectors, inputting the spliced vectors into a GRU network for resetting and updating, randomly inactivating the output vector of the GRU network, outputting a multi-dimensional vector, converting the output multi-dimensional vector into a one-dimensional vector, mapping each element of the one-dimensional vector to the probability corresponding to each pronunciation through a function, and outputting the pronunciation with the maximum probability.
2. A polyphonic pronunciation prediction method according to claim 1, characterized by: the pre-training model is a Word2vec or bert model.
3. A polyphonic pronunciation prediction method according to claim 2, characterized by: the training of the polyphone prediction model comprises the steps of training the model by adopting a random gradient descent algorithm in each iteration, and evaluating the fitting degree of the model by adopting a cross entropy loss function.
4. A polyphone pronunciation prediction device, comprising:
the polyphone prediction module is used for importing the input text into a trained polyphone prediction model to obtain the pronunciation of the polyphone in the input text;
the single-tone character pronunciation marking module is used for marking the pronunciation of the input text to obtain the pronunciation of the single-tone character;
the pronunciation combination module is used for combining the pronunciation of the single-tone character and the pronunciation of the polyphone character according to the text sequence and outputting the whole text pronunciation;
the polyphone prediction module comprises:
the input layer is used for inputting a training text containing polyphones, marking corresponding correct pronunciation and outputting a data text corresponding to the training text;
the pre-training layer is used for inputting the data text into the pre-training language model and acquiring the vector representation of the data;
the convolution layer is used for performing convolution operation on the circulation and the output vector of the pre-training layer to obtain two vectors obtained by performing convolution operation on the context of the position where the polyphone is located;
the splicing layer is used for splicing the two vectors output by the convolution layer;
the GRU network layer is used for selectively resetting and updating the vectors output by the splicing layer;
the Dropout layer is used for randomly inactivating the output vector of the GRU network layer;
the full connection layer is used for converting the multidimensional vector output by the Dropout layer into a one-dimensional vector;
and the output layer is used for mapping the vector elements output by the full connection layer to the corresponding probabilities of the pronunciations by using the function and outputting the pronunciations with the maximum probabilities.
5. A polyphonic pronunciation prediction device according to claim 4, wherein: the polyphone prediction module adopts a random gradient descent algorithm to train the model in each iteration and adopts a cross entropy loss function to evaluate the fitting degree of the model.
6. A polyphonic pronunciation prediction device according to claim 5, wherein: the voice synthesis module is used for synthesizing the pronunciation output by the pronunciation combination module into voice and outputting audio.
7. A computer-readable storage medium characterized by: comprising a set of computer executable instructions for performing a polyphonic pronunciation prediction method as claimed in any one of claims 1 to 3 when executed.
CN202010727658.8A 2020-07-27 2020-07-27 Polyphone pronunciation prediction method and device and computer readable storage medium Pending CN111599340A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010727658.8A CN111599340A (en) 2020-07-27 2020-07-27 Polyphone pronunciation prediction method and device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010727658.8A CN111599340A (en) 2020-07-27 2020-07-27 Polyphone pronunciation prediction method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111599340A true CN111599340A (en) 2020-08-28

Family

ID=72186722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010727658.8A Pending CN111599340A (en) 2020-07-27 2020-07-27 Polyphone pronunciation prediction method and device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111599340A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737957A (en) * 2020-08-25 2020-10-02 北京世纪好未来教育科技有限公司 Chinese character pinyin conversion method and device, electronic equipment and storage medium
CN112348073A (en) * 2020-10-30 2021-02-09 北京达佳互联信息技术有限公司 Polyphone recognition method and device, electronic equipment and storage medium
CN112580335A (en) * 2020-12-28 2021-03-30 建信金融科技有限责任公司 Method and device for disambiguating polyphone
CN112735376A (en) * 2020-12-29 2021-04-30 竹间智能科技(上海)有限公司 Self-learning platform
CN112966476A (en) * 2021-04-19 2021-06-15 马上消费金融股份有限公司 Text processing method and device, electronic equipment and storage medium
WO2022121166A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Method, apparatus and device for predicting heteronym pronunciation, and storage medium
CN114742044A (en) * 2022-03-18 2022-07-12 联想(北京)有限公司 Information processing method and device and electronic equipment
CN115273809A (en) * 2022-06-22 2022-11-01 北京市商汤科技开发有限公司 Training method of polyphone pronunciation prediction network, and speech generation method and device
CN116266266A (en) * 2022-11-08 2023-06-20 美的集团(上海)有限公司 Multi-tone word disambiguation method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
JP2017208097A (en) * 2016-05-20 2017-11-24 富士通株式会社 Ambiguity avoidance method of polyphonic entity and ambiguity avoidance device of polyphonic entity
CN110277085A (en) * 2019-06-25 2019-09-24 腾讯科技(深圳)有限公司 Determine the method and device of polyphone pronunciation
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN110909879A (en) * 2019-12-09 2020-03-24 北京爱数智慧科技有限公司 Auto-regressive neural network disambiguation model, training and using method, device and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103578464A (en) * 2013-10-18 2014-02-12 威盛电子股份有限公司 Language model establishing method, speech recognition method and electronic device
JP2017208097A (en) * 2016-05-20 2017-11-24 富士通株式会社 Ambiguity avoidance method of polyphonic entity and ambiguity avoidance device of polyphonic entity
CN110277085A (en) * 2019-06-25 2019-09-24 腾讯科技(深圳)有限公司 Determine the method and device of polyphone pronunciation
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN110909879A (en) * 2019-12-09 2020-03-24 北京爱数智慧科技有限公司 Auto-regressive neural network disambiguation model, training and using method, device and system

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737957B (en) * 2020-08-25 2021-06-01 北京世纪好未来教育科技有限公司 Chinese character pinyin conversion method and device, electronic equipment and storage medium
CN111737957A (en) * 2020-08-25 2020-10-02 北京世纪好未来教育科技有限公司 Chinese character pinyin conversion method and device, electronic equipment and storage medium
CN112348073A (en) * 2020-10-30 2021-02-09 北京达佳互联信息技术有限公司 Polyphone recognition method and device, electronic equipment and storage medium
CN112348073B (en) * 2020-10-30 2024-05-17 北京达佳互联信息技术有限公司 Multi-tone character recognition method and device, electronic equipment and storage medium
JP7441864B2 (en) 2020-12-10 2024-03-01 平安科技(深▲せん▼)有限公司 Methods, devices, equipment, and storage media for predicting polyphonic pronunciation
WO2022121166A1 (en) * 2020-12-10 2022-06-16 平安科技(深圳)有限公司 Method, apparatus and device for predicting heteronym pronunciation, and storage medium
JP2023509257A (en) * 2020-12-10 2023-03-08 平安科技(深▲せん▼)有限公司 Method, apparatus, equipment, and storage medium for predicting polyphonic pronunciation
CN112580335A (en) * 2020-12-28 2021-03-30 建信金融科技有限责任公司 Method and device for disambiguating polyphone
CN112580335B (en) * 2020-12-28 2023-03-24 建信金融科技有限责任公司 Method and device for disambiguating polyphone
CN112735376A (en) * 2020-12-29 2021-04-30 竹间智能科技(上海)有限公司 Self-learning platform
CN112966476A (en) * 2021-04-19 2021-06-15 马上消费金融股份有限公司 Text processing method and device, electronic equipment and storage medium
CN112966476B (en) * 2021-04-19 2022-03-25 马上消费金融股份有限公司 Text processing method and device, electronic equipment and storage medium
CN114742044A (en) * 2022-03-18 2022-07-12 联想(北京)有限公司 Information processing method and device and electronic equipment
CN115273809A (en) * 2022-06-22 2022-11-01 北京市商汤科技开发有限公司 Training method of polyphone pronunciation prediction network, and speech generation method and device
CN116266266B (en) * 2022-11-08 2024-02-20 美的集团(上海)有限公司 Multi-tone word disambiguation method, device, equipment and storage medium
CN116266266A (en) * 2022-11-08 2023-06-20 美的集团(上海)有限公司 Multi-tone word disambiguation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
WO2020224219A1 (en) Chinese word segmentation method and apparatus, electronic device and readable storage medium
CN110765785B (en) Chinese-English translation method based on neural network and related equipment thereof
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
CN116127953B (en) Chinese spelling error correction method, device and medium based on contrast learning
CN111079418B (en) Named entity recognition method, device, electronic equipment and storage medium
CN110096572B (en) Sample generation method, device and computer readable medium
CN112613293B (en) Digest generation method, digest generation device, electronic equipment and storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN110968725B (en) Image content description information generation method, electronic device and storage medium
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
CN111414746A (en) Matching statement determination method, device, equipment and storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN115600597A (en) Named entity identification method, device and system based on attention mechanism and intra-word semantic fusion and storage medium
CN114218945A (en) Entity identification method, device, server and storage medium
CN113449514A (en) Text error correction method and device suitable for specific vertical field
WO2023134085A1 (en) Question answer prediction method and prediction apparatus, electronic device, and storage medium
CN114281996B (en) Method, device, equipment and storage medium for classifying long text
CN112632956A (en) Text matching method, device, terminal and storage medium
CN114637852B (en) Entity relation extraction method, device, equipment and storage medium of medical text
CN116842168A (en) Cross-domain problem processing method and device, electronic equipment and storage medium
CN110347813B (en) Corpus processing method and device, storage medium and electronic equipment
CN115952284A (en) Medical text relation extraction method fusing density clustering and ERNIE
CN111090720B (en) Hot word adding method and device
CN115292492A (en) Method, device and equipment for training intention classification model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200828

RJ01 Rejection of invention patent application after publication