CN112786002B

CN112786002B - Voice synthesis method, device, equipment and storage medium

Info

Publication number: CN112786002B
Application number: CN202011597100.9A
Authority: CN
Inventors: 祖漪清; 钟金佐穆
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2022-12-06
Anticipated expiration: 2040-12-28
Also published as: WO2022141710A1; CN112786002A

Abstract

The application provides a voice synthesis method, a device, equipment and a storage medium, wherein the voice synthesis method comprises the following steps: acquiring a text unit sequence according to a target text, wherein each text unit in the text unit sequence comprises at least one character; segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence; determining a voice unit corresponding to each sub-text unit in the sub-text unit sequence to obtain a voice unit sequence; and performing voice synthesis based on the sub-text unit sequence and the voice unit sequence to obtain a synthesized voice corresponding to the target text. The voice synthesis method has a good voice synthesis effect, and is low in labor cost and time cost.

Description

Voice synthesis method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, apparatus, device, and storage medium.

Background

The speech synthesis technology is also called text-to-speech technology, i.e. a technology for converting text into speech, and the technology endows a computer with the ability of freely speaking like a human, so that information communication between a user and a machine is more comfortable and natural.

The present speech synthesis schemes are mostly based on pronunciation dictionary, and the general idea of the scheme is that an expert builds a pronunciation dictionary comprising pronunciation sequences of a plurality of entries in advance, when performing speech synthesis, the pronunciation sequences of the entries in a target text to be subjected to speech synthesis are obtained from the pronunciation dictionary, and speech synthesis is performed according to the pronunciation sequences of the entries in the target text.

Because the pronunciation dictionary is required to be constructed manually in the speech synthesis scheme based on the pronunciation dictionary, and the pronunciation dictionary usually needs to contain pronunciation sequences of a large number of words, the labor cost and the time cost for constructing the pronunciation dictionary are extremely high, and the manual construction of the pronunciation dictionary is greatly influenced by subjective factors, namely, wrong pronunciation sequences are easy to appear in the constructed pronunciation dictionary, and the wrong pronunciation sequences in the pronunciation dictionary inevitably influence the speech synthesis effect.

Disclosure of Invention

In view of this, the present application provides a speech synthesis method, apparatus, device and storage medium, so as to solve the problems that the speech synthesis scheme in the prior art is high in labor cost and time cost, and the speech synthesis effect is greatly affected by subjective factors, and the technical scheme is as follows:

a method of speech synthesis comprising:

acquiring a text unit sequence according to a target text, wherein each text unit in the text unit sequence comprises at least one character;

segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence;

determining a voice unit corresponding to each sub-text unit in the sub-text unit sequence to obtain a voice unit sequence;

and performing voice synthesis based on the sub-text unit sequence and the voice unit sequence to obtain a synthesized voice corresponding to the target text.

Optionally, the segmenting each text unit in the text unit sequence according to syllables to obtain sub-text unit sequences includes:

and segmenting each text unit in the text unit sequence according to a preset segmentation principle to obtain a sub-text unit sequence, wherein the segmentation principle is that each sub-text unit corresponds to one syllable.

Optionally, the segmenting each text unit in the text unit sequence includes:

for each text unit to be divided in the target text:

determining the voice attribute corresponding to each character in the text unit to be divided;

and segmenting the text unit to be segmented according to the voice attribute corresponding to each character in the text unit to be segmented.

Optionally, the determining the voice attribute corresponding to each character in the text unit to be divided includes:

and determining the voice attribute corresponding to each character in the text unit to be divided based on the pre-established corresponding relationship between the characters and the voice attributes.

Optionally, the segmenting the text unit to be segmented based on the voice attribute corresponding to each character in the text unit to be segmented includes:

based on the character with the corresponding voice attribute as the first attribute in the text unit to be divided, dividing the text unit to be divided to obtain a first division result;

based on the character of which the corresponding voice attribute is the second attribute in the first cutting result, further cutting the first cutting result;

wherein the first attribute is one of a vowel and a consonant, and the second attribute is the other of the vowel and the consonant.

Optionally, segmenting the text unit to be segmented based on the character of which the corresponding voice attribute in the text unit to be segmented is the first attribute, to obtain a first segmentation result, where the first segmentation result includes:

based on the character of which the corresponding voice attribute in the text unit to be segmented is vowel, segmenting the text unit to be segmented to obtain a first segmentation result;

the further segmentation of the first segmentation result based on the character of which the corresponding voice attribute is the second attribute in the first segmentation result comprises:

and further segmenting the first segmentation result based on the character of which the corresponding voice attribute is consonant in the first segmentation result.

Optionally, the segmenting the text unit to be segmented based on the character of which the corresponding voice attribute in the text unit to be segmented is a vowel to obtain a first segmentation result, including:

adding a segmentation symbol after the corresponding voice attribute in the text unit to be segmented is the character of the vowel to obtain a primary segmentation result;

and adjusting the preliminary segmentation result based on the voice attributes respectively corresponding to the forward adjacent characters and the backward adjacent characters of the segmentation symbol, wherein the adjusted segmentation result is used as a first segmentation result.

Optionally, the further segmenting the first segmentation result based on the character of which the corresponding voice attribute is a consonant in the first segmentation result includes:

if a first target character string consisting of a plurality of continuous characters of which the corresponding voice attributes are consonants exists in the first cutting result, determining whether a second target character string matched with the consonant character string in a preset consonant character string set exists in the first target character string;

if yes, adding a segmentation character in front of the second target character string;

and if the first target character string does not exist, segmenting the first target character string according to the number of characters contained in the first target character string.

Optionally, the determining the voice unit corresponding to each sub-text unit in the sequence of sub-text units includes:

aiming at each sub-text unit of the voice unit to be determined in the sub-text unit sequence:

determining a voice unit corresponding to the characters in the sub-text unit of the voice unit to be determined based on the characters and the attributes corresponding to the characters in the sub-text unit of the voice unit to be determined and the pre-established corresponding relationship among the voice attributes, the characters and the voice unit;

and acquiring the voice unit corresponding to the sub-text unit of the voice unit to be determined based on the voice unit corresponding to the character in the sub-text unit of the voice unit to be determined.

Optionally, the determining, based on the attributes corresponding to the characters and the characters in the sub-text unit of the voice unit to be determined and the pre-established correspondence between the voice attributes, the characters, and the voice unit, the voice unit corresponding to the characters in the sub-text unit of the voice unit to be determined includes:

aiming at the characters of the voice unit to be determined in the sub-text units of the voice unit to be determined:

according to the voice attribute corresponding to the character of the voice unit to be determined, determining the corresponding relation between the character corresponding to the character of the voice unit to be determined and the voice unit from the corresponding relations among the voice attribute, the character and the voice unit, and taking the corresponding relation as a target corresponding relation;

and determining a voice unit corresponding to the character of the voice unit to be determined according to the corresponding relation between the character of the voice unit to be determined and the target.

Optionally, the determining, according to the correspondence between the characters of the voice unit to be determined and the target, the voice unit corresponding to the characters of the voice unit to be determined includes:

acquiring a voice unit corresponding to the character which is the same as the character of the voice unit to be determined from the target corresponding relation;

and if the number of the acquired voice units is multiple, determining the voice unit corresponding to the character of the voice unit to be determined from the acquired multiple voice units based on the context information of the character of the voice unit to be determined in the sub-text unit of the voice unit to be determined.

Optionally, the target correspondence includes a correspondence between a single character and a speech unit, and a correspondence between a special combined character and a speech unit, where a part of the special combined character is not pronounced;

the determining the voice unit corresponding to the character of the voice unit to be determined according to the corresponding relation between the character of the voice unit to be determined and the target comprises the following steps:

if the characters of the voice unit to be determined are special combination characters, determining the voice unit corresponding to the characters of the voice unit to be determined based on the corresponding relation between the special combination characters and the voice unit in the target corresponding relation;

and if the characters of the voice unit to be determined are single characters, determining the voice unit corresponding to the characters of the voice unit to be determined based on the corresponding relation between the single characters in the target corresponding relation and the voice unit.

for each sub-text unit of the speech unit to be determined in the sequence of sub-text units:

determining a voice unit corresponding to the characters in the sub-text unit of the voice unit to be determined based on the pre-established corresponding relationship between the characters and the voice unit;

Optionally, performing speech synthesis based on the sub text unit sequence and the speech unit sequence to obtain a synthesized speech corresponding to the target text, including:

performing voice synthesis based on the sub-text unit sequence, the voice unit sequence and a pre-established voice synthesis model to obtain a synthesized voice corresponding to the target text;

the speech synthesis model is obtained by training with training sub-text unit sequences obtained by segmenting text units in a training text according to syllables and training speech unit sequences corresponding to the training sub-text unit sequences as training samples and with real speech corresponding to the training texts as sample labels.

Optionally, the training sub-text unit sequence is marked with languages;

and each sub-text unit in the training sub-text unit sequence is marked with a reading attribute used for indicating whether the corresponding voice unit in the corresponding training voice unit sequence needs to be read weakly.

A speech synthesis apparatus comprising: the system comprises a text unit sequence acquisition module, a text segmentation module, a voice unit determination module and a voice synthesis module;

the text unit sequence acquisition module is used for acquiring a text unit sequence according to a target text, wherein each text unit in the text unit sequence comprises at least one character;

the text segmentation module is used for segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence;

the voice unit determining module is used for determining a voice unit corresponding to each sub-text unit in the sub-text unit sequence to obtain a voice unit sequence;

and the voice synthesis module is used for carrying out voice synthesis on the basis of the sub-text unit sequence and the voice unit sequence to obtain the synthetic voice corresponding to the target text.

A speech synthesis apparatus comprising: a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech synthesis method described in any one of the above.

A readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the speech synthesis method of any one of the preceding claims.

According to the scheme, the method, the device, the equipment and the storage medium for voice synthesis are characterized in that a text unit sequence is obtained according to a target text, each text unit in the text unit sequence is segmented according to syllables to obtain a sub-text unit sequence formed by sub-text units with definite voice meanings, a voice unit corresponding to each sub-text unit in the sub-text unit sequence is determined, namely, each sub-text unit with definite voice meanings in the sub-text unit sequence is converted into a voice unit to obtain a voice unit sequence, and finally voice synthesis is carried out on the basis of the sub-text unit sequence and the voice unit sequence to obtain synthetic voice corresponding to the target text. Compared with the speech synthesis scheme based on the pronunciation dictionary in the prior art, the speech synthesis method provided by the application does not need to construct pronunciation sequences for a large number of entries, so that labor cost and time cost are greatly reduced, and influence of subjective factors on speech synthesis effect is greatly reduced because pronunciation information of the entries does not need to be manually distinguished.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech synthesis method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of segmenting a text unit to be segmented into sub-text units corresponding to a syllable according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an implementation manner of determining a speech unit corresponding to a sub-text unit of a speech unit to be determined in a sub-text unit sequence according to an embodiment of the present application;

fig. 4 is a schematic flowchart of another implementation manner of determining a speech unit corresponding to a sub-text unit of a speech unit to be determined in a sequence of sub-text units according to the embodiment of the present application;

FIG. 5 is a schematic flowchart of establishing a speech synthesis model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, in a speech synthesis scheme based on a pronunciation dictionary, because a pronunciation sequence needs to be established manually for a large number of entries, the labor cost and the time cost are extremely high, and for the entries, because pronunciation information of the entries needs to be judged manually, the pronunciation information of the entries is greatly influenced by subjective factors.

In view of the problems of the speech synthesis scheme based on the pronunciation dictionary, the inventor of the present invention has studied, and the original idea is:

considering that the problem of the pronunciation dictionary-based speech synthesis scheme is caused by manually constructing the pronunciation dictionary, the inventor tries to abandon the pronunciation dictionary and provide a completely new speech synthesis method, along with the idea that the inventor thinks of a speech synthesis scheme based on a speech synthesis model, the general idea of the scheme is as follows:

modeling is directly carried out between texts and voices, namely, a training text and corresponding real voice data are used for training a voice synthesis model in advance, so that when the target text is subjected to voice synthesis, the trained voice synthesis model is used for directly carrying out voice synthesis on the target text, and synthetic voice corresponding to the target text is obtained.

The inventor of the present invention has found through research that, although the above speech synthesis scheme based on the speech synthesis model solves the problems of the speech synthesis scheme based on the pronunciation dictionary to some extent, it brings new problems:

it can be understood that, in order to obtain a better speech synthesis model, a large amount of training texts are usually required to be used for training, however, the number of training texts is still larger, all terms cannot be covered, and for terms that cannot be covered, the speech synthesis model is likely to be wrong, that is, the above speech synthesis scheme based on the speech synthesis model is not stable enough and is not robust.

In view of the above problems of the speech synthesis scheme based on the speech synthesis model, the present inventors have further studied and finally proposed a more stable speech synthesis method without constructing a pronunciation dictionary (i.e. without manually labeling the pronunciations of the phonemes in the vocabulary entry), which is based on the following concept:

the method comprises the steps of segmenting a text unit in a target text of voice to be synthesized into sub-text units with voice significance according to syllables to obtain a sequence of the sub-text units, converting each sub-text unit with the voice significance in the sequence of the sub-text units into a voice unit to obtain a sequence of the voice unit, and finally performing voice synthesis according to the sequence of the sub-text units and the sequence of the voice unit.

The speech synthesis method provided by the application can be applied to a terminal with data processing capacity, the terminal acquires a target text, and speech synthesis is carried out on the target text according to the speech synthesis method provided by the application, the terminal can comprise a processing component, a memory, an input/output interface and a power supply component, and optionally, the terminal can further comprise a multimedia component, an audio component, a sensor component, a communication component and the like. Wherein:

the processing component is used for data processing, and can perform speech synthesis processing in the present case, and the processing component may include one or more processors, and may further include one or more modules for facilitating interaction with other components.

The memory is configured to store various types of data and may be implemented with any type or combination of volatile or non-volatile memory devices, such as a combination of one or more of Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disks, optical disks, and the like.

The power components provide power to the various components of the terminal, which may include a power management system, one or more power supplies, and the like.

The multimedia component may include a screen, which may preferably be a touch display screen that may receive input signals from a user. The multimedia component may also include a front facing camera and/or a rear facing camera.

The audio component is configured to output and/or input audio signals, e.g., the audio component may include a microphone configured to receive external audio signals, the audio component may further include a speaker configured to output audio signals, and the terminal synthesized speech may be output through the speaker.

The input/output interface is an interface between the processing component and a peripheral interface module, which may be a keyboard, buttons, etc., wherein the buttons may include, but are not limited to, a home button, a volume button, a start button, a lock button, etc.

The sensor assembly may include one or more sensors for providing various aspects of state assessment for the terminal, e.g., the sensor assembly may detect an open/closed state of the terminal, whether a user is in contact with the terminal, orientation, speed, temperature, etc. of the device. The sensor components may include, but are not limited to, one or a combination of more of image sensors, acceleration sensors, gyroscope sensors, pressure sensors, temperature sensors, and the like.

The communication component is configured to facilitate wired or wireless communication between the terminal and other devices. The terminal may access a wireless network based on a communication standard, such as a combination of one or more of WiFi, 2G, 3G, 4G, 5G.

Alternatively, the terminal may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital signal processors (ASPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the speech synthesis methods provided herein.

The voice synthesis method provided by the application can also be applied to a server, the server obtains the target text, and voice synthesis is carried out on the target text according to the voice synthesis method provided by the application. The server may include one or more central processors and memory configured to store various types of data, and the memory may be implemented with any type of volatile or non-volatile memory device or combination thereof, such as one or more of a Static Random Access Memory (SRAM), an electrically erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a programmable read-only memory (PROM), a read-only memory (ROM), a magnetic memory, a flash memory, a magnetic disk, an optical disk, and so forth. The server may also include one or more power supplies, one or more wired network interfaces, and/or one or more wireless network interfaces, one or more operating systems.

The speech synthesis method provided by the present application is described next by the following embodiments.

First embodiment

Referring to fig. 1, a schematic flow chart of a speech synthesis method provided in an embodiment of the present application is shown, where the method may include:

step S101: and acquiring a text unit sequence according to the target text.

The target text is a text to be subjected to speech synthesis, the target text may be a text of any language, such as a chinese text, an english text, a german text, a spanish text, and the like, a text unit sequence obtained according to the target text includes at least one text unit, and each text unit includes at least one character.

The process of obtaining the text unit sequence according to the target text comprises the following steps: if the target text is composed of characters (such as an english text, a german text, a spanish text and the like), the target text is directly segmented according to preset text units, for example, if the preset text units are words, word segmentation is performed on the target text, each obtained word is a text unit, and thus a text unit sequence is obtained; if the target text is composed of non-characters (for example, a chinese text, which is generally composed of chinese characters), the target text is first converted into a text composed of characters, and then the converted text is segmented according to a preset text unit to obtain a text unit sequence, where it is to be noted that the characters mentioned above are letters with voice attributes (for example, latin letters, english letters, pinyin letters, etc.), special symbols, numbers, and the like.

Illustratively, the target text is "I am very happy", and the preset text unit is a word, a text unit sequence "I", "am", "very", "happy" is obtained by segmenting the target text according to the preset text unit, where the text unit "I" includes one character, the text unit "am" includes two characters, the text unit "very" includes four characters, and the text unit "happy" includes five characters.

Step S102: and segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence.

In this embodiment, the number of syllables may be preset, and when each text unit in the text unit sequence is segmented according to a syllable, each text unit in the text unit sequence may be segmented according to the preset number of syllables.

The preset number of syllables is smaller than the number of syllables of the text unit having the most syllables in the text unit sequence, for example, if the number of syllables of the text unit having the most syllables in the target text is 4, the number of syllables may be set to 3, 2 or 1.

Preferably, the number of the preset syllables is 1, that is, each text unit in the text unit sequence is segmented on the principle that each sub-text unit corresponds to one syllable, or each text unit in the text unit sequence is segmented into sub-text units corresponding to one syllable, so that the sub-text unit sequence is obtained, and each sub-text unit in the sub-text unit sequence obtained through segmentation corresponds to one syllable.

Illustratively, the text unit sequence includes three text units, wherein the first text unit includes two syllables, the second text unit includes one syllable, the third text unit includes three syllables, then two sub-text units are obtained after segmenting the first text unit, each sub-text unit corresponds to one syllable, because the second text includes only one syllable, it is not necessary to segment it, three sub-text units are obtained after segmenting the third text unit, a sub-text unit sequence composed of six sub-text units is finally obtained by segmentation, and each sub-text unit in the sub-text unit sequence corresponds to one syllable.

It should be noted that the sub-text unit corresponding to a syllable is usually a morpheme or sub-morpheme. The morpheme is the smallest combination of voice and semantic meaning and is a meaningful language unit, the word is composed of morphemes, one word may include one morpheme or may include a plurality of morphemes, the morphemes have a text embodiment and a voice embodiment, the voice embodiment of the morphemes refers to corresponding syllables, the voice embodiment of one morpheme may correspond to one syllable or may correspond to a plurality of syllables. Illustratively, "unhappy" includes two morphemes, "un" and "happy," respectively, where "un" corresponds to one syllable and "happy" corresponds to two syllables.

It should be noted that, although the morphemes in a text unit are not easy to be determined, the corresponding syllables are easy to be determined, and a syllable corresponds to a morpheme or sub-morpheme, so this embodiment can use the syllable as the smallest semantic unit, and segment each text unit in the text unit sequence into sub-text units corresponding to a syllable.

Step S103: and determining the voice unit corresponding to each sub-text unit in the sub-text unit sequence to obtain a voice unit sequence.

And the voice unit corresponding to one sub-text unit can represent the pronunciation of the sub-text unit.

Specifically, for each sub-text unit of the to-be-determined speech unit in the sequence of the sub-text units, the speech unit corresponding to each character in the sub-text unit of the to-be-determined speech unit can be determined first, and then the speech unit corresponding to the sub-text unit of the to-be-determined speech unit can be obtained according to the speech unit corresponding to each character in the sub-text unit of the to-be-determined speech unit.

Step S104: and performing voice synthesis based on the sub-text unit sequence and the voice unit sequence to obtain a synthesized voice corresponding to the target text.

Specifically, speech synthesis may be performed based on the sub-text unit sequence, the speech unit sequence, and a pre-established speech synthesis model to obtain a synthesized speech corresponding to the target text.

The speech synthesis model is obtained by taking a training sub-text unit sequence obtained by segmenting text units in a training text and a training speech unit sequence corresponding to the training sub-text unit sequence as training samples and taking real speech corresponding to the training text as a sample label for training.

The voice synthesis method provided by the embodiment of the application comprises the steps of firstly obtaining a text unit sequence according to a target text, then segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence formed by sub-text units with definite voice meanings, then determining a voice unit corresponding to each sub-text unit in the sub-text unit sequence, namely converting each sub-text unit with definite voice meanings in the sub-text unit sequence into a voice unit to obtain a voice unit sequence, and finally performing voice synthesis based on the sub-text unit sequence and the voice unit sequence to obtain synthetic voice corresponding to the target text. Compared with the speech synthesis scheme based on the pronunciation dictionary in the prior art, the speech synthesis method provided by the embodiment of the application does not need to construct pronunciation sequences for a large number of entries, so that the labor cost and the time cost are greatly reduced, and the influence of subjective factors on the speech synthesis effect is greatly reduced because the pronunciation information of the entries does not need to be manually judged.

Second embodiment

The above embodiments mentioned that, when each text unit in the text unit sequence is segmented, each text unit in the text unit sequence can be segmented into sub-text units corresponding to one syllable, and this embodiment focuses on this process.

Referring to fig. 2, a schematic flow chart of segmenting a text unit to be segmented into sub-text units corresponding to a syllable is shown for each text unit to be segmented in a text unit sequence, and may include:

step S201: and determining the voice attribute corresponding to each character in the text unit to be divided.

Regardless of the language, the speech is divided into two speech attributes, namely vowel and consonant, according to pronunciation. The vowel is the sound produced by the airflow passing through the oral cavity without being blocked during the pronunciation process, and the consonant is the sound produced by the airflow being blocked in the oral cavity or the pharynx. In this embodiment, a voice attribute corresponding to each character in the text unit to be cut is determined, that is, whether each character in the text unit to be cut is a vowel character or a consonant character is determined.

Specifically, the process of determining the voice attribute corresponding to each character in the text unit to be divided may include: and determining the voice attribute corresponding to each character in the text unit to be divided based on the pre-established corresponding relationship between the characters and the voice attributes.

The voice synthesis method provided by the application can be applied to a single language scene and can also be applied to a multi-language scene, if the scene is the multi-language scene, the corresponding relation between the characters and the voice attributes is respectively established for a plurality of languages when the corresponding relation is established, and correspondingly, when the voice attribute corresponding to each character in the text unit to be divided is determined, the voice attribute corresponding to each character in the text unit to be divided is determined based on the corresponding relation between the characters and the voice attributes established for the language to which the target text belongs.

It should be noted that there may be a case where one character corresponds to multiple voice attributes in the correspondence relationship between the character and the voice attribute, because some characters may correspond to different voice attributes under different contexts, for example, the voice attribute corresponding to "y" in the english word boy is a vowel, and the voice attribute corresponding to "y" in the english word yellow is a consonant.

In view of this, the process of determining the voice attribute corresponding to the character in the text unit to be divided based on the pre-established correspondence between the character and the voice attribute may include: and aiming at each character with voice attributes to be determined in the text unit to be divided, searching characters which are the same as the characters with the voice attributes to be determined from the corresponding relationship between the pre-established characters and the voice attributes, if the voice attributes corresponding to the searched characters are one, determining the voice attributes corresponding to the searched characters as the voice attributes corresponding to the characters with the voice attributes to be determined, and if the voice attributes corresponding to the searched characters are multiple, determining the voice attributes corresponding to the characters with the voice attributes to be determined from the multiple voice attributes according to the context information of the characters with the voice attributes to be determined. The context information of the character with the voice attribute to be determined may be a voice attribute corresponding to a character before or after the character with the voice attribute to be determined.

Illustratively, the language to which the target text belongs is english, english includes 26 characters, the voice attributes include vowels and consonants, and assuming that vowels are represented by V and consonants are represented by C, the correspondence relationship between characters and voice attributes established in advance for english is shown in the following table:

TABLE 1 correspondence of characters to phonetic attributes

Character(s)	Voice attributes
		a	V
b	C
		c	C
d	C
		e	V
f	C
		g	C
h	C
		i	V
…	…
		y	V、C
…	…

Assuming that a text unit to be split is "unhappy", it can be obtained that a voice attribute corresponding to "u" is "V", a voice attribute corresponding to "n" is "C", a voice attribute corresponding to "h" is "C", a voice attribute corresponding to "a" is "V", a voice attribute corresponding to "p" is "C", and there are two voice attributes corresponding to "y" in the above table.

Step S202: and segmenting the text unit to be segmented according to the voice attribute corresponding to each character in the text unit to be segmented.

Specifically, the specific implementation process of step S202 may include:

step S2021, segmenting the text unit to be segmented based on the character of which the corresponding voice attribute in the text unit to be segmented is the first attribute to obtain a first segmentation result.

The first attribute may be a vowel or a consonant.

Step S2022, based on the character whose voice attribute is the second attribute in the first segmentation result, further segmenting the first segmentation result.

And if the first attribute is vowel, the second attribute is consonant, and if the first attribute is consonant, the second attribute is vowel.

Next, taking the first attribute as a vowel and the second attribute as a consonant as an example, a process of "segmenting the text unit to be segmented based on the voice attribute corresponding to each character in the text unit to be segmented" is introduced:

step a1, segmenting the text unit to be segmented based on the character of which the corresponding voice attribute in the text unit to be segmented is vowel to obtain a first segmentation result.

Specifically, the specific implementation process of step a1 may include:

step a11, adding a segmentation symbol after the corresponding voice attribute in the text unit to be segmented is the character of the vowel, and obtaining a preliminary segmentation result.

Exemplarily, the text units to be divided are English words "elementary", and the speech attribute sequence corresponding to the elementary "is" VCCCVVCC ", adding a segmentation character after the character with the corresponding voice attribute of V (namely the first 'e', 'o', 'y' and the second 'e') in the 'assembly' to obtain 'e/mplo/y/me/nt' ('V/CCCV/V/CV/CC').

Step a12, adjusting the preliminary segmentation result based on the voice attributes respectively corresponding to the forward adjacent character and the backward adjacent character of the segmentation symbol, and taking the adjusted segmentation result as a first segmentation result.

Considering that a segmentation symbol may be added at some positions where the segmentation symbol is not necessarily added in the manner of step a11, in order to obtain a more reasonable segmentation result, after obtaining the preliminary segmentation result, in this embodiment, it may further be determined whether the segmentation symbol is an unnecessary segmentation symbol based on the voice attributes corresponding to the forward adjacent character and the backward adjacent character of the segmentation symbol, specifically, if the voice attributes corresponding to the forward adjacent character and the backward adjacent character of a segmentation symbol are vowels, and a vowel string formed by the forward adjacent character and the backward adjacent character of the segmentation symbol is located in a preset vowel string set, it may be determined that the segmentation symbol is an unnecessary segmentation symbol, and the segmentation symbol is removed, where the vowel string in the preset vowel string set has a combined pronunciation, and is a vowel string which does not need to be segmented.

For example, after the "employee" is segmented according to the step a11, an "e/mplo/y/me/nt" ("V/CCCV/V/CV/CC") is obtained, since the voice attributes respectively corresponding to the forward adjacent character "o" and the backward adjacent character "y" of the second segmentation character are vowels, and the "oy" is located in the preset vowel character string set ("oy" has a combined pronunciation

) Therefore, the second delimiter (i.e. the delimiter between "oy") is an unnecessary delimiter, and is removed to obtain "e/moment/me/nt" ("V/CCCVV/CV/CC").

In addition, if the voice attribute corresponding to the last character of the text unit to be cut is consonant, and the character before the last segmentation symbol and the character after the last segmentation symbol belong to a syllable, removing the last segmentation symbol.

For example, the voice attribute corresponding to the last character of "employee" is a consonant, and for the segmentation result "e/moment/me/nt" ("V/CCCVV/CV/CC"), since "me" before the last segmentation symbol corresponds to a syllable with "nt" after the last segmentation symbol, the last segmentation symbol (i.e., the segmentation symbol between "me" and "nt") is removed, and "e/moment/ment" ("V/CCCVV/CVCC") is obtained.

And a2, further segmenting the first segmentation result based on the character of which the corresponding voice attribute is consonant in the first segmentation result.

Specifically, the specific implementation process of step S2022 may include:

step a21, if a first target character string composed of a plurality of continuous characters of which the corresponding voice attributes are consonants exists in the first segmentation result, determining whether a second target character string matched with the consonant character string in a preset consonant character string set exists in the first target character string, if so, executing the step a21-1, and if not, executing the step a21-2.

The consonant character strings in the pre-established consonant character string set are the consonant character strings which are required to be in the same syllable. It should be noted that the second target character string may be the first target character string, and may also be a character string composed of partial continuous characters in the first target character string.

Step a21-1, adding a segmentation character in front of the target character string.

Illustratively, the "elementary" ("VCCCVVCVCC") is segmented according to step S2021 to obtain "e/moment/ment" ("V/CCCVV/CVCC"), "e/moment/ment" includes two first target character strings, which are "mpl" ("CCC") and "nt" ("CC"), respectively, and the pre-established consonant character string set includes "pl", and since the target character string "pl" matching the consonant character string "pl" in the consonant character string set exists in the first target character string "mpl", the segmentation symbol is added before "pl" to obtain "e/m/moment/ment" ("V/C/CCVV/CVCC").

Step a21-2, segmenting the target character string according to the number of characters contained in the first target character string.

If the first target character string does not have a second target character string matched with the consonant character string in the pre-established consonant character string set, the segmentation position cannot be determined based on the consonant character string set, and at the moment, the first target character string can be segmented by adopting a forced half-and-half segmentation principle.

It can be understood that if the first target character string includes an even number of characters, half-and-half segmentation can be implemented, and if the first target character string includes an odd number of characters, half-and-half segmentation cannot be implemented, and in view of this, the embodiment provides the following segmentation strategy:

for a first target character string containing three or more characters, if the number of characters contained in the first target character string is an even number, the first target character string is divided into half and half, namely a segmentation character is added after the n/2 (n is the number of characters contained in the first target character string), if the number of characters contained in the first target character string is an odd number, the segmentation character is added after the (n-1)/2 characters, for example, if the first target character string contains three characters, the segmentation character is added after the first character, if the first target character string contains four characters, the segmentation character is added after the second character, if the first target character string contains five characters, the segmentation character is added after the second character; for the first target character string containing two characters, if the two characters are required to be in the same syllable, no processing is carried out, otherwise, a segmentation character is added between the two characters.

It should be noted that a single consonant character cannot be independently syllable-divided, and if there is a consonant character between two segmentations, the segmentations before the consonant character are removed, for example, the segmentations before the first consonant character "m" in "e/m/ploy/ment" ("V/C/CCVV/CVCC") are removed, and "em/ploy/ment" ("VC/CCVV/CVCC") is obtained after the segmentations are removed.

Third embodiment

This embodiment is similar to the "step S103: and determining the voice unit corresponding to each sub-text unit in the sub-text unit sequence, and introducing the concrete implementation process of obtaining the voice unit sequence.

There are various implementations for determining the speech unit corresponding to each sub-text unit in the sequence of sub-text units, and this embodiment provides the following two alternative implementations:

the first implementation mode comprises the following steps:

as shown in fig. 3, the process of determining a corresponding speech unit for each sub-text unit of the speech unit to be determined in the sub-text unit sequence includes:

step S301: and determining a voice unit corresponding to the character in the sub-text unit of the voice unit to be determined based on the pre-established corresponding relation between the character and the voice unit.

It should be noted that, if the speech synthesis method provided by the present application is applied to a single language scene, the correspondence between the characters and the speech units is only established for the single language, and if the speech synthesis method provided by the present application is applied to a multilingual scene, the correspondence between the characters and the speech units is respectively established for multiple languages when the correspondence is established, and correspondingly, when the speech units corresponding to the characters in the sub-text units of the speech units to be determined are determined, the speech units corresponding to the characters in the sub-text units of the speech units to be determined are determined based on the correspondence between the characters and the speech units established for the language to which the target text belongs.

The correspondence between the characters and the phonetic units is shown in table 2, and GP in table 2 represents a phonetic unit, which is specific pronunciation information of the characters.

TABLE 2 correspondence of characters to phonetic units

Character(s)	Speech unit
		Character 1	Gp ₁
Character 2	Gp ₂₁ 、Gp ₂₂
		Character 3	Gp ₃
Character 4	Gp ₄₁ 、Gp ₄₂ 、Gp ₄₃
		…	…
Character N	…

Because the same character may have a plurality of different pronunciations, some characters correspond to one phonetic unit and some characters correspond to a plurality of phonetic units in the corresponding relationship between the characters and the phonetic units established in advance. In addition, the correspondence between characters and phonetic units generally includes the correspondence between single characters and phonetic units, and the correspondence between special combination characters and phonetic units.

Generally, each character has its corresponding pronunciation information, but in some text units, there may be a case where a plurality of characters pronounce one sound, or a case where a part of the characters do not pronounce, for example, german combination character ch pronounces one sound/x/, for which the corresponding relationship of a special combination character and a speech unit is set in the corresponding relationship of the characters and the speech unit, so that when a pronunciation unit corresponding to a text unit containing the special combination character is determined, a more accurate pronunciation unit can be obtained.

Considering that the same character (possibly a single character or a combined character) may emit different sounds in different text units, for example, the english character a has a different pronunciation in "arm" than in "happy", the present embodiment may adopt the following two optional strategies when constructing the corresponding relationship for characters with multiple pronunciations:

firstly, a voice unit is set in the corresponding relation aiming at each pronunciation of the character, namely, the number of voice units is set according to the number of pronunciations of the character;

secondly, only setting a voice unit aiming at the existence of opposite pronunciations in the corresponding relation, for example, if a certain character has two opposite pronunciations, two voice units need to be set, it needs to be noted that the opposite pronunciations of the character need to be put into syllables for judgment, and the character has opposite pronunciations, which means that the character pronunciations are different in the same context environment; or in another embodiment, to more rationally set the voice sheetFor similar pronunciations without opposition, when corresponding phonetic units are set for the characters, only one phonetic unit can be set, for example, when a German combination character ch is used as a syllable tail, two pronunciations are set, and the character is read after the character a, o, au and u/x/, and the character is read after the character i and e

Due to/x/and

have certain similarity and no opposition, so when setting the corresponding phonetic unit for the composite character ch, only one phonetic unit may be set, for example, it may be set

The phonetic unit corresponding to the german combination character ch, however, the phonetic unit/x/corresponding to the german combination character ch may be set. In addition, if a certain character (single character or combined character) has multiple dissimilar pronunciations and does not belong to a sound variation, different phonetic units need to be set, for example, a combined character gu in portuguese reads two sounds/gw/, before vowel characters a, o, u, and reads one sound/g/, before vowel characters i, e, the phonetic units need to be set for the two cases respectively.

In addition, if the same character has different pronunciations in different languages, when the corresponding relationship between the character and the phonetic unit is established, different phonetic units need to be set in the corresponding relationship between different languages for the character, that is, the corresponding relationship between different languages needs to reflect different pronunciations of the same character in different languages.

Specifically, the process of determining the speech unit corresponding to the character in the sub-text unit of the speech unit to be determined based on the correspondence between the character and the speech unit, which is established in advance for the language to which the target text belongs, includes: the method comprises the steps of determining characters which are the same as characters of a voice unit to be determined from a pre-established corresponding relation between the characters and the voice unit, obtaining the voice unit which corresponds to the characters which are the same as the characters of the voice unit to be determined, determining the obtained voice unit as the voice unit which corresponds to the characters of the voice unit to be determined if the obtained voice unit is one, and determining the voice unit which corresponds to the characters of the voice unit to be determined from the obtained voice units according to context information of the characters of the voice unit to be determined in a sub-text unit of the voice unit to be determined if the obtained voice units are multiple.

In view of the above, when determining a speech unit corresponding to a character in a sub-text unit of a speech unit to be determined based on a correspondence between characters and speech units, it may first determine whether a special combination character exists in the sub-text unit of the speech unit to be determined based on the special character in the correspondence, if a special combination character exists in the sub-text unit of the speech unit to be determined, use the entire special combination character as a character of the speech unit to be determined, use each single character except the special combination character in the sub-text unit of the speech unit to be determined as a character of the speech unit to be determined, and when determining a speech unit corresponding to a character of the speech unit to be determined, if the character of the speech unit to be determined is a single character, determine the speech unit corresponding to the character of the speech unit to be determined based on the correspondence between the single character and the speech unit, and if the character of the speech unit to be determined is a special combination character, determine the speech unit corresponding to the character of the speech unit to be determined based on the correspondence between the single character and the special combination character.

Step S302: and acquiring the voice unit corresponding to the sub-text unit of the voice unit to be determined according to the voice unit corresponding to the character in the sub-text unit of the voice unit to be determined.

The method includes the steps of obtaining a plurality of implementation manners of voice units corresponding to sub-text units of the voice units to be determined according to the voice units corresponding to characters in the sub-text units of the voice units to be determined, combining the voice units corresponding to the characters in the sub-text units of the voice units to be determined in sequence in one possible implementation manner, using the combined voice units as the voice units corresponding to the sub-text units of the voice units to be determined, combining the voice units corresponding to the characters in the sub-text units of the voice units to be determined in sequence in another possible implementation manner, then adjusting the combined voice units according to a pronunciation rule of a language to which a target text belongs (for example, removing redundant pronunciation units), and using the adjusted pronunciation units as the voice units corresponding to the sub-text units of the voice units to be determined.

In order to determine the speech unit more quickly, this embodiment provides a second implementation manner for determining the speech unit corresponding to each sub-text unit in the sequence of sub-text units:

as shown in fig. 4, the process of determining a corresponding speech unit for each sub-text unit of the speech unit to be determined in the sub-text unit sequence includes:

step S401: and determining the voice unit corresponding to the characters in the sub-text unit of the voice unit to be determined based on the characters in the sub-text unit of the voice unit to be determined and the voice attribute corresponding to the characters as well as the pre-established corresponding relationship among the voice attribute, the characters and the voice unit.

It should be noted that, if the speech synthesis method provided by the present application is applied to a single language scene, the corresponding relationship among the speech attribute, the character, and the speech unit is established only for the single language, and if the speech synthesis method provided by the present application is applied to a multilingual scene, the corresponding relationship among the speech attribute, the character, and the speech unit is established for each of a plurality of languages when the corresponding relationship is established, and correspondingly, when the speech unit corresponding to the character in the child text unit of the speech unit to be determined is determined, the speech unit corresponding to the character in the child text unit of the speech unit to be determined is determined based on the corresponding relationship among the speech attribute, the character, and the speech unit established for the language to which the target text belongs.

The correspondence among the voice attributes, the characters, and the voice units is shown in table 3.

TABLE 3 correspondence of phonetic attributes, characters and phonetic units

It should be noted that the voice attributes corresponding to the characters corresponding to a voice attribute are all the voice attributes, for example, the voice attributes corresponding to the characters 11, 12, and 8230in the above table, the voice attributes corresponding to the character 1N are all the voice attributes 1, and the voice attributes corresponding to the characters 21, 22, and 8230, and the voice attributes corresponding to the character 2M are all the voice attributes 2.

The correspondence between the voice attributes, the characters and the voice units in the present implementation is equivalent to classifying the correspondence between the characters and the voice units in the first implementation according to the voice attributes, and putting together the correspondence between the characters and the voice units belonging to the same voice attribute. The corresponding relations of the characters and the voice units belonging to the same voice attribute are put together, so that when a voice unit corresponding to one character is searched, the searching range can be reduced to the corresponding relation under the voice attribute corresponding to the character based on the voice attribute corresponding to the character, the voice unit corresponding to the character does not need to be searched in the whole corresponding relation, and the searching efficiency of the voice unit can be improved.

Specifically, based on the characters and the voice attributes corresponding to the characters in the sub-text units of the voice unit to be determined, and the corresponding relationship among the voice attributes, the characters and the voice unit, which is established in advance for the language to which the target text belongs, the process of determining the voice unit corresponding to the characters in the sub-text units of the voice unit to be determined includes: executing the following steps for the characters of the voice unit to be determined in the sub text unit of the voice unit to be determined:

step S4011, according to the voice attribute corresponding to the character of the voice unit to be determined, determining, from the correspondence between the voice attribute and the character and the voice unit, a correspondence between the character and the voice unit corresponding to the character of the voice unit to be determined and the voice attribute corresponding to the character of the voice unit to be determined, as a target correspondence.

Specifically, the voice attribute which is the same as the voice attribute corresponding to the character of the voice unit to be determined is searched from the corresponding relationship among the voice attribute, the character and the voice unit, and the corresponding relationship between the character which is the same as the voice attribute corresponding to the character of the voice unit to be determined and the voice unit in the corresponding relationship is determined as the target corresponding relationship.

Through step S4011, the search range of the speech unit can be narrowed from the entire corresponding relationship to the target corresponding relationship, so that the search efficiency of the speech unit can be improved.

For example, if the voice attribute corresponding to the character of the voice unit to be determined is the same as the voice attribute 1 in the table above, the corresponding relationship between the character corresponding to the voice attribute 1 and the voice unit is determined as the target corresponding relationship, and the target corresponding relationship is shown in table 4 below:

TABLE 4 target correspondences

Character 11	Gp ₁₁
		Character 12	Gp _12-1 、Gp _12-2
…	…
		Character 1N

And S4012, determining a phonetic unit corresponding to the character of the phonetic unit to be determined according to the corresponding relation between the character of the phonetic unit to be determined and the target.

Specifically, the process of determining the speech unit corresponding to the character of the speech unit to be determined according to the correspondence between the character of the speech unit to be determined and the target includes: acquiring a voice unit corresponding to a character which is the same as the character of the voice unit to be determined from the target corresponding relation; and if the number of the acquired voice units is multiple, determining the voice unit corresponding to the character of the voice unit to be determined from the multiple acquired voice units according to the context information of the character of the voice unit to be determined in the sub-text unit of the voice unit to be determined.

For example, in the target correspondence shown in table 4, if the character identical to the character of the phonetic unit to be determined is character 11, since there is only one phonetic unit corresponding to character 11, gp is directly used ₁₁ If the character corresponding to the character of the voice unit to be determined is determined as the voice unit corresponding to the character of the voice unit to be determined, and if the character same as the character of the voice unit to be determined is the character 12, because there are two voice units corresponding to the character 12, it is necessary to determine that the voice unit corresponding to the character of the voice unit to be determined should be Gp according to the context information of the character of the voice unit to be determined, that is, the character before the character of the voice unit to be determined and/or the character after the character of the voice unit to be determined _12-1 Or Gp _12-2 。

It should be noted that, in this implementation, the correspondence between the character and the voice unit corresponding to each voice attribute also includes a correspondence between a single character and a voice unit and a correspondence between a special combination character and a voice unit, and in view of this, when determining a voice unit corresponding to a character in a sub-text unit of a voice unit to be determined based on the correspondence between the voice attribute and the character and the voice unit, it is determined whether a special combination character exists in the sub-text unit of the voice unit to be determined, based on the special character in the correspondence, and if a special combination character exists in the sub-text unit of the voice unit to be determined, it is determined that the voice unit corresponding to the character of the voice unit to be determined is a character of the voice unit to be determined as a whole, and each single character in the sub-text unit of the voice unit to be determined except the special combination character is a character of the voice unit to be determined, so that when determining the voice unit corresponding to the character of the voice unit to be determined according to the character and the target correspondence between the voice unit to be determined, it is determined that the correspondence between the single character and the voice unit to be determined is a target of the voice unit to be determined, and the corresponding to be determined, if the character in the voice unit is a single character corresponding to be determined, it is a corresponding to be a corresponding relationship between the single character in the voice unit to be determined.

Step S402: and acquiring the voice unit corresponding to the sub-text unit of the voice unit to be determined based on the voice unit corresponding to the character in the sub-text unit of the voice unit to be determined.

Optionally, the speech units corresponding to the characters in the sub-text units of the speech unit to be determined may be combined in sequence, and the combined speech unit is used as the speech unit corresponding to the sub-text unit of the speech unit to be determined.

Fourth embodiment

The sub-text unit sequence can be obtained through the implementation manner provided by the second embodiment, the speech unit sequence can be obtained through the implementation manner provided by the third embodiment, and the implementation process of "performing speech synthesis based on the sub-text unit sequence and the speech unit sequence to obtain synthesized speech corresponding to the target text" is described in this embodiment.

Performing speech synthesis based on the sub-text unit and the speech unit sequence, and obtaining a synthesized speech corresponding to the target text may include: and performing voice synthesis based on the sub-text unit sequence, the voice unit sequence and a pre-established voice synthesis model to obtain a synthesized voice corresponding to the target text.

Specifically, the feature vectors of the sub-text unit sequences and the feature vectors of the speech unit sequences may be determined first, and then the feature vectors of the sub-text unit sequences and the feature vectors of the speech unit sequences are input into a pre-established speech synthesis model for speech synthesis, so that the synthesized speech corresponding to the target text can be obtained.

Optionally, the speech synthesis model in this embodiment may be a model based on an encoder-decoder framework, and the speech synthesis model may be, but is not limited to, a convolutional neural network CNN, a cyclic neural network RNN, a long-short term memory network LSTM, and the like.

Since the speech synthesis is implemented based on the pre-established speech synthesis model, the present embodiment will be described with emphasis on the process of establishing the speech synthesis model.

Referring to fig. 5, a process for establishing a speech synthesis model is shown, which may include:

step S501, training texts are obtained from the training text set, and real voices corresponding to the training texts are obtained.

And step S502, obtaining a training text unit sequence according to the training text.

The implementation manner of obtaining the training text unit sequence according to the training text is similar to that in step S101: similar implementation processes of obtaining the text unit sequence according to the target text may specifically refer to relevant parts in the first embodiment, which is not described herein again.

And S503, segmenting each training text unit in the training text unit sequence according to syllables to obtain a training sub-text unit sequence.

The implementation process of segmenting each text unit in the training text according to syllables is similar to the step S102: the implementation process of segmenting each text unit in the text unit sequence according to syllables is similar, and this embodiment is not described herein again.

Step S504, determining the voice unit corresponding to each sub-text unit in the training sub-text unit sequence to obtain a training voice unit sequence.

Determining the implementation process of the speech unit corresponding to each sub-text unit in the training sub-text unit sequence and "step S103: the implementation process of determining the speech unit corresponding to each sub-text unit in the sub-text unit sequence is similar, which may specifically refer to the third embodiment, and details are not described herein in this embodiment.

And step S505, training a speech synthesis model by using the training sub-text unit sequence, the training speech unit sequence and the real speech corresponding to the training text.

Specifically, a characterization vector of a training sub-text unit sequence and a characterization vector of a training voice unit sequence are determined, the characterization vectors of the training sub-text unit sequence and the characterization vectors of the training voice unit sequence are input into a voice synthesis model for voice synthesis to obtain synthesized voice corresponding to a training text, prediction loss of the voice synthesis model is determined according to the synthesized voice corresponding to the training text and real voice corresponding to the training text, and parameters of the voice synthesis model are updated according to the prediction loss of the voice synthesis model.

And repeating the steps S501 to S505 for multiple times until the preset execution times are reached or the performance of the speech synthesis model meets the requirement.

In a possible implementation manner, the training texts in the training text set may only include texts of one language, that is, texts of a language to which the target text belongs, and the speech synthesis model thus constructed may only perform speech synthesis on the texts of the language to which the target text belongs; in another possible implementation manner, the training texts in the training text set may include texts of multiple languages, where the multiple languages include the language to which the target text belongs and other languages, and the speech synthesis model thus constructed may perform speech synthesis on the texts of the multiple languages.

If the training texts in the training text set include texts of multiple languages, in one possible implementation manner, the training texts in the training text set may be unlabeled texts, and in order to improve the training speed and the training effect of the speech synthesis model, in another preferred implementation manner, the training texts in the training text set may be texts labeled with languages and pronunciation attributes.

It should be noted that, each text unit in the training text is labeled with a reading attribute, the reading attribute includes a weak reading attribute and a repeat attribute, the weak reading attribute is used to indicate whether the corresponding text unit needs to be weakly read, the repeat attribute is used to indicate the sound in the corresponding text unit that needs to be repeat, after the training text labeled with the language and the reading attribute is segmented, a training sub-text unit sequence labeled with the language and the reading attribute can be obtained, and the reading attribute of each sub-text unit in the training sub-text unit sequence is the reading attribute of the text unit in which it is located.

It should be noted that, when the speech synthesis model is trained by using the sequence of the training sub-text units labeled with the language and the reading attributes, the speech synthesis model may determine the reading attributes of the training speech units in the sequence of the training speech units according to the labeled language, determine whether the training speech units corresponding to the sequence of the training speech units need to be weakly read according to the weak reading attributes in the reading attributes of each sub-text unit in the sequence of the training sub-text units, and determine the speech that needs to be read again according to the re-reading attributes in the reading attributes of each sub-text unit in the sequence of the training sub-text units. Therefore, after the speech synthesis model is trained, the speech synthesis model can also determine the pronunciation attribute based on the input text, and then in the later application stage, namely the speech synthesis process, the model can consider whether to reread or not, so that the synthesis effect is further improved.

Or, in another embodiment, the sound attributes may also include tone attributes, and the speech synthesis model may determine the sound attributes of the training speech units in the training speech unit sequence according to the marked language, and further may determine the tones of each sub-text unit according to the sound attributes in the sound attributes of each sub-text unit in the training sub-text unit sequence, thereby further improving the model synthesis effect.

In the speech synthesis methods provided in the first to fifth embodiments, since it is not necessary to construct a pronunciation sequence for a large number of entries, labor cost and time cost are greatly reduced, and since it is not necessary to manually determine pronunciation information of the entries, the influence of subjective factors on speech synthesis effect is greatly reduced. It should be noted that, although the establishment of the correspondence relation involved in the present application requires manual participation, the correspondence relation is relatively simple, and the establishment of the correspondence relation is not influenced by subjective factors (for example, the voice attribute corresponding to the character a in english is that vowels are well defined in the language, and manual judgment is not required), so the labor cost and the time cost of the voice synthesis method provided by the present application are far lower than those of the voice synthesis scheme based on the pronunciation vocabulary in the prior art, and the synthesis effect of the voice synthesis method provided by the present application is basically not influenced by the subjective factors. In addition, because the modeling is not directly performed between the text and the voice, each text unit in the text is firstly segmented into the sub-text units corresponding to one syllable, the sub-text units are converted into the voice units, and then the modeling is performed between the sub-text unit sequences and the voice unit sequences and between the voice.

Sixth embodiment

The following describes the speech synthesis apparatus provided in the embodiments of the present application, and the speech synthesis apparatus described below and the speech synthesis method described above may be referred to in correspondence.

Referring to fig. 6, a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application is shown, which may include: a text unit sequence acquisition module 601, a text segmentation module 602, a speech unit determination module 603, and a speech synthesis module 604.

The text unit sequence obtaining module 601 is configured to obtain a text unit sequence according to a target text, where each text unit in the text unit sequence includes at least one character.

A text segmentation module 602, configured to segment each text unit in the text unit sequence according to syllables, so as to obtain a sub-text unit sequence.

A speech unit determining module 603, configured to determine a speech unit corresponding to each sub-text unit in the sub-text unit sequence, so as to obtain a speech unit sequence.

A speech synthesis module 604, configured to perform speech synthesis based on the sub-text unit sequence and the speech unit sequence, so as to obtain a synthesized speech corresponding to the target text.

Optionally, the text segmentation module 602 is specifically configured to segment each text unit in the text unit sequence according to a preset segmentation rule to obtain a sub-text unit sequence, where the segmentation rule is that each sub-text unit corresponds to a syllable.

Optionally, the text segmentation module 602 may include: a voice attribute determination submodule and a text segmentation submodule.

And the voice attribute determining submodule is used for determining the voice attribute corresponding to each character in the text unit to be cut aiming at each text unit to be cut in the target text.

And the text segmentation sub-module is used for segmenting the text unit to be segmented according to the voice attribute corresponding to each character in the text unit to be segmented.

Optionally, the voice attribute determining sub-module is specifically configured to determine, based on a pre-established correspondence between characters and voice attributes, a voice attribute corresponding to each character in the text unit to be divided.

Optionally, the text segmentation sub-module includes: a first cutting submodule and a second cutting submodule.

The first segmentation module is used for segmenting the text unit to be segmented based on the character with the first attribute as the corresponding voice attribute in the text unit to be segmented to obtain a first segmentation result.

And the second segmentation submodule is used for further segmenting the first segmentation result based on the character of which the corresponding voice attribute in the first segmentation result is the second attribute.

Wherein the first attribute is one of a vowel and a consonant, and the second attribute is the other of a vowel and a consonant.

Optionally, the first segmentation module is specifically configured to segment the text unit to be segmented based on the character of which the corresponding voice attribute is a vowel in the text unit to be segmented, so as to obtain a first segmentation result.

The second segmentation submodule is specifically configured to further segment the first segmentation result based on the character of which the corresponding voice attribute is a consonant in the first segmentation result.

Optionally, the first segmentation module is specifically configured to add a segmentation symbol after the corresponding voice attribute in the text unit to be segmented is a character of a vowel, to obtain a preliminary segmentation result, adjust the preliminary segmentation result based on the voice attributes corresponding to the forward adjacent character and the backward adjacent character of the segmentation symbol, and use the adjusted segmentation result as the first segmentation result.

Optionally, the second segmentation sub-module is specifically configured to, if a first target character string composed of a plurality of consecutive characters of which corresponding voice attributes are consonants exists in the first segmentation result, determine whether a second target character string matching a consonant character string in a pre-established consonant character string set exists in the first target character string, if so, add a segmentation symbol in front of the second target character string, and if not, segment the first target character string according to the number of characters included in the first target character string.

Optionally, the phonetic unit determining module 603 includes: a first phonetic unit determination submodule and a second phonetic unit determination submodule.

The first voice unit determining submodule is configured to, for each sub-text unit of the to-be-determined voice unit in the sub-text unit sequence, determine a voice unit corresponding to a character in the sub-text unit of the to-be-determined voice unit based on attributes corresponding to the character and the character in the sub-text unit of the to-be-determined voice unit and a pre-established correspondence between the voice attribute, the character, and the voice unit.

And the second voice unit determining submodule is used for obtaining the voice unit corresponding to the sub-text unit of the voice unit to be determined based on the voice unit corresponding to the character in the sub-text unit of the voice unit to be determined.

Optionally, the first speech unit determining submodule is specifically configured to, for a character of the speech unit to be determined in the sub-text unit of the speech unit to be determined: and determining the corresponding relation of the characters and the voice unit corresponding to the characters of the voice unit to be determined from the corresponding relation among the voice attribute, the characters and the voice unit according to the voice attribute corresponding to the characters of the voice unit to be determined, taking the corresponding relation as a target corresponding relation, and determining the voice unit corresponding to the characters of the voice unit to be determined according to the characters of the voice unit to be determined and the target corresponding relation.

Optionally, when determining the speech unit corresponding to the character of the speech unit to be determined according to the character of the speech unit to be determined and the target corresponding relationship, the first speech unit determining submodule is specifically configured to obtain, from the target corresponding relationship, a speech unit corresponding to a character that is the same as the character of the speech unit to be determined, and if a plurality of obtained speech units are provided, determine, from the obtained plurality of speech units, the speech unit corresponding to the character of the speech unit to be determined based on context information of the character of the speech unit to be determined in the sub-text unit of the speech unit to be determined.

Optionally, the target correspondence includes a correspondence between a single character and a speech unit, and a correspondence between a special combined character and a speech unit, where some characters in the special combined character are not pronounced.

When the first voice unit determining submodule determines the voice unit corresponding to the character of the voice unit to be determined according to the character of the voice unit to be determined and the target corresponding relationship, the first voice unit determining submodule is specifically configured to determine the voice unit corresponding to the character of the voice unit to be determined based on the corresponding relationship between the special combination character and the voice unit in the target corresponding relationship when the character of the voice unit to be determined is the special combination character, and determine the voice unit corresponding to the character of the voice unit to be determined based on the corresponding relationship between the single character and the voice unit in the target corresponding relationship when the character of the voice unit to be determined is the single character.

Optionally, the phonetic unit determining module 603 includes: a third speech unit determination submodule and a fourth speech unit determination submodule.

And the third voice unit determining submodule is used for determining the voice unit corresponding to the character in the sub-text unit of the voice unit to be determined based on the pre-established corresponding relationship between the character and the voice unit for each sub-text unit of the voice unit to be determined in the sub-text unit sequence.

And the fourth voice unit determining submodule is used for obtaining the voice unit corresponding to the sub-text unit of the voice unit to be determined based on the voice unit corresponding to the character in the sub-text unit of the voice unit to be determined.

Optionally, the speech synthesis module 604 is specifically configured to perform speech synthesis based on the sub-text unit sequence, the speech unit sequence, and a pre-established speech synthesis model to obtain a synthesized speech corresponding to the target text.

The voice synthesis model is obtained by training with training sub-text unit sequences obtained by segmenting text units in a training text according to syllables and training voice unit sequences corresponding to the training sub-text unit sequences as training samples and with real voice corresponding to the training texts as sample labels.

Optionally, the training sub-text unit sequence is marked with languages; and each sub-text unit in the training sub-text unit sequence is marked with a reading attribute used for indicating whether the corresponding voice unit in the corresponding training voice unit sequence needs to be read weakly.

The speech synthesis device provided by the embodiment of the application firstly acquires a text unit sequence according to a target text, then segments each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence formed by sub-text units with definite speech significance, then determines the speech unit corresponding to each sub-text unit in the sub-text unit sequence, namely, converts each sub-text unit with definite speech significance in the sub-text unit sequence into a speech unit to obtain the speech unit sequence, and finally performs speech synthesis based on the sub-text unit sequence and the speech unit sequence to obtain the synthesized speech corresponding to the target text. Compared with a speech synthesis scheme based on a pronunciation dictionary in the prior art, the speech synthesis device provided by the embodiment of the application does not need to construct pronunciation sequences for a large number of entries, so that labor cost and time cost are greatly reduced, and influence of subjective factors on speech synthesis effect is greatly reduced because pronunciation information of the entries does not need to be manually judged.

Seventh embodiment

An embodiment of the present application further provides a speech synthesis device, please refer to fig. 7, which shows a schematic structural diagram of the speech synthesis device, where the speech synthesis may include: at least one processor 701, at least one communication interface 702, at least one memory 703 and at least one communication bus 704;

in the embodiment of the present application, the number of the processor 701, the communication interface 702, the memory 703 and the communication bus 704 is at least one, and the processor 701, the communication interface 702 and the memory 703 complete mutual communication through the communication bus 704;

the processor 701 may be a central processing unit CPU, or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, or the like;

the memory 703 may include a high-speed RAM memory, a non-volatile memory (non-volatile memory), and the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

Alternatively, the detailed function and the extended function of the program may be as described above.

Eighth embodiment

Embodiments of the present application further provide a readable storage medium, where a program suitable for being executed by a processor may be stored, where the program is configured to:

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of additional like elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech synthesis, comprising:

segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence, wherein the sub-text unit sequence consists of sub-text units with voice significance, the sub-text unit corresponding to one syllable is a morpheme or a sub-morpheme, and the morpheme is a voice and semantic combination;

performing voice synthesis based on the sub-text unit sequence and the voice unit sequence to obtain a synthesized voice corresponding to the target text;

wherein, the segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence comprises:

based on characters with the corresponding voice attribute as the first attribute in the text unit to be divided, dividing the text unit to be divided to obtain a first division result;

2. The method of claim 1, wherein the segmenting each text unit in the sequence of text units according to syllables to obtain a sequence of sub-text units, further comprises:

3. The method of claim 2, wherein the segmenting each text unit in the sequence of text units comprises:

for each text unit to be cut in the target text:

4. The method according to claim 3, wherein the determining the voice attribute corresponding to each character in the text unit to be divided comprises:

5. The speech synthesis method according to claim 1, wherein the step of segmenting the text unit to be segmented based on the character of which the corresponding speech attribute in the text unit to be segmented is the first attribute to obtain a first segmentation result comprises:

6. The speech synthesis method according to claim 5, wherein the segmenting the text unit to be segmented based on the character of which the corresponding speech attribute in the text unit to be segmented is vowel to obtain a first segmentation result, comprises:

7. The method according to claim 5, wherein the further segmenting the first segmentation result based on the character of which the corresponding voice attribute is a consonant in the first segmentation result comprises:

if a first target character string consisting of a plurality of continuous characters of which the corresponding voice attributes are consonants exists in the first cut result, determining whether a second target character string matched with the consonant character string in a pre-established consonant character string set exists in the first target character string;

8. The method according to claim 1, wherein the determining the speech unit corresponding to each sub-text unit in the sequence of sub-text units comprises:

9. The method according to claim 8, wherein the determining the phonetic unit corresponding to the character in the sub-text unit of the phonetic unit to be determined based on the attributes corresponding to the character and the character in the sub-text unit of the phonetic unit to be determined and the pre-established correspondence among the phonetic attributes, the character and the phonetic unit comprises:

10. The method according to claim 9, wherein the determining a phonetic unit corresponding to the character of the phonetic unit to be determined according to the character of the phonetic unit to be determined and the target corresponding relationship comprises:

11. The method according to claim 9, wherein the target correspondence includes a correspondence between a single character and a speech unit, and a correspondence between a special combination character and a speech unit, wherein the special combination character has a part of characters that are silent;

if the character of the voice unit to be determined is a special combination character, determining a voice unit corresponding to the character of the voice unit to be determined based on the corresponding relation between the special combination character and the voice unit in the target corresponding relation;

12. The method according to claim 1, wherein the determining the speech unit corresponding to each sub-text unit in the sequence of sub-text units comprises:

13. The method according to claim 1, wherein performing speech synthesis based on the sequence of sub-text units and the sequence of speech units to obtain a synthesized speech corresponding to the target text comprises:

14. The method of claim 13, wherein the sequence of training sub-text units is labeled with a language;

15. A speech synthesis apparatus, comprising: the system comprises a text unit sequence acquisition module, a text segmentation module, a voice unit determination module and a voice synthesis module;

the text segmentation module is used for segmenting each text unit in the text unit sequence according to syllables to obtain a sub-text unit sequence, wherein the sub-text unit sequence consists of sub-text units with voice significance, the sub-text unit corresponding to one syllable is a morpheme or a sub-morpheme, and the morpheme is a voice and semantic combination;

the voice synthesis module is used for carrying out voice synthesis based on the sub-text unit sequence and the voice unit sequence to obtain a synthetic voice corresponding to the target text;

wherein, the segmenting each text unit in the text unit sequence according to syllables to obtain sub-text unit sequences comprises:

16. A speech synthesis apparatus, characterized by comprising: a memory and a processor;

the memory is used for storing programs;

the processor, which executes the program, implements the respective steps of the speech synthesis method according to any one of claims 1 to 14.

17. A readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, carries out the steps of the speech synthesis method according to any one of claims 1 to 14.