[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2018072543A1 - 模型生成方法、语音合成方法及装置 - Google Patents

模型生成方法、语音合成方法及装置 Download PDF

Info

Publication number
WO2018072543A1
WO2018072543A1 PCT/CN2017/097314 CN2017097314W WO2018072543A1 WO 2018072543 A1 WO2018072543 A1 WO 2018072543A1 CN 2017097314 W CN2017097314 W CN 2017097314W WO 2018072543 A1 WO2018072543 A1 WO 2018072543A1
Authority
WO
WIPO (PCT)
Prior art keywords
splicing
speech segment
speech
candidate
training
Prior art date
Application number
PCT/CN2017/097314
Other languages
English (en)
French (fr)
Inventor
袁豪磊
吴富章
钱柄桦
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to US16/318,889 priority Critical patent/US10832652B2/en
Publication of WO2018072543A1 publication Critical patent/WO2018072543A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • Embodiments of the present invention relate to the field of voice synthesis technologies, and in particular, to a model generation method, a voice synthesis method, and an apparatus.
  • Speech synthesis technology also known as Text to Speech, is used to convert text information into voice information.
  • the widely used speech synthesis technology is a speech synthesis technology based on waveform stitching.
  • the segment sequence V (v 1 , v 2 , ..., v n ) performs speech synthesis, and v i is a speech segment.
  • the target cost is used to characterize the similarity between the predicted acoustic features corresponding to the text primitive w i and the acoustic features of the candidate speech segments in the corpus.
  • the text primitive "Good Morning” corresponds to three candidate speech segments a in the corpus
  • the text primitive "China” corresponds to two candidate speech segments b in the corpus.
  • the target cost is used to characterize the similarity between the predicted acoustic features corresponding to the text element "Good Morning” and the candidate speech segments a, and the predicted acoustic features for the text primitive "China”.
  • the similarity between the candidate speech segment b and the candidate speech segment b; and the splicing cost is used to characterize the continuity between the candidate speech segment a and the candidate speech segment b; for the six candidate splicing schemes, the target cost of each candidate splicing scheme is calculated. And the cost of splicing, select a candidate mosaic with the smallest total cost As the final splicing scheme, the case is synthesized to obtain the final voice information.
  • the complete splicing cost model consists of two parts: the algorithm model and the weight.
  • these weights are manually adjusted according to the designer's experience and trial and error. Specifically, after performing speech synthesis on the input text information through the splicing cost model with initial weights, it is necessary to manually measure the continuity effect of the voice information. If an unsatisfactory continuity effect is obtained, the stitching cost needs to be manually adjusted.
  • These weights in the model by using the stitching cost model with the adjusted weights, the input text information is again synthesized by speech, and the above process is repeated for the synthesized voice information again until a satisfactory continuity effect is obtained.
  • the embodiment of the present invention provides a model generation method, a speech synthesis method and a device.
  • the technical solution is as follows:
  • a method for generating a model comprising:
  • training voice data is voice data obtained by splicing voice segments with the lowest target cost
  • a speech synthesis method which uses the model cost generation method generated by the model generation method according to the first aspect, the method comprising:
  • k candidate speech segments are selected from the corpus, and the k is a positive integer
  • a model generating apparatus comprising:
  • An acquiring module configured to acquire training voice data, where the training voice data is voice data obtained by splicing voice segments with a minimum target cost;
  • An extraction module configured to extract a training speech segment having a first annotation type from the training speech data, where the first annotation type is used to mark that the speech continuity of the training speech segment is better than a preset condition;
  • a first calculating module configured to calculate an average difference matrix according to a neighboring candidate speech segment corresponding to the training speech segment having the first annotation type before splicing; the average difference matrix corresponding to a type of splicing combination relationship, The average difference matrix is used to characterize the average difference in acoustic characteristics of a plurality of sets of adjacent candidate speech segments belonging to the same type of the stitching combination relationship;
  • a generating module configured to generate, according to the average difference matrix, a stitching cost model having a target stitching weight value, where the stitching cost model corresponds to a type of the stitching combination relationship.
  • a speech synthesis apparatus which uses the splicing cost model generated by the model generation apparatus according to the third aspect, the apparatus comprising:
  • a splitting module configured to split the input text information to obtain a text primitive sequence (w 1 , w 2 , . . . , w n ), where w i is the i-th text primitive, 1 ⁇ i ⁇ n;
  • a selection module for selecting k candidate speech segments from the corpus for each of the text primitives w i , the k being a positive integer;
  • a second calculating module configured to calculate, according to the target cost model, a target cost between each of the text primitives w i and the corresponding candidate speech segments; and calculate, between the adjacent candidate speech segments, according to the stitching cost model a splicing cost for characterizing a similarity between the predicted acoustic feature corresponding to the text primitive w i and an acoustic feature of the candidate speech segment, the splicing cost being used to characterize an adjacent location Continuity between candidate speech segments;
  • a synthesizing module configured to select a set of target speech segment sequences (v 1 , v 2 , . . . , v n ) whose target cost and the total cost corresponding to the splicing cost are the smallest, to perform speech synthesis, and obtain the input The voice information corresponding to the text information.
  • a server comprising: a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction The at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the model generation method of the first aspect above.
  • a server comprising: a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction The at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the speech synthesis method of the second aspect above.
  • a terminal comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction The at least one program, the set of codes, or the set of instructions is loaded and executed by the processor to implement the speech synthesis method of the second aspect above.
  • a computer readable storage medium stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one instruction A program, the set of codes, or a set of instructions is loaded and executed by the processor to implement the model generation method described in the first aspect above.
  • a computer readable storage medium stores at least one instruction, at least one program, a code set or a set of instructions, the at least one instruction, the at least one instruction A program, the set of codes, or a set of instructions is loaded and executed by the processor to implement the speech synthesis method of the second aspect above.
  • a candidate speech segment is calculated, and an average difference matrix is calculated, and a splicing cost model having a target splicing weight is generated according to the average difference matrix; wherein each average difference matrix is used to represent multiple sets of adjacent candidate speech segments belonging to the same splicing combination relationship.
  • the average difference in acoustic characteristics since the splicing cost model is calculated based on the average difference matrix, so that the generated splicing cost model has precise weights, reduces the number of manual adjustments, and avoids the need to manually adjust the splicing cost model multiple times. The weight of the weight, and the resulting weight is still not accurate enough.
  • 1A is a schematic diagram of a principle of a speech synthesis method based on waveform stitching
  • FIG. 1B is a schematic diagram of a principle of a speech synthesis method according to another embodiment of the present invention.
  • FIG. 2 is a flowchart of a method for a speech synthesis method according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a method for a speech synthesis method according to another embodiment of the present invention.
  • FIG. 4A is a flowchart of a method for a speech synthesis method according to another embodiment of the present invention.
  • 4B is a flowchart of a method for synthesizing a speech according to another embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a principle of a speech synthesis method according to another embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a principle of a speech synthesis method according to another embodiment of the present invention.
  • FIG. 7 is a flowchart of a method for a speech synthesis method according to another embodiment of the present invention.
  • FIG. 8 is a schematic diagram of an interface of a voice synthesis method according to another embodiment of the present invention.
  • FIG. 9 is a schematic structural diagram of a module generating apparatus according to an embodiment of the present invention.
  • FIG. 10 is a schematic structural diagram of a module generating apparatus according to another embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a voice synthesizing apparatus according to an embodiment of the present invention.
  • FIG. 12 is a block diagram of a terminal according to an embodiment of the present invention.
  • Figure 13 is a block diagram of a server provided by an embodiment of the present invention.
  • Text primitive sequence split the input text information to obtain a set of text primitive sequences (w 1 , w 2 ,..., w n ), w i is the i-th text primitive, 1 ⁇ i ⁇ n, i and n are positive integers.
  • Target cost used to characterize the similarity between the predicted acoustic features corresponding to the text primitive w i and the acoustic features of the candidate speech segments. The smaller the target cost, the more similar the two are.
  • the predictive acoustic feature is represented by an acoustic parameter value corresponding to the text primitive w i
  • the predicted acoustic feature is represented by a probability model corresponding to the text primitive w i
  • the predicted acoustic characteristics are at least one of a fundamental frequency, a spectral characteristic, a first-order rate of change of the fundamental frequency, and a high-order rate of change, a first-order rate of change of the spectrum, and a high-order rate of change, energy of the signal, and zero-crossing rate of the signal.
  • the candidate speech segments are a plurality of speech segments stored in the corpus.
  • Splice cost Used to characterize continuity between adjacent candidate speech segments.
  • Training voice data is the voice data obtained by splicing the voice segments with the lowest target cost.
  • the training speech data is speech information to be trained that is related to the target cost and is independent of the splicing cost. That is, in the process of speech synthesis of training speech data, the influence of the splicing cost is not considered (the splicing cost is 0), and only the target cost is considered.
  • the splicing process of the model generation method assumes that the splicing cost is 0, that is, the influence of the splicing cost on the speech synthesis process is not considered.
  • the training speech data includes at least one training speech segment, and one training speech segment is a training speech segment obtained by splicing the first candidate speech segment and the second candidate speech segment.
  • the annotation type of the training speech segment including the first annotation type and the second annotation type.
  • the first annotation type is used to mark the speech continuity of the training speech segment is better than the preset condition, that is, the speech continuity segment with better speech continuity effect
  • the second annotation type is used to label the speech continuity of the training speech segment to be lower than the preset condition. That is, the training speech segment with poor speech continuity effect.
  • the annotation type of each training speech segment is obtained by manual audiometry. If the manual audiometry result is that the continuity of the training speech segment is superior, the training speech segment is marked as the first annotation type; if the manual audiometry result is that the continuity of the training speech segment is poor, the training speech is performed.
  • the segment is labeled as the second identifier type, and the voice continuity corresponding to the first annotation type is better than the voice continuity corresponding to the second annotation type.
  • Average difference matrix used to characterize the average difference in acoustic characteristics of multiple sets of adjacent candidate speech segments belonging to the same type of splicing combination. Among them, the average difference matrix corresponds to a type of splicing combination relationship.
  • the first candidate speech segment can be obtained by the difference in acoustic characteristics between the first candidate speech segment and the second candidate speech segment. And a mosaic difference matrix of the second candidate speech segment.
  • the splicing difference matrix belonging to the same type of splicing combination relationship is averaged to obtain the average difference matrix corresponding to the splicing combination relationship.
  • the splicing combination relationship includes a combination relationship between the at least two phonemes.
  • the splicing combination relationship is a combination relationship in which the phoneme unit a is in the front and the phoneme unit b is in the back.
  • the combination relationship between pinyin "y” and pinyin “i” is a stitching combination relationship.
  • the splicing cost model is a splicing cost model with target splicing weights. Among them, the stitching cost model corresponds to a type of stitching combination relationship.
  • the target splicing weight includes a first weight and a second weight.
  • the first weight is a weight corresponding to the nth acoustic feature of the two candidate speech segments
  • the second weight is a second weight corresponding to the acoustic feature of the t-th overlapping frame of the two candidate speech segments.
  • FIG. 1A illustrates a schematic diagram of a speech synthesis method based on waveform stitching.
  • the user inputs a text message to the server, and the server splits the input text information to obtain a set of text primitive sequences (w 1 , w 2 , . . . , w n ).
  • the final server sets the text of the group.
  • the primitive sequence is transformed into a set of target speech segment sequences (v 1 , v 2 , . . . , v n ) for speech synthesis, and speech information corresponding to the input text information is obtained.
  • the server performs front-end processing on text primitive w 1 and text primitive w 2 according to a preset acoustic model.
  • the predicted acoustic feature 1 corresponding to the text primitive w 1 and the predicted acoustic feature 2 corresponding to the text primitive w 2 are respectively obtained.
  • the predicted acoustic feature 1 corresponding to the text primitive w 1 three first candidate speech segments are selected from the corpus, and the three first candidate speech segments include the candidate speech segment a1, the candidate speech segment a2, and the candidate speech segment a3;
  • the predicted acoustic feature 2 corresponding to the text primitive w 2 selects two second candidate speech segments from the corpus, and the two second candidate speech segments include the candidate speech segment b1 and the candidate speech segment b2.
  • the first set of candidate splicing schemes is the splicing of the candidate speech segment a1 and the candidate speech segment b1
  • the second set of candidate splicing schemes is the splicing of the candidate speech segment a2 and the candidate speech segment b1
  • the third candidate splicing scheme is the candidate speech segment a3 and the candidate speech.
  • the fourth set of candidate splicing schemes is the splicing of the candidate speech segment a1 and the candidate speech segment b2
  • the fifth candidate splicing scheme is the splicing of the candidate speech segment a2 and the candidate speech segment b2
  • the sixth candidate splicing scheme is the candidate speech segment.
  • A3 is spliced with the candidate speech segment b2.
  • the server calculates w 1 text primitives corresponding to the first target speech piece candidate cost of TC11 between a1, w 2 text primitives corresponding speech piece candidate of the target cost model b1
  • the second target cost TC50 is calculated according to the splicing cost model, and the splicing cost CC11 between the candidate speech segment a1 and the candidate speech segment b1 is calculated, and the total cost RC1 corresponding to the first group candidate splicing scheme is calculated, and the total cost RC1 includes the first The target cost TC11, the second target cost TC50, and the first splicing cost CC11; and so on, respectively calculating a total cost RC2 corresponding to the second group candidate splicing scheme, and a total cost RC3 corresponding to the third group candidate splicing scheme, and The total cost RC4 corresponding to the four candidate splicing schemes, the total cost RC5 corresponding to the fifth group candidate splicing scheme, and the
  • the server compares the total cost corresponding to the six candidate splicing schemes, and the comparison result is that the total cost RC2 corresponding to the second group of candidate splicing schemes is the smallest, that is, the candidate speech segment a1 and the candidate speech segment b2 are determined to belong to the target voice segment.
  • the final speech stitching and the final synthesized speech are the final speech stitching and the final synthesized speech.
  • the splicing cost model may be defined by the following formula:
  • CC is the splicing cost, the CC is used to characterize the continuity of the candidate speech segment a1 and the candidate speech segment b2, T is the number of frames of the overlapping frame of the candidate speech segment a1 or the candidate speech segment b2, and w t is the candidate speech segment a1 and the candidate The second weight corresponding to the acoustic feature of the t-th overlapping frame of the speech segment b2, N is the number of acoustic features included in the candidate speech segment a1 or the candidate speech segment b2, and w n is the candidate speech segment a1 and the candidate speech segment b2
  • is an acoustic distance measure of the nth acoustic feature of the candidate speech segment a1 and the candidate speech segment b2, and F is the candidate speech segment a1 and the candidate speech segment b2 Splicing the difference matrix.
  • is an acoustic distance
  • the candidate speech segment a1 and the candidate speech segment b2 are spliced, it is assumed that the candidate speech segment a1 and the candidate speech segment b2 have only one overlapping frame, and the candidate speech segment a1 has N acoustic features on the overlapping frame (or Referring to the N-dimensional acoustic feature, the candidate speech segment b2 correspondingly has N acoustic features (or N-dimensional acoustic features) text primitives w 1 text primitives w 2 on the overlapping frames.
  • the lip transition and the pitch transition are different for different adjacent candidate speech segments, that is, the first weight w n and the corresponding corresponding to the nth acoustic feature corresponding to the different adjacent candidate speech segments
  • the second weight w t corresponding to the acoustic features of the t overlapping frames is also different.
  • the acoustic distance measure of each acoustic feature of the candidate speech segment a1 and the candidate speech segment b2 is multiplied by the corresponding first weight w n according to the number of acoustic features included in the candidate speech segment a1 or the candidate speech segment b2.
  • the first weight w n and the second weight w t are determined as target stitching weights.
  • the present invention provides the following embodiments.
  • FIG. 2 is a flowchart of a method for a speech synthesis method according to an embodiment of the present invention.
  • the speech synthesis method may be performed by a server or a terminal having voice processing capability, the speech synthesis method comprising:
  • Step 202 Acquire training voice data.
  • the server acquires training voice data to be trained, where the training voice data includes multiple training voice segments.
  • Step 204 Extract a training speech segment having a first annotation type from the training speech data.
  • the server determines at least two training speech segments included in the training speech data, and the annotation type of the at least two training speech segments includes a first annotation type and/or a second annotation type, from at least two training speech segments. Extract x training speech segments with the first annotation type, x being a positive integer.
  • Step 206 Calculate an average difference matrix according to the adjacent candidate speech segments corresponding to the training speech segments having the first annotation type before splicing.
  • the server after extracting the training speech segments having the first annotation type, for each training speech segment having the first annotation type, the server according to the adjacent candidate speech segments corresponding to the training speech segment before the splicing Calculate the stitching difference matrix. For the splicing difference matrix belonging to the same type of splicing combination relationship, the average difference matrix corresponding to the splicing combination relationship is calculated.
  • Step 208 Generate a splicing cost model having a target splicing weight according to the average difference matrix.
  • the server calculates, according to the calculated average difference matrix, a stitching cost model by using a preset formula, where the stitching cost model has a target stitching weight.
  • Step 210 Perform speech synthesis by using a splicing cost model with target splicing weights to obtain synthesized speech information.
  • the server when the server determines the text information that needs to be synthesized by the voice, the server performs voice synthesis on the determined text information by using the stitching cost model to obtain the synthesized voice information.
  • the server transmits the generated splicing cost model to the terminal, so that the terminal can apply the splicing cost model.
  • the terminal obtains the generated splicing cost model from the server.
  • the terminal receives the text information that needs to be synthesized, the terminal synthesizes the input text information through the splicing cost model to obtain the synthesized voice information.
  • step 202 to step 208 can be separately implemented as a model generating method, which is generally completed by a server for generating a stitching cost model with target stitching weights;
  • step 210 is a voice synthesis
  • the method of the speech synthesis is generally performed by a server or a terminal, and is used for synthesizing the input text information by using the splicing cost model generated in steps 202 to 208 to obtain synthesized speech information.
  • the model generation method is completed only by the server, and the terminal completes the speech synthesis method as an example.
  • the average difference matrix is calculated according to the adjacent candidate speech segments corresponding to the plurality of training speech segments having the first annotation type before the splicing, and the target splicing weight is generated according to the average difference matrix.
  • a splicing cost model wherein each average difference matrix is used to characterize the average difference in acoustic characteristics of multiple sets of adjacent candidate speech segments belonging to the same splicing combination relationship, since the splicing cost model is calculated based on the average difference matrix, The generated splicing cost model has precise weights, reduces the number of manual adjustments, avoids the need to manually adjust the weights in the splicing cost model, and the resulting weights are still not accurate enough.
  • FIG. 3 is a flowchart of a method for a speech synthesis method according to another embodiment of the present invention.
  • the speech synthesis method includes:
  • step 301 the server acquires training voice data.
  • step 301 can be implemented as step 301a, step 301b, step 301c, and step 301d instead, as shown in FIG. 4A:
  • step 301a the server splits the text information to be trained to obtain a text primitive sequence (w 1 , w 2 , . . . , w n ), where w i is the i-th text primitive, 1 ⁇ i ⁇ n.
  • the server splits the text information to be trained based on the phoneme or the syllable to obtain a text primitive sequence (w 1 , w 2 , . . . , w n ), where w i is the i-th text primitive, 1 ⁇ i ⁇ n.
  • step 301b the server obtains a predicted acoustic feature corresponding to each text primitive w i according to the preset acoustic model.
  • the server outputs a linguistic model corresponding to each text primitive w i to a preset acoustic model, and the predicted acoustic model outputs a predicted acoustic feature corresponding to each text primitive w i .
  • Step 301c the server group for each text element w i, the minimum cost is selected target speech segments from the corpus v i.
  • the server calculates a target cost of the candidate speech segment corresponding to each text primitive w i , and selects a speech segment viw t with the smallest target cost from the corpus.
  • the server calculates the corresponding target cost by the following formula:
  • TC i is the target cost corresponding to the text primitive w i
  • w n is a preset first weight
  • is the predicted acoustic feature corresponding to the text primitive w i
  • the acoustic distance measure may take the Euclidean distance or the absolute value of the difference.
  • the server selects 10 speech segments v i with the smallest target cost from the corpus.
  • Step 301d The server performs speech synthesis according to the training speech segment sequence (v 1 , v 2 , . . . , v n ) composed of the selected speech segments v i to obtain training speech data corresponding to the text information to be trained.
  • Step 302 The server extracts a training speech segment having a first annotation type from the training speech data.
  • step 302 can be implemented as step 302a and step 302b instead, as shown in FIG. 4B:
  • Step 302a The server acquires an annotation type of at least one training speech segment in the training speech data, where the annotation type of each training speech segment is a first annotation type or a second annotation type, and the voice continuity corresponding to the first annotation type is better than the first The continuity of speech corresponding to the two annotation types.
  • step 302b the server extracts the training speech segment having the first annotation type.
  • the training speech segment of the first annotation type or the second annotation type is marked by manual listening to the training speech data.
  • the server extracts the training speech segments having the first annotation type
  • the annotation type of each training speech segment is obtained.
  • the server extracts the training speech segment having the first annotation type from the training speech data.
  • Step 303 The server calculates, for each training speech segment having the first annotation type, a splicing difference matrix according to the adjacent candidate speech segments corresponding to the training speech segment before splicing.
  • the training voice segment is multiple, such as hundreds, thousands, or tens of thousands.
  • the server calculates a splicing difference matrix corresponding to the training speech segment according to the adjacent candidate speech segment corresponding to the training speech segment before splicing.
  • the steps for the server to calculate the splicing difference matrix include:
  • the server For each training speech segment having the first annotation type, acquires the candidate speech segment a and the candidate speech segment b corresponding to the training speech segment before splicing.
  • the server acquires a first set of acoustic features corresponding to each overlapping frame of the candidate speech segment a and a second set of acoustic features corresponding to each overlapping frame of the candidate speech segments b.
  • the number of frames of the overlapping frames of the candidate speech segment a and the candidate speech segment b may be one frame or multiple frames.
  • the current time is t0
  • the time of the last frame of the candidate speech segment a is t0
  • the time of the first frame of the candidate speech segment b is t0
  • the current time is t0
  • the time of the last frame of the candidate speech segment a is t0
  • the time of the first frame of the candidate speech segment b is t0
  • the length of the splicing window T is an arbitrary value.
  • the t0th frame to the t0+T-1 frame of the candidate speech segment a overlap with the t0-T+1th frame to the t0th frame of the candidate speech segment b, respectively, that is, "a(t0:t0+T-1) +b(t0-T+1: t0)"
  • the number of frames T of the overlapping frames is not limited in the embodiment of the present invention.
  • the number of frames T of the overlapping frames is 20 frames.
  • each overlapping frame of the candidate speech segment a corresponds to a first set of acoustic features
  • the first set of acoustic features includes n acoustic features (or n-dimensional acoustic features)
  • each overlapping frame of the candidate speech segments b The upper portion corresponds to a second set of acoustic features, the second set of acoustic features comprising n acoustic features (or n-dimensional acoustic features).
  • the acoustic characteristics are the fundamental frequency, the spectral characteristics, the first-order rate of change of the fundamental frequency, and the high-order rate of change, the first-order rate of change of the spectrum, and the high-order rate of change, the energy of the signal, and the zero-crossing rate of the signal. One less.
  • the server calculates the stitching difference matrix F according to the first set of acoustic features and the second set of acoustic features according to the following formula.
  • F is the splicing difference matrix corresponding to the candidate speech segment a and the candidate speech segment b
  • the nth row and the tth column in the splicing difference matrix represent the nth acoustic feature and candidate of the tth overlapping frame in the candidate speech segment a
  • the acoustic distance measure of the nth acoustic feature of the t-T+1th overlapping frame in the speech segment b, f a,t is the nth acoustic feature corresponding to the t-th overlapping frame of the candidate speech segment a
  • f b, t-T+1 is the nth acoustic feature corresponding to the t-T+1th overlapping frame of the candidate speech segment b.
  • Step 304 The server classifies the splicing difference matrix according to the splicing combination relationship of the adjacent candidate speech segments, and obtains a splicing difference matrix set corresponding to each splicing combination relationship.
  • the splicing difference matrix set includes m splicing difference matrices belonging to the same splicing combination relationship, and m is a positive integer.
  • a neighboring candidate voice segment corresponding to each measured voice segment can calculate a stitching difference matrix. If the measured voice segment is 10,000, a 10,000 stitching difference matrix can be calculated.
  • the candidate speech segments have different phonemes or syllable types. If a training speech segment is spliced by the a-type speech segment preceding and b-type speech segments, the splicing combination relationship corresponding to the training speech segment is: a The type of speech segment is preceded and the b-type speech segment is followed.
  • the candidate speech segment is divided into phoneme units, for example, the candidate speech segment a is a speech segment corresponding to pinyin "y”, and the candidate speech segment b is a speech segment corresponding to pinyin "i", then the pinyin "y"
  • the combination relationship formed with Pinyin "i” is a splicing combination relationship.
  • the stitching combination relationship formed by the pinyin "y” and the pinyin "i” there may be several hundred stitching difference matrices, and the hundreds of stitching difference matrices are classified to correspond to the stitching combination relationship "y+i" Splicing the difference matrix set.
  • Step 305 The server calculates an average value for the splicing difference matrix in each spliced difference matrix set, and obtains an average difference matrix corresponding to each splicing combination relationship.
  • the mean is calculated for all the splicing difference matrices in F ab , i , and the average difference corresponding to the splicing combination relationship between the selected speech segment a and the candidate speech segment b is obtained.
  • Matrix F ab is the mean of the splicing difference matrix set.
  • Step 306 the server performs singular value decomposition on the average difference matrix F ab for each average difference matrix F ab to obtain a first decomposition matrix U and a second decomposition matrix V.
  • ab represents a splicing combination relationship of a speech segment of the a type before and a speech segment of the b type; schematically, the type refers to a phoneme type.
  • Step 307 The server generates the orthogonal matrix of the first decomposition matrix U as the first weight w n and the orthogonal matrix of the second decomposition matrix V as the second weight w t .
  • the server defines the stitching cost by the following formula:
  • the stitching cost is the smallest.
  • the first weight w n and the second weight w t at this time are determined as target stitching weights.
  • Step 308 the server generates a stitching cost model having a first weight w n and a second weight w t .
  • the server generates the stitching cost model as follows:
  • CC is the splicing cost
  • the splicing cost is used to characterize the continuity between adjacent candidate speech segments
  • T is the number of overlapping frames of adjacent candidate speech segments
  • w t is the t-th overlap of adjacent candidate speech segments.
  • N is the number of acoustic features included in each candidate speech segment
  • w n is the first weight corresponding to the nth acoustic feature of the adjacent candidate speech segment,
  • Step 309 The terminal performs speech synthesis by using a splicing cost model with a target splicing weight to obtain synthesized speech information.
  • the average difference matrix corresponding to each splicing combination relationship is obtained by calculating the mean value of the splicing difference matrix in each splicing difference matrix set, and singular values are performed for each average difference matrix.
  • the decomposition determines the first weight and the second weight such that the calculated weight is more accurate.
  • step 309 can be implemented as step 309a, step 309b, step 309c, step 309d and step 309e instead, as shown in FIG. 7:
  • step 309a the terminal splits the input text information to obtain a text primitive sequence (w 1 , w 2 , . . . , w n ), where w i is the i-th text primitive, and 1 ⁇ i ⁇ n.
  • the input text information is text information input by the user, such as news text or novel text.
  • the terminal splits the input text information to obtain a text primitive sequence (w 1 , w 2 , . . . , w n ), where w i is the i-th text primitive, 1 ⁇ i ⁇ n.
  • Step 309b the terminal obtains a predicted acoustic feature corresponding to each text primitive w i according to the preset acoustic model.
  • step 309c the terminal selects k candidate speech segments from the corpus for each text primitive w i , and k is a positive integer.
  • Step 309d The terminal calculates a target cost between each text primitive w i and a corresponding candidate speech segment according to the target cost model; and calculates a splicing cost between adjacent candidate speech segments according to the splicing cost model.
  • the terminal calculates a target cost between each text primitive w i and the corresponding candidate speech segment according to the target cost model by using the following formula:
  • TC is the target cost corresponding to the input text primitive a
  • w n is the first weight corresponding to the nth acoustic feature of the candidate speech segment in the mosaic cost model generated by the model generation method
  • f a,n - f a ',n is the acoustic distance measure of the candidate speech segment a and the nth acoustic feature of the predicted acoustic feature a'.
  • the terminal calculates a splicing cost between adjacent candidate segments according to a splicing cost model by using the following formula:
  • the CC T is a splicing cost corresponding to the adjacent candidate speech segment a and the candidate speech segment b
  • w t is a second weight corresponding to the acoustic feature of the t-th overlapping frame of the candidate speech segment a or the candidate speech segment b
  • w n is the first weight corresponding to the nth acoustic feature of the candidate speech segment a or the candidate speech segment b
  • is the t-th overlap of the candidate speech segment a
  • Step 309e the terminal selects a set of target speech segment sequences (v 1 , v 2 , . . . , v n ) with the lowest total cost corresponding to the target cost and the splicing cost to perform speech synthesis, and obtains voice information corresponding to the input text information. .
  • the terminal selects, from all the candidate splicing modes, a set of target speech segment sequences (v 1 , v 2 , . . . , v n ) with the lowest total cost corresponding to the target cost and the splicing cost for speech synthesis, and obtains The voice information corresponding to the input text information.
  • the target cost and the splicing cost corresponding to all candidate splicing modes can form a matrix, and a path with the smallest value from left to right in the matrix can be obtained through a dynamic programming algorithm, and the path is Corresponding individual speech segments constitute a sequence of target speech segments with the lowest total cost.
  • the embodiment further calculates a target cost between each text primitive w i and a corresponding candidate speech segment according to the target cost model, and calculates a splicing cost between adjacent candidate speech segments according to the splicing cost model.
  • the target cost factor considers the influence of the splicing cost at the same time, because the splicing cost is used to characterize the continuity of adjacent candidate speech segments after splicing, so that the synthesized speech information has better continuity effect.
  • a speech synthesis method is applied to a terminal application such as "Penguin FM", when a user inputs a piece of news text or novel text in an application having a speech synthesis function, the application The voice information corresponding to the input news text or novel text will be synthesized.
  • FIG. 9 is a schematic structural diagram of a module generating apparatus according to an embodiment of the present invention.
  • the device can be implemented as all or part of a server by software, hardware or a combination of both.
  • the module generating device includes an obtaining module 910, an extracting module 920, a first calculating module 930, and a generating module 940.
  • the obtaining module 910 is configured to acquire training voice data, and the training voice data is voice data obtained by splicing the voice segments with the lowest target cost.
  • the extraction module 920 is configured to extract a training speech segment having a first annotation type from the training speech data, where the first annotation type is used to label the speech continuity of the training speech segment to be better than a preset condition.
  • the first calculating module 930 is configured to calculate an average difference matrix according to the adjacent candidate speech segments corresponding to the training speech segments having the first annotation type before splicing; the average difference matrix corresponds to a type of splicing combination relationship, and the average difference matrix It is used to characterize the average difference in acoustic characteristics of multiple sets of adjacent candidate speech segments belonging to the same type of splicing combination relationship.
  • the generating module 940 is configured to generate a splicing cost model with a target splicing weight according to the average difference matrix, and the splicing cost model corresponds to a splicing combination relationship.
  • FIG. 10 is a schematic structural diagram of a module generating apparatus according to another embodiment of the present invention.
  • the generating module 940 includes: a decomposing unit 941, a first generating unit 942, and a second generating unit 943.
  • the first generating unit 942 is configured to generate the orthogonal matrix of the first decomposition matrix U as the first weight w n and the orthogonal matrix of the second decomposition matrix V as the second weight w t .
  • the second generating unit 943 is configured to generate a stitching cost model having a first weight w n and a second weight w t .
  • ab represents the splicing combination relationship of the speech segment of the a type before and the speech segment of the b type.
  • the second generating unit 943 is specifically configured to generate the stitching cost model as follows:
  • CC is the splicing cost
  • the splicing cost is used to characterize the continuity between adjacent candidate speech segments
  • T is the number of frames of overlapping frames of adjacent candidate speech segments
  • w t is the t-th of adjacent candidate speech segments.
  • a second weight corresponding to the acoustic features of the overlapping frames N is the number of the acoustic features included in each of the candidate speech segments
  • w n is the nth of the adjacent candidate speech segments
  • is an acoustic distance measure of the nth acoustic feature of the adjacent candidate speech segment.
  • the first calculation module 930 includes: a first calculation unit 931, a classification unit 932, and a second calculation unit 933.
  • the first calculating unit 931 is configured to calculate, for each training speech segment having the first annotation type, a mosaic difference matrix according to the adjacent candidate speech segments corresponding to the training speech segment before the splicing.
  • the categorization unit 932 is configured to classify the splicing difference matrix according to the splicing combination relationship of the adjacent candidate speech segments, and obtain a splicing difference matrix set corresponding to each splicing combination relationship, and the splicing difference matrix set includes the same splicing combination relationship.
  • the second calculating unit 933 is configured to calculate an average value of the splicing difference matrix in each spliced difference matrix set, and obtain an average difference matrix corresponding to each splicing combination relationship.
  • the first calculating unit 931 includes:
  • the first acquisition subunit 931a the second acquisition subunit 931b, and the calculation subunit 931c.
  • the first obtaining sub-unit 931a is configured to obtain a candidate speech segment a and a candidate speech segment b corresponding to the training speech segment before the splicing for each training speech segment having the first annotation type.
  • a second acquisition sub-unit 931b configured to acquire a second set of acoustic features corresponding to the overlapping frames of the overlapping frames of the candidate speech segments a and the overlapping frames of the candidate speech segments b, the first set of acoustic features comprising n acoustic features
  • the second set of acoustic features includes n acoustic features.
  • the calculating subunit 931c is configured to calculate the stitching difference matrix F according to the first formula according to the first set of acoustic features and the second set of acoustic features.
  • F is the splicing difference matrix corresponding to the candidate speech segment a and the candidate speech segment b
  • the nth row and the tth column in the splicing difference matrix represent the nth acoustic feature and candidate of the tth overlapping frame in the candidate speech segment a
  • the acoustic distance measure of the nth acoustic feature of the t-T+1th overlapping frame in the speech segment b, f a,t is the nth acoustic feature corresponding to the t-th overlapping frame of the candidate speech segment a
  • f b, t-T+1 is the nth acoustic feature corresponding to the t-T+1th overlapping frame of the candidate speech segment b.
  • the extraction module 920 includes an obtaining unit 921 and an extracting unit 922.
  • An obtaining unit 921 configured to acquire an annotation class of at least one training voice segment in the training voice data
  • the type of the label of each training voice segment is a first label type or a second label type, and the voice continuity corresponding to the first label type is better than the voice continuity corresponding to the second label type.
  • the extracting unit 922 is configured to extract a training speech segment having a first annotation type.
  • the obtaining module 910 includes:
  • the splitting unit 911 is configured to split the text information to be trained to obtain a text primitive sequence (w 1 , w 2 , . . . , w n ), where w i is the i-th text primitive, and 1 ⁇ i ⁇ n.
  • the obtaining unit 912 is configured to obtain a predicted acoustic feature corresponding to each text primitive w i according to the preset acoustic model.
  • the selecting unit 913 is configured to select, for each text primitive w i , a speech segment vi having the smallest target cost from the corpus, and the target cost is used to represent the predicted acoustic features corresponding to the text primitive w i and the candidate speech segments in the corpus The similarity between the acoustic characteristics.
  • the synthesizing unit 914 is configured to perform speech synthesis according to the training speech segment sequence (v 1 , v 2 , . . . , v n ) composed of the selected speech segments vi, and obtain training speech data corresponding to the text information to be trained.
  • FIG. 11 is a schematic structural diagram of a voice synthesizing apparatus according to an embodiment of the present invention.
  • the voice synthesizing device adopts a splicing cost model provided in the embodiment shown in FIG. 9 or FIG. 10, and the voice synthesizing device includes: a splitting module 1100, a obtaining module 1110, a selecting module 1120, a second calculating module 1130, and a synthesizing module 1140. .
  • the splitting module 1100 is configured to split the input text information to obtain a text primitive sequence (w 1 , w 2 , . . . , w n ), where w i is the i-th text primitive, and 1 ⁇ i ⁇ n.
  • a module 1110 is obtained for obtaining a predicted acoustic feature corresponding to each text primitive w i according to a preset acoustic model.
  • a selection module 1120 is provided for selecting, for each text primitive w i , a plurality of candidate speech segments from the corpus.
  • the second calculating module 1130 is configured to calculate a target cost between each text primitive w i and a corresponding candidate speech segment according to the target cost model.
  • the splicing cost between adjacent candidate speech segments is calculated according to the splicing cost model.
  • the synthesizing module 1140 is configured to select a set of target speech segment sequences (v 1 , v 2 , . . . , v n ) with the lowest total cost corresponding to the target cost and the splicing cost for speech synthesis, and obtain corresponding to the input text information. voice message.
  • the terminal 1200 may include a RF (Radio Frequency, RF) circuit 1210, comprising one or more computer-readable storage medium, a memory 1220, an input unit 1230, a display unit 1240, a sensor 1250, an audio circuit 1260, W i A Fi (w i reless fidelity) module 1270, a processor 1280 including one or more processing cores, and a power supply 1290 and the like.
  • RF Radio Frequency, RF
  • the device structure illustrated in FIG. 12 does not constitute a limitation to the device, and may include more or less components than those illustrated, or a combination of certain components, or different component arrangements. among them:
  • the RF circuit 1210 can be used for transmitting and receiving information or during a call, and receiving and transmitting signals. Specifically, after receiving the downlink information of the base station, the downlink information is processed by one or more processors 1280. In addition, the data related to the uplink is sent to the base station. .
  • the RF circuit 1210 includes, but is not limited to, an antenna, at least one amplifier, a tuner, one or more oscillators, a Subscriber Identity Module (SIM) card, a transceiver, a coupler, an LNA (Low Noise Amplifier). , duplexer, etc.
  • RF circuitry 1210 can also communicate with the network and other devices via wireless communication.
  • Wireless communication can use any communication standard or protocol, including but not limited to GSM (Global System of Mobile communication), GPRS (General Packet Radio Service), CDMA (Code Division Multiple Access) Division Multiple Access), WCDMA (W i deband Code Division Multiple Access), LTE (Long Term Evolution), e-mail, SMS (Short Messaging Service), and the like.
  • Memory 1220 can be used to store software programs as well as modules.
  • the processor 1280 executes various functional applications and data processing by running software programs and modules stored in the memory 1220.
  • the memory 1220 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may be stored according to The data created by the use of the terminal 1200 (such as audio data, phone book, etc.) and the like.
  • memory 1220 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 1220 can also include a memory controller to provide access to memory 1220 by processor 1280 and input unit 1230.
  • the input unit 1230 can be configured to receive input numeric or character information, and generate and user settings And keyboard, mouse, joystick, optical or trackball signal input related to function control.
  • input unit 1230 can include touch-sensitive surface 1231 as well as other input devices 1232.
  • Touch-sensitive surface 1231 also known as a touch display or touchpad, can collect touch operations on or near the user (eg, the user uses a finger, stylus, etc., on any touch-sensitive surface 1231 or on the touch-sensitive surface 1231 The operation near the touch-sensitive surface 1231) and driving the corresponding connecting device according to a preset program.
  • the touch-sensitive surface 1231 may include two parts of a touch detection device and a touch controller.
  • the touch detection device detects the touch orientation of the user, and detects a signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts the touch information into contact coordinates, and sends the touch information.
  • the processor 1280 is provided and can receive commands from the processor 1280 and execute them.
  • the touch sensitive surface 1231 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 1230 can also include other input devices 1232.
  • other input devices 1232 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackballs, mice, joysticks, and the like.
  • Display unit 1240 can be used to display information entered by the user or information provided to the user and various graphical user interfaces of device 120, which can be composed of graphics, text, icons, video, and any combination thereof.
  • the display unit 1240 may include a display panel 1241.
  • the display panel 1241 may be configured in the form of an LCD (Liquid Crystal Display), an OLED (Organic Light-Emitting Diode), or the like.
  • the touch-sensitive surface 1231 may be overlaid on the display panel 1241. When the touch-sensitive surface 1231 detects a touch operation thereon or nearby, it is transmitted to the processor 1280 to determine the type of the touch event, and then the processor 1280 is The type of touch event provides a corresponding visual output on display panel 1241.
  • touch-sensitive surface 1231 and display panel 1241 are implemented as two separate components to implement input and input functions, in some embodiments, touch-sensitive surface 1231 can be integrated with display panel 1241 for input. And output function.
  • Terminal 1200 can also include at least one type of sensor 1250, such as a light sensor, motion sensor, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1241 according to the brightness of the ambient light, and the proximity sensor may close the display panel 1241 when the terminal 1200 moves to the ear. / or backlight.
  • the gravity acceleration sensor can detect the magnitude of acceleration in all directions (usually three axes). When it is stationary, it can detect the magnitude and direction of gravity.
  • gesture of the mobile phone such as horizontal and vertical screen switching, related Game, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tap)
  • sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, and the like, which are also configurable by the terminal 1200, are not described herein.
  • Audio circuit 1260, speaker 1221, and microphone 1222 can provide an audio interface between the user and terminal 1200.
  • the audio circuit 1260 can transmit the converted electrical data of the received audio data to the speaker 1221, and convert it into a sound signal output by the speaker 1221.
  • the microphone 1222 converts the collected sound signal into an electrical signal, and the audio circuit 1260. After receiving, it is converted into audio data, and then processed by the audio data output processor 1280, transmitted to another device via the RF circuit 1210, or outputted to the memory 1220 for further processing.
  • the audio circuit 1260 may also include an earbud jack to provide communication of the peripheral earphones with the terminal 1200.
  • W i Fi is a short-range wireless transmission technology
  • the terminal 1200 through 1270 W i Fi module can help users to send and receive e-mail, web browsing, and access to streaming media, etc., which provides wireless broadband Internet access for the user.
  • FIG. 12 shows a 1270 W i Fi module, it will be appreciated that it is not configured to be part of the terminal 1200, and can be omitted without changing the essence of the invention in the ranges as needed.
  • the processor 1280 is a control center of the terminal 1200 that connects various portions of the entire device using various interfaces and lines, by running or executing software programs and/or modules stored in the memory 1220, and recalling data stored in the memory 1220, The various functions and processing data of the terminal 1200 are performed to perform overall monitoring of the device.
  • the processor 1280 may include one or more processing cores; optionally, the processor 1280 may integrate an application processor and a modem processor, where the application processor mainly processes an operating system, a user interface, and an application. Etc.
  • the modem processor primarily handles wireless communications. It will be appreciated that the above described modem processor may also not be integrated into the processor 1280.
  • the terminal 1200 also includes a power source 1290 (such as a battery) for powering various components.
  • the power source can be logically coupled to the processor 1280 through a power management system to manage functions such as charging, discharging, and power management through the power management system.
  • the power supply 1290 may also include any one or more of a DC or AC power source, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
  • the terminal 1200 may further include a camera, a Bluetooth module, and the like, and details are not described herein.
  • the memory 1202 stores at least one instruction, at least one program, a code set, or an instruction set.
  • the at least one instruction, the at least one program, the code set, or the set of instructions is loaded and executed by the processor 1280 to implement the speech synthesis method as described in the various method embodiments above.
  • the server 1300 includes a central processing unit (CPU) 1301, a system memory 1304 including a random access memory (RAM) 1302 and a read only memory (ROM) 1303, and a system memory 1304 and a central processing unit 1301.
  • the server 1300 also includes a basic input/output system (I/O system) 1306 that facilitates transfer of information between various devices within the computer, and mass storage for storing the operating system 1313, applications 1314, and other program modules 1315.
  • I/O system basic input/output system
  • the basic input/output system 1306 includes a display 1308 for displaying information and an input device 1309 such as a mouse or keyboard for user input of information.
  • the display 1308 and the input device 1309 are both connected to the central processing unit 1301 via an input/output controller 1310 connected to the system bus 1305.
  • the basic input/output system 1306 can also include an input output controller 1310 for receiving and processing input from a plurality of other devices, such as a keyboard, mouse, or electronic stylus.
  • the input and output controller 1310 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 1307 is connected to the central processing unit 1301 by a mass storage controller (not shown) connected to the system bus 1305.
  • the mass storage device 1307 and its associated computer readable medium provide non-volatile storage for the server 1300. That is, the mass storage device 1307 can include a computer readable medium (not shown) such as a hard disk or a CD-ROM drive.
  • the computer readable medium can include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory or other solid state storage technologies, CD-ROM, DVD or other optical storage, tape cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • the server 1300 may also be operated by a remote computer connected to the network through a network such as the Internet. That is, the server 1300 can be connected to the network 1312 through the network interface unit 1311 connected to the system bus 1305, or can also be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1311. .
  • the memory stores at least one instruction, at least one program, a code set, or a set of instructions. The At least one instruction, at least one program, code set, or instruction set is loaded and executed by the processor to implement the model generation method and/or the speech synthesis method as described in the various method embodiments above.
  • the steps of implementing the model generation method and the speech synthesis method of the foregoing embodiments may be completed by hardware, or may be completed by a program to execute related hardware, and the program may be stored in a kind of In the computer readable storage medium, the above-mentioned storage medium may be a read only memory, a magnetic disk or an optical disk or the like.
  • the storage medium stores at least one instruction, at least one program, a code set, or a set of instructions, the at least one instruction, the at least one program, the code set, or the instruction set being loaded and executed by the processor to implement the methods as described above.
  • the model generation method and/or the speech synthesis method described in the examples are examples of the speech synthesis method described in the examples.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Machine Translation (AREA)

Abstract

一种模型生成方法、语音合成方法及装置,属于语音合成技术领域。模型生成方法包括:获取训练语音数据(202);从训练语音数据中提取具有第一标注类型的训练语音片段(204);根据具有第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵(206);根据平均差异矩阵,生成具有目标拼接权值的拼接代价模型(208);通过具有目标拼接权值的拼接代价模型进行语音合成,得到合成的语音信息(210)。通过根据平均差异矩阵生成具有目标拼接权值的拼接代价模型,避免了需要多次手工调整拼接代价模型中的权值,且最终得到的权值仍然不够准确的情况,从而达到了减少手工调整次数,直接通过平均差异矩阵计算出较为精准的目标拼接权值的效果。

Description

模型生成方法、语音合成方法及装置
本申请要求于2016年10月17日提交中国专利局、申请号为201610901099.1、发明名称为“语音合成方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及语音合成技术领域,特别涉及一种模型生成方法、语音合成方法及装置。
背景技术
语音合成技术,又称文语转换(Text to Speech)技术,用于将文字信息转化为语音信息。目前使用较为广泛的语音合成技术是基于波形拼接的语音合成技术。
基于波形拼接的语音合成技术的核心思想是:预先构建一个语料库,该语料库中包含各种语音片段;对于输入的文本信息,从语料库中选择合适的多个语音片段拼接得到最终的语音信息。具体来讲,对于已知的一个文本信息W=(w1,w2,…,wn),wi为文本基元,采用目标代价和拼接代价从语料库中选择出总代价最小的目标语音片段序列V=(v1,v2,…,vn)进行语音合成,vi为语音片段。其中,目标代价用于表征文本基元wi对应的预测声学特征与语料库中的候选语音片段的声学特征之间的相似性,目标代价越小,两者越相似;拼接代价用于表征相邻候选语音片段在拼接后的连续性,拼接代价越小,拼接后的语音连续性效果越好。
比如,对于已知的一个文本信息“早安中国”,文本基元“早安”在语料库中对应3个候选语音片段a,文本基元“中国”在语料库中对应2个候选语音片段b,共存在6组候选拼接方案;目标代价用于表征文本基元“早安”对应的预测声学特征与候选语音片段a之间的相似性,以及用于文本基元“中国”对应的预测声学特征与候选语音片段b之间的相似性;而拼接代价用于表征候选语音片段a与候选语音片段b之间的连续性;对于6种候选拼接方案,计算出每种候选拼接方案各自的目标代价和拼接代价,选择出总代价最小的一种候选拼接方 案作为最终的拼接方案,合成得到最终的语音信息。
完整的拼接代价模型由算法模型和权值两部分组成,为了获得较好的连续性效果,这些权值是根据设计者的经验和试错进行手工调整的。具体来讲,在通过具有初始权值的拼接代价模型为输入的文字信息进行语音合成后,需要人工测听语音信息的连续性效果,如果获得不满意的连续性效果,则需要手工调整拼接代价模型中的这些权值;通过使用具有调整后权值的拼接代价模型,将输入的文字信息再次进行语音合成,再一次对合成的语音信息重复上述过程,直至获得满意的连续性效果。
每次手工调整这些权值后,都需要重新进行语音合成并对合成的语音信息的连续性效果进行人工测听,而每次调整后的连续性效果不一定比上一次的连续性结果更优,通常需要很多次的人工测听和手工调整操作才能获得较优的权值和满意的连续性效果。即便如此,最终得到的权值仍然不够准确。
发明内容
为了解决相关技术中在语音合成过程中多次手工调整得到的权值仍然不准确的问题,本发明实施例提供了一种模型生成方法、语音合成方法及装置。所述技术方案如下:
第一方面,提供了一种模型生成方法,所述方法包括:
获取训练语音数据,所述训练语音数据是将目标代价最小的语音片段进行拼接所得到的语音数据;
从所述训练语音数据中提取具有第一标注类型的训练语音片段,所述第一标注类型用于标注所述训练语音片段的语音连续性优于预设条件;
根据具有所述第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵;所述平均差异矩阵与一类拼接组合关系对应,所述平均差异矩阵用于表征属于同一类所述拼接组合关系的多组相邻候选语音片段在声学特征上的平均差异;
根据所述平均差异矩阵,生成具有目标拼接权值的拼接代价模型,所述拼接代价模型与一类所述拼接组合关系对应。
第二方面,提供了一种语音合成方法,采用如第一方面所述的模型生成方法所生成的所述拼接代价模型,所述方法包括:
对输入的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi 为第i个文本基元,1≤i≤n;
根据预设声学模型,得到与每个所述文本基元wi对应的预测声学特征;
对于每个所述文本基元wi,从语料库中选择出k个候选语音片段,所述k为正整数;
根据目标代价模型计算每个所述文本基元wi与对应的候选语音片段之间的目标代价;根据所述拼接代价模型计算相邻所述候选语音片段之间的拼接代价,所述目标代价用于表征所述文本基元wi对应的所述预测声学特征与所述候选语音片段的声学特征之间的相似性,所述拼接代价用于表征所述相邻候选语音片段之间的连续性;
选择出所述目标代价和所述拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的所述文本信息对应的所述语音信息。
第三方面,提供了一种模型生成装置,所述装置包括:
获取模块,用于获取训练语音数据,所述训练语音数据是将目标代价最小的语音片段进行拼接所得到的语音数据;
提取模块,用于从所述训练语音数据中提取具有第一标注类型的训练语音片段,所述第一标注类型用于标注所述训练语音片段的语音连续性优于预设条件;
第一计算模块,用于根据具有所述第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵;所述平均差异矩阵与一类拼接组合关系对应,所述平均差异矩阵用于表征属于同一类所述拼接组合关系的多组相邻候选语音片段在声学特征上的平均差异;
生成模块,用于根据所述平均差异矩阵,生成具有目标拼接权值的拼接代价模型,所述拼接代价模型与一类所述拼接组合关系对应。
第四方面,提供了一种语音合成装置,采用如第三方面所述的模型生成装置所生成的所述拼接代价模型,所述装置包括:
拆分模块,用于对输入的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n;
得到模块,用于根据预设声学模型,得到与每个所述文本基元wi对应的预测声学特征;
选择模块,用于对于每个所述文本基元wi,从语料库中选择出k个候选语 音片段,所述k为正整数;
第二计算模块,用于根据目标代价模型计算每个所述文本基元wi与对应的候选语音片段之间的目标代价;根据所述拼接代价模型计算相邻的所述候选语音片段之间的拼接代价,所述目标代价用于表征所述文本基元wi对应的所述预测声学特征与所述候选语音片段的声学特征之间的相似性,所述拼接代价用于表征相邻所述候选语音片段之间的连续性;
合成模块,用于选择出所述目标代价和所述拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的所述文本信息对应的所述语音信息。
根据本发明实施例的第五方面,提供了一种服务器,所述服务器包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上第一方面所述的模型生成方法。
根据本发明实施例的第六方面,提供了一种服务器,所述服务器包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上第二方面所述的语音合成方法。
根据本发明实施例的第七方面,提供了一种终端,所述终端包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上第二方面所述的语音合成方法。
根据本发明实施例的第八方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上第一方面所述的模型生成方法。
根据本发明实施例的第九方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上第二方面所述的语音合成方法。
本发明实施例提供的技术方案至少具有如下有益效果:
通过根据具有第一标注类型的多个训练语音片段在拼接前所对应的相邻 候选语音片段,计算得到平均差异矩阵,根据平均差异矩阵生成具有目标拼接权值的拼接代价模型;其中,每个平均差异矩阵用于表征属于同一类拼接组合关系的多组相邻候选语音片段在声学特征上的平均差异,由于拼接代价模型是根据平均差异矩阵计算得到的,因此使得生成的拼接代价模型具有精准的权值,减少了手工调整次数,避免了需要多次手工调整拼接代价模型中的权值,且最终得到的权值仍然不够准确的情况。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1A是一种基于波形拼接的语音合成方法的原理示意图;
图1B是本发明另一个实施例提供的语音合成方法的原理示意图;
图2是本发明一个实施例提供的语音合成方法的方法流程图;
图3是本发明另一个实施例提供的语音合成方法的方法流程图;
图4A是本发明另一个实施例提供的语音合成方法的方法流程图;
图4B是本发明另一个实施例提供的语音合成方法的方法流程图;
图5是本发明另一个实施例提供的语音合成方法的原理示意图;
图6是本发明另一个实施例提供的语音合成方法的原理示意图;
图7是本发明另一个实施例提供的语音合成方法的方法流程图;
图8是本发明另一个实施例提供的语音合成方法的界面示意图;
图9是本发明一个实施例提供的模块生成装置的结构示意图;
图10是本发明另一个实施例提供的模块生成装置的结构示意图;
图11是本发明一个实施例提供的语音合成装置的结构示意图;
图12是本发明一个实施例提供的终端的框图;
图13是本发明一个实施例提供的服务器的框图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
首先,对本发明实施例涉及到的一些名词进行解释:
文本基元序列:对输入的文本信息进行拆分,得到一组文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n,i和n为正整数。
目标代价:用于表征文本基元wi对应的预测声学特征与候选语音片段的声学特征之间的相似性。目标代价越小,代表两者越相似。
可选的,预测声学特征由与文本基元wi对应的声学参数数值来表示,或者预测声学特征由与文本基元wi对应的概率模型来表示。预测声学特征是基频、频谱特征、基频的一阶变化率以及高阶变化率、频谱的一阶变化率以及高阶变化率、信号的能量、信号的过零率中的至少一种。
可选的,候选语音片段是语料库中存储的若干个语音片段。
拼接代价:用于表征相邻候选语音片段之间的连续性。
训练语音数据:是将目标代价最小的语音片段进行拼接所得到的语音数据。
训练语音数据是与目标代价有关且与拼接代价无关的待训练的语音信息。即在训练语音数据的语音合成过程中,不考虑拼接代价的影响(设拼接代价为0),只考虑目标代价。在本发明实施例中,模型生成方法的拼接过程中假设拼接代价为0,即不考虑拼接代价对语音合成过程的影响。
可选的,训练语音数据包括至少一个训练语音片段,一个训练语音片段是由第一候选语音片段和第二候选语音片段拼接得到的训练语音片段。
训练语音片段的标注类型:包括第一标注类型和第二标注类型。第一标注类型用于标注训练语音片段的语音连续性优于预设条件即语音连续性效果较好的训练语音片段,第二标注类型用于标注训练语音片段的语音连续性低于预设条件即语音连续性效果较差的训练语音片段。
可选的,每个训练语音片段的标注类型由人工测听后标注得到。若人工测听结果为该训练语音片段的连续性较优,则将该训练语音片段标注为第一标注类型;若人工测听结果为该训练语音片段的连续性较差,则将该训练语音片段标注为第二标识类型,第一标注类型所对应的语音连续性优于第二标注类型所对应的语音连续性。
平均差异矩阵:用于表征属于同一类拼接组合关系的多组相邻候选语音片段在声学特征上的平均差异。其中,平均差异矩阵与一类拼接组合关系对应。
由于一个训练语音片段是由第一候选语音片段和第二候选语音片段拼接得到的,因此通过第一候选语音片段和第二候选语音片段在声学特征上的差异,能够求得第一候选语音片段和第二候选语音片段的拼接差异矩阵。对多组属于同一类拼接组合关系的拼接差异矩阵求均值,从而得到该类拼接组合关系所对应的平均差异矩阵。
可选的,若语音片段采用音素为单位进行划分,拼接组合关系包括至少两个音素之间的组合关系。示意性的,拼接组合关系是音素单元a在前且音素单元b在后所组成的组合关系。
比如,拼音“y”和拼音“i”所形成的组合关系是一种拼接组合关系。
拼接代价模型:是具有目标拼接权值的拼接代价模型。其中,拼接代价模型与一类拼接组合关系对应。
其中,目标拼接权值包括第一权值和第二权值。第一权值是拼接的两个候选语音片段中的第n个声学特征对应的权值,第二权值是两个候选语音片段中的第t个重叠帧的声学特征对应的第二权值。
在介绍本发明实施例提供的模型生成方法以及语音合成方法之前,先介绍一下相关技术中基于波形拼接的语音合成过程。请参考图1A,其示出了一种基于波形拼接的语音合成方法的原理示意图。
用户向服务器输入一个文本信息,服务器对输入的文本信息进行拆分,得到一组文本基元序列(w1,w2,…,wn),经过一系列的步骤,最终服务器将该组文本基元序列转化为一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的文本信息对应的语音信息。以两个前后相邻的文本基元即文本基元w1、文本基元w2为例进行具体说明,服务器根据预设声学模型,将文本基元w1和文本基元w2进行前端处理,分别得到与文本基元w1对应的预测声学特征1,与文本基元w2对应的预测声学特征2。对于文本基元w1对应的预测声学特征1,从语料库中选择出三个第一候选语音片段,三个第一候选语音片段包括候选语音片段a1、候选语音片段a2、候选语音片段a3;对于文本基元w2对应的预测声学特征2,从语料库中选择出两个第二候选语音片段,两个第二候选语音片段包括候选语音片段b1、候选语音片段b2。
当将三个第一候选语音片段和两个第二候选语音片段进行拼接时,一共存在6组候选拼接方案,如表一所示。第一组候选拼接方案为候选语音片段a1 与候选语音片段b1拼接,第二组候选拼接方案为候选语音片段a2与候选语音片段b1拼接,第三组候选拼接方案为候选语音片段a3与候选语音片段b1拼接,第四组候选拼接方案为候选语音片段a1与候选语音片段b2拼接,第五组候选拼接方案为候选语音片段a2与候选语音片段b2拼接,第六组候选拼接方案为候选语音片段a3与候选语音片段b2拼接。其中,对于第一组候选拼接方案,服务器根据目标代价模型计算文本基元w1与对应的候选语音片段a1之间的第一目标代价TC11,文本基元w2与对应的候选语音片段b1之间的第二目标代价TC50,根据拼接代价模型计算候选语音片段a1与候选语音片段b1之间的拼接代价CC11,计算得到与第一组候选拼接方案对应的总代价RC1,总代价RC1包括第一目标代价TC11、第二目标代价TC50和第一拼接代价CC11;依次类推,分别计算得到与第二组候选拼接方案对应的总代价RC2,与第三组候选拼接方案对应的总代价RC3,与第四组候选拼接方案对应的总代价RC4,与第五组候选拼接方案对应的总代价RC5,与第六组候选拼接方案对应的总代价RC6。服务器将这六组候选拼接方案对应的总代价进行比较,比较结果为第二组候选拼接方案所对应的总代价RC2最小,即确定出候选语音片段a1与候选语音片段b2属于目标语音片段,进行最终的语音拼接,并得到最终的合成语音。
表一
Figure PCTCN2017097314-appb-000001
Figure PCTCN2017097314-appb-000002
在本发明实施例中,以上述的第四组候选拼接方案即候选语音片段a1与候选语音片段b2为例,拼接代价模型可以采用如下公式定义:
Figure PCTCN2017097314-appb-000003
wn=[wn=1 wn=2 … wn=N]T
Figure PCTCN2017097314-appb-000004
Figure PCTCN2017097314-appb-000005
CC为拼接代价,该CC用于表征候选语音片段a1和候选语音片段b2的连续性,T为候选语音片段a1或候选语音片段b2的重叠帧的帧数,wt为候选语音片段a1和候选语音片段b2的第t个重叠帧的声学特征对应的第二权值,N为候选语音片段a1或候选语音片段b2包含的声学特征的个数,wn为候选语音片段a1和候选语音片段b2的第n个声学特征对应的第一权值,|Δf|为候选语音片段a1和候选语音片段b2的第n个声学特征的声学距离测度,F为候选语音片段a1和候选语音片段b2对应的拼接差异矩阵。可选的,|Δf|为候选语音片段a1的第n个声学特征与候选语音片段b2的第n个声学特征之间的声学距离测度。
结合参考图1B,当候选语音片段a1和候选语音片段b2拼接时,假设候选语音片段a1和候选语音片段b2只有1个重叠帧,候选语音片段a1在该重叠帧上具有N个声学特征(或者说N维声学特征),候选语音片段b2在该重叠帧上对应存在N个声学特征(或者说N维声学特征)文本基元w1文本基元w2。由于用户发音时,对于不同的相邻候选语音片段,口型过渡和音调过渡是不同的,即不同的相邻候选语音片段所对应的第n个声学特征对应的第一权值wn和第t个重叠帧(图1B中假设只有1个重叠帧)的声学特征对应的第二权值wt也是不同的。根据候选语音片段a1或候选语音片段b2包含的声学特征的个数,将候选语音片段a1和候选语音片段b2的每个声学特征的声学距离测度 与相对应的第一权值wn相乘求和,再根据候选语音片段a1或候选语音片段b2的重叠帧的帧数,将与第i个重叠帧相对应的第一权值wn相乘求和的结果再与相对应的第二权值wt相乘求和得到拼接代价。
其中,通过奇异值矩阵分解,可以将拼接代价的计算公式进行如下变形:
Figure PCTCN2017097314-appb-000006
根据上述的几个公式可知,服务器可以预先通过训练语音数据(相当于训练样本)计算得到拼接差异矩阵F,根据拼接差异矩阵F,计算得到第一权值wn和第二权值wt,即当第一权值wn与第一分解矩阵U正交且第二权值wt与第二分解矩阵V正交,即u=0、v=0时,拼接代价最小,将此时的第一权值wn和第二权值wt确定为目标拼接权值。为此,本发明提供如下实施例。
请参考图2,其示出了本发明一个实施例提供的语音合成方法的方法流程图。该语音合成方法可由具有语音处理能力的服务器或终端来执行,该语音合成方法包括:
步骤202,获取训练语音数据。
可选的,服务器获取待训练的训练语音数据,该训练语音数据包括多个训练语音片段。
步骤204,从训练语音数据中提取具有第一标注类型的训练语音片段。
可选的,服务器确定该训练语音数据所包括的至少两个训练语音片段,至少两个训练语音片段的标注类型包括第一标注类型和/或第二标注类型,从至少两个训练语音片段中提取x个具有第一标注类型的训练语音片段,x为正整数。
步骤206,根据具有第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵。
可选的,在提取出x个具有第一标注类型的训练语音片段之后,对于每个具有第一标注类型的训练语音片段,服务器根据该训练语音片段在拼接前所对应的相邻候选语音片段,计算得到拼接差异矩阵。对于多组属于同一类拼接组合关系的拼接差异矩阵求均值,计算得到该类拼接组合关系所对应的平均差异矩阵。
步骤208,根据平均差异矩阵,生成具有目标拼接权值的拼接代价模型。
可选的,服务器根据计算得到的平均差异矩阵,通过预设公式计算得到拼接代价模型,该拼接代价模型具有目标拼接权值。
步骤210,通过具有目标拼接权值的拼接代价模型进行语音合成,得到合成的语音信息。
可选的,当服务器确定需要进行语音合成的文本信息时,服务器通过该拼接代价模型将确定的文本信息进行语音合成,得到合成的语音信息。
在实际的语音合成过程中,服务器将生成的拼接代价模型传输给终端,使得终端能够采用拼接代价模型进行应用。
可选的,终端从服务器中获取生成的拼接代价模型,当终端接收到需要进行语音合成的文本信息时,终端通过该拼接代价模型将输入的文本信息进行语音合成,得到合成的语音信息。
需要说明的是,步骤202至步骤208可以单独实现成为一种模型生成方法,该模型生成方法通常由服务器来完成,用于生成具有目标拼接权值的拼接代价模型;步骤210为一种语音合成方法,该语音合成方法通常由服务器或终端来完成,用于采用步骤202至步骤208所生成的拼接代价模型将输入的文本信息进行语音合成,得到合成的语音信息。下面,仅以服务器完成模型生成方法,且终端完成语音合成方法为例进行说明,
综上所述,本实施例通过根据具有第一标注类型的多个训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵,根据平均差异矩阵生成具有目标拼接权值的拼接代价模型;其中,每个平均差异矩阵用于表征属于同一类拼接组合关系的多组相邻候选语音片段在声学特征上的平均差异,由于拼接代价模型是根据平均差异矩阵计算得到的,因此使得生成的拼接代价模型具有精准的权值,减少了手工调整次数,避免了需要多次手工调整拼接代价模型中的权值,且最终得到的权值仍然不够准确的情况。
请参考图3,其示出了本发明另一个实施例提供的语音合成方法的方法流程图。该语音合成方法包括:
步骤301,服务器获取训练语音数据。
可选的,步骤301可以被替代实现为步骤301a、步骤301b、步骤301c和步骤301d,如图4A所示:
步骤301a,服务器对待训练的文本信息进行拆分,得到文本基元序列(w1, w2,…,wn),wi为第i个文本基元,1≤i≤n。
可选的,服务器基于音素或音节对待训练的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n。
步骤301b,服务器根据预设声学模型,得到与每个文本基元wi对应的预测声学特征。
可选的,服务器将每个文本基元wi对应的语言学模型输出预设的声学模型中,由该预设的声学模型输出与每个文本基元wi对应的预测声学特征。
步骤301c,服务器对于每个文本基元wi,从语料库中选择出目标代价最小的语音片段vi
可选的,服务器对于每个文本基元wi,计算得到与每个文本基元wi对应的候选语音片段的目标代价,从语料库中选择出目标代价最小的语音片段viwt
可选的,对于每个文本基元wi,服务器通过如下公式计算对应的目标代价:
Figure PCTCN2017097314-appb-000007
其中,TCi为文本基元wi对应的目标代价,wn为预设的第一权值,|fa,n-fa',n|为文本基元wi对应的预测声学特征a’中的第n个声学特征与候选语音片段a的第n个声学特征之间的声学距离测度。
可选的,若声学特征采用具体的声学参数取值来表示,则声学距离测度可以取欧几里德距离或差值绝对值。
示意性的,若存在10个文本基元wi,则服务器从语料库中对应选择出10个具有最小目标代价的语音片段vi
步骤301d,服务器根据选择出的语音片段vi所组成的训练语音片段序列(v1,v2,…,vn)进行语音合成,得到与待训练的文本信息对应的训练语音数据。
步骤302,服务器从训练语音数据中提取具有第一标注类型的训练语音片段。
可选的,步骤302可以被替代实现为步骤302a和步骤302b,如图4B所示:
步骤302a,服务器获取训练语音数据中至少一个训练语音片段的标注类型,每个训练语音片段的标注类型为第一标注类型或第二标注类型,第一标注类型所对应的语音连续性优于第二标注类型所对应的语音连续性。
步骤302b,服务器提取具有第一标注类型的训练语音片段。
可选的,通过对训练语音数据进行人工测听,标注出第一标注类型或第二标注类型的训练语音片段。在服务器提取出具有第一标注类型的训练语音片段时,获取每个训练语音片段的标注类型。服务器从训练语音数据中提取具有第一标注类型的训练语音片段。
步骤303,服务器对于每个具有第一标注类型的训练语音片段,根据训练语音片段在拼接前所对应的相邻候选语音片段计算得到拼接差异矩阵。
可选的,训练语音片段为多个,比如几百个、几千个或者上万个。服务器对于每个具有第一标注类型的训练语音片段,根据该训练语音片段在拼接前所对应的相邻候选语音片段计算得到与该训练语音片段所对应的拼接差异矩阵。
服务器计算拼接差异矩阵的步骤包括:
1)对于每个具有第一标注类型的训练语音片段,服务器获取训练语音片段在拼接前所对应的候选语音片段a和候选语音片段b。
2)服务器获取候选语音片段a的每个重叠帧对应的第一组声学特征和候选语音片段b的每个重叠帧对应的第二组声学特征。
可选的,候选语音片段a和候选语音片段b的重叠帧的帧数可以是一帧,也可以是多帧。示意性的,如图5所示,设当前时刻为t0,候选语音片段a的最后一帧所在时刻为t0,候选语音片段b的第一帧所在时刻为t0,当拼接窗口长度T=1帧时,候选语音片段a的最后一帧与候选语音片段b的第一帧重叠,即“a(t0)+b(t0)”;也即,在拼接过程中,候选语音片段a和候选语音片段b存在一个重叠帧。
示意性的,如图6所示,设当前时刻为t0,候选语音片段a的最后一帧所在时刻为t0,候选语音片段b的第一帧所在时刻为t0,当拼接窗口长度T取任意值时,候选语音片段a的第t0帧至第t0+T-1帧分别与候选语音片段b的第t0-T+1帧至第t0帧重叠,即“a(t0:t0+T-1)+b(t0-T+1:t0)”,本发明实施例对重叠帧的帧数T不加以限定,示意性的,该重叠帧的帧数T为20帧。
可选的,候选语音片段a的每个重叠帧上对应第一组声学特征,该第一组声学特征包含n个声学特征(或者说n维声学特征),候选语音片段b的每个重叠帧上对应第二组声学特征,该第二组声学特征包含n个声学特征(或者说n维声学特征)。该声学特征是基频、频谱特征、基频的一阶变化率以及高阶变化率、频谱的一阶变化率以及高阶变化率、信号的能量、信号的过零率中的至 少一种。
3)服务器根据第一组声学特征和第二组声学特征,按照如下公式计算得到拼接差异矩阵F。
Figure PCTCN2017097314-appb-000008
其中,F为候选语音片段a和候选语音片段b对应的拼接差异矩阵,拼接差异矩阵中的第n行第t列表示候选语音片段a中的第t个重叠帧的第n个声学特征与候选语音片段b中的第t-T+1个重叠帧的第n个声学特征的声学距离测度,fa,t是与候选语音片段a的第t个重叠帧对应的第n个声学特征,fb,t-T+1是与候选语音片段b的第t-T+1个重叠帧对应的第n个声学特征。
步骤304,服务器根据相邻候选语音片段的拼接组合关系对拼接差异矩阵进行分类,得到与每一种拼接组合关系所对应的拼接差异矩阵集合。
其中,拼接差异矩阵集合包括属于同一种拼接组合关系的m个拼接差异矩阵,m为正整数。
可选的,每个测量语音片段所对应的相邻候选语音片段能够计算出一个拼接差异矩阵,若测量语音片段为一万个,则可以计算出一万个拼接差异矩阵。
候选语音片段具有不同的音素或音节类型,若一个训练语音片段是由a类型的语音片段在前且b类型的语音片段所拼接得到的,则该训练语音片段所对应的拼接组合关系是:a类型的语音片段在前且b类型的语音片段在后。
示意性的,若候选语音片段采用音素为单位进行划分,比如候选语音片段a是拼音“y”所对应的语音片段,候选语音片段b是拼音“i”所对应的语音片段,则拼音“y”和拼音“i”所形成的组合关系就是一种拼接组合关系。对于拼音“y”和拼音“i”所形成的拼接组合关系,可能存在几百个拼接差异矩阵,则这几百个拼接差异矩阵都归类至与拼接组合关系“y+i”所对应的拼接差异矩阵集合。
步骤305,服务器对每个拼接差异矩阵集合中的拼接差异矩阵计算均值,得到与每一种拼接组合关系所对应的平均差异矩阵。
示意性的,当拼接差异矩阵集合为Fab,i时,对Fab,i中的所有拼接差异矩阵计算均值,得到与选语音片段a和候选语音片段b的拼接组合关系所对应的平均差异矩阵Fab
步骤306,服务器对于每个平均差异矩阵Fab,对平均差异矩阵Fab进行奇异值分解,得到第一分解矩阵U和第二分解矩阵V。
服务器对于每个平均差异矩阵Fab,对平均差异矩阵Fab进行奇异值分解Fab=U∑V,得到第一分解矩阵U和第二分解矩阵V。
其中,ab代表由a类型的语音片段在前且b类型的语音片段在后的拼接组合关系;示意性的,该类型是指音素类型。
步骤307,服务器将第一分解矩阵U的正交矩阵生成为第一权值wn,将第二分解矩阵V的正交矩阵生成为第二权值wt
可选的,服务器通过如下公式定义拼接代价:
Figure PCTCN2017097314-appb-000009
根据上述公式可知,当第一权值wn与第一分解矩阵U正交且第二权值wt与第二分解矩阵V正交,即u=0、v=0时,拼接代价最小,将此时的第一权值wn和第二权值wt确定为目标拼接权值。
步骤308,服务器生成具有第一权值wn和第二权值wt的拼接代价模型。
服务器生成拼接代价模型如下:
Figure PCTCN2017097314-appb-000010
其中,CC为拼接代价,拼接代价用于表征相邻候选语音片段之间的连续性,T为相邻候选语音片段的重叠帧的帧数,wt为相邻候选语音片段的第t个重叠帧的声学特征对应的第二权值,N为每个候选语音片段包含的声学特征的个数,wn为相邻候选语音片段的第n个声学特征对应的第一权值,|Δf|为相邻候选语音片段的第n个声学特征的声学距离测度。
步骤309,终端通过具有目标拼接权值的拼接代价模型进行语音合成,得到合成的语音信息。
综上所述,本实施例还通过对每个拼接差异矩阵集合中的拼接差异矩阵计算均值,得到与每一种拼接组合关系所对应的平均差异矩阵,通过对每个平均差异矩阵进行奇异值分解确定第一权值和第二权值,从而使得计算出的权值更加精确。
在一种可能的实施例中,上述实施例中,由服务器得到的拼接代价模型可 以传输给终端在实际的语音合成过程中进行应用。此时,步骤309可以被替代实现为步骤309a、步骤309b、步骤309c、步骤309d和步骤309e,如图7所示:
步骤309a,终端对输入的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n。
可选的,输入的文本信息是由用户输入的文本信息,比如,新闻文本或者小说文本。终端对输入的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n。
步骤309b,终端根据预设声学模型,得到与每个文本基元wi对应的预测声学特征。
步骤309c,终端对于每个文本基元wi,从语料库中选择出k个候选语音片段,k为正整数。
步骤309d,终端根据目标代价模型计算每个文本基元wi与对应的候选语音片段之间的目标代价;根据拼接代价模型计算相邻候选语音片段之间的拼接代价。
可选的,终端根据目标代价模型,通过如下公式计算每个文本基元wi与对应的候选语音片段之间的目标代价:
Figure PCTCN2017097314-appb-000011
其中,TC为输入的文本基元a对应的目标代价,wn为采用模型生成方法生成的拼接代价模型中的候选语音片段第n个声学特征对应的第一权值,|fa,n-fa',n|为候选语音片段a和预测声学特征a’的第n个声学特征的声学距离测度。
可选的,终端根据拼接代价模型,通过如下公式计算相邻候选语音片段之间的拼接代价:
Figure PCTCN2017097314-appb-000012
其中,CCT为相邻的候选语音片段a和候选语音片段b对应的拼接代价,wt为候选语音片段a或候选语音片段b的第t个重叠帧的声学特征对应的第二权值,wn为候选语音片段a或候选语音片段b的第n个声学特征对应的第一权值,|fa,t-fb,t-T+1|为候选语音片段a的第t个重叠帧和候选语音片段b的第t-T+1个重叠帧的第n个声学特征的声学距离测度。
步骤309e,终端选择出目标代价和拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的文本信息对应的语音信息。
可选的,终端从所有候选拼接方式中,选择出目标代价和拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的文本信息对应的语音信息。
可选的,所有候选拼接方式所对应的目标代价和拼接代价,能够形成一个矩阵,通过动态规划算法,能够求出该矩阵中从左到右的取值最小的一条路径,则该条路径所对应的各个语音片段,构成总代价最小的一组目标语音片段序列。
综上所述,本实施例还通过终端根据目标代价模型计算每个文本基元wi与对应的候选语音片段之间的目标代价,根据拼接代价模型计算相邻候选语音片段之间的拼接代价,选择出目标代价和拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的文本信息对应的语音信息;在考虑目标代价因素的同时考虑拼接代价的影响,由于拼接代价用于表征相邻候选语音片段在拼接后的连续性,从而使得合成的语音信息具有较好的连续性效果。
结合参考图8,在一个示意性的例子中,语音合成方法应用于终端的应用程序如“企鹅FM”上,当用户在具有语音合成功能的应用程序中输入一段新闻文本或者小说文本,应用程序将合成与输入的新闻文本或者小说文本相对应的语音信息。
下面为本发明中的装置实施例,对于装置实施例中未详尽描述的细节,可以结合参考上述一一对应的方法实施例。
请参考图9,其示出了本发明一个实施例提供的模块生成装置的结构示意图。
该装置可以通过软件、硬件或者两者的结合,实现成为服务器的全部或一部分。该模块生成装置包括:获取模块910、提取模块920、第一计算模块930和生成模块940。
获取模块910,用于获取训练语音数据,训练语音数据是将目标代价最小的语音片段进行拼接所得到的语音数据。
提取模块920,用于从训练语音数据中提取具有第一标注类型的训练语音片段,第一标注类型用于标注训练语音片段的语音连续性优于预设条件。
第一计算模块930,用于根据具有第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵;平均差异矩阵与一类拼接组合关系对应,平均差异矩阵用于表征属于同一类拼接组合关系的多组相邻候选语音片段在声学特征上的平均差异。
生成模块940,用于根据平均差异矩阵,生成具有目标拼接权值的拼接代价模型,拼接代价模型与一类拼接组合关系对应。
基于图9提供的实施例,在一种可能的实现方式中,请参考图10,其示出了本发明另一个实施例提供的模块生成装置的结构示意图。
生成模块940,包括:分解单元941、第一生成单元942和第二生成单元943。
分解单元941,用于对于每个平均差异矩阵Fab,对平均差异矩阵Fab进行奇异值分解Fab=U∑V,得到第一分解矩阵U和第二分解矩阵V。
第一生成单元942,用于将第一分解矩阵U的正交矩阵生成为第一权值wn,将第二分解矩阵V的正交矩阵生成为第二权值wt
第二生成单元943,用于生成具有第一权值wn和第二权值wt的拼接代价模型。
其中,ab代表由a类型的语音片段在前且b类型的语音片段在后的拼接组合关系。
第二生成单元943,具体用于生成所述拼接代价模型如下:
Figure PCTCN2017097314-appb-000013
其中,CC为拼接代价,所述拼接代价用于表征相邻候选语音片段之间的连续性,T为相邻候选语音片段的重叠帧的帧数,wt为相邻候选语音片段的第t个所述重叠帧的所述声学特征对应的第二权值,N为每个所述候选语音片段包含的所述声学特征的个数,wn为相邻候选语音片段的第n个所述声学特征对 应的第一权值,|Δf|为相邻候选语音片段的第n个所述声学特征的声学距离测度。
第一计算模块930,包括:第一计算单元931、分类单元932和第二计算单元933。
第一计算单元931,用于对于每个具有第一标注类型的训练语音片段,根据训练语音片段在拼接前所对应的相邻候选语音片段计算得到拼接差异矩阵。
分类单元932,用于根据相邻候选语音片段的拼接组合关系对拼接差异矩阵进行分类,得到与每一种拼接组合关系所对应的拼接差异矩阵集合,拼接差异矩阵集合包括属于同一种拼接组合关系的若干个拼接差异矩阵。
第二计算单元933,用于对每个拼接差异矩阵集合中的拼接差异矩阵计算均值,得到与每一种拼接组合关系所对应的平均差异矩阵。
第一计算单元931,包括:
第一获取子单元931a、第二获取子单元931b和计算子单元931c。
第一获取子单元931a,用于对于每个具有第一标注类型的训练语音片段,获取训练语音片段在拼接前所对应的候选语音片段a和候选语音片段b。
第二获取子单元931b,用于获取候选语音片段a的重叠帧对应的第一组声学特征和候选语音片段b的重叠帧对应的第二组声学特征,第一组声学特征包含n个声学特征,第二组声学特征包含n个声学特征。
计算子单元931c,用于根据第一组声学特征和第二组声学特征,按照如下公式计算得到拼接差异矩阵F。
Figure PCTCN2017097314-appb-000014
其中,F为候选语音片段a和候选语音片段b对应的拼接差异矩阵,拼接差异矩阵中的第n行第t列表示候选语音片段a中的第t个重叠帧的第n个声学特征与候选语音片段b中的第t-T+1个重叠帧的第n个声学特征的声学距离测度,fa,t是与候选语音片段a的第t个重叠帧对应的第n个声学特征,fb,t-T+1是与候选语音片段b的第t-T+1个重叠帧对应的第n个声学特征。
提取模块920,包括:获取单元921和提取单元922。
获取单元921,用于获取训练语音数据中至少一个训练语音片段的标注类 型,每个训练语音片段的标注类型为第一标注类型或第二标注类型,第一标注类型所对应的语音连续性优于第二标注类型所对应的语音连续性。
提取单元922,用于提取出具有第一标注类型的训练语音片段。
获取模块910,包括:
拆分单元911、得到单元912、选择单元913和合成单元914。
拆分单元911,用于对待训练的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n。
得到单元912,用于根据预设声学模型,得到与每个文本基元wi对应的预测声学特征。
选择单元913,用于对于每个文本基元wi,从语料库中选择出目标代价最小的语音片段vi,目标代价用于表征文本基元wi对应的预测声学特征与语料库中的候选语音片段的声学特征之间的相似性。
合成单元914,用于根据选择出的语音片段vi所组成的训练语音片段序列(v1,v2,…,vn)进行语音合成,得到与待训练的文本信息对应的训练语音数据。
请参考图11,其示出了本发明一个实施例提供的语音合成装置的结构示意图。该语音合成装置采用如图9或图10所示实施例中提供的拼接代价模型,该语音合成装置包括:拆分模块1100、得到模块1110、选择模块1120、第二计算模块1130和合成模块1140。
拆分模块1100,用于对输入的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n。
得到模块1110,用于根据预设声学模型,得到与每个文本基元wi对应的预测声学特征。
选择模块1120,用于对于每个文本基元wi,从语料库中选择出若干个候选语音片段。
第二计算模块1130,用于根据目标代价模型计算每个文本基元wi与对应的候选语音片段之间的目标代价。根据拼接代价模型计算相邻的候选语音片段之间的拼接代价。
合成模块1140,用于选择出目标代价和拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的文本信 息对应的语音信息。
请参考图12,其示出了本发明一个实施例提供的终端的框图。具体来讲:终端1200可以包括RF(Radio Frequency,射频)电路1210、包括有一个或一个以上计算机可读存储介质的存储器1220、输入单元1230、显示单元1240、传感器1250、音频电路1260、WiFi(wireless fidelity,无线保真)模块1270、包括有一个或者一个以上处理核心的处理器1280、以及电源1290等部件。本领域技术人员可以理解,图12中示出的设备结构并不构成对设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。其中:
RF电路1210可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,交由一个或者一个以上处理器1280处理;另外,将涉及上行的数据发送给基站。通常,RF电路1210包括但不限于天线、至少一个放大器、调谐器、一个或多个振荡器、用户身份模块(SIM)卡、收发信机、耦合器、LNA(Low Noise Amplifier,低噪声放大器)、双工器等。此外,RF电路1210还可以通过无线通信与网络和其他设备通信。无线通信可以使用任一通信标准或协议,包括但不限于GSM(Global System of Mobile communication,全球移动通讯系统)、GPRS(General Packet Radio Service,通用分组无线服务)、CDMA(Code Division Multiple Access,码分多址)、WCDMA(Wideband Code Division Multiple Access,宽带码分多址)、LTE(Long Term Evolution,长期演进)、电子邮件、SMS(Short Messaging Service,短消息服务)等。存储器1220可用于存储软件程序以及模块。处理器1280通过运行存储在存储器1220的软件程序以及模块,从而执行各种功能应用以及数据处理。存储器1220可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端1200的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1220可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。相应地,存储器1220还可以包括存储器控制器,以提供处理器1280和输入单元1230对存储器1220的访问。
输入单元1230可用于接收输入的数字或字符信息,以及产生与用户设置 以及功能控制有关的键盘、鼠标、操作杆、光学或者轨迹球信号输入。具体地,输入单元1230可包括触敏表面1231以及其他输入设备1232。触敏表面1231,也称为触摸显示屏或者触控板,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触敏表面1231上或在触敏表面1231附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触敏表面1231可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1280,并能接收处理器1280发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触敏表面1231。除了触敏表面1231,输入单元1230还可以包括其他输入设备1232。具体地,其他输入设备1232可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元1240可用于显示由用户输入的信息或提供给用户的信息以及设备120的各种图形用户接口,这些图形用户接口可以由图形、文本、图标、视频和其任意组合来构成。显示单元1240可包括显示面板1241,可选的,可以采用LCD(Liquid Crystal Display,液晶显示器)、OLED(Organic Light-Emitting Diode,有机发光二极管)等形式来配置显示面板1241。进一步的,触敏表面1231可覆盖在显示面板1241之上,当触敏表面1231检测到在其上或附近的触摸操作后,传送给处理器1280以确定触摸事件的类型,随后处理器1280根据触摸事件的类型在显示面板1241上提供相应的视觉输出。虽然在图12中,触敏表面1231与显示面板1241是作为两个独立的部件来实现输入和输入功能,但是在某些实施例中,可以将触敏表面1231与显示面板1241集成而实现输入和输出功能。
终端1200还可包括至少一种传感器1250,比如光传感器、运动传感器以及其它传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板1241的亮度,接近传感器可在终端1200移动到耳边时,关闭显示面板1241和/或背光。作为运动传感器的一种,重力加速度传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击) 等;至于终端1200还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其它传感器,在此不再赘述。
音频电路1260、扬声器1221,传声器1222可提供用户与终端1200之间的音频接口。音频电路1260可将接收到的音频数据转换后的电信号,传输到扬声器1221,由扬声器1221转换为声音信号输出;另一方面,传声器1222将收集的声音信号转换为电信号,由音频电路1260接收后转换为音频数据,再将音频数据输出处理器1280处理后,经RF电路1210以发送给另一设备,或者将音频数据输出至存储器1220以便进一步处理。音频电路1260还可能包括耳塞插孔,以提供外设耳机与终端1200的通信。
WiFi属于短距离无线传输技术,终端1200通过WiFi模块1270可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图12示出了WiFi模块1270,但是可以理解的是,其并不属于终端1200的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
处理器1280是终端1200的控制中心,利用各种接口和线路连接整个设备的各个部分,通过运行或执行存储在存储器1220内的软件程序和/或模块,以及调用存储在存储器1220内的数据,执行终端1200的各种功能和处理数据,从而对设备进行整体监控。可选的,处理器1280可包括一个或多个处理核心;可选的,处理器1280可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1280中。
终端1200还包括给各个部件供电的电源1290(比如电池),优选的,电源可以通过电源管理系统与处理器1280逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。电源1290还可以包括一个或一个以上的直流或交流电源、再充电系统、电源故障检测电路、电源转换器或者逆变器、电源状态指示器等任意组件。
尽管未示出,终端1200还可以包括摄像头、蓝牙模块等,在此不再赘述。具体在本实施例中,存储器1202中存储有至少一条指令、至少一段程序、代码集或指令集。该至少一条指令、至少一段程序、代码集或指令集由处理器1280加载并执行以实现如上述各个方法实施例中所述的语音合成方法。
请参考图13,其示出了本发明一个实施例提供的服务器的框图。具体来讲:所述服务器1300包括中央处理单元(CPU)1301、包括随机存取存储器(RAM)1302和只读存储器(ROM)1303的系统存储器1304,以及连接系统存储器1304和中央处理单元1301的系统总线1305。所述服务器1300还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)1306,和用于存储操作系统1313、应用程序1314和其他程序模块1315的大容量存储设备1307。
所述基本输入/输出系统1306包括有用于显示信息的显示器1308和用于用户输入信息的诸如鼠标、键盘之类的输入设备1309。其中所述显示器1308和输入设备1309都通过连接到系统总线1305的输入输出控制器1310连接到中央处理单元1301。所述基本输入/输出系统1306还可以包括输入输出控制器1310以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器1310还提供输出到显示屏、打印机或其他类型的输出设备。
所述大容量存储设备1307通过连接到系统总线1305的大容量存储控制器(未示出)连接到中央处理单元1301。所述大容量存储设备1307及其相关联的计算机可读介质为服务器1300提供非易失性存储。也就是说,所述大容量存储设备1307可以包括诸如硬盘或者CD-ROM驱动器之类的计算机可读介质(未示出)。
不失一般性,所述计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM、EEPROM、闪存或其他固态存储其技术,CD-ROM、DVD或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知所述计算机存储介质不局限于上述几种。上述的系统存储器1304和大容量存储设备1307可以统称为存储器。
根据本发明的各种实施例,所述服务器1300还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即服务器1300可以通过连接在所述系统总线1305上的网络接口单元1311连接到网络1312,或者说,也可以使用网络接口单元1311来连接到其他类型的网络或远程计算机系统(未示出)。具体在本实施例中,存储器中存储有至少一条指令、至少一段程序、代码集或指令集。该 至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现如上述各个方法实施例中所述的模型生成方法和/或语音合成方法。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
本领域普通技术人员可以理解实现上述实施例的模型生成方法和语音合成方法中全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。或者说,该存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现如上述各个方法实施例中所述的模型生成方法和/或语音合成方法。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (21)

  1. 一种模型生成方法,其特征在于,所述方法包括:
    获取训练语音数据,所述训练语音数据是将目标代价最小的语音片段进行拼接所得到的语音数据;
    从所述训练语音数据中提取具有第一标注类型的训练语音片段,所述第一标注类型用于标注所述训练语音片段的语音连续性优于预设条件;
    根据具有所述第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵;所述平均差异矩阵与一类拼接组合关系对应,所述平均差异矩阵用于表征属于同一类所述拼接组合关系的多组所述相邻候选语音片段在声学特征上的平均差异;
    根据所述平均差异矩阵,生成具有目标拼接权值的拼接代价模型,所述拼接代价模型与一类所述拼接组合关系对应。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述平均差异矩阵,生成具有目标拼接权值的拼接代价模型,包括:
    对于每个所述平均差异矩阵Fab,对所述平均差异矩阵Fab进行奇异值分解Fab=U∑V,得到第一分解矩阵U和第二分解矩阵V;
    将所述第一分解矩阵U的正交矩阵生成为第一权值wn,将所述第二分解矩阵V的正交矩阵生成为第二权值wt
    生成具有所述第一权值wn和所述第二权值wt的所述拼接代价模型;
    其中,ab代表由a类型的语音片段在前且b类型的语音片段在后的拼接组合关系。
  3. 根据权利要求2所述的方法,其特征在于,所述生成具有所述第一权值wn和所述第二权值wt的所述拼接代价模型,包括:
    生成所述拼接代价模型如下:
    Figure PCTCN2017097314-appb-100001
    其中,CC为拼接代价,所述拼接代价用于表征所述相邻候选语音片段之间 的连续性,T为所述相邻候选语音片段的重叠帧的帧数,wt为所述相邻候选语音片段的第t个所述重叠帧的所述声学特征对应的所述第二权值,N为每个所述候选语音片段包含的所述声学特征的个数,wn为所述相邻候选语音片段的第n个所述声学特征对应的所述第一权值,|Δf|为所述相邻候选语音片段的第n个所述声学特征的声学距离测度。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述根据具有所述第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵,包括:
    对于每个具有所述第一标注类型的所述训练语音片段,根据所述训练语音片段在拼接前所对应的所述相邻候选语音片段计算得到拼接差异矩阵;
    根据所述相邻候选语音片段的拼接组合关系对所述拼接差异矩阵进行分类,得到与每一种拼接组合关系所对应的拼接差异矩阵集合,所述拼接差异矩阵集合包括属于同一种拼接组合关系的m个所述拼接差异矩阵,所述m为正整数;
    对每个所述拼接差异矩阵集合中的所述拼接差异矩阵计算均值,得到与每一种所述拼接组合关系所对应的所述平均差异矩阵。
  5. 根据权利要求4所述的方法,其特征在于,所述对于每个具有所述第一标注类型的训练语音片段,根据所述训练语音片段在拼接前所对应的所述相邻候选语音片段计算得到拼接差异矩阵,包括:
    对于每个具有所述第一标注类型的训练语音片段,获取所述训练语音片段在拼接前所对应的候选语音片段a和候选语音片段b;
    获取所述候选语音片段a的重叠帧对应的第一组声学特征和所述候选语音片段b的重叠帧对应的第二组声学特征,所述第一组声学特征包含n个所述声学特征,所述第二组声学特征包含n个所述声学特征;
    根据所述第一组声学特征和所述第二组声学特征,按照如下公式计算得到所述拼接差异矩阵F;
    Figure PCTCN2017097314-appb-100002
    其中,F为所述候选语音片段a和所述候选语音片段b对应的所述拼接差异矩阵,所述拼接差异矩阵中的第n行第t列表示所述候选语音片段a中的第t个所述重叠帧的第n个所述声学特征与所述候选语音片段b中的第t-T+1个所述重叠帧的第n个所述声学特征的声学距离测度,fa,t是与所述候选语音片段a的第t个所述重叠帧对应的第n个所述声学特征,fb,t-T+1是与所述候选语音片段b的第t-T+1个所述重叠帧对应的第n个所述声学特征。
  6. 根据权利要求1至3任一所述的方法,其特征在于,所述从所述训练语音数据中提取具有第一标注类型的训练语音片段,包括:
    获取所述训练语音数据中至少一个训练语音片段的标注类型,每个所述训练语音片段的标注类型为所述第一标注类型或第二标注类型,所述第一标注类型所对应的语音连续性优于所述第二标注类型所对应的语音连续性;
    提取具有所述第一标注类型的所述训练语音片段。
  7. 根据权利要求1至3任一所述的方法,其特征在于,所述获取训练语音数据,包括:
    对待训练的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n;
    根据预设声学模型,得到与每个所述文本基元wi对应的预测声学特征;
    对于每个所述文本基元wi,从语料库中选择所述目标代价最小的语音片段vi,所述目标代价用于表征所述文本基元wi对应的预测声学特征与所述候选语音片段的声学特征之间的相似性;
    根据选择出的所述语音片段vi所组成的训练语音片段序列(v1,v2,…,vn)进行语音合成,得到与待训练的所述文本信息对应的所述训练语音数据。
  8. 一种语音合成方法,其特征在于,采用如权利要求1至7任一所述的模型生成方法所生成的所述拼接代价模型,所述方法包括:
    对输入的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n;
    根据预设声学模型,得到与每个所述文本基元wi对应的预测声学特征;
    对于每个所述文本基元wi,从语料库中选择出k个候选语音片段,所述k为正整数;
    根据目标代价模型计算每个所述文本基元wi与对应的候选语音片段之间的目标代价;根据所述拼接代价模型计算相邻所述候选语音片段之间的拼接代价,所述目标代价用于表征所述文本基元wi对应的所述预测声学特征与所述候选语音片段的声学特征之间的相似性,所述拼接代价用于表征所述相邻候选语音片段之间的连续性;
    选择出所述目标代价和所述拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的所述文本信息对应的所述语音信息。
  9. 一种模型生成装置,其特征在于,所述装置包括:
    获取模块,用于获取训练语音数据,所述训练语音数据是将目标代价最小的语音片段进行拼接所得到的语音数据;
    提取模块,用于从所述训练语音数据中提取具有第一标注类型的训练语音片段,所述第一标注类型用于标注所述训练语音片段的语音连续性优于预设条件;
    第一计算模块,用于根据具有所述第一标注类型的训练语音片段在拼接前所对应的相邻候选语音片段,计算得到平均差异矩阵;所述平均差异矩阵与一类拼接组合关系对应,所述平均差异矩阵用于表征属于同一类所述拼接组合关系的多组所述相邻候选语音片段在声学特征上的平均差异;
    生成模块,用于根据所述平均差异矩阵,生成具有目标拼接权值的拼接代价模型,所述拼接代价模型与一类所述拼接组合关系对应。
  10. 根据权利要求9所述的装置,其特征在于,所述生成模块,包括:
    分解单元、第一生成单元和第二生成单元;
    所述分解单元,用于对于每个所述平均差异矩阵Fab,对所述平均差异矩阵Fab进行奇异值分解Fab=U∑V,得到第一分解矩阵U和第二分解矩阵V;
    所述第一生成单元,用于将所述第一分解矩阵U的正交矩阵生成为第一权值wn,将所述第二分解矩阵V的正交矩阵生成为第二权值wt
    所述第二生成单元,用于生成具有所述第一权值wn和所述第二权值wt的所述拼接代价模型;
    其中,ab代表由a类型的语音片段在前且b类型的语音片段在后的拼接组合关系。
  11. 根据权利要求10所述的装置,其特征在于,所述第二生成单元,具体用于生成所述拼接代价模型如下:
    Figure PCTCN2017097314-appb-100003
    其中,CC为拼接代价,所述拼接代价用于表征所述相邻候选语音片段之间的连续性,T为所述相邻候选语音片段的重叠帧的帧数,wt为所述相邻候选语音片段的第t个所述重叠帧的所述声学特征对应的所述第二权值,N为每个所述候选语音片段包含的所述声学特征的个数,wn为所述相邻候选语音片段的第n个所述声学特征对应的所述第一权值,|Δf|为所述相邻候选语音片段的第n个所述声学特征的声学距离测度。
  12. 根据权利要求9至11任一所述的装置,其特征在于,所述第一计算模块,包括:
    第一计算单元、分类单元和第二计算单元;
    所述第一计算单元,用于对于每个具有所述第一标注类型的所述训练语音片段,根据所述训练语音片段在拼接前所对应的所述相邻候选语音片段计算得到拼接差异矩阵;
    所述分类单元,用于根据所述相邻候选语音片段的拼接组合关系对所述拼接差异矩阵进行分类,得到与每一种拼接组合关系所对应的拼接差异矩阵集合,所述拼接差异矩阵集合包括属于同一种拼接组合关系的m个所述拼接差异矩阵,所述m为正整数;
    所述第二计算单元,用于对每个所述拼接差异矩阵集合中的所述拼接差异矩阵计算均值,得到与每一种所述拼接组合关系所对应的所述平均差异矩阵。
  13. 根据权利要求12所述的装置,其特征在于,所述第一计算单元,包括:
    第一获取子单元、第二获取子单元和计算子单元;
    所述第一获取子单元,用于对于每个具有所述第一标注类型的训练语音片段,获取所述训练语音片段在拼接前所对应的候选语音片段a和候选语音片段b;
    所述第二获取子单元,用于获取所述候选语音片段a的重叠帧对应的第一组声学特征和所述候选语音片段b的重叠帧对应的第二组声学特征,所述第一组声学特征包含n个所述声学特征,所述第二组声学特征包含n个所述声学特征;
    所述计算子单元,用于根据所述第一组声学特征和所述第二组声学特征,按照如下公式计算得到所述拼接差异矩阵F;
    Figure PCTCN2017097314-appb-100004
    其中,F为所述候选语音片段a和所述候选语音片段b对应的所述拼接差异矩阵,所述拼接差异矩阵中的第n行第t列表示所述候选语音片段a中的第t个所述重叠帧的第n个所述声学特征与所述候选语音片段b中的第t-T+1个所述重叠帧的第n个所述声学特征的声学距离测度,fa,t是与所述候选语音片段a的第t个所述重叠帧对应的第n个所述声学特征,fb,t-T+1是与所述候选语音片段b的第t-T+1个所述重叠帧对应的第n个所述声学特征。
  14. 根据权利要求9至11任一所述的装置,其特征在于,所述提取模块,包括:
    获取单元和提取单元;
    所述获取单元,用于获取所述训练语音数据中至少一个训练语音片段的标注类型,每个所述训练语音片段的标注类型为所述第一标注类型或第二标注类型,所述第一标注类型所对应的语音连续性优于所述第二标注类型所对应的语音连续性;
    所述提取单元,用于提取具有所述第一标注类型的所述训练语音片段。
  15. 根据权利要求9至11任一所述的装置,其特征在于,所述获取模块, 包括:
    拆分单元、得到单元、选择单元和合成单元;
    所述拆分单元,用于对待训练的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n;
    所述得到单元,用于根据预设声学模型,得到与每个所述文本基元wi对应的预测声学特征;
    所述选择单元,用于对于每个所述文本基元wi,从语料库中选择所述目标代价最小的语音片段vi,所述目标代价用于表征所述文本基元wi对应的预测声学特征与所述候选语音片段的声学特征之间的相似性;
    所述合成单元,用于根据选择出的所述语音片段vi所组成的训练语音片段序列(v1,v2,…,vn)进行语音合成,得到与待训练的所述文本信息对应的所述训练语音数据。
  16. 一种语音合成装置,其特征在于,采用如权利要求9至15任一所述的模型生成装置所生成的所述拼接代价模型,所述装置包括:
    拆分模块,用于对输入的文本信息进行拆分,得到文本基元序列(w1,w2,…,wn),wi为第i个文本基元,1≤i≤n;
    得到模块,用于根据预设声学模型,得到与每个所述文本基元wi对应的预测声学特征;
    选择模块,用于对于每个所述文本基元wi,从语料库中选择出k个候选语音片段,所述k为正整数;
    第二计算模块,用于根据目标代价模型计算每个所述文本基元wi与对应的候选语音片段之间的目标代价;根据所述拼接代价模型计算相邻的所述候选语音片段之间的拼接代价,所述目标代价用于表征所述文本基元wi对应的所述预测声学特征与所述候选语音片段的声学特征之间的相似性,所述拼接代价用于表征相邻所述候选语音片段之间的连续性;
    合成模块,用于选择出所述目标代价和所述拼接代价所对应的总代价最小的一组目标语音片段序列(v1,v2,…,vn)进行语音合成,得到与输入的所述文本信息对应的所述语音信息。
  17. 一种服务器,其特征在于,所述服务器包括处理器和存储器,所述存 储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的模型生成方法。
  18. 一种服务器,其特征在于,所述服务器包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求8所述的语音合成方法。
  19. 一种终端,其特征在于,所述终端包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求8所述的语音合成方法。
  20. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的模型生成方法。
  21. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求8所述的语音合成方法。
PCT/CN2017/097314 2016-10-17 2017-08-14 模型生成方法、语音合成方法及装置 WO2018072543A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/318,889 US10832652B2 (en) 2016-10-17 2017-08-14 Model generating method, and speech synthesis method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610901099.1A CN106356052B (zh) 2016-10-17 2016-10-17 语音合成方法及装置
CN201610901099.1 2016-10-17

Publications (1)

Publication Number Publication Date
WO2018072543A1 true WO2018072543A1 (zh) 2018-04-26

Family

ID=57866682

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/097314 WO2018072543A1 (zh) 2016-10-17 2017-08-14 模型生成方法、语音合成方法及装置

Country Status (3)

Country Link
US (1) US10832652B2 (zh)
CN (1) CN106356052B (zh)
WO (1) WO2018072543A1 (zh)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106356052B (zh) 2016-10-17 2019-03-15 腾讯科技(深圳)有限公司 语音合成方法及装置
CN106920547B (zh) * 2017-02-21 2021-11-02 腾讯科技(上海)有限公司 语音转换方法和装置
CN108109633A (zh) * 2017-12-20 2018-06-01 北京声智科技有限公司 无人值守的云端语音库采集与智能产品测试的系统与方法
CN108172211B (zh) * 2017-12-28 2021-02-12 云知声(上海)智能科技有限公司 可调节的波形拼接系统及方法
CN110288682B (zh) * 2019-06-28 2023-09-26 北京百度网讯科技有限公司 用于控制三维虚拟人像口型变化的方法和装置
CN110473552A (zh) * 2019-09-04 2019-11-19 平安科技(深圳)有限公司 语音识别认证方法及系统
CN111508471B (zh) * 2019-09-17 2021-04-20 马上消费金融股份有限公司 语音合成方法及其装置、电子设备和存储装置
US11295721B2 (en) * 2019-11-15 2022-04-05 Electronic Arts Inc. Generating expressive speech audio from text data
US11335326B2 (en) * 2020-05-14 2022-05-17 Spotify Ab Systems and methods for generating audible versions of text sentences from audio snippets
CN111653263B (zh) * 2020-06-12 2023-03-31 百度在线网络技术(北京)有限公司 音量调节方法、装置、电子设备以及存储介质
CN112309405A (zh) * 2020-10-29 2021-02-02 平安科技(深圳)有限公司 多种声音事件的检测方法、装置、计算机设备及存储介质
CN114827657A (zh) * 2022-04-28 2022-07-29 腾讯音乐娱乐科技(深圳)有限公司 一种音频拼接方法、设备及存储介质
CN117765926B (zh) * 2024-02-19 2024-05-14 上海蜜度科技股份有限公司 语音合成方法、系统、电子设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6366883B1 (en) * 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
CN101131818A (zh) * 2006-07-31 2008-02-27 株式会社东芝 语音合成装置与方法
CN104112444A (zh) * 2014-07-28 2014-10-22 中国科学院自动化研究所 一种基于文本信息的波形拼接语音合成方法
CN104575488A (zh) * 2014-12-25 2015-04-29 北京时代瑞朗科技有限公司 一种基于文本信息的波形拼接语音合成方法
CN106356052A (zh) * 2016-10-17 2017-01-25 腾讯科技(深圳)有限公司 语音合成方法及装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787072B (zh) * 2004-12-07 2010-06-16 北京捷通华声语音技术有限公司 基于韵律模型和参数选音的语音合成方法
CN101004909A (zh) * 2007-02-16 2007-07-25 黑龙江大学 基于韵律特征的汉语语音合成基元的选取方法
US8321222B2 (en) * 2007-08-14 2012-11-27 Nuance Communications, Inc. Synthesis by generation and concatenation of multi-form segments
JP5238205B2 (ja) * 2007-09-07 2013-07-17 ニュアンス コミュニケーションズ,インコーポレイテッド 音声合成システム、プログラム及び方法
US8571871B1 (en) * 2012-10-02 2013-10-29 Google Inc. Methods and systems for adaptation of synthetic speech in an environment
US9082401B1 (en) * 2013-01-09 2015-07-14 Google Inc. Text-to-speech synthesis
CN103531196B (zh) * 2013-10-15 2016-04-13 中国科学院自动化研究所 一种波形拼接语音合成的选音方法
WO2015092936A1 (ja) * 2013-12-20 2015-06-25 株式会社東芝 音声合成装置、音声合成方法およびプログラム
WO2016042659A1 (ja) * 2014-09-19 2016-03-24 株式会社東芝 音声合成装置、音声合成方法およびプログラム
KR20160058470A (ko) * 2014-11-17 2016-05-25 삼성전자주식회사 음성 합성 장치 및 그 제어 방법
CN105654940B (zh) * 2016-01-26 2019-12-24 百度在线网络技术(北京)有限公司 一种语音合成方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6366883B1 (en) * 1996-05-15 2002-04-02 Atr Interpreting Telecommunications Concatenation of speech segments by use of a speech synthesizer
CN101131818A (zh) * 2006-07-31 2008-02-27 株式会社东芝 语音合成装置与方法
CN104112444A (zh) * 2014-07-28 2014-10-22 中国科学院自动化研究所 一种基于文本信息的波形拼接语音合成方法
CN104575488A (zh) * 2014-12-25 2015-04-29 北京时代瑞朗科技有限公司 一种基于文本信息的波形拼接语音合成方法
CN106356052A (zh) * 2016-10-17 2017-01-25 腾讯科技(深圳)有限公司 语音合成方法及装置

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SYRDAL, ANN K.: "DATA-DRIVEN PERCEPTUALLY BASED JOIN COSTS", 5TH ISCA SPEECH SYNTHESIS WORKSHOP, 14 June 2004 (2004-06-14) - 16 June 2004 (2004-06-16), Pittsburg, PA, USA, pages 49 - 54, XP055603262 *
SYRDAL, ANN K.: "Perceptually-based Data-driven Join Costs: Comparing Join Types", INTERSPEECH 2005, 31 December 2005 (2005-12-31), Lisbon, Portugal, pages 2813 - 2816, XP055603269 *

Also Published As

Publication number Publication date
US10832652B2 (en) 2020-11-10
US20190189109A1 (en) 2019-06-20
CN106356052A (zh) 2017-01-25
CN106356052B (zh) 2019-03-15

Similar Documents

Publication Publication Date Title
WO2018072543A1 (zh) 模型生成方法、语音合成方法及装置
CN108305296B (zh) 图像描述生成方法、模型训练方法、设备和存储介质
EP4064276A1 (en) Method and device for speech recognition, terminal and storage medium
EP2821992B1 (en) Method for updating voiceprint feature model and terminal
WO2019047971A1 (zh) 图像识别方法、终端及存储介质
WO2018219105A1 (zh) 语音识别方法及相关产品
US11720814B2 (en) Method and system for classifying time-series data
CN112820299B (zh) 声纹识别模型训练方法、装置及相关设备
CN104281394A (zh) 智能选词的方法和装置
TW201512865A (zh) 一種網頁數據搜索方法、裝置和系統
WO2017088434A1 (zh) 人脸模型矩阵训练方法、装置及存储介质
CN111738100B (zh) 一种基于口型的语音识别方法及终端设备
US11308965B2 (en) Voice information processing method and apparatus, and terminal
CN104038832A (zh) 一种播放视频的方法及装置
CN110335629B (zh) 音频文件的音高识别方法、装置以及存储介质
CN109389977B (zh) 一种语音交互方法及装置
CN112947890B (zh) 一种归并排序方法及装置
WO2018214760A1 (zh) 对焦方法及相关产品
CN109032482B (zh) 分屏控制方法、装置、存储介质和电子设备
WO2019072193A1 (zh) 一种信息智能检索的方法、装置及存储介质
CN109799994B (zh) 一种终端组件生成方法及装置
CN108958505B (zh) 一种显示候选信息的方法及终端
CN116935883B (zh) 声源定位方法、装置、存储介质及电子设备
CN113806532B (zh) 比喻句式判断模型的训练方法、装置、介质及设备
CN109522071B (zh) 一种照片管理方法及终端设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17862829

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17862829

Country of ref document: EP

Kind code of ref document: A1