EP1846918B1 - Method of estimating a voice conversion function - Google Patents
Method of estimating a voice conversion function Download PDFInfo
- Publication number
- EP1846918B1 EP1846918B1 EP05850632A EP05850632A EP1846918B1 EP 1846918 B1 EP1846918 B1 EP 1846918B1 EP 05850632 A EP05850632 A EP 05850632A EP 05850632 A EP05850632 A EP 05850632A EP 1846918 B1 EP1846918 B1 EP 1846918B1
- Authority
- EP
- European Patent Office
- Prior art keywords
- voice
- speaker
- recorded
- conversion
- voice message
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Not-in-force
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 70
- 238000000034 method Methods 0.000 title claims abstract description 56
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 45
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 40
- 239000000203 mixture Substances 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 description 13
- 230000001755 vocal effect Effects 0.000 description 13
- 230000003595 spectral effect Effects 0.000 description 10
- 239000013598 vector Substances 0.000 description 10
- 230000009466 transformation Effects 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000012545 processing Methods 0.000 description 6
- 230000006978 adaptation Effects 0.000 description 5
- 238000013139 quantization Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 241000220010 Rhode Species 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000000593 degrading effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000004377 microelectronic Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011426 transformation method Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Definitions
- It also relates to a method for estimating a voice conversion function between, on the one hand, the voice of a source speaker defined from a first voice message recorded by said source speaker, and, on the other hand, on the other hand, the voice of a target speaker defined from a second voice message recorded by said target speaker.
- the invention finds an advantageous application whenever it is desired to have a speaker say a voice message recorded by another speaker. It is thus possible, for example, to diversify the voices used in speech synthesis systems, or, conversely, to anonymously render messages recorded by different speakers. It is also conceivable to implement the method according to the invention for making dubbing films.
- voice conversion consists of estimating a transformation function, or conversion function, which, applied to a first speaker whose voice is defined from a recorded voice message, makes it possible to reproduce as faithfully as possible the voice of a second speaker.
- said second speaker may be a reference speaker whose voice is defined by a voice synthesis database or a so-called "target” speaker whose voice is also defined from a voice message registered, with the first speaker being called "source”.
- timbre segmental
- pitch of voice vocal quality
- suprasegmental speech style
- the principle of voice conversion consists, in known manner, in a learning operation which aims to estimate a function connecting the tone of the voice of the first speaker to that of the voice of the second speaker.
- two parallel recordings of the two speakers that is to say containing the same voice message, are necessary.
- An analysis is conducted on each of the recordings in order to extract representative parameters of the tone of the voice.
- Many transformation methods based on this principle have been proposed, for example, conversion by vector quantization ( M. Abe, S. Nakamura, K. Shikano and H.
- the speaker adaptation module allows you to customize an HMM synthesis system.
- a classification of the HMM models in context by decision tree is carried out to build a model of "average" voice.
- the parameters of these HMM models are adapted according to the target speaker. Both objective and subjective tests have shown the usefulness of the method in the context of HMM synthesis. But the quality of the converted speech accessible by synthesis systems by HMM is nevertheless very poor.
- a technical problem to be solved by the object of the present invention is to propose a method of estimating a voice conversion function between, on the one hand, the voice of a speaker defined from a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a voice synthesis database, which would provide a converted speech of better quality than that provided by the methods to non-parallel corpora known.
- the document US2002 / 0173962 discloses a method for synthesizing a personalized voice from text where the learning operation is performed between a synthethic voice message obtained from the text and a corresponding voice message spoken by the target speaker.
- said voice synthesis database is a database of a concatenated speech synthesis system.
- said voice synthesis database is a database of a speech synthesis system by corpus.
- the acoustic database is not restricted to a dictionary of mono-represented diphones, but contains these same elements recorded in different contexts (grammatical, syntactic, phonemic, phonological or prosodic). Each element thus manipulated, also called “unit”, is thus a segment of speech characterized by a set of symbolic descriptors relative to the context in which it was recorded.
- the problematic of the synthesis changes radically: it is no longer a matter of distorting the speech signal with the aim of degrading the quality of the timbre as little as possible, but rather of having a sufficiently rich database.
- the selection of units can therefore be likened to a problem of minimizing a cost function composed of two types of metrics: a "target cost” which measures the adequacy of the units with the symbolic parameters resulting from the language processing modules of the system and a "concatenation cost” which accounts for the acoustic compatibility of two consecutive units.
- the figure 1 is a block diagram showing the steps of a voice conversion method between a speaker and a reference speaker.
- the figure 3 is a diagram of a voice conversion system implementing the estimation method according to the invention.
- FIG. 1 On the figure 1 is illustrated a voice conversion estimation method between a speaker and a reference speaker.
- the voice of said speaker is defined from a recorded voice message while the voice of said reference speaker is defined from an acoustic data base of a concatenated speech synthesis system, preferably by corpus. although a mono-represented diphon synthesis system can also be used.
- a synthetic record parallel to the voice message recorded by the speaker is generated from said voice synthesis data base.
- a first block required for generation is intended to extract from the record of the speaker concerned information of a symbolic type relating to the message contained in said record.
- a first type of treatment envisaged consists in extracting the message delivered in text form from the voice recording. This can be obtained automatically by a voice recognition system, or manually by listening and retranscribing voice messages. In this case, the text thus recognized directly feeds the system 30 of speech synthesis, thereby generating the desired reference synthetic record.
- a prosodic annotation algorithm can be integrated in the method or a manual annotation phase of the corpus can be considered to take into account melodic markers deemed relevant.
- the acoustic analysis is carried out for example by means of the HNM model ("Harmonic plus Noise Model") which supposes that a segment (also called frame) voiced of the speech signal s (n) can be decomposed into a harmonic part h ( n) representing the quasi-periodic component of the signal consisting of a sum of L harmonic sinusoids of amplitudes A l and of phases ⁇ l , and a noisy part b (n) representing the friction noise and the variation of the glottal excitation from one period to another, modeled by a noise-excited LPC (Linear Prediction Coefficients) filter.
- the harmonic part is absent and the signal is simply modeled by a white noise shaped by auto-regressive filtering (AR).
- AR auto-regressive filtering
- the fundamental frequency F 0 and the maximum frequency of voicing that is to say the frequency beyond which the signal is considered to consist solely of noise, are first determined. Then, a synchronized analysis on F 0 makes it possible to estimate the parameters of the harmonic part (the amplitudes and the phases) as well as the parameters of the noise.
- the harmonic parameters are calculated by minimizing a weighted least squares criterion (see the article by Y.
- the parts of the spectrum corresponding to noise are modeled using a simple linear prediction.
- the frequency response of the AR model thus estimated is then sampled at constant pitch, which provides an estimate of the spectral envelope on the noisy areas.
- the parameters modeling this spectral envelope are deduced using the discrete regularized cepstrum method ( O. Cappe, E. Mills, Regularization techniques for discrete cepstrum estimation, IEEE Signal Processing Letters, Vol. 3 (4), pp. 100-102, April 1996 ).
- the order of cepstral modeling was set at 20.
- a Bark scale transformation is performed.
- Dynamic Alignment DTW for Dynamic Time Warping
- the alignment path can be constrained so as to respect the segmentation marks.
- a joint classification of the acoustic vectors of the two aligned recordings is performed.
- x 1: N [x 1 , x 2 , ⁇ , x N ]
- y 1: N [y 1 , y 2 ⁇ , y N ] be the sequences of aligned acoustic vectors.
- the random variable z is modeled by a mixture of Gaussian laws (in English GMM for "Gaussian Mixture Model") of order Q.
- the estimation of the parameters of the model is carried out by applying a classical iterative procedure, namely the EM (Expectation-Maximization) algorithm ( AP Dempster, NM Laird, DR Rubin, EM algorithm, Journal of the Royal Statistical Society B, vol. 39, pp. 1-38, 1977 ).
- the determination of the initial parameters of the GMM model is obtained using a standard vector quantization technique.
- the figure 2 illustrates a method for estimating a voice conversion function between a source speaker and a target speaker whose voices are respectively defined from voice messages recorded by each of the speakers, these recordings being non-parallel.
- synthetic reference records are generated from said voice messages recorded according to a procedure similar to that just described with regard to the figure 1 .
- a voice conversion system incorporating the described estimation method is represented on the figure 3 .
- the analysis step still relies on HNM modeling, but this time is conducted in a pitch-synchronous manner, as this allows for better pitch and spectral envelope changes (see FIG. article by Y. Stylianou cited above).
- the extracted spectral parameters are then transformed using a conversion module 80 performing the conversion determined by the relation (6).
- modified parameters as well as the residual information necessary for the sound generation are transmitted to a synthesis module by HNM.
- the harmonic component of the signal defined by equation (2) and present for the voiced signal frames is generated by summation of sinusoids previously tabulated whose amplitudes are calculated from the converted spectral parameters.
- the stochastic portion is determined by inverse Fourier Transform (IFFT) on the spectrum calculated from the spectral parameters.
- IFFT inverse Fourier Transform
- the HNM model can be replaced by other models known to those skilled in the art, such as linear prediction coding (LPC) models, sinusoidal or MBE ("Multi-Band Excited") models. ").
- LPC linear prediction coding
- MBE Multi-Band Excited
- the GMM conversion method can be replaced by conventional vector quantization (VQ) or fuzzy vector quantization (Fuzzy VQ) techniques.
- the steps of the method are determined by the instructions of a program for estimating a voice conversion function incorporated in a server, and the method according to the invention is implemented when this program is loaded into a computer whose operation is then controlled by the execution of the program.
- the information carrier may be any entity or device capable of storing the program.
- the medium may comprise storage means, such as a ROM memory, for example a CD ROM or a microelectronic circuit ROM memory, or a means magnetic recording, for example a floppy disk or a hard disk.
- the information medium may be a transmissible medium such as an electrical or optical signal, which may be conveyed via an electrical or optical cable, by radio or by other means.
- the program according to the invention can be downloaded in particular on an Internet type network.
- the information carrier may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
Description
La présente invention concerne un procédé d'estimation d'une fonction de conversion de voix entre, d'une part, la voix d'un locuteur définie à partir d'un message vocal enregistré par ledit locuteur, et, d'autre part, la voix d'un locuteur de référence définie par une base de données de synthèse vocale.The present invention relates to a method for estimating a voice conversion function between, on the one hand, the voice of a speaker defined from a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a speech synthesis database.
Elle concerne également un procédé d'estimation d'une fonction de conversion de voix entre, d'une part, la voix d'un locuteur source définie à partir d'un premier message vocal enregistré par ledit locuteur source, et, d'autre part, la voix d'un locuteur cible définie à partir d'un deuxième message vocal enregistré par ledit locuteur cible.It also relates to a method for estimating a voice conversion function between, on the one hand, the voice of a source speaker defined from a first voice message recorded by said source speaker, and, on the other hand, on the other hand, the voice of a target speaker defined from a second voice message recorded by said target speaker.
L'invention trouve une application avantageuse à chaque fois que l'on veut faire dire par un locuteur un message vocal enregistré par un autre locuteur. Il est ainsi possible, par exemple, de diversifier les voix utilisées dans les systèmes de synthèse de la parole, ou, à l'inverse, restituer de manière anonyme des messages enregistrés par différents locuteurs. On peut également envisager de mettre en oeuvre le procédé conforme à l'invention pour réaliser des doublages de films.The invention finds an advantageous application whenever it is desired to have a speaker say a voice message recorded by another speaker. It is thus possible, for example, to diversify the voices used in speech synthesis systems, or, conversely, to anonymously render messages recorded by different speakers. It is also conceivable to implement the method according to the invention for making dubbing films.
D'une manière générale, la conversion de voix consiste à estimer une fonction de transformation, ou de conversion, qui, appliquée à un premier locuteur dont la voix est définie à partir d'un message vocal enregistré, permet de reproduire aussi fidèlement que possible la voix d'un deuxième locuteur. Dans le cadre de l'invention, ledit deuxième locuteur peut être un locuteur de référence dont la voix est définie par une base de données de synthèse vocale ou un locuteur dit « cible » dont la voix est également définie à partir d'un message vocal enregistré, le premier locuteur étant qualifié de « source ».In general, voice conversion consists of estimating a transformation function, or conversion function, which, applied to a first speaker whose voice is defined from a recorded voice message, makes it possible to reproduce as faithfully as possible the voice of a second speaker. In the context of the invention, said second speaker may be a reference speaker whose voice is defined by a voice synthesis database or a so-called "target" speaker whose voice is also defined from a voice message registered, with the first speaker being called "source".
L'identité vocale d'un locuteur dépend de nombreuses caractéristiques, qu'elles soient segmentales (timbre, hauteur de voix, qualité vocale), ou suprasegmentales (style d'élocution). Parmi celles-ci, le timbre reste l'information la plus importante, c'est pourquoi la plupart des travaux dans le domaine de la conversion de voix traitent essentiellement de la modification du timbre. Néanmoins, lors de la conversion, une modification de la fréquence fondamentale, appelée aussi « pitch », peut être également effectuée afin de respecter globalement la hauteur de voix du deuxième locuteur.The vocal identity of a speaker depends on many characteristics, whether they are segmental (timbre, pitch of voice, vocal quality), or suprasegmental (speech style). Of these, timbre remains the most important piece of information, which is why most work in the voice conversion field deals mainly with the modification of the timbre. Nevertheless, during the conversion, a modification of the fundamental frequency, also called "pitch", can also be performed in order to respect overall the voice height of the second speaker.
En substance, le principe de la conversion de voix consiste, de manière connue, en une opération d'apprentissage qui vise à estimer une fonction reliant le timbre de la voix du premier locuteur à celui de la voix du deuxième locuteur. Pour cela, deux enregistrements parallèles des deux locuteurs, c'est-à-dire comportant le même message vocal, sont nécessaires. Une analyse est menée sur chacun des enregistrements afin d'extraire des paramètres représentatifs du timbre de la voix. Puis, après alignement des deux enregistrements, on commence par effectuer une classification, c'est à dire une partition des espaces acoustiques des deux locuteurs. Cette classification est ensuite utilisée pour l'estimation de la fonction de conversion. De nombreuses méthodes de transformation basées sur ce principe ont été proposées, on citera par exemple la conversion par quantification vectorielle (
Les procédés d'estimation de fonctions de conversion de voix qui viennent d'être présentés utilisent des enregistrements, ou corpus, de messages parallèles des deux locuteurs. Cependant, il n'est pas toujours possible d'obtenir de tels enregistrements. C'est pourquoi, parallèlement au développement des méthodes de conversion basée sur l'utilisation de corpus parallèles, d'autres travaux ont été menés afin de rendre possible la conversion dans le cas où les corpus source et cible ne sont pas parallèles. Ces travaux sont très largement inspirés des techniques d'adaptation au locuteur classiquement utilisées en reconnaissance de la parole par modèles de Markov cachés (en anglais HMM pour Hidden Markov Model). Une application intéressante a été proposée (
Une technique d'adaptation au locuteur est également proposée (
Aussi, un problème technique à résoudre par l'objet de la présente invention est de proposer un procédé d'estimation d'une fonction de conversion de voix entre, d'une part, la voix d'un locuteur définie à partir d'un message vocal enregistré par ledit locuteur, et, d'autre part, la voix d'un locuteur de référence définie par une base de données de synthèse vocale, qui permettrait d'obtenir une parole convertie de qualité meilleure que celle fournie par les procédés à corpus non parallèles connus.Also, a technical problem to be solved by the object of the present invention is to propose a method of estimating a voice conversion function between, on the one hand, the voice of a speaker defined from a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a voice synthesis database, which would provide a converted speech of better quality than that provided by the methods to non-parallel corpora known.
Le document
La solution au problème technique posé consiste, selon la présente invention, en ce que ledit procédé comprend les étapes consistant à :
- générer, à partir dudit message vocal enregistré par le locuteur et de ladite base de données de synthèse vocale, un enregistrement synthétique dudit message vocal,
- estimer ladite fonction de conversion de voix par une opération d'apprentissage effectuée sur ledit message vocal enregistré et ledit enregistrement synthétique.
- generating, from said voice message recorded by the speaker and said voice synthesis database, a synthetic record of said voice message,
- estimating said voice conversion function by a learning operation performed on said recorded voice message and said synthetic record.
Ainsi, on comprend que le procédé selon l'invention permet d'obtenir deux enregistrements parallèles du même message vocal, l'un étant enregistré directement par le locuteur, et qui constitue en quelque sorte le message de base, et l'autre étant une reproduction synthétique de ce message de base. L'estimation de la fonction de conversion recherchée est alors réalisée par une opération d'apprentissage classique effectuée sur deux enregistrements parallèles. Les différentes étapes de ce traitement seront décrites en détail plus loin.Thus, it will be understood that the method according to the invention makes it possible to obtain two parallel recordings of the same voice message, one being recorded directly by the speaker, which constitutes in a way the basic message, and the other being a Synthetic reproduction of this basic message. The estimation of the conversion function sought is then performed by a conventional learning operation performed on two parallel recordings. The different stages of this treatment will be described in detail below.
Deux applications du procédé conforme à l'invention peuvent être envisagées, à savoir, d'une part, une application à la conversion de messages vocaux enregistrés par un locuteur source en messages correspondants reproduits par ledit locuteur de référence, et, d'autre part, une application à la conversion de messages synthétiques enregistrés par un locuteur de référence en messages correspondants reproduits par un locuteur cible. La première application conduit à rendre anonymes, car reproduits par un même locuteur de référence, des messages vocaux enregistrés par des locuteurs différents. La deuxième application vise, au contraire, à diversifier les voix utilisées en synthèse de la parole.Two applications of the method according to the invention can be envisaged, namely, on the one hand, an application to the conversion of voice messages recorded by a source speaker into corresponding messages reproduced by said reference speaker, and, on the other hand , an application to the conversion of synthetic messages recorded by a speaker of reference in corresponding messages reproduced by a target speaker. The first application leads to make anonymous, because reproduced by the same reference speaker, voice messages recorded by different speakers. The second application aims, on the contrary, to diversify the voices used in speech synthesis.
Le même principe de parallélisation de messages via un locuteur de référence peut s'appliquer à la conversion de voix entre deux locuteurs conformément à un procédé d'estimation d'une fonction de conversion de voix entre, d'une part, la voix d'un locuteur source définie à partir d'un premier message vocal enregistré par ledit locuteur source, et, d'autre part, la voix d'un locuteur cible définie à partir d'un deuxième message vocal enregistré par ledit locuteur cible, qui, selon l'invention, est remarquable en ce que ledit procédé comprend les étapes consistant à :
- générer, à partir dudit premier message vocal enregistré par le locuteur source et d'une base de données de synthèse vocale, un enregistrement synthétique dudit premier message vocal,
- estimer une première fonction de conversion de voix entre la voix du locuteur source et la voix d'un locuteur de référence définie par ladite base de données de synthèse vocale, par une opération d'apprentissage effectuée sur ledit premier message vocal enregistré par le locuteur source et ledit enregistrement synthétique du premier message vocal,
- générer, à partir dudit deuxième message vocal enregistré par le locuteur cible et de ladite base de données de synthèse vocale, un enregistrement synthétique dudit deuxième message vocal,
- estimer une deuxième fonction de conversion de voix entre la voix dudit locuteur de référence et la voix du locuteur cible, par une opération d'apprentissage effectuée sur ledit enregistrement synthétique du deuxième message vocal et ledit deuxième message vocal enregistré par le locuteur cible,
- estimer ladite fonction de conversion de voix par composition de ladite première et de ladite deuxième fonctions de conversion de voix.
- generating, from said first voice message recorded by the source speaker and a voice synthesis database, a synthetic record of said first voice message,
- estimating a first voice conversion function between the voice of the source speaker and the voice of a reference speaker defined by said voice synthesis database, by a learning operation performed on said first voice message recorded by the source speaker and said synthetic record of the first voice message,
- generating, from said second voice message recorded by the target speaker and said voice synthesis database, a synthetic record of said second voice message,
- estimating a second voice conversion function between the voice of said reference speaker and the voice of the target speaker, by a learning operation performed on said synthetic record of the second voice message and said second voice message recorded by the target speaker,
- estimating said voice conversion function by composing said first and said second voice conversion functions.
Selon un premier mode de réalisation de l'invention, ladite base de données de synthèse vocale est une base de données d'un système de synthèse de la parole par concaténation.According to a first embodiment of the invention, said voice synthesis database is a database of a concatenated speech synthesis system.
Selon un deuxième mode de réalisation de l'invention, ladite base de données de synthèse vocale est une base de données d'un système de synthèse de la parole par corpus.According to a second embodiment of the invention, said voice synthesis database is a database of a speech synthesis system by corpus.
On rappelle que les systèmes de synthèse par concaténation peuvent utiliser des bases de diphones mono-représentés. Le choix du diphone, et non pas du phone (réalisation acoustique d'un phonème), résulte de l'importance de la zone transitoire, ainsi conservée, comprise entre deux phones pour l'intelligibilité du signal de parole. La synthèse par diphone conduit en général à un signal synthétique dont l'intelligibilité est assez bonne. En revanche, les modifications effectuées par l'algorithme TD-PSOLA (
La disponibilité récente de ressources informatiques importantes a permis l'émergence de solutions nouvelles regroupées sous l'appellation de synthèse par corpus. Dans cette approche, la base de données acoustiques ne se restreint pas à un dictionnaire de diphones mono-représentés, mais contient ces mêmes éléments enregistrés dans différents contextes (grammatical, syntaxique, phonémique, phonologique ou prosodique). Chaque élément ainsi manipulé, appelé aussi "unité", est donc un segment de parole caractérisé par un ensemble de descripteurs symboliques relatifs au contexte dans lequel il a été enregistré. Dans cette approche par corpus, la problématique de la synthèse change alors radicalement : il ne s'agit plus de déformer le signal de parole en visant à dégrader le moins possible la qualité du timbre mais plutôt de disposer d'une base de données suffisamment riche et d'une algorithmique fine permettant la sélection des unités les mieux adaptées au contexte et minimisant les artéfacts aux instants de concaténation. La sélection des unités peut donc être assimilée à un problème de minimisation d'une fonction de coût composée de deux types de métriques : un "coût cible" qui mesure l'adéquation des unités avec les paramètres symboliques issus des modules de traitements linguistiques du système et un "coût de concaténation" qui rend compte de la compatibilité acoustique de deux unités consécutives.The recent availability of important computer resources has allowed the emergence of new solutions grouped under the name of synthesis by corpus. In this approach, the acoustic database is not restricted to a dictionary of mono-represented diphones, but contains these same elements recorded in different contexts (grammatical, syntactic, phonemic, phonological or prosodic). Each element thus manipulated, also called "unit", is thus a segment of speech characterized by a set of symbolic descriptors relative to the context in which it was recorded. In this corpus approach, the problematic of the synthesis changes radically: it is no longer a matter of distorting the speech signal with the aim of degrading the quality of the timbre as little as possible, but rather of having a sufficiently rich database. and a fine algorithm allowing the selection of the units best adapted to the context and minimizing the artifacts at the concatenation instants. The selection of units can therefore be likened to a problem of minimizing a cost function composed of two types of metrics: a "target cost" which measures the adequacy of the units with the symbolic parameters resulting from the language processing modules of the system and a "concatenation cost" which accounts for the acoustic compatibility of two consecutive units.
Pour des raisons de complexité algorithmique, énumérer et traiter d'emblée l'ensemble des combinaisons d'unités correspondant à la phonétisation d'un texte donné est difficilement envisageable. Il convient donc d'opérer un filtrage des données avant de décider du choix de la séquence optimale. Pour cette raison, le module de sélection des unités opère généralement en deux étapes : premièrement une "pré-sélection" qui consiste à sélectionner des ensembles d'unités candidates pour chaque séquence cible, puis une "sélection finale" qui vise à déterminer la séquence optimale selon une certaine fonction de coût prédéterminé. Les méthodes de pré-sélection sont pour la plupart des variantes de la méthode baptisée "Context Oriented Clustering" introduite par Nakajima (
La description qui va suivre en regard des dessins annexés, donnés à titre d'exemples non limitatifs, fera bien comprendre en quoi consiste l'invention et comment elle peut être réalisée.The following description with reference to the accompanying drawings, given as non-limiting examples, will make it clear what the invention consists of and how it can be achieved.
La
La
La
Sur la
Dans une première étape, un enregistrement synthétique parallèle au message vocal enregistré par le locuteur est généré à partir de ladite base 10 de données de synthèse vocale.In a first step, a synthetic record parallel to the voice message recorded by the speaker is generated from said voice synthesis data base.
A cet effet, un premier bloc nécessaire à la génération, dit bloc 20 d'analyse et d'annotation, a pour but d'extraire de l'enregistrement du locuteur considéré des informations de type symbolique, relatives au message contenu dans ledit enregistrement.For this purpose, a first block required for generation, called analysis and
Un premier type de traitement envisagé consiste à extraire de l'enregistrement vocal le message prononcé sous forme textuelle. Ceci peut être obtenu de façon automatique par un système de reconnaissance vocale, ou de façon manuelle par écoute et retranscription des messages vocaux. Dans ce cas, le texte ainsi reconnu alimente directement le système 30 de synthèse vocale, générant ainsi l'enregistrement synthétique de référence désiré.A first type of treatment envisaged consists in extracting the message delivered in text form from the voice recording. This can be obtained automatically by a voice recognition system, or manually by listening and retranscribing voice messages. In this case, the text thus recognized directly feeds the
Cependant, il peut être avantageux de déterminer la chaîne phonétique effectivement réalisée par le locuteur considéré. Pour cela, des procédures standard de décodage acoustico-phonétique, par exemple à base de modèles HMM, peuvent être utilisées. Par cette variante, il est possible de contraindre le synthétiseur vocal à reproduire exactement la phonétisation ainsi déterminée.However, it may be advantageous to determine the phonetic string actually performed by the speaker. For this, standard acousto-phonetic decoding procedures, for example based on HMM models, can be used. By this variant, it is possible to constrain the speech synthesizer to reproduce exactly the phonetization thus determined.
Plus généralement, il est souhaitable d'introduire un mécanisme d'annotation de l'enregistrement afin d'extraire le maximum d'informations pouvant être pris en compte par le système de synthèse par concaténation. Parmi celles-ci, les informations relatives à l'intonation semblent particulièrement pertinentes, car elles permettent de mieux contrôler le mode d'élocution du locuteur. Ainsi, un algorithme d'annotation prosodique peut être intégré au procédé ou une phase d'annotation manuelle du corpus peut être envisagée afin de prendre en compte des marqueurs mélodiques jugés pertinents.More generally, it is desirable to introduce an annotation mechanism of the record in order to extract the maximum of information that can be taken into account by the concatenation synthesis system. Among these, the intonation information seems particularly relevant, because it allows to better control the speech mode of the speaker. Thus, a prosodic annotation algorithm can be integrated in the method or a manual annotation phase of the corpus can be considered to take into account melodic markers deemed relevant.
Il est alors possible d'estimer la fonction de conversion recherchée en appliquant aux deux enregistrements parallèles disponibles, à savoir le message vocal enregistré et l'enregistrement synthétique de référence, une opération d'apprentissage qui va maintenant être décrite en détail.It is then possible to estimate the conversion function sought by applying to the two available parallel records, namely the recorded voice message and the synthetic reference record, a learning operation which will now be described in detail.
Comme on peut le voir sur la
- l'analyse acoustique 40,
l'alignement 50 des corpus,la classification acoustique 60,l'estimation 70 de la fonction de conversion.
-
acoustic analysis 40, - the
alignment 50 of the corpora, - the
acoustic classification 60, - the
estimate 70 of the conversion function.
L'analyse acoustique est effectuée par exemple au moyen du modèle HNM (« Harmonic plus Noise Model ») qui suppose qu'un segment (appelé aussi trame) voisé du signal de parole s(n) peut être décomposé en une partie harmonique h(n) représentant la composante quasi-périodique du signal constituée d'une somme de L sinusoïdes harmoniques d'amplitudes Al et de phases φ l , et une partie bruitée b(n) représentant le bruit de friction et la variation de l'excitation glottale d'une période à l'autre, modélisée par un filtre LPC (« Linear Prediction Coefficients ») excité par un bruit blanc gaussien (
Pour une trame non-voisée, la partie harmonique est absente et le signal est simplement modélisé par un bruit blanc mis en forme par filtrage auto-régressif (AR).For an unvoiced frame, the harmonic part is absent and the signal is simply modeled by a white noise shaped by auto-regressive filtering (AR).
La première étape de l'analyse HNM consiste à prendre une décision quant au caractère voisé ou non de la trame analysée. Ce traitement est réalisé en mode asynchrone à l'aide d'un pas d'analyse fixé à 10 ms.The first step of the HNM analysis is to make a decision as to whether the analyzed frame is voiced. This processing is performed in asynchronous mode using an analysis step set at 10 ms.
Pour une trame voisée, on détermine d'abord la fréquence fondamentale F0 et la fréquence maximale de voisement, c'est-à-dire la fréquence au-delà de laquelle le signal est considéré comme uniquement constitué de bruit. Ensuite, une analyse synchronisée sur F0 permet d'estimer les paramètres de la partie harmonique (les amplitudes et les phases) ainsi que les paramètres du bruit. Les paramètres des harmoniques sont calculés par minimisation d'un critère des moindres carrés pondérés (voir l'article de Y. Stylianou cité plus haut) :
où s(n) est le signal original, h(n) est la partie harmonique définie par la relation (5) écrite plus loin, w(n) est la fenêtre d'analyse, et T0 i est la période fondamentale de la trame courante. Il convient de noter que la trame d'analyse a une durée égale à deux fois la période fondamentale (voir l'article de Y. Stylianou cité plus haut). Cette analyse harmonique est importante dans la mesure où elle apporte une information fiable sur la valeur du spectre aux fréquences harmoniques. Une telle information est nécessaire pour avoir une estimation robuste de l'enveloppe spectrale.For a voiced frame, the fundamental frequency F 0 and the maximum frequency of voicing, that is to say the frequency beyond which the signal is considered to consist solely of noise, are first determined. Then, a synchronized analysis on F 0 makes it possible to estimate the parameters of the harmonic part (the amplitudes and the phases) as well as the parameters of the noise. The harmonic parameters are calculated by minimizing a weighted least squares criterion (see the article by Y. Stylianou cited above):
where s (n) is the original signal, h (n) is the harmonic part defined by relation (5) written later, w (n) is the analysis window, and T 0 i is the fundamental period of the current frame. It should be noted that the analysis frame has a duration equal to twice the fundamental period (see the article by Y. Stylianou cited above). This harmonic analysis is important in the extent to which it provides reliable information on the value of the spectrum at harmonic frequencies. Such information is necessary to have a robust estimate of the spectral envelope.
Les parties du spectre correspondant à du bruit (qu'il s'agisse de la composante de bruit d'une trame voisée ou d'une trame non voisée) sont modélisées à l'aide d'une simple prédiction linéaire. La réponse fréquentielle du modèle AR ainsi estimé est ensuite échantillonnée à pas constant, ce qui fournit une estimation de l'enveloppe spectrale sur les zones bruitées.The parts of the spectrum corresponding to noise (whether the noise component of a voiced frame or an unvoiced frame) are modeled using a simple linear prediction. The frequency response of the AR model thus estimated is then sampled at constant pitch, which provides an estimate of the spectral envelope on the noisy areas.
Dans le mode de réalisation proposé, étant donné cet échantillonnage de l'enveloppe spectrale, on en déduit les paramètres modélisant cette enveloppe spectrale en utilisant la méthode du cepstre discret régularisé (
Il convient également de noter que d'autres types de paramètres modélisant l'enveloppe spectrale peuvent être utilisés : par exemple les LSF (Line Spectral Frequency) ou encore les LAR (Log Area Ratio).It should also be noted that other types of parameters modeling the spectral envelope can be used: for example Line Spectral Frequency (LSF) or Log Area Ratio (LAR).
Après analyse acoustique, il convient de mettre en correspondance les différents vecteurs acoustiques des deux enregistrements. Pour cela, un algorithme classique, dit d'alignement dynamique (en anglais DTW pour "Dynamic Time Warping), est utilisé.After acoustic analysis, it is necessary to match the different acoustic vectors of the two recordings. For this, a conventional algorithm called Dynamic Alignment (DTW for Dynamic Time Warping) is used.
Avantageusement, si une annotation et une segmentation des deux enregistrements sont disponibles (par exemple un découpage en phonèmes) et si ces informations sont concordantes entre les deux enregistrements, alors le chemin d'alignement peut être contraint de manière à respecter les marques de segmentation.Advantageously, if an annotation and a segmentation of the two recordings are available (for example a phoneme division) and if this information is concordant between the two recordings, then the alignment path can be constrained so as to respect the segmentation marks.
Dans le mode de réalisation proposé, une classification conjointe des vecteurs acoustiques des deux enregistrements alignés est effectuée. Soient x 1: N = [x1,x2, ···, x N ] et y 1: N =[y1 ,y 2···,y N ] les séquences de vecteurs acoustiques alignés. Soient x et y les variables aléatoires relatives aux vecteurs acoustiques de chacun des enregistrements et z=(x,y) le couple associé. Dans la classification acoustique décrite ici, la variable aléatoire z est modélisée par un mélange de lois gaussiennes (en anglais GMM pour "Gaussian Mixture Model") d'ordre Q. Sa densité de probabilité s'écrit alors sous la forme suivante :
où N(z;µ;Σ) est la densité de probabilité de la loi normale de moyenne µ et de matrice de covariance Σ, et où les αi sont les coefficients du mélange (α i est la probabilité a priori que z soit généré par la iième gaussienne).In the proposed embodiment, a joint classification of the acoustic vectors of the two aligned recordings is performed. Let x 1: N = [x 1 , x 2 , ···, x N ] and y 1: N = [y 1 , y 2 ··· , y N ] be the sequences of aligned acoustic vectors. Let x and y be the random variables relating to the acoustic vectors of each of the records and z = (x, y) the associated pair. In the acoustic classification described here, the random variable z is modeled by a mixture of Gaussian laws (in English GMM for "Gaussian Mixture Model") of order Q. Its probability density is then written in the following form:
where N (z; μ; Σ) is the probability density of the normal law of mean μ and of covariance matrix Σ, and where α i are the coefficients of the mixture (α i is the probability a priori that z is generated by the ith Gaussian).
L'estimation des paramètres du modèle est effectuée en appliquant une procédure itérative classique, à savoir l'algorithme EM (Expectation-Maximization) (
Une fois le modèle GMM appris, il peut être utilisé pour déterminer par régression une fonction de conversion entre le locuteur et le locuteur de référence. Dans le cas d'une conversion d'un locuteur x vers un locuteur y, celle-ci s'écrit sous la forme :
où
or
La
Dans une première étape, des enregistrements synthétiques de référence sont générés à partir desdits messages vocaux enregistrés selon une procédure analogue à celle qui vient d'être décrite en regard de la
Deux étapes de conversion sont alors nécessaires pour convertir la voix du locuteur source en celle du locuteur cible. Dans un premier temps, il faut convertir les paramètres du locuteur source en ceux du locuteur de référence, puis transformer ces derniers de manière à reproduire le locuteur cible désiré. Ainsi, une fonction permettant la conversion source-cible recherchée peut être estimée en composant deux fonctions de transformation données par (4) :
Un système de conversion de voix intégrant le procédé d'estimation décrit est représenté sur la
Ces paramètres modifiés ainsi que les informations résiduelles nécessaires à la génération sonore (fréquence fondamentale, phase des harmoniques, gain de la partie bruitée, fréquence maximale de voisement) sont transmises à un module de synthèse par HNM. La composante harmonique du signal définie par l'équation (2) et présente pour les trames de signal voisées est générée par sommation de sinusoïdes préalablement tabulées dont les amplitudes sont calculées à partir des paramètres spectraux convertis. La partie stochastique est déterminée par Transformée de Fourier inverse (IFFT) sur le spectre calculé à partir des paramètres spectraux.These modified parameters as well as the residual information necessary for the sound generation (fundamental frequency, phase of the harmonics, gain of the noisy part, maximum frequency of voicing) are transmitted to a synthesis module by HNM. The harmonic component of the signal defined by equation (2) and present for the voiced signal frames is generated by summation of sinusoids previously tabulated whose amplitudes are calculated from the converted spectral parameters. The stochastic portion is determined by inverse Fourier Transform (IFFT) on the spectrum calculated from the spectral parameters.
En variante, le modèle HNM peut être remplacé par d'autres modèles connus de l'homme du métier, tels que les modèles par prédiction linéaire (LPC pour « Linear Predictive Coding »), les modèles sinusoïdaux ou MBE (« Multi-Band Excited »). La méthode de conversion par GMM peut être remplacée par des techniques classiques de quantification vectorielle (VQ pour « Vector Quantization ») ou de quantification vectorielle floue (Fuzzy VQ).Alternatively, the HNM model can be replaced by other models known to those skilled in the art, such as linear prediction coding (LPC) models, sinusoidal or MBE ("Multi-Band Excited") models. "). The GMM conversion method can be replaced by conventional vector quantization (VQ) or fuzzy vector quantization (Fuzzy VQ) techniques.
La description qui vient d'être donnée du procédé d'estimation conforme à l'invention n'a fait référence qu'à la seule transformation de paramètres relatifs au timbre. Mais il est bien entendu que le même procédé peut également être appliqué à la transformation d'autres types de paramètres comme la fréquence fondamentale (« pitch ») ou encore de paramètres liés à la qualité vocale.The description that has just been given of the estimation method according to the invention only referred to the only transformation of parameters relating to the stamp. But it is understood that the same method can also be applied to the transformation of other types of parameters such as the pitch frequency or parameters related to voice quality.
Selon une implémentation préférée de l'invention, les étapes du procédé sont déterminées par les instructions d'un programme d'estimation d'une fonction de conversion de voix incorporé dans un serveur, et le procédé selon l'invention est mis en oeuvre lorsque ce programme est chargé dans un ordinateur dont le fonctionnement est alors commandé par l'exécution du programme.According to a preferred implementation of the invention, the steps of the method are determined by the instructions of a program for estimating a voice conversion function incorporated in a server, and the method according to the invention is implemented when this program is loaded into a computer whose operation is then controlled by the execution of the program.
En conséquence, l'invention s'applique également à un programme d'ordinateur, notamment un programme d'ordinateur sur ou dans un support d'informations, adapté à mettre en oeuvre l'invention. Ce programme peut utiliser n'importe quel langage de programmation, et être sous la forme de code source, code objet, ou de code intermédiaire entre code source et code objet tel que dans une forme partiellement compilée, ou dans n'importe quelle autre forme souhaitable pour implémenter le procédé selon l'invention.Accordingly, the invention also applies to a computer program, including a computer program on or in an information carrier, adapted to implement the invention. This program can use any programming language, and be in the form of source code, object code, or intermediate code between source code and object code such as in a partially compiled form, or in any other form desirable to implement the method according to the invention.
Le support d'informations peut être n'importe quelle entité ou dispositif capable de stocker le programme. Par exemple, le support peut comporter un moyen de stockage, tel qu'une mémoire ROM, par exemple un CD ROM ou une mémoire ROM de circuit microélectronique, ou encore un moyen d'enregistrement magnétique, par exemple une disquette (floppy disc) ou un disque dur.The information carrier may be any entity or device capable of storing the program. For example, the medium may comprise storage means, such as a ROM memory, for example a CD ROM or a microelectronic circuit ROM memory, or a means magnetic recording, for example a floppy disk or a hard disk.
D'autre part, le support d'informations peut être un support transmissible tel qu'un signal électrique ou optique, qui peut être acheminé via un câble électrique ou optique, par radio ou par d'autres moyens. Le programme selon l'invention peut être en particulier téléchargé sur un réseau de type Internet.On the other hand, the information medium may be a transmissible medium such as an electrical or optical signal, which may be conveyed via an electrical or optical cable, by radio or by other means. The program according to the invention can be downloaded in particular on an Internet type network.
Alternativement, le support d'informations peut être un circuit intégré dans lequel le programme est incorporé, le circuit étant adapté pour exécuter ou pour être utilisé dans l'exécution du procédé en question.Alternatively, the information carrier may be an integrated circuit in which the program is incorporated, the circuit being adapted to execute or to be used in the execution of the method in question.
Claims (8)
- Method of estimating a voice conversion function for converting between, on the one hand, the voice of a speaker defined on the basis of a voice message recorded by said speaker, and, on the other hand, the voice of a reference speaker defined by a voice synthesis database, characterized in that said method comprises the steps consisting in:- generating, on the basis of said voice message recorded by the speaker and of said voice synthesis database, a synthetic recording of said voice message,- estimating said voice conversion function by a training operation performed on said recorded voice message and said synthetic recording.
- Method of estimating a voice conversion function for converting between, on the one hand, the voice of a source speaker defined on the basis of a first voice message recorded by said source speaker, and, on the other hand, the voice of a target speaker defined on the basis of a second voice message recorded by said target speaker, characterized in that said method comprises the steps consisting in:- generating, on the basis of said first voice message recorded by the source speaker and of a voice synthesis database, a synthetic recording of said first voice message,- estimating a first voice conversion function for converting between the voice of the source speaker and the voice of a reference speaker defined by said voice synthesis database, by a training operation performed on said first voice message recorded by the source speaker and said synthetic recording of the first voice message,- generating, on the basis of said second voice message recorded by the target speaker and of said voice synthesis database, a synthetic recording of said second voice message,- estimating a second voice conversion function for converting between the voice of said reference speaker and the voice of the target speaker, by a training operation performed on said synthetic recording of the second voice message and said second voice message recorded by the target speaker,- estimating said voice conversion function by composition of said first and said second voice conversion functions.
- Method according to one of Claims 1 or 2,
characterized in that said voice synthesis database is a database of a concatenation-based speech synthesis system. - Method according to one of Claims 1 or 2,
characterized in that said voice synthesis database is a database of a corpus-based speech synthesis system. - Application of the method according to Claim 1 to the conversion of voice messages recorded by a source speaker into corresponding messages reproduced by said reference speaker.
- Application of the method according to Claim 1 to the conversion of synthetic messages recorded by a reference speaker into corresponding messages reproduced by a target speaker.
- Voice conversion system, characterized in that it comprises a voice conversion module comprising means for implementing the method according to any one of Claims 1 to 4.
- Computer program on an information medium, said program comprising program instructions suitable for implementing a method according to any one of Claims 1 to 4, when said program is loaded and executed in a computer system.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
FR0550278 | 2005-01-31 | ||
PCT/FR2005/003308 WO2006082287A1 (en) | 2005-01-31 | 2005-12-28 | Method of estimating a voice conversion function |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1846918A1 EP1846918A1 (en) | 2007-10-24 |
EP1846918B1 true EP1846918B1 (en) | 2009-02-25 |
Family
ID=34954674
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP05850632A Not-in-force EP1846918B1 (en) | 2005-01-31 | 2005-12-28 | Method of estimating a voice conversion function |
Country Status (5)
Country | Link |
---|---|
EP (1) | EP1846918B1 (en) |
AT (1) | ATE424022T1 (en) |
DE (1) | DE602005012998D1 (en) |
ES (1) | ES2322909T3 (en) |
WO (1) | WO2006082287A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101015522B1 (en) * | 2005-12-02 | 2011-02-16 | 아사히 가세이 가부시키가이샤 | Voice quality conversion system |
JP4241736B2 (en) * | 2006-01-19 | 2009-03-18 | 株式会社東芝 | Speech processing apparatus and method |
CN108780643B (en) | 2016-11-21 | 2023-08-25 | 微软技术许可有限责任公司 | Automatic dubbing method and device |
CN111179902B (en) * | 2020-01-06 | 2022-10-28 | 厦门快商通科技股份有限公司 | Speech synthesis method, equipment and medium for simulating resonance cavity based on Gaussian model |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1156819C (en) * | 2001-04-06 | 2004-07-07 | 国际商业机器公司 | Method of producing individual characteristic speech sound from text |
-
2005
- 2005-12-28 AT AT05850632T patent/ATE424022T1/en not_active IP Right Cessation
- 2005-12-28 WO PCT/FR2005/003308 patent/WO2006082287A1/en active Application Filing
- 2005-12-28 ES ES05850632T patent/ES2322909T3/en active Active
- 2005-12-28 EP EP05850632A patent/EP1846918B1/en not_active Not-in-force
- 2005-12-28 DE DE602005012998T patent/DE602005012998D1/en active Active
Also Published As
Publication number | Publication date |
---|---|
WO2006082287A1 (en) | 2006-08-10 |
ES2322909T3 (en) | 2009-07-01 |
DE602005012998D1 (en) | 2009-04-09 |
EP1846918A1 (en) | 2007-10-24 |
ATE424022T1 (en) | 2009-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP1944755B1 (en) | Modification of a voice signal | |
EP1730729A1 (en) | Improved voice signal conversion method and system | |
EP1970894A1 (en) | Method and device for modifying an audio signal | |
FR2553555A1 (en) | SPEECH CODING METHOD AND DEVICE FOR IMPLEMENTING IT | |
LU88189A1 (en) | Speech segment coding and pitch control methods for speech synthesis | |
EP1769489B1 (en) | Voice recognition method and system adapted to non-native speakers' characteristics | |
EP1593116A1 (en) | Method for differentiated digital voice and music processing, noise filtering, creation of special effects and device for carrying out said method | |
EP1730728A1 (en) | Method and system for the quick conversion of a voice signal | |
EP1606792B1 (en) | Method for analyzing fundamental frequency information and voice conversion method and system implementing said analysis method | |
Muralishankar et al. | Modification of pitch using DCT in the source domain | |
Meyer et al. | Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes | |
Türk | New methods for voice conversion | |
EP1526508B1 (en) | Method for the selection of synthesis units | |
EP1789953B1 (en) | Method and device for selecting acoustic units and a voice synthesis device | |
EP1846918B1 (en) | Method of estimating a voice conversion function | |
Mary et al. | Automatic syllabification of speech signal using short time energy and vowel onset points | |
Kakouros et al. | Comparison of spectral tilt measures for sentence prominence in speech—Effects of dimensionality and adverse noise conditions | |
Csapó et al. | Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation | |
Orphanidou et al. | Wavelet-based voice morphing | |
US11302300B2 (en) | Method and apparatus for forced duration in neural speech synthesis | |
Gupta et al. | G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost | |
Xiao et al. | Speech intelligibility enhancement by non-parallel speech style conversion using CWT and iMetricGAN based CycleGAN | |
Gupta et al. | A new framework for artificial bandwidth extension using H∞ filtering | |
Bous | A neural voice transformation framework for modification of pitch and intensity | |
Alrige et al. | End-to-End Text-to-Speech Systems in Arabic: A Comparative Study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20070831 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR |
|
17Q | First examination report despatched |
Effective date: 20080129 |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: EN-NAJJARY, TAOUFIK Inventor name: ROSEC, OLIVIER |
|
DAX | Request for extension of the european patent (deleted) | ||
GRAP | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOSNIGR1 |
|
GRAS | Grant fee paid |
Free format text: ORIGINAL CODE: EPIDOSNIGR3 |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC NL PL PT RO SE SI SK TR |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: FG4D Free format text: NOT ENGLISH |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: EP |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FG4D Free format text: LANGUAGE OF EP DOCUMENT: FRENCH |
|
REF | Corresponds to: |
Ref document number: 602005012998 Country of ref document: DE Date of ref document: 20090409 Kind code of ref document: P |
|
REG | Reference to a national code |
Ref country code: ES Ref legal event code: FG2A Ref document number: 2322909 Country of ref document: ES Kind code of ref document: T3 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: LT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: FI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: SI Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 |
|
NLV1 | Nl: lapsed or annulled due to failure to fulfill the requirements of art. 29p and 29m of the patents act | ||
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: SE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090525 Ref country code: LV Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: PL Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: AT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: IS Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090625 |
|
REG | Reference to a national code |
Ref country code: IE Ref legal event code: FD4D |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: DK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: IE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: CZ Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: PT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090812 Ref country code: EE Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: RO Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 Ref country code: SK Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BG Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090525 |
|
26N | No opposition filed |
Effective date: 20091126 |
|
BERE | Be: lapsed |
Owner name: FRANCE TELECOM Effective date: 20091231 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: MC Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20100701 |
|
REG | Reference to a national code |
Ref country code: CH Ref legal event code: PL |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090526 Ref country code: LI Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20091231 Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20091231 Ref country code: CH Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20091231 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: IT Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: LU Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20091228 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: HU Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090826 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: TR Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: CY Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT Effective date: 20090225 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 11 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: PLFP Year of fee payment: 12 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20161121 Year of fee payment: 12 Ref country code: FR Payment date: 20161121 Year of fee payment: 12 Ref country code: GB Payment date: 20161128 Year of fee payment: 12 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: ES Payment date: 20161125 Year of fee payment: 12 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R119 Ref document number: 602005012998 Country of ref document: DE |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20171228 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: ST Effective date: 20180831 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180102 Ref country code: DE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20180703 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171228 |
|
REG | Reference to a national code |
Ref country code: ES Ref legal event code: FD2A Effective date: 20190704 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: ES Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20171229 |