[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115620748B - Comprehensive training method and device for speech synthesis and false identification evaluation - Google Patents

Comprehensive training method and device for speech synthesis and false identification evaluation Download PDF

Info

Publication number
CN115620748B
CN115620748B CN202211552858.XA CN202211552858A CN115620748B CN 115620748 B CN115620748 B CN 115620748B CN 202211552858 A CN202211552858 A CN 202211552858A CN 115620748 B CN115620748 B CN 115620748B
Authority
CN
China
Prior art keywords
voice
loss function
speech
conversion
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211552858.XA
Other languages
Chinese (zh)
Other versions
CN115620748A (en
Inventor
郑榕
孟凡芹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yuanjian Information Technology Co Ltd
Original Assignee
Beijing Yuanjian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yuanjian Information Technology Co Ltd filed Critical Beijing Yuanjian Information Technology Co Ltd
Priority to CN202211552858.XA priority Critical patent/CN115620748B/en
Publication of CN115620748A publication Critical patent/CN115620748A/en
Application granted granted Critical
Publication of CN115620748B publication Critical patent/CN115620748B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention provides a comprehensive training method and a comprehensive training device for speech synthesis and authenticity identification evaluation, wherein source speech and target speech are obtained as input corpora; performing voice conversion by training a preset voice converter; performing voice inverse conversion by training a preset voice inverse converter; carrying out voice authentication by training a preset voice authentication device; performing voice quality evaluation by training a preset voice quality evaluator; and fusing a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator to construct a target loss function for minimum iteration. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.

Description

Comprehensive training method and device for speech synthesis and false identification evaluation
Technical Field
The disclosure relates to the technical field of audio processing, in particular to a comprehensive training method and device for speech synthesis and counterfeit detection evaluation.
Background
With the continuous development of deep synthesis technology, the method can be applied to various application forms such as speech synthesis, video generation and even digital virtual human. Voice conversion (voice conversion) is a technology that changes the voice personality characteristics, such as frequency spectrum, rhythm, etc., of a source speaker (source speaker) through technical processing, so that the voice personality characteristics have the personality characteristics of a target speaker (target speaker), and meanwhile, semantic information is kept unchanged. Based on sound conversion technology, the sound of a real player is converted into the sound of a game character, or the sound of real social interaction is converted into the sound of an entertainment character or a specific target. Typical conversions include a male voice to a female voice, a female voice to a male voice, a male voice to a male voice, etc.
Currently, voice conversion, voice evaluation, and voice authentication detection are usually processed as independent tasks, and a processing flow of a voice conversion task cannot simultaneously satisfy requirements for evaluation of a conversion effect from a source speaker voice to a target speaker voice, detection and supervision of converted voice, namely, controllability detection of voice authentication, traceability detection of voice, and the like, so that converted voice has poor traceability in voice quality, conversion effect, and authentication controllability.
Disclosure of Invention
The embodiment of the disclosure at least provides a comprehensive training method and device for voice synthesis and voice identification evaluation, which can perform comprehensive training optimization for three tasks of voice conversion, voice evaluation and voice identification detection, further improve the voice conversion effect, simultaneously realize the detectability and traceability of the converted voice, and perform defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
The embodiment of the disclosure provides a comprehensive training method for speech synthesis and counterfeit detection evaluation, which comprises the following steps:
acquiring source voice and target voice as input corpora;
converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter;
determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS (metal oxide semiconductor) score between the inverted voice information and the input corpus through a preset voice quality evaluator;
respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator;
and constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function, and performing minimization iteration aiming at the target loss function.
In an optional implementation manner, the converting the input corpus into corresponding converted speech information through a preset sound converter, and converting the converted speech information into corresponding inverted speech information through a preset sound inverse converter specifically includes:
determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech;
inputting the source voiceprint embedding vector and the source voice to the voice converter, and determining conversion source voice corresponding to the source voice and conversion target voice corresponding to the target voice;
determining the conversion source speech and the conversion target speech as the conversion speech information;
inputting the converted source speech and the target voiceprint embedded vector to the sound inverse converter, and determining inverse source speech corresponding to the converted source speech and inverse target speech corresponding to the converted target speech;
and determining the inverse source voice and the inverse target voice as the inverse voice information.
In an optional implementation manner, the constructing a target loss function according to the voice conversion loss function, the voice authentication loss function, and the quality assessment loss function, and performing a minimization iteration on the target loss function specifically includes:
configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function respectively;
according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function to determine the target loss function;
and performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process.
In an alternative embodiment, the speech conversion loss function is determined based on the following steps:
determining an inverse source voiceprint embedding vector corresponding to the inverse source speech and an inverse target voiceprint embedding vector corresponding to the inverse target speech;
determining a first mean square error between the source voiceprint embedding vector and the inverted source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverted target voiceprint embedding vector;
defining a sum of the first mean square error and the second mean square error as the speech conversion loss function, wherein the speech conversion loss function is used to describe speaker similarity between the source speech and the target speech.
In an alternative embodiment, the speech discrimination loss function is determined based on the following steps:
determining a first authentication score for the speech authenticator for the translated source speech and the inverted source speech output and a second authentication score for the speech authenticator for the translated target speech and the inverted target speech output;
and respectively carrying out normalization index operation on the first counterfeit identification score and the second counterfeit identification score, and defining the sum of the first counterfeit identification score and the second counterfeit identification score subjected to normalization index operation as the voice counterfeit identification loss function, wherein the voice counterfeit identification loss function is used for describing the detectability of voice counterfeit identification.
In an alternative embodiment, the quality assessment loss function is determined based on the following steps:
determining, by the speech quality evaluator, based on a perceptual objective hearing quality evaluation algorithm, a first MOS score between the source voiceprint embedding vector and the inverted source voiceprint embedding vector, and a second MOS score between the target voiceprint embedding vector and the inverted target voiceprint embedding vector;
and summing the first MOS score and the second MOS score after taking negative values to define the first MOS score and the second MOS score as the quality assessment loss function, wherein the quality assessment loss function is used for describing the evaluability of the voice quality.
The embodiment of the present disclosure further provides a comprehensive training device for speech synthesis and counterfeit detection evaluation, where the device includes:
the acquisition module is used for acquiring source speech and target speech as input linguistic data;
the conversion inversion module is used for converting the input corpus into corresponding conversion voice information through a preset sound converter and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter;
the anti-counterfeiting evaluation module is used for determining the anti-counterfeiting scores corresponding to the converted voice information and the inverted voice information through a preset voice anti-counterfeiting device and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator;
a loss function constructing module, configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit discrimination loss function corresponding to the voice counterfeit discriminator, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively;
and the training module is used for constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function and performing minimum iteration aiming at the target loss function.
An embodiment of the present disclosure further provides an electronic device, including: the device comprises a processor, a memory and a bus, wherein the memory stores machine readable instructions executable by the processor, the processor and the memory are communicated through the bus when the electronic device runs, and the machine readable instructions are executed by the processor to execute the steps of the comprehensive training method for speech synthesis and counterfeit detection evaluation or any possible implementation mode of the comprehensive training method for speech synthesis and counterfeit detection evaluation.
The disclosed embodiments also provide a computer-readable storage medium, where a computer program is stored, and the computer program is executed by a processor to perform the steps in any possible implementation manner of the above-mentioned method for comprehensive training of speech synthesis and counterfeit detection evaluation, or the above-mentioned method for comprehensive training of speech synthesis and counterfeit detection evaluation.
Embodiments of the present disclosure further provide a computer program product, which includes a computer program/instructions, and the computer program/instructions, when executed by a processor, implement the above-mentioned comprehensive training method for speech synthesis and counterfeit detection evaluation, or the steps in any possible implementation manner of the above-mentioned comprehensive training method for speech synthesis and counterfeit detection evaluation.
According to the comprehensive training method and device for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained to serve as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.
FIG. 1 is a flow chart illustrating a method for integrated training of speech synthesis and authentication evaluation provided by an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating another method for integrated training of speech synthesis and authentication evaluation provided by embodiments of the present disclosure;
FIG. 3 is a schematic diagram of a comprehensive training device for speech synthesis and authentication evaluation according to an embodiment of the present disclosure;
fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, as generally described and illustrated in the figures herein, could be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, C, and may mean including any one or more elements selected from the group consisting of a, B, and C.
Research shows that currently, voice conversion, voice evaluation and voice identification detection are usually processed as independent tasks, and a processing flow of a voice conversion task cannot simultaneously meet requirements of conversion effect evaluation on source speaker voice to target speaker voice, detection and supervision on converted voice, namely controllability detection of voice identification and traceability detection of voice, and the like, so that converted voice has poor traceability on voice quality, conversion effect and traceability controllability.
Based on the research, the disclosure provides a comprehensive training method and device for speech synthesis and authenticity identification evaluation, which obtains source speech and target speech as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit discrimination loss function corresponding to a voice counterfeit discriminator and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.
To facilitate understanding of the embodiment, first, a comprehensive training method for speech synthesis and counterfeit detection evaluation disclosed in the embodiments of the present disclosure is described in detail, where an execution subject of the comprehensive training method for speech synthesis and counterfeit detection evaluation provided in the embodiments of the present disclosure is generally a computer device with certain computing power, and the computer device includes: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle mounted device, a wearable device, or a server or other processing device. In some possible implementations, the method for integrated training of speech synthesis and authentication evaluation may be implemented by a processor calling computer-readable instructions stored in a memory.
Referring to fig. 1, a flowchart of a comprehensive training method for speech synthesis and counterfeit detection evaluation provided by the embodiment of the present disclosure is shown, where the method includes steps S101 to S105, where:
s101, obtaining source voice and target voice as input corpora.
In specific implementation, a source speech corresponding to a source speaker and a target speech corresponding to a target speaker are obtained, and the source speech and the target speech are used as input corpora of speech conversion.
Here, the source speech and the target speech are used as input corpora, and the input corpora are used for inputting to a speech conversion system for speech conversion, so that the speech characteristics of the source speaker, such as frequency spectrum, rhythm and the like, are converted into the speech characteristics of the target speaker while the semantic information of the source speech of the speaker is kept unchanged.
The method comprises the steps of performing model training on source speech of a source speaker and target speech of a target speaker, and performing model training on the extracted features to obtain a speech conversion model. And analyzing, extracting and mapping the characteristics of the source speech to be converted in the reasoning stage, then performing characteristic conversion on the mapped characteristics by using the speech conversion model obtained in the training stage, and finally using the converted characteristics for speech synthesis to obtain converted speech.
Optionally, the source speech and the target speech may be collected for the source speaker and the target speaker respectively through audio collection equipment.
S102, converting the input corpus into corresponding conversion voice information through a preset sound converter, and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter.
In specific implementation, source speech and target speech are input to a preset sound converter as input corpora, the input corpora are converted into corresponding converted speech information, and the converted speech information is input to a preset sound inverse converter, so that inverse speech information for performing inverse speech conversion on the converted speech information is obtained.
Here, the converted speech information includes converted speech information output after the source speech is converted by the sound converter, and converted speech information output after the target speech information is converted by the sound converter; the sound inverse converter outputs inverse voice information corresponding to the source voice and inverse voice information corresponding to the target voice.
The preset sound converter may be a conditional variant self-encoder (VITS _ VC) for speech synthesis with antagonistic learning. VITS _ VC is a high-expressive voice conversion model that combines variational inference (variational inference), normalized flow (normative flows), and resistance training. The VITS _ VC carries out random modeling on the hidden variable through an acoustic model and a vocoder in the voice conversion of the hidden variable instead of the common spectrum series connection, utilizes a random duration predictor to improve the diversity of the converted voice, inputs the same voice and can obtain the voices with different tones and rhythms. The VITS _ VC algorithm adopts a non-autoregressive network structure, and compared with the traditional autoregressive network, the generation speed is obviously improved, and the requirement of high-rate conversion in practical application is met.
Furthermore, the sound inverse converter can evaluate the voice similarity between the input corpus and the inverse voice information, so that the controllability and the traceability of the sound are realized.
It should be noted that after the voice discriminator detects the converted voice information and the inverted voice information as described in step S103, it further determines that the voice conversion task is performed by the own voice converter, and then performs the voice inverse conversion process on the converted voice information.
As a possible implementation, step S102 may be implemented by steps S1021 to S1025 as follows:
s1021, determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech.
S1022, inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining a conversion source speech corresponding to the source speech and a conversion target speech corresponding to the target speech.
S1023, determining the conversion source voice and the conversion target voice as the conversion voice information.
S1024, inputting the converted source speech and the target voiceprint embedding vector to the sound inverse converter, and determining the inverse source speech corresponding to the converted source speech and the inverse target speech corresponding to the converted target speech.
S1025, the inverse source voice and the inverse target voice are determined to be the inverse voice information.
In a specific implementation, in order to improve the voice conversion effect under the condition of small samples or phrase voices, a pre-trained voiceprint extractor is used as a speaker encoder (speaker encoder) for extracting a voiceprint embedding vector from an input corpus.
Wherein, the voiceprint extractor converts the input corpus into 512-dimensional speaker embedded representation, and sends the 512-dimensional speaker embedded representation to the voice converter.
Optionally, when the sound converter selects VITS _ VC, the VITS supports multi-speaker sound conversion (multi-speaker sound conversion), and when the sound converter is applied to a multi-speaker model, the source sound-stripe embedding vector corresponding to the source sound of each source speaker is added to a corresponding module of the VITS.
Furthermore, the voice converter embeds vectors through source voice and source voice prints corresponding to a given source speaker, and then outputs converted source voice obtained by converting the source voice through the voice converter and converted target voice obtained by converting the target voice through the voice converter through the vocoder.
Furthermore, the voice inverse converter outputs inverse source voice of the converted source voice after inverse conversion of the voice converter and inverse target voice of the target voice after inverse conversion of the voice inverse converter through the vocoder by giving the converted source voice output by the voice converter and the target voiceprint embedding vector corresponding to the target voice.
S103, determining the converted voice information and the discrimination scores corresponding to the inverted voice information through a preset voice discriminator, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator.
In the specific implementation, the voice discriminator adopts an end-to-end audio frequency discrimination method based on a graph convolution attention network, extracts voice full-band and sub-band embedding characteristics, introduces a fusion attention mechanism, effectively utilizes information of three attention sub-modules of a time region, a frequency spectrum region and a channel region, inputs converted voice information and inverted voice information into the voice discriminator, and outputs corresponding authenticity judgment scores of the converted voice information and the inverted voice information by the voice discriminator.
The authenticity judgment score of the voice counterfeit discriminator is normalized to output a range [0,1], wherein the output authenticity judgment score represents the possibility of judging the authenticity, if the counterfeit audio is close to 0, if the counterfeit audio is close to 1, the true audio is close to 1.
Further, the speech quality estimator may estimate the MOS score between the inverted speech information and the input corpus by a perceptual objective hearing quality estimation (POLQA) algorithm.
Here, the most intuitive judgment of the quality of the voice conversion system is to convert the audio quality, which is a commonly used MOS (mean opinion score) value, but the scoring of the MOS value requires personnel in many fields to score, which requires expensive human resources and time overhead, so in the training process of the embodiment of the present application, the original input corpus before the conversion of the sound converter and the inverse voice information after the inverse conversion of the sound inverse converter can be obtained, and the POLQA algorithm is used to evaluate the voice quality.
Specifically, the POLQA algorithm filters the reference signal and the degraded signal, aligns the time, estimates the sampling rate, objectively senses and scores to obtain a POLQA score, and finally maps the score to the MOS score. POLQA is a reference objective evaluation method that quantifies the degree of impairment of a corrupted signal (here, inversely transformed speech) in the presence of a reference signal (lossless signal, here, the original speech before transformation), and gives an objective speech quality score close to the subjective speech quality score.
Wherein, the MOS value of the POLQA value is at most 4.5 in the narrow-band mode, and the MOS value is at most 4.75 in the ultra-wideband mode. Preferably, to define the quality assessment loss function, the POLQA value is taken to be negative.
S104, respectively determining a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice identification loss function corresponding to the voice identifier, and a quality evaluation loss function corresponding to the voice quality evaluator.
In the implementation, a voice conversion loss function corresponding to the voice conversion-inverse conversion process, a voice identification loss function corresponding to the voice identifier, and a quality evaluation loss function corresponding to the voice quality evaluator are respectively determined.
Here, the voice conversion loss function corresponds to a loss function corresponding to the sound converter and the sound inverse converter.
Wherein the voice conversion loss function describes similarity between speakers, the voice authentication loss function describes detectability of voice authentication, and the quality assessment loss function describes quality evaluability between voices.
As a possible implementation, the speech conversion loss function may be determined based on the following steps 1-3:
step 1, determining an inverse source voiceprint embedded vector corresponding to the inverse source voice and an inverse target voiceprint embedded vector corresponding to the inverse target voice;
step 2, determining a first mean square error between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverse target voiceprint embedding vector;
and 3, defining the sum of the first mean square error and the second mean square error as the voice conversion loss function, wherein the voice conversion loss function is used for describing the speaker similarity between the source voice and the target voice.
Specifically, the voice conversion loss function may be constructed based on the following formula:
Figure P_221122155653218_218538001
Figure P_221122155653265_265927001
wherein L is MSE Representing a speech conversion loss function; l is MSE_source Representing the corresponding loss function of the source speech in the speech conversion and inverse conversion processes; l is a radical of an alcohol MSE_target Representing the corresponding loss function of the source speech in the speech conversion and inverse conversion processes; MSE represents the mean square error; e (.) represents the calculation of the voiceprint embedding vector; s and t represent source and target speech respectively,
Figure P_221122155653297_297185001
and &>
Figure P_221122155653328_328437002
Respectively representing the inverted source speech and the inverted target speech after being inversely converted by the acoustic inverse converter.
As another possible implementation, the speech discrimination loss function may be determined based on the following steps 1-2:
step 1, determining a first authentication score of the voice authenticator for the converted source voice and the inverted source voice and a second authentication score of the voice authenticator for the converted target voice and the inverted target voice.
And 2, respectively carrying out normalization index operation on the first authentication score and the second authentication score, and defining the sum of the first authentication score and the second authentication score after normalization index operation as the voice authentication loss function, wherein the voice authentication loss function is used for describing the detectability of voice authentication.
Specifically, the speech discrimination loss function may be constructed based on the following formula:
Figure P_221122155653359_359672001
Figure P_221122155653390_390942001
wherein L is SPOOF Representing a speech discrimination loss function; l is SPOOF_vc Representing a loss function of the speech discriminator for the processing of the converted source speech and the inverted source speech; l is SPOOF_ivc Representing a loss function of the voice discriminator for processing the conversion target voice and the inversion target voice; softmax (.) represents the normalized exponential function operation;
Figure P_221122155653426_426115001
and &>
Figure P_221122155653457_457346002
Representing respectively converted source speech and inverted sourceVoice; />
Figure P_221122155653488_488587003
And &>
Figure P_221122155653519_519843004
Respectively representing the converted target speech and the inverted target speech.
As another possible implementation, the quality assessment loss function may be determined based on steps 1-2 below:
step 1, determining a first MOS (metal oxide semiconductor) score between the source voiceprint embedding vector and the inverse source voiceprint embedding vector and a second MOS score between the target voiceprint embedding vector and the inverse target voiceprint embedding vector through the voice quality evaluator based on a perception objective hearing quality evaluation algorithm.
And 2, summing the first MOS value and the second MOS value after taking negative values, and defining the sum as the quality evaluation loss function, wherein the quality evaluation loss function is used for describing the evaluability of the voice quality.
Specifically, the quality assessment loss function may be constructed based on the following formula:
Figure P_221122155653535_535468001
Figure P_221122155653566_566703001
wherein L is POLQA Representing a quality assessment loss function; l is POLQA_source Representing a loss function of the speech quality evaluator for the source speech and the inverse source speech; l is POLQA_target Representing a loss function of the voice quality evaluator for processing the target voice and the inverted target voice; POLQA (.) represents the calculation of a reference objective evaluation score based on POLQA;
Figure P_221122155653615_615022001
and &>
Figure P_221122155653633_633573002
Respectively representing source speech and inverse source speech; />
Figure P_221122155653665_665338003
And &>
Figure P_221122155653696_696597004
Respectively representing the target voice and the inverted target voice.
S105, constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function.
In a particular implementation, a voice conversion loss function describes similarity between speakers, a speech discrimination loss function describes detectability of speech discrimination, and a quality assessment loss function describes quality assessability between speech. And combining the three loss functions defined from different dimensions to form an objective function, and realizing the combined optimization of the whole system through the minimum iterative training of the objective loss function.
According to the comprehensive training method for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit discrimination loss function corresponding to a voice counterfeit discriminator and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
Referring to fig. 2, a flow chart of another speech synthesis and counterfeit detection evaluation comprehensive training method provided in the embodiment of the present disclosure is shown, where the method includes steps S201 to S203, where:
s201, configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function respectively.
In the specific implementation, corresponding learning hyper-parameters to be optimized are configured for the voice conversion loss function, the voice identification loss function and the quality evaluation loss function respectively.
It should be noted that the learning hyper-parameter to be optimized corresponding to each loss function may be selected according to actual needs, and is not limited specifically here. Preferably, the initial value of the learning hyper-parameter to be optimized may be set to 1.
S202, according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and determining the target loss function.
Specifically, the target loss function can be constructed by the following formula:
Figure P_221122155653712_712210001
wherein L is total Representing an objective loss function; l is MSE Representing a speech conversion loss function; l is SPOOF Representing a speech discrimination loss function; l is POLQA Representing a quality assessment loss function; alpha, beta and lambda respectively represent the learning hyper-parameters to be optimized.
S203, performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process.
According to the comprehensive training method for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method can be used for carrying out comprehensive training and optimization on three tasks of voice conversion, voice evaluation and voice authentication detection, further improving the voice conversion effect, realizing the detectability and traceability of the converted voice, and carrying out defense and reinforcement on potential malicious attacks of voice processing and voiceprint recognition.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
Based on the same inventive concept, the embodiment of the present disclosure further provides a comprehensive training device for speech synthesis and counterfeit detection evaluation corresponding to the comprehensive training method for speech synthesis and counterfeit detection evaluation, and as the principle of solving the problem of the device in the embodiment of the present disclosure is similar to that of the comprehensive training method for speech synthesis and counterfeit detection evaluation in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated parts are not repeated.
Referring to fig. 3, fig. 3 is a schematic diagram of a speech synthesis and counterfeit detection evaluation integrated training device according to an embodiment of the present disclosure. As shown in fig. 3, a speech synthesis and counterfeit detection comprehensive training apparatus 300 according to an embodiment of the present disclosure includes:
the obtaining module 310 is configured to obtain source speech and target speech as input corpora.
The conversion and inversion module 320 is configured to convert the input corpus into corresponding conversion voice information through a preset voice converter, and convert the conversion voice information into corresponding inversion voice information through a preset voice inverse converter.
The counterfeit discrimination evaluation module 330 is configured to determine the converted speech information and a counterfeit discrimination score corresponding to the inverted speech information by using a preset speech discriminator, and determine an MOS score between the inverted speech information and the input corpus by using a preset speech quality evaluator.
The loss function constructing module 340 is configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit detection loss function corresponding to the voice counterfeit detector, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively.
A training module 350, configured to construct a target loss function according to the speech conversion loss function, the speech discrimination loss function, and the quality evaluation loss function, and perform a minimization iteration on the target loss function.
The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.
According to the comprehensive training device for speech synthesis and authenticity identification evaluation, source speech and target speech are obtained and used as input corpora; converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter; determining the converted voice information and the false distinguishing score corresponding to the inverted voice information through a preset voice false distinguishing device, and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator; respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to a voice counterfeit identifier and a quality evaluation loss function corresponding to a voice quality evaluator; and constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function. The voice recognition and protection method has the advantages that comprehensive training optimization can be carried out on three tasks of voice conversion, voice evaluation and voice counterfeit detection, further, the voice conversion effect is improved, the detectability and traceability of converted voice are achieved, and defense and reinforcement are carried out on potential malicious attacks of voice processing and voiceprint recognition.
Corresponding to the speech synthesis and the comprehensive training method for counterfeit detection evaluation in fig. 1 and fig. 2, an embodiment of the present disclosure further provides an electronic device 400, as shown in fig. 4, which is a schematic structural diagram of the electronic device 400 provided in the embodiment of the present disclosure and includes:
a processor 41, a memory 42, and a bus 43; the storage 42 is used for storing execution instructions and includes a memory 421 and an external storage 422; the memory 421 is also referred to as an internal memory, and is configured to temporarily store the operation data in the processor 41 and the data exchanged with the external memory 422 such as a hard disk, the processor 41 exchanges data with the external memory 422 through the memory 421, and when the electronic device 400 operates, the processor 41 communicates with the memory 42 through the bus 43, so that the processor 41 executes the steps of the comprehensive training method of speech synthesis and false-detection evaluation in fig. 1 and 2.
The embodiment of the present disclosure further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the comprehensive training method for speech synthesis and counterfeit identification evaluation described in the above method embodiments are performed. The storage medium may be a volatile or non-volatile computer-readable storage medium.
The embodiment of the present disclosure further provides a computer program product, where the computer program product includes computer instructions, and when the computer instructions are executed by a processor, the steps of the comprehensive training method for speech synthesis and false identification evaluation in the foregoing method embodiments may be executed.
The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK) or the like.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in software functional units and sold or used as a stand-alone product, may be stored in a non-transitory computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes and substitutions do not depart from the spirit and scope of the embodiments disclosed herein, and they should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (9)

1. A speech synthesis and counterfeit detection evaluation comprehensive training method is characterized by comprising the following steps:
obtaining source voice and target voice as input corpora;
converting the input corpus into corresponding conversion voice information through a preset voice converter, and converting the conversion voice information into corresponding inversion voice information through a preset voice inverse converter;
determining the converted voice information and the discrimination fraction corresponding to the inverted voice information through a preset voice discriminator, and determining the MOS (metal oxide semiconductor) score between the inverted voice information and the input corpus through a preset voice quality evaluator;
respectively determining a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit identification loss function corresponding to the voice counterfeit identifier and a quality evaluation loss function corresponding to the voice quality evaluator;
constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function, and performing minimum iteration aiming at the target loss function;
the constructing a target loss function according to the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function, and performing a minimization iteration on the target loss function specifically includes:
configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function respectively;
according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function to determine the target loss function;
performing minimum iterative computation on the target loss function to realize joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process;
wherein the target loss function is constructed by the following formula:
Figure QLYQS_1
wherein,
Figure QLYQS_2
representing a speech conversion loss function; />
Figure QLYQS_3
Representing a speech discrimination loss function; />
Figure QLYQS_4
Representing a quality assessment loss function; />
Figure QLYQS_5
Beta and>
Figure QLYQS_6
respectively represent the learning hyper-parameters to be optimized.
2. The method according to claim 1, wherein the converting the input corpus into corresponding converted speech information by a predetermined voice converter, and converting the converted speech information into corresponding inverted speech information by a predetermined voice inverse converter, specifically comprises:
determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech;
inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining conversion source speech corresponding to the source speech and conversion target speech corresponding to the target speech;
determining the conversion source speech and the conversion target speech as the conversion speech information;
inputting the conversion source voice and the target voiceprint embedding vector to the voice inverse converter, and determining an inverse source voice corresponding to the conversion source voice and an inverse target voice corresponding to the conversion target voice;
and determining the inverse source voice and the inverse target voice as the inverse voice information.
3. The method of claim 2, wherein the speech conversion loss function is determined based on the steps of:
determining an inverse source voiceprint embedding vector corresponding to the inverse source speech and an inverse target voiceprint embedding vector corresponding to the inverse target speech;
determining a first mean square error between the source voiceprint embedding vector and the inverted source voiceprint embedding vector and a second mean square error between the target voiceprint embedding vector and the inverted target voiceprint embedding vector;
defining a sum of the first mean square error and the second mean square error as the speech conversion loss function, wherein the speech conversion loss function is used to describe speaker similarity between the source speech and the target speech.
4. The method of claim 2, wherein the speech discrimination loss function is determined based on the steps of:
determining a first authentication score for the speech authenticator for the translated source speech and the inverted source speech output and a second authentication score for the speech authenticator for the translated target speech and the inverted target speech output;
and respectively carrying out normalization index operation on the first authentication score and the second authentication score, and defining the sum of the first authentication score and the second authentication score after normalization index operation as the voice authentication loss function, wherein the voice authentication loss function is used for describing the detectability of voice authentication.
5. The method of claim 3, wherein the quality assessment loss function is determined based on the steps of:
determining, by the speech quality evaluator, based on a perceptual objective hearing quality evaluation algorithm, a first MOS score between the source voiceprint embedding vector and the inverted source voiceprint embedding vector, and a second MOS score between the target voiceprint embedding vector and the inverted target voiceprint embedding vector;
and summing the first MOS score and the second MOS score after taking negative numbers, and defining the sum as the quality assessment loss function, wherein the quality assessment loss function is used for describing the evaluability of the voice quality.
6. A speech synthesis and counterfeit detection evaluation integrated training device is characterized by comprising:
the acquisition module is used for acquiring source speech and target speech as input linguistic data;
the conversion inversion module is used for converting the input corpus into corresponding conversion voice information through a preset sound converter and converting the conversion voice information into corresponding inversion voice information through a preset sound inverse converter;
the anti-counterfeiting evaluation module is used for determining the anti-counterfeiting scores corresponding to the converted voice information and the inverted voice information through a preset voice anti-counterfeiting device and determining the MOS score between the inverted voice information and the input corpus through a preset voice quality evaluator;
a loss function constructing module, configured to determine a voice conversion loss function corresponding to a voice conversion-inverse conversion process, a voice counterfeit discrimination loss function corresponding to the voice counterfeit discriminator, and a quality evaluation loss function corresponding to the voice quality evaluator, respectively;
the training module is used for constructing a target loss function according to the voice conversion loss function, the voice identification loss function and the quality evaluation loss function and performing minimum iteration aiming at the target loss function;
the training module is specifically configured to:
configuring corresponding learning hyper-parameters to be optimized for the voice conversion loss function, the voice authentication loss function and the quality evaluation loss function respectively;
according to the learning hyper-parameter to be optimized, carrying out weighted summation on the voice conversion loss function, the voice identification loss function and the quality evaluation loss function to determine the target loss function;
performing minimum iterative computation aiming at the target loss function to realize the joint training optimization of a sound conversion-inverse conversion process, a voice identification process and a voice quality evaluation process;
wherein the target loss function is constructed by the following formula:
Figure QLYQS_7
wherein,
Figure QLYQS_8
representing a speech conversion loss function; />
Figure QLYQS_9
Representing a speech discrimination loss function; />
Figure QLYQS_10
Representing a quality assessment loss function; />
Figure QLYQS_11
Beta and->
Figure QLYQS_12
Respectively represent the learning hyper-parameters to be optimized.
7. The apparatus of claim 6, wherein the conversion inverting module is specifically configured to:
determining a source voiceprint embedding vector corresponding to the source speech and a target voiceprint embedding vector corresponding to the target speech;
inputting the source voiceprint embedding vector and the source speech to the sound converter, and determining conversion source speech corresponding to the source speech and conversion target speech corresponding to the target speech;
determining the conversion source speech and the conversion target speech as the conversion speech information;
inputting the converted source speech and the target voiceprint embedded vector to the sound inverse converter, and determining inverse source speech corresponding to the converted source speech and inverse target speech corresponding to the converted target speech;
and determining the inverse source voice and the inverse target voice as the inverse voice information.
8. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine readable instructions when executed by the processor performing the steps of the method of comprehensive training of speech synthesis and authentication assessment according to any one of claims 1 to 5.
9. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of the method for integrated training of speech synthesis and authentication evaluation according to any one of claims 1 to 5.
CN202211552858.XA 2022-12-06 2022-12-06 Comprehensive training method and device for speech synthesis and false identification evaluation Active CN115620748B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211552858.XA CN115620748B (en) 2022-12-06 2022-12-06 Comprehensive training method and device for speech synthesis and false identification evaluation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211552858.XA CN115620748B (en) 2022-12-06 2022-12-06 Comprehensive training method and device for speech synthesis and false identification evaluation

Publications (2)

Publication Number Publication Date
CN115620748A CN115620748A (en) 2023-01-17
CN115620748B true CN115620748B (en) 2023-03-28

Family

ID=84879698

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211552858.XA Active CN115620748B (en) 2022-12-06 2022-12-06 Comprehensive training method and device for speech synthesis and false identification evaluation

Country Status (1)

Country Link
CN (1) CN115620748B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021137754A1 (en) * 2019-12-31 2021-07-08 National University Of Singapore Feedback-controlled voice conversion
CN113555023A (en) * 2021-09-18 2021-10-26 中国科学院自动化研究所 Method for joint modeling of voice authentication and speaker recognition
WO2022142115A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Adversarial learning-based speaker voice conversion method and related device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6076057A (en) * 1997-05-21 2000-06-13 At&T Corp Unsupervised HMM adaptation based on speech-silence discrimination
CN110060701B (en) * 2019-04-04 2023-01-31 南京邮电大学 Many-to-many voice conversion method based on VAWGAN-AC
US11854562B2 (en) * 2019-05-14 2023-12-26 International Business Machines Corporation High-quality non-parallel many-to-many voice conversion
CN111445900A (en) * 2020-03-11 2020-07-24 平安科技(深圳)有限公司 Front-end processing method and device for voice recognition and terminal equipment
WO2021229643A1 (en) * 2020-05-11 2021-11-18 日本電信電話株式会社 Sound signal conversion model learning device, sound signal conversion device, sound signal conversion model learning method, and program
CN114360583A (en) * 2022-01-05 2022-04-15 新疆大学 Voice quality evaluation method based on neural network
CN114882897A (en) * 2022-05-13 2022-08-09 平安科技(深圳)有限公司 Training of voice conversion model, voice conversion method, device and related equipment
CN115273804A (en) * 2022-07-29 2022-11-01 平安科技(深圳)有限公司 Voice conversion method and device based on coding model, electronic equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021137754A1 (en) * 2019-12-31 2021-07-08 National University Of Singapore Feedback-controlled voice conversion
WO2022142115A1 (en) * 2020-12-31 2022-07-07 平安科技(深圳)有限公司 Adversarial learning-based speaker voice conversion method and related device
CN113555023A (en) * 2021-09-18 2021-10-26 中国科学院自动化研究所 Method for joint modeling of voice authentication and speaker recognition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苗晓孔 ; 孙蒙 ; 张雄伟 ; 李嘉康 ; 张星昱.基于参数转换的语音深度伪造及其对声纹认证的威胁评估.信息安全学报.2020,第5卷(第006期),第53-59页. *

Also Published As

Publication number Publication date
CN115620748A (en) 2023-01-17

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
CN110364144B (en) Speech recognition model training method and device
CN105976812B (en) A kind of audio recognition method and its equipment
WO2021159902A1 (en) Age recognition method, apparatus and device, and computer-readable storage medium
CN108962237A (en) Mixing voice recognition methods, device and computer readable storage medium
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN112233698B (en) Character emotion recognition method, device, terminal equipment and storage medium
CN106952649A (en) Method for distinguishing speek person based on convolutional neural networks and spectrogram
EP4198807A1 (en) Audio processing method and device
CN113314119B (en) Voice recognition intelligent household control method and device
CN110120230B (en) Acoustic event detection method and device
CN107316635B (en) Voice recognition method and device, storage medium and electronic equipment
CN111508524B (en) Method and system for identifying voice source equipment
CN112767927A (en) Method, device, terminal and storage medium for extracting voice features
CN112632248A (en) Question answering method, device, computer equipment and storage medium
CN115620748B (en) Comprehensive training method and device for speech synthesis and false identification evaluation
CN108847251A (en) A kind of voice De-weight method, device, server and storage medium
Wang et al. Interference quality assessment of speech communication based on deep learning
CN116935889B (en) Audio category determining method and device, electronic equipment and storage medium
CN117115312B (en) Voice-driven facial animation method, device, equipment and medium
Medikonda et al. Higher order information set based features for text-independent speaker identification
CN111652164A (en) Isolated word sign language identification method and system based on global-local feature enhancement
CN106340310A (en) Speech detection method and device
CN115221351A (en) Audio matching method and device, electronic equipment and computer-readable storage medium
CN116434758A (en) Voiceprint recognition model training method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant