[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112309426B - Voice processing model training method and device and voice processing method and device - Google Patents

Voice processing model training method and device and voice processing method and device Download PDF

Info

Publication number
CN112309426B
CN112309426B CN202011330109.3A CN202011330109A CN112309426B CN 112309426 B CN112309426 B CN 112309426B CN 202011330109 A CN202011330109 A CN 202011330109A CN 112309426 B CN112309426 B CN 112309426B
Authority
CN
China
Prior art keywords
signal
estimated
amplitude mask
ideal amplitude
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011330109.3A
Other languages
Chinese (zh)
Other versions
CN112309426A (en
Inventor
郑羲光
李楠
任新蕾
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202011330109.3A priority Critical patent/CN112309426B/en
Publication of CN112309426A publication Critical patent/CN112309426A/en
Application granted granted Critical
Publication of CN112309426B publication Critical patent/CN112309426B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0324Details of processing therefor
    • G10L21/034Automatic adjustment
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/21Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being power information

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Control Of Amplification And Gain Control (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The disclosure provides a training method and device for a speech processing model, and a speech processing method and device. The training method comprises the following steps: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Description

Voice processing model training method and device and voice processing method and device
Technical Field
The disclosure relates to the technical field of audio, and in particular relates to a training method and device of a speech processing model, and a speech processing method and device.
Background
With the rapid development of electronic technology and network technology, electronic devices can process audio signals in a time-frequency domain based on a voice processing algorithm of a neural network.
Although neural network-based speech enhancement and noise reduction have achieved performance beyond conventional signal processing methods and have been able to operate efficiently in electronic devices, for speech enhancement (non-speech component-invariant speech component increases) and speech denoising (speech component-invariant non-speech component decreases) problems, the two neural networks are typically trained to achieve the purposes of speech enhancement and denoising, respectively. In addition, for speech processing using two types of neural networks, one type of signal is always amplified or reduced while the other type of signal is kept unchanged.
Disclosure of Invention
The present disclosure provides a training method and apparatus for a speech processing model, and a speech processing method and apparatus for the same, to at least solve the problem of using a neural network to simultaneously perform speech enhancement and denoising.
According to a first aspect of embodiments of the present disclosure, there is provided a training method of a speech processing model, the method may include: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.
Alternatively, the step of generating the mixed signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; the mixed signal is generated by mixing the first signal, the second signal and the speech signal.
Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio and the second gain may be determined based on a second signal-to-noise ratio and the first gain.
Alternatively, the step of generating the target signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the first signal.
Alternatively, the estimated data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask is related to the signal energy.
Optionally, in the case that the estimated data is an estimated ideal amplitude mask, the step of determining a loss function based on the target signal and the estimated data may comprise: calculating a target ideal amplitude mask based on the target signal and the mixed signal; a loss function is determined based on the target ideal amplitude mask and the estimated data.
Alternatively, the target ideal amplitude mask may be an amplitude ratio of the target signal and the mixed signal in a time-frequency domain.
According to a second aspect of embodiments of the present disclosure, there is provided a speech processing method, the method may comprise: acquiring an audio signal, wherein the audio signal comprises at least one of a speech signal, a noise signal and a specific signal, the specific signal being of a type of audio that does not need to be enhanced and suppressed; obtaining an ideal amplitude mask based on the audio signal using a speech processing model; and performing different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.
Alternatively, the speech processing model may be trained by the training method described above.
Optionally, the step of differently processing the audio signal to obtain the desired signal according to the size of the ideal amplitude mask may include: the desired signal is obtained by comparing the ideal amplitude mask with a predetermined threshold value to determine whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask.
Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask may comprise: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.
Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask may comprise: if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.
Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask may comprise: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.
Optionally, the output of the speech processing model is the ideal amplitude mask or the estimated target signal, wherein the step of obtaining the ideal amplitude mask may comprise, in case the output of the speech processing model is the estimated target signal: obtaining an estimated target signal by applying the audio signal to a speech processing model; the ideal amplitude mask is obtained based on the estimated target signal and the audio signal.
According to a third aspect of embodiments of the present disclosure, there is provided a training apparatus for a speech processing model, the apparatus may include: a data generation module configured to: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; and a data training module configured to: inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.
Alternatively, the data generation module may be configured to: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; and generating the mixed signal by mixing the first signal, the second signal and the speech signal.
Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio and the second gain may be determined based on a second signal-to-noise ratio and the first gain.
Alternatively, the data generation module may be configured to: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the first signal.
Alternatively, the estimated data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask may be related to signal energy.
Optionally, in the case where the estimated data is an estimated ideal amplitude mask, the data training module may be configured to: calculating a target ideal amplitude mask based on the target signal and the mixed signal; a loss function is determined based on the target ideal amplitude mask and the estimated data.
Alternatively, the target ideal amplitude mask may be an amplitude ratio of the target signal and the mixed signal in a time-frequency domain.
According to a fourth aspect of embodiments of the present disclosure, there is provided a speech processing apparatus, the apparatus may comprise: a data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal belonging to an audio type that does not need to be enhanced and suppressed; and a data processing module configured to: obtaining an ideal amplitude mask based on the audio signal using a speech processing model; and performing different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.
Alternatively, the data processing module may be configured to: the desired signal is obtained by comparing the ideal amplitude mask with a predetermined threshold value to determine whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask.
Alternatively, the data processing module may be configured to: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.
Alternatively, the data processing module may be configured to: if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.
Alternatively, the data processing module may be configured to: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.
Alternatively, the output of the speech processing model may be the ideal amplitude mask or the estimated target signal, wherein, in case the output of the speech processing model is the estimated target signal, the data processing module may be configured to: obtaining an estimated target signal by applying the audio signal to a speech processing model; the ideal amplitude mask is obtained based on the estimated target signal and the audio signal.
According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a speech processing method and a model training method as described above.
According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method and the model training method as described above.
According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which are executed by at least one processor in an electronic device to perform the speech processing method and model training method as described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
The training is performed by integrating the speech enhancement and the denoising into one deep neural network, and the speech enhancement and the denoising can be respectively performed by the post-processing based on the ideal target mask IRM, or the speech enhancement and the denoising can be performed simultaneously. In addition, the training targets are divided into three categories (i.e., speech (requiring enhancement), noise (requiring suppression), and other audio such as music (neither enhancing nor suppressing)) in the model design, and the speech processing model trained using such training data is different from the model of single speech enhancement and speech noise reduction, so that the model more conforms to the actual application requirements, and the speech processing can be performed more efficiently.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.
FIG. 1 is a flow chart of a speech processing method according to an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method of training a speech processing model according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a training speech processing model according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a trained speech processing model according to another embodiment of the present disclosure;
FIG. 5 is a flow diagram of a speech processing method according to an embodiment of the present disclosure;
FIG. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present disclosure.
Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.
Detailed Description
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.
It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.
In related speech enhancement and denoising applications, the complexity is doubled due to the use of two neural networks alone as the network for speech enhancement and noise cancellation, which is disadvantageous for application use of the electronic device. Accordingly, the present disclosure proposes a method of simultaneously performing speech enhancement and denoising using one neural network, that is, simultaneously guaranteeing noise suppression and speech enhancement.
In addition, a new category is introduced in the model design, namely the audio type such as music which is not wanted to be amplified or weakened, and the model design meets the practical application requirements better. Thus, noise suppression, speech enhancement, and other types of sound size invariance can be simultaneously ensured by the speech processing model of the present disclosure.
Hereinafter, according to various embodiments of the present disclosure, the method, apparatus, and system of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 is a flowchart of a speech processing method according to an embodiment of the present disclosure. The speech processing method shown in fig. 1 may be executed at a network end connected to the electronic device or locally at the electronic device.
The electronic device may be any electronic device having the functions of speech/text reception, speech processing, and executing commands. In exemplary embodiments of the present disclosure, the electronic apparatus may include, for example, but not limited to, a portable communication device (e.g., a smart phone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a server, etc. According to the embodiments of the present disclosure, the electronic device is not limited to the above.
Referring to fig. 1, in step S101, an audio signal is acquired. The speech processing model of the present disclosure is not purely useful for speech enhancement and speech noise reduction, since a new audio class, i.e., a type that is neither desired to be amplified nor desired to be attenuated (such as a music signal, etc.), is introduced during the model training phase of the present disclosure. Thus, the present disclosure can perform speech processing on multiple types of signals. For example, the audio signal may include at least one of a voice signal, a noise signal, and a specific signal. Here, the specific signal belongs to an audio type that does not need to be enhanced and suppressed. For example, the specific signal is a music signal. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.
In step S102, an Ideal amplitude Mask (IRM) is obtained using a speech processing model based on the acquired audio signal. How the speech processing model is obtained will be described in detail below with reference to fig. 2. Fig. 2 is a flow chart of a method of training a speech processing model according to an embodiment of the present disclosure. The execution subject of the model training method provided in the embodiment of the present disclosure may be the model training apparatus provided in the embodiment of the present disclosure, or may be an electronic device including the model training apparatus. May be determined according to actual use requirements, embodiments of the present disclosure are not limited.
Referring to fig. 2, in step S201, a mixed signal and a target signal are generated based on at least one of a voice signal, a noise signal, and a specific signal, where the specific signal belongs to an audio type that does not need to be enhanced and suppressed. For example, the specific signal may be a music signal. According to an embodiment of the present disclosure, in generating the mixed signal and the target signal, other types of signals may be included in addition to the above-listed signals, that is, the above-mentioned training data is not limited to the above-mentioned three categories, and may include more types of audio signals.
As an example, three data sources may be included in the mixed signal, such as a speech signal S (t), a specific signal M (t), and a noise signal N (t). Where t represents time. The speech signal S (t) may refer to a signal requiring enhancement, the specific signal M (t) may refer to an audio type requiring neither enhancement nor suppression, and the noise signal N (t) may refer to a signal requiring suppression.
In generating the mixed signal, the specific signal M (t) may be multiplied by a first gain to obtain a first signal and the noise signal N (t) may be multiplied by a second gain to obtain a second signal, and then the mixed signal may be generated by mixing the first signal, the second signal, and the voice signal S (t). For example, the mixed signal may be represented by the following equation (1):
Mix(t)=S(t)+M(t)*gSNR1+N(t)*gSNR2 (1)
Wherein Mix (t) is a mixed signal, g SNR1 is a first gain, and g SNR2 is a second gain.
In generating the target signal, the voice signal S (t) may be multiplied by a third gain to obtain a third signal, and then the target signal may be generated by mixing the third signal and the first signal. For example, the target signal may be represented by the following equation (2):
wherein Tar (t) is a target signal, Is the third gain. Here, the third gain may be a target voice amplification gain.
According to the embodiment of the disclosure, the first gain, the second gain and the third gain can be determined according to the preset signal-to-noise ratio, so that the generated mixed signal and the target signal are more in line with the actual situation, and the trained voice processing model is more accurate. The third gain may be adjusted by the user according to actual needs, or may be a predetermined value, to which the present disclosure is not limited.
As an example, a first gain may be determined based on a first predetermined signal-to-noise ratio and a second gain may be determined based on a second signal-to-noise ratio and the first gain. For example, the first gain and the second gain may be determined using the following equations (3) and (4):
Wherein TARGET SNR is a first predetermined signal-to-noise ratio and TARGET SNR is a second signal-to-noise ratio. TARGET SNR1 is represented as the energy ratio between the speech signal and the specific signal, TARGET SNR is represented as the energy ratio of the speech signal plus the specific signal to the noise signal. The above examples are merely exemplary, and the present disclosure is not limited thereto. Alternatively, different signal to noise ratios may be set according to actual requirements.
In addition, in generating the mixed signal and the target signal, if the training data includes other types of audio signals in addition to the above-described voice signal, noise signal, and specific signal, it is possible to distinguish by applying different target gains to each type of signal and satisfy the signal-to-noise ratio of the actual demand.
In step S202, the mixed signal is input to a speech processing model to obtain estimation data. Here, the speech processing model may be obtained by training a deep neural network.
According to embodiments of the present disclosure, different speech processing models may be obtained from different training data. Here, the estimation data may be an estimated target signal or an estimated ideal amplitude mask.
In step S203, a loss function is determined based on the target signal and the estimation data. As an example, in the case where the estimated data is an estimated ideal amplitude mask, the target ideal amplitude mask may be calculated first based on the target signal and the mixed signal, and then the loss function may be determined based on the target ideal amplitude mask and the estimated data.
In step S204, the speech processing model is trained based on the loss function to adjust parameters of the speech processing model. The training process of the speech processing model in the case where the output of the speech processing model is the estimated target signal will be described in detail with reference to fig. 3, and the training process of the speech processing model in the case where the output of the speech processing model is the estimated ideal amplitude mask will be described in detail with reference to fig. 4.
In the case where the output of the speech processing model is the estimated target signal, the speech processing model may be trained with reference to fig. 3. FIG. 3 is a schematic diagram of a trained speech processing model according to an embodiment of the disclosure.
Referring to fig. 3, the mixed signal Mix (t) and the target signal Tar (t) are respectively transferred to the time-frequency domain by the short-time fourier transform STFT to obtain the mixed signal Mix (n, k) and the target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with the lengths T are Tar (T) and Mix (T) in the time domain, where T represents time, 0<t +.t, after the short-time fourier transform STFT, tar (T) and Mix (T) can be expressed as:
Tar(n,k)=STFT(Tar(t)) (5)
Mix(n,k)=STFT(Mix(t)) (6)
wherein N is a frame sequence, 0<n is less than or equal to N, and N is the total frame number; k is a center frequency sequence, 0<k is less than or equal to K, and K is the total frequency number.
Next, the mixed signal Mix (n, k) in the time-frequency domain is input to the deep neural network DNN, and the estimated target signal Tar est (n, k) is output from the DNN. And constructing a loss function based on the target signal Tar (n, k) and the estimated target signal Tar est (n, k), carrying out optimization iteration on the deep neural network DNN based on the loss function, and finally converging to complete a training stage so as to obtain a voice processing model. However, the above examples of constructing the loss function are merely exemplary, and the present disclosure is not limited thereto.
After inputting the audio signal into the speech processing model trained as shown in fig. 3, an estimated target signal can be obtained.
In the case where the output of the speech processing model is the estimated ideal amplitude mask, the speech processing model may be trained with reference to fig. 4. FIG. 4 is a schematic diagram of a trained speech processing model according to another embodiment of the disclosure.
Referring to fig. 4, the mixed signal Mix (t) and the target signal Tar (t) are respectively transferred to the time-frequency domain by the short-time fourier transform STFT to obtain the mixed signal Mix (n, k) and the target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with the lengths T are Tar (T) and Mix (T) in the time domain, respectively, where T represents time 0<t +.t, after the short-time fourier transform STFT, tar (T) and Mix (T) can be expressed as equation (5) and equation (6) in the time-frequency domain:
Tar(n,k)=STFT(Tar(t)) (5)
Mix(n,k)=STFT(Mix(t)) (6)
wherein N is a frame sequence, 0<n is less than or equal to N, and N is the total frame number; k is a center frequency sequence, 0<k is less than or equal to K, and K is the total frequency number.
The target ideal amplitude mask is calculated based on the mixed signal Mix (n, k) and the target signal Tar (n, k). For example, the target ideal amplitude mask may be calculated using equation (7) below:
As can be seen from the above equation (7),
Next, the mixed signal Mix (n, k) in the time-frequency domain is input to the deep neural network DNN, and the estimated ideal amplitude mask IRM est (n, k) is output from the DNN. Then, a loss function is constructed based on the target ideal amplitude mask IRM obj (n, k) and the estimated ideal amplitude mask IRM est (n, k), and the deep neural network DNN is optimally trained based on the loss function to adjust network parameters, thereby obtaining a speech processing model. However, the above examples of constructing the loss function are merely exemplary, and the present disclosure is not limited thereto.
After inputting the audio signal into the speech processing model trained as shown in fig. 4, an estimated ideal amplitude mask can be obtained.
Referring back to fig. 1, in step S102, in the case where the output of the speech processing model is an estimated target signal, the estimated target signal may be obtained by applying the obtained audio signal to the speech processing model, and then an ideal amplitude mask may be obtained based on the estimated target signal and the audio signal. For example, when an estimated target signal is output from the speech processing model, the ideal amplitude mask may be calculated using the following equation (8):
Where Tar est (n, k) is an estimated target signal output from the speech processing model, and Aud (n, k) represents a signal on a time-frequency domain of the obtained audio signal after short-time fourier transform.
In step S103, the audio signal is differently processed according to the size of the obtained ideal amplitude mask to obtain a desired signal. Here, the desired signal may be a signal subjected to speech enhancement, a signal subjected to noise reduction processing, or a signal subjected to speech enhancement and noise reduction processing. A determination is made as to whether or not to obtain a desired signal based on an estimated signal obtained by multiplying the obtained audio signal by the ideal amplitude mask by comparing the ideal amplitude mask with a predetermined threshold.
As an example, if the ideal amplitude mask is greater than a predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to the following equation (9):
Where Est (n, k) represents a desired signal and Aud (n, k) represents an audio signal after short-time fourier transform. Additional gain defined for the adjustable user. Here, the preset threshold may be 1 or any value set by the user.
After the desired signal Est (n, k) in the video domain is obtained, the desired signal Est (t) in the time domain is obtained by short-time inverse fourier transform.
By the above-described processing, the speech portion in the obtained audio signal can be further enhanced, and the gain of the speech portion desired to be enhanced can be arbitrarily adjusted according to the user's needs.
As another example, if the ideal amplitude mask is smaller than a predetermined threshold, the estimated signal is taken as the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to the following equation (10):
Where Est (n, k) represents a desired signal and Aud (n, k) represents an audio signal. Here, the preset threshold may be 1 or any value set by the user.
Through the above processing, a denoising effect on the obtained audio signal can be achieved.
As another example, if the ideal amplitude mask is greater than a predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, if the ideal amplitude mask is less than the predetermined threshold, the estimated signal is taken as the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to the following equation (11):
where Est (n, k) represents a desired signal and Aud (n, k) represents an audio signal. Additional gain defined for the adjustable user. Here, the preset threshold may be 1 or any value set by the user.
By the above processing, the voice portion in the obtained audio signal can be further enhanced, and the gain of the voice portion desired to be enhanced can be arbitrarily adjusted according to the user's demand, while the noise reduction processing can be performed on the audio signal.
In the above-described embodiment, in the model training stage and the speech processing stage, the signals on the time domain obtained first may be converted into signals on the time domain via short-time fourier transform, then model training and speech processing may be performed, and finally the signals on the time domain obtained finally may be converted into time domain signals via short-time inverse fourier transform.
Fig. 5 is a flow diagram of a speech processing method according to an embodiment of the present disclosure. In the present embodiment, it is assumed that the output of the speech processing model is an estimated target signal.
Referring to fig. 5, the obtained audio signal Aud (t) is transformed into a signal Aud (n, k) on a time-frequency domain through a short-time fourier transform STFT, and then the signal Aud (n, k) is input to a trained speech processing model.
The estimated target signal Tar est (n, k) is output from the speech processing model, an ideal amplitude mask IRM (n, k) can be calculated using equation (8), and then a determination is made as to how to post-process the acquired audio signal based on a comparison of the calculated ideal amplitude mask with a predetermined threshold.
The preset threshold value may be set to 1, the desired signal Est (n, k) on the time-frequency domain is obtained using equation (9), equation (10), or equation (11), and then the desired signal Est (n, k) on the time-frequency domain is subjected to the short-time inverse fourier transform ISTFT to obtain the signal on the time domain.
Further, in the case where the output of the speech processing model is an estimated ideal amplitude mask, the time-frequency mask conversion operation in fig. 5 may be omitted, the estimated ideal amplitude mask is directly obtained by the speech processing model, and then a different post-processing operation is performed based on a comparison of the ideal amplitude mask with a preset threshold.
Fig. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure. Referring to fig. 6, a speech processing apparatus 600 may include a data acquisition module 601, a data processing module 602, and a model training module 603. Each module in the speech processing apparatus 600 may be implemented by one or more modules, and the names of the corresponding modules may vary according to the types of the modules. In various embodiments, some modules in the speech processing apparatus 600 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus functions of the respective modules/elements prior to combination may be equivalently performed.
The data acquisition module 601 may acquire an audio signal, wherein the audio signal may include at least one of a speech signal, a noise signal, and a specific signal belonging to an audio type that does not need to be enhanced and suppressed. Since a new audio class, i.e., a type that is neither desired to be amplified nor desired to be attenuated (such as a music signal, etc.), is introduced during the model training phase of the present disclosure, the speech processing model of the present disclosure is not purely used for speech enhancement and speech noise reduction, and such a design is more in line with practical application requirements. Thus, the present disclosure can perform speech processing on multiple types of signals.
The data processing module 602 may obtain an ideal amplitude mask using a speech processing model based on the acquired audio signal, and then perform different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.
As an example, the data processing module 602 may determine whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal with the ideal amplitude mask by comparing the ideal amplitude mask with a predetermined threshold.
For example, if the ideal amplitude mask is greater than a predetermined threshold, the data processing module 602 may multiply an estimated signal resulting from multiplication of the audio signal and the ideal amplitude mask with a gain defined by the user to obtain the desired signal; otherwise the audio signal may be taken as the desired signal. Here, the preset threshold value may be set to 1, or an arbitrary value set by the user. The post-speech processing operation may be performed with reference to equation (9).
For another example, if the ideal amplitude mask is less than a predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the audio signal may be taken as the desired signal. The post-speech processing operation may be performed with reference to equation (10).
For another example, if the ideal amplitude mask is greater than a predetermined threshold, the data processing module 602 may multiply the estimated signal with a gain defined by the user to obtain the desired signal. If the ideal amplitude mask is less than the predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the data processing module 602 may treat the audio signal as the desired signal. The post-speech processing operation may be performed with reference to equation (11).
Different speech processing models can be trained due to different training data. In the present disclosure, the output of the speech processing model may be an ideal amplitude mask or an estimated target signal.
In the case where the output of the speech processing model is an estimated target signal, the data processing module 602 may obtain the estimated target signal by applying the obtained audio signal to the speech processing model, and then obtain an ideal amplitude mask based on the estimated target signal and the audio signal. A speech post-processing operation is performed based on the obtained ideal amplitude mask.
Optionally, the speech processing apparatus can further comprise a model training module 603. Model training module 603 may train the speech processing model based on the following method: generating a mixed signal and a target signal based on at least one of the voice signal, the noise signal, and the specific signal, inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; the speech processing model is trained based on the loss function to adjust parameters of the speech processing model.
Alternatively, the model training module 603 may multiply the specific signal by a first gain to obtain a first signal and multiply the noise signal by a second gain to obtain a second signal, and generate the mixed signal by mixing the first signal, the second signal, and the voice signal. For example, equation (1) may be utilized to generate a mixed signal.
Alternatively, the model training module 603 may multiply the speech signal by a third gain to obtain a third signal and generate the target signal by mixing the third signal with the first signal. For example, equation (2) may be utilized to generate the target signal.
Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio and the second gain may be determined based on the second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal and target signal serving as training data are more in line with the actual application requirements.
Since the speech processing models are different, the estimated data output by the speech processing models may be different. For example, the estimated data output by the speech processing model may be an estimated target signal or an estimated ideal amplitude mask.
In the case where the estimated data is an estimated ideal amplitude mask, the model training module 603 may calculate a target ideal amplitude mask based on the target signal and the mixed signal, and then determine a loss function based on the target ideal amplitude mask and the estimated data. Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mixed signal.
Fig. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure. Referring to fig. 7, a model training apparatus 700 may include a data generation module 701 and a data training module 702. Each module in model training apparatus 700 may be implemented by one or more modules, and the name of the corresponding module may vary depending on the type of module. In various embodiments, some modules in model training apparatus 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus functions of the respective modules/elements prior to combination may be equivalently performed.
Training data is divided into three categories during the model design stage, distinguishing from single speech enhancement and speech noise reduction, and there are three sources of input in the mix, e.g., speech (which requires enhancement), music (which does not enhance nor suppress audio types), and noise (which requires suppression).
The data generation module 701 may generate the mixed signal and the target signal based on at least one of a voice signal, a noise signal, and a specific signal. Specifically, the data generation module 701 may multiply a specific signal by a first gain to obtain a first signal and multiply a noise signal by a second gain to obtain a second signal, and generate a mixed signal by mixing the first signal, the second signal, and the voice signal. For example, a mixed signal as shown in equation (1) may be generated.
The data generation module 701 may multiply the voice signal by a third gain to obtain a third signal, and generate the target signal by mixing the third signal and the first signal. For example, a target signal may be generated as shown in equation (2).
Here, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on the second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal and target signal serving as training data are more in line with the actual application requirements. For example, equations (3) and (4) may be utilized to determine gain values for different signals.
The data training module 702 may input the mixed signal into a speech processing model (such as a deep neural network) to obtain estimated data, determine a loss function based on the target signal and the estimated data, and train the speech processing model based on the loss function to adjust parameters of the speech processing model.
According to embodiments of the present disclosure, different training data may be utilized to obtain different speech processing models. Assuming that the training output is a speech processing model of the target signal, the data training module 702 inputs the mixed signal Mix (n, k) on the time-frequency domain into the deep neural network DNN, and outputs the estimated target signal Tar est (n, k) from the DNN. And constructing a loss function based on the target signal Tar (n, k) and the estimated target signal Tar est (n, k), carrying out optimization iteration on the deep neural network DNN based on the loss function, and finally converging to complete a training stage so as to obtain a voice processing model.
Assuming that the training output is a speech processing model of an ideal amplitude mask, the data training module 702 may calculate the target ideal amplitude mask based on the target signal Tar (n, k) and the mixed signal Mix (n, k), and the data training module 702 inputs the mixed signal Mix (n, k) on the time-frequency domain into the deep neural network DNN, and outputs the estimated ideal amplitude mask IRM est (n, k) from the DNN. The loss function is then determined based on the target ideal amplitude mask IRM obj (n, k) and the estimated ideal amplitude mask IRM est (n, k). Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mixed signal.
According to embodiments of the present disclosure, an electronic device may be provided. Fig. 8 is a block diagram of an electronic device according to an embodiment of the present disclosure, the electronic device 800 may include at least one memory 802 and at least one processor 801, the at least one memory 802 storing a set of computer-executable instructions that, when executed by the at least one processor 801, perform a speech processing method or training method of a speech processing model according to an embodiment of the present disclosure.
Processor 801 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 801 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
The memory 802, which is a storage medium, may include an operating system, a data storage module, a network communication module, a user interface module, a video playback parameter determination program, and a database.
The memory 802 may be integrated with the processor 801, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. In addition, the memory 802 may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system. The memory and the processor may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor is able to read files stored in the memory.
In addition, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.
By way of example, electronic device 800 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 800 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction set) individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).
It will be appreciated by those skilled in the art that the structure shown in fig. 8 is not limiting and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a speech processing method or a training method of a speech processing model according to the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card-type memories (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, hard disks, solid state disks, and any other devices configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
In accordance with embodiments of the present disclosure, a computer program product may also be provided, instructions in which are executable by a processor of a computer device to perform the above-described speech processing method or training method of a speech processing model.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (23)

1. A method of training a speech processing model, the method comprising:
Multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; wherein the specific signal is of an audio type that does not need to be enhanced and suppressed, the noise signal is of an audio type that needs to be suppressed, the first gain is determined based on a first predetermined signal-to-noise ratio, the second gain is determined based on a second signal-to-noise ratio and the first gain, wherein the first predetermined signal-to-noise ratio is an energy ratio between the speech signal and the first signal, and the second signal-to-noise ratio is an energy ratio of the speech signal plus the first signal and the second signal;
Generating a mixed signal by mixing the first signal, the second signal and the voice signal; wherein the speech signal is of the audio type that needs to be enhanced;
multiplying the speech signal by a third gain to obtain a third signal;
Generating a target signal by mixing the third signal and the first signal;
inputting the mixed signal into a voice processing model to obtain estimated data;
Determining a loss function based on the target signal and the estimated data;
training the speech processing model based on the loss function to adjust parameters of the speech processing model.
2. The method of claim 1, wherein the estimated data is an estimated target signal or an estimated ideal amplitude mask,
Where the ideal amplitude mask is related to the signal energy.
3. The method of claim 1, wherein the step of determining a loss function based on the target signal and the estimated data if the estimated data is an estimated ideal amplitude mask comprises:
Calculating a target ideal amplitude mask based on the target signal and the mixed signal;
A loss function is determined based on the target ideal amplitude mask and the estimated data.
4. A method according to claim 3, wherein the target ideal amplitude mask is an amplitude ratio of the target signal to the mixed signal in the time-frequency domain.
5. A method of speech processing, the method comprising:
acquiring an audio signal, wherein the audio signal comprises at least one of a voice signal, a noise signal and a specific signal, wherein the voice signal belongs to an audio type needing to be enhanced, the noise signal belongs to an audio type needing to be suppressed, and the specific signal belongs to an audio type not needing to be enhanced and suppressed;
obtaining an ideal amplitude mask based on the audio signal using a speech processing model trained using the training method of any of claims 1-4; and
And processing the audio signal according to the comparison result of the size of the ideal amplitude mask and a preset threshold value to obtain a desired signal.
6. The method of claim 5, wherein the step of processing the audio signal to obtain the desired signal based on a comparison of the magnitude of the ideal amplitude mask to a predetermined threshold comprises:
A determination is made as to whether to obtain the desired signal based on an estimated signal obtained by multiplying the audio signal by the ideal amplitude mask based on a comparison result of the size of the ideal amplitude mask and a predetermined threshold.
7. The method of claim 6, wherein determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask comprises: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.
8. The method of claim 6, wherein determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask comprises: if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.
9. The method of claim 6, wherein determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask comprises: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.
10. The method of claim 5, wherein the output of the speech processing model is the ideal amplitude mask or an estimated target signal,
Wherein, in the case where the output of the speech processing model is the estimated target signal, the step of obtaining the ideal amplitude mask comprises:
Obtaining an estimated target signal by applying the audio signal to a speech processing model;
the ideal amplitude mask is obtained based on the estimated target signal and the audio signal.
11. A training device for a speech processing model, the device comprising:
A data generation module configured to: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; wherein the specific signal is of an audio type that does not need to be enhanced and suppressed, the noise signal is of an audio type that needs to be suppressed, the first gain is determined based on a first predetermined signal-to-noise ratio, the second gain is determined based on a second signal-to-noise ratio and the first gain, wherein the first predetermined signal-to-noise ratio is an energy ratio between the speech signal and the first signal, and the second signal-to-noise ratio is an energy ratio of the speech signal plus the first signal and the second signal; generating a mixed signal by mixing the first signal, the second signal and the voice signal; wherein the speech signal is of the audio type that needs to be enhanced; multiplying the speech signal by a third gain to obtain a third signal; generating a target signal by mixing the third signal and the first signal; and
A data training module configured to: inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.
12. The apparatus of claim 11, wherein the estimated data is an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask is related to signal energy.
13. The apparatus of claim 11, wherein, in the case where the estimated data is an estimated ideal amplitude mask, the data training module is configured to:
Calculating a target ideal amplitude mask based on the target signal and the mixed signal;
A loss function is determined based on the target ideal amplitude mask and the estimated data.
14. The apparatus of claim 13, wherein the target ideal amplitude mask is an amplitude ratio of the target signal to the mixed signal in a time-frequency domain.
15. A speech processing apparatus, the apparatus comprising:
A data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal, wherein the speech signal belongs to an audio type that needs to be enhanced, the noise signal belongs to an audio type that needs to be suppressed, and the specific signal belongs to an audio type that does not need to be enhanced and suppressed;
a data processing module configured to:
obtaining an ideal amplitude mask based on the audio signal using a speech processing model trained using the training method of any of claims 1-4; and
And processing the audio signal according to the comparison result of the size of the ideal amplitude mask and a preset threshold value to obtain a desired signal.
16. The apparatus of claim 15, wherein the data processing module is configured to:
A determination is made as to whether to obtain the desired signal based on an estimated signal obtained by multiplying the audio signal by the ideal amplitude mask based on a comparison result of the size of the ideal amplitude mask and a predetermined threshold.
17. The apparatus of claim 16, wherein the data processing module is configured to:
multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.
18. The apparatus of claim 16, wherein the data processing module is configured to:
If the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.
19. The apparatus of claim 16, wherein the data processing module is configured to:
multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold;
If the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal;
Otherwise, the audio signal is used as the expected signal.
20. The apparatus of claim 15 wherein the output of the speech processing model is the ideal amplitude mask or an estimated target signal,
Wherein, in case the output of the speech processing model is the estimated target signal, the data processing module is configured to:
Obtaining an estimated target signal by applying the audio signal to a speech processing model;
the ideal amplitude mask is obtained based on the estimated target signal and the audio signal.
21. An electronic device, comprising:
At least one processor;
At least one memory storing computer-executable instructions,
Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the training method of any of claims 1 to 4 and the speech processing method of any of claims 5 to 10.
22. A computer readable storage medium storing instructions which, when executed by at least one processor, cause the at least one processor to perform the training method of any one of claims 1 to 4 and the speech processing method of any one of claims 5 to 10.
23. A computer program product having instructions that are executed by at least one processor in an electronic device to perform the training method of any of claims 1 to 4 and the speech processing method of any of claims 5 to 10.
CN202011330109.3A 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device Active CN112309426B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011330109.3A CN112309426B (en) 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011330109.3A CN112309426B (en) 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device

Publications (2)

Publication Number Publication Date
CN112309426A CN112309426A (en) 2021-02-02
CN112309426B true CN112309426B (en) 2024-07-12

Family

ID=74335596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011330109.3A Active CN112309426B (en) 2020-11-24 2020-11-24 Voice processing model training method and device and voice processing method and device

Country Status (1)

Country Link
CN (1) CN112309426B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992168B (en) * 2021-02-26 2024-04-19 平安科技(深圳)有限公司 Speech noise reducer training method, device, computer equipment and storage medium
CN113035221B (en) * 2021-02-26 2023-12-19 北京达佳互联信息技术有限公司 Training method and device for voice processing model and voice processing method and device
CN113192536B (en) * 2021-04-28 2023-07-28 北京达佳互联信息技术有限公司 Training method of voice quality detection model, voice quality detection method and device
CN113470124B (en) * 2021-06-30 2023-09-22 北京达佳互联信息技术有限公司 Training method and device for special effect model, and special effect generation method and device
CN113990343A (en) * 2021-11-18 2022-01-28 北京达佳互联信息技术有限公司 Training method and device of voice noise reduction model and voice noise reduction method and device
CN114121031A (en) * 2021-12-08 2022-03-01 思必驰科技股份有限公司 Device voice noise reduction, electronic device, and storage medium
CN114974277A (en) * 2022-03-07 2022-08-30 云知声智能科技股份有限公司 Training method of voice noise reduction model, voice noise reduction method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615535A (en) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 Sound enhancement method, device, intelligent sound equipment and computer equipment
CN111933164A (en) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 Training method and device of voice processing model, electronic equipment and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000231399A (en) * 1999-02-10 2000-08-22 Oki Electric Ind Co Ltd Noise reducing device
CN101154383B (en) * 2006-09-29 2010-10-06 株式会社东芝 Method and device for noise suppression, phonetic feature extraction, speech recognition and training voice model
US8639516B2 (en) * 2010-06-04 2014-01-28 Apple Inc. User-specific noise suppression for voice quality improvements
US9633671B2 (en) * 2013-10-18 2017-04-25 Apple Inc. Voice quality enhancement techniques, speech recognition techniques, and related systems
CN108346433A (en) * 2017-12-28 2018-07-31 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN108831500B (en) * 2018-05-29 2023-04-28 平安科技(深圳)有限公司 Speech enhancement method, device, computer equipment and storage medium
KR102085739B1 (en) * 2018-10-29 2020-03-06 광주과학기술원 Speech enhancement method
CN110956957B (en) * 2019-12-23 2022-05-17 思必驰科技股份有限公司 Training method and system of speech enhancement model
CN111554321B (en) * 2020-04-20 2023-12-05 北京达佳互联信息技术有限公司 Noise reduction model training method and device, electronic equipment and storage medium
CN111627458B (en) * 2020-05-27 2023-11-17 北京声智科技有限公司 Sound source separation method and equipment

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615535A (en) * 2018-05-07 2018-10-02 腾讯科技(深圳)有限公司 Sound enhancement method, device, intelligent sound equipment and computer equipment
CN111933164A (en) * 2020-06-29 2020-11-13 北京百度网讯科技有限公司 Training method and device of voice processing model, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112309426A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
CN112309426B (en) Voice processing model training method and device and voice processing method and device
CN113436643B (en) Training and application method, device and equipment of voice enhancement model and storage medium
CN110265064B (en) Audio frequency crackle detection method, device and storage medium
US10262680B2 (en) Variable sound decomposition masks
CN113555031B (en) Training method and device of voice enhancement model, and voice enhancement method and device
CN113314147B (en) Training method and device of audio processing model, audio processing method and device
CN113241088A (en) Training method and device of voice enhancement model and voice enhancement method and device
US20140358534A1 (en) General Sound Decomposition Models
US20230267947A1 (en) Noise reduction using machine learning
CN112712816B (en) Training method and device for voice processing model and voice processing method and device
JP2019078864A (en) Musical sound emphasis device, convolution auto encoder learning device, musical sound emphasis method, and program
CN113284507A (en) Training method and device of voice enhancement model and voice enhancement method and device
US9601124B2 (en) Acoustic matching and splicing of sound tracks
CN113921022A (en) Audio signal separation method, device, storage medium and electronic equipment
CN114283833A (en) Speech enhancement model training method, speech enhancement method, related device and medium
US11393443B2 (en) Apparatuses and methods for creating noise environment noisy data and eliminating noise
US9318106B2 (en) Joint sound model generation techniques
CN113035221B (en) Training method and device for voice processing model and voice processing method and device
US20190385590A1 (en) Generating device, generating method, and non-transitory computer readable storage medium
CN112652290A (en) Method for generating reverberation audio signal and training method of audio processing model
CN106847299B (en) Time delay estimation method and device
CN113707163B (en) Speech processing method and device and model training method and device
WO2023093029A1 (en) Wake-up word energy calculation method and system, and voice wake-up system and storage medium
US20140140519A1 (en) Sound processing device, sound processing method, and program
Li et al. Robust Non‐negative matrix factorization with β‐divergence for speech separation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant