CN112309426B

CN112309426B - Voice processing model training method and device and voice processing method and device

Info

Publication number: CN112309426B
Application number: CN202011330109.3A
Authority: CN
Inventors: 郑羲光; 李楠; 任新蕾; 张晨
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2024-07-12
Anticipated expiration: 2040-11-24
Also published as: CN112309426A

Abstract

The disclosure provides a training method and device for a speech processing model, and a speech processing method and device. The training method comprises the following steps: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Description

Voice processing model training method and device and voice processing method and device

Technical Field

The disclosure relates to the technical field of audio, and in particular relates to a training method and device of a speech processing model, and a speech processing method and device.

Background

With the rapid development of electronic technology and network technology, electronic devices can process audio signals in a time-frequency domain based on a voice processing algorithm of a neural network.

Although neural network-based speech enhancement and noise reduction have achieved performance beyond conventional signal processing methods and have been able to operate efficiently in electronic devices, for speech enhancement (non-speech component-invariant speech component increases) and speech denoising (speech component-invariant non-speech component decreases) problems, the two neural networks are typically trained to achieve the purposes of speech enhancement and denoising, respectively. In addition, for speech processing using two types of neural networks, one type of signal is always amplified or reduced while the other type of signal is kept unchanged.

Disclosure of Invention

The present disclosure provides a training method and apparatus for a speech processing model, and a speech processing method and apparatus for the same, to at least solve the problem of using a neural network to simultaneously perform speech enhancement and denoising.

According to a first aspect of embodiments of the present disclosure, there is provided a training method of a speech processing model, the method may include: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Alternatively, the step of generating the mixed signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; the mixed signal is generated by mixing the first signal, the second signal and the speech signal.

Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio and the second gain may be determined based on a second signal-to-noise ratio and the first gain.

Alternatively, the step of generating the target signal based on at least one of the voice signal, the noise signal, and the specific signal may include: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the first signal.

Alternatively, the estimated data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask is related to the signal energy.

Optionally, in the case that the estimated data is an estimated ideal amplitude mask, the step of determining a loss function based on the target signal and the estimated data may comprise: calculating a target ideal amplitude mask based on the target signal and the mixed signal; a loss function is determined based on the target ideal amplitude mask and the estimated data.

Alternatively, the target ideal amplitude mask may be an amplitude ratio of the target signal and the mixed signal in a time-frequency domain.

According to a second aspect of embodiments of the present disclosure, there is provided a speech processing method, the method may comprise: acquiring an audio signal, wherein the audio signal comprises at least one of a speech signal, a noise signal and a specific signal, the specific signal being of a type of audio that does not need to be enhanced and suppressed; obtaining an ideal amplitude mask based on the audio signal using a speech processing model; and performing different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.

Alternatively, the speech processing model may be trained by the training method described above.

Optionally, the step of differently processing the audio signal to obtain the desired signal according to the size of the ideal amplitude mask may include: the desired signal is obtained by comparing the ideal amplitude mask with a predetermined threshold value to determine whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask.

Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask may comprise: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.

Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask may comprise: if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.

Optionally, the step of determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask may comprise: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.

Optionally, the output of the speech processing model is the ideal amplitude mask or the estimated target signal, wherein the step of obtaining the ideal amplitude mask may comprise, in case the output of the speech processing model is the estimated target signal: obtaining an estimated target signal by applying the audio signal to a speech processing model; the ideal amplitude mask is obtained based on the estimated target signal and the audio signal.

According to a third aspect of embodiments of the present disclosure, there is provided a training apparatus for a speech processing model, the apparatus may include: a data generation module configured to: generating a mixed signal and a target signal based on at least one of a voice signal, a noise signal, and the specific signal; and a data training module configured to: inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

Alternatively, the data generation module may be configured to: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; and generating the mixed signal by mixing the first signal, the second signal and the speech signal.

Alternatively, the data generation module may be configured to: multiplying the speech signal by a third gain to obtain a third signal; the target signal is generated by mixing the third signal and the first signal.

Alternatively, the estimated data may be an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask may be related to signal energy.

Optionally, in the case where the estimated data is an estimated ideal amplitude mask, the data training module may be configured to: calculating a target ideal amplitude mask based on the target signal and the mixed signal; a loss function is determined based on the target ideal amplitude mask and the estimated data.

According to a fourth aspect of embodiments of the present disclosure, there is provided a speech processing apparatus, the apparatus may comprise: a data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal belonging to an audio type that does not need to be enhanced and suppressed; and a data processing module configured to: obtaining an ideal amplitude mask based on the audio signal using a speech processing model; and performing different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.

Alternatively, the data processing module may be configured to: the desired signal is obtained by comparing the ideal amplitude mask with a predetermined threshold value to determine whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask.

Alternatively, the data processing module may be configured to: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.

Alternatively, the data processing module may be configured to: if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.

Alternatively, the data processing module may be configured to: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.

Alternatively, the output of the speech processing model may be the ideal amplitude mask or the estimated target signal, wherein, in case the output of the speech processing model is the estimated target signal, the data processing module may be configured to: obtaining an estimated target signal by applying the audio signal to a speech processing model; the ideal amplitude mask is obtained based on the estimated target signal and the audio signal.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a speech processing method and a model training method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the speech processing method and the model training method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product, instructions in which are executed by at least one processor in an electronic device to perform the speech processing method and model training method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

The training is performed by integrating the speech enhancement and the denoising into one deep neural network, and the speech enhancement and the denoising can be respectively performed by the post-processing based on the ideal target mask IRM, or the speech enhancement and the denoising can be performed simultaneously. In addition, the training targets are divided into three categories (i.e., speech (requiring enhancement), noise (requiring suppression), and other audio such as music (neither enhancing nor suppressing)) in the model design, and the speech processing model trained using such training data is different from the model of single speech enhancement and speech noise reduction, so that the model more conforms to the actual application requirements, and the speech processing can be performed more efficiently.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

FIG. 1 is a flow chart of a speech processing method according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of training a speech processing model according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a training speech processing model according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a trained speech processing model according to another embodiment of the present disclosure;

FIG. 5 is a flow diagram of a speech processing method according to an embodiment of the present disclosure;

FIG. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 8 is a block diagram of an electronic device according to an embodiment of the present disclosure.

Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.

Detailed Description

The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The embodiments described in the examples below are not representative of all embodiments consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

In related speech enhancement and denoising applications, the complexity is doubled due to the use of two neural networks alone as the network for speech enhancement and noise cancellation, which is disadvantageous for application use of the electronic device. Accordingly, the present disclosure proposes a method of simultaneously performing speech enhancement and denoising using one neural network, that is, simultaneously guaranteeing noise suppression and speech enhancement.

In addition, a new category is introduced in the model design, namely the audio type such as music which is not wanted to be amplified or weakened, and the model design meets the practical application requirements better. Thus, noise suppression, speech enhancement, and other types of sound size invariance can be simultaneously ensured by the speech processing model of the present disclosure.

Hereinafter, according to various embodiments of the present disclosure, the method, apparatus, and system of the present disclosure will be described in detail with reference to the accompanying drawings.

Fig. 1 is a flowchart of a speech processing method according to an embodiment of the present disclosure. The speech processing method shown in fig. 1 may be executed at a network end connected to the electronic device or locally at the electronic device.

The electronic device may be any electronic device having the functions of speech/text reception, speech processing, and executing commands. In exemplary embodiments of the present disclosure, the electronic apparatus may include, for example, but not limited to, a portable communication device (e.g., a smart phone), a computer device, a portable multimedia device, a portable medical device, a camera, a wearable device, or a server, etc. According to the embodiments of the present disclosure, the electronic device is not limited to the above.

Referring to fig. 1, in step S101, an audio signal is acquired. The speech processing model of the present disclosure is not purely useful for speech enhancement and speech noise reduction, since a new audio class, i.e., a type that is neither desired to be amplified nor desired to be attenuated (such as a music signal, etc.), is introduced during the model training phase of the present disclosure. Thus, the present disclosure can perform speech processing on multiple types of signals. For example, the audio signal may include at least one of a voice signal, a noise signal, and a specific signal. Here, the specific signal belongs to an audio type that does not need to be enhanced and suppressed. For example, the specific signal is a music signal. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.

In step S102, an Ideal amplitude Mask (IRM) is obtained using a speech processing model based on the acquired audio signal. How the speech processing model is obtained will be described in detail below with reference to fig. 2. Fig. 2 is a flow chart of a method of training a speech processing model according to an embodiment of the present disclosure. The execution subject of the model training method provided in the embodiment of the present disclosure may be the model training apparatus provided in the embodiment of the present disclosure, or may be an electronic device including the model training apparatus. May be determined according to actual use requirements, embodiments of the present disclosure are not limited.

Referring to fig. 2, in step S201, a mixed signal and a target signal are generated based on at least one of a voice signal, a noise signal, and a specific signal, where the specific signal belongs to an audio type that does not need to be enhanced and suppressed. For example, the specific signal may be a music signal. According to an embodiment of the present disclosure, in generating the mixed signal and the target signal, other types of signals may be included in addition to the above-listed signals, that is, the above-mentioned training data is not limited to the above-mentioned three categories, and may include more types of audio signals.

As an example, three data sources may be included in the mixed signal, such as a speech signal S (t), a specific signal M (t), and a noise signal N (t). Where t represents time. The speech signal S (t) may refer to a signal requiring enhancement, the specific signal M (t) may refer to an audio type requiring neither enhancement nor suppression, and the noise signal N (t) may refer to a signal requiring suppression.

In generating the mixed signal, the specific signal M (t) may be multiplied by a first gain to obtain a first signal and the noise signal N (t) may be multiplied by a second gain to obtain a second signal, and then the mixed signal may be generated by mixing the first signal, the second signal, and the voice signal S (t). For example, the mixed signal may be represented by the following equation (1):

Mix(t)＝S(t)+M(t)*g_SNR1+N(t)*g_SNR2 (1)

Wherein Mix (t) is a mixed signal, g _SNR1 is a first gain, and g _SNR2 is a second gain.

In generating the target signal, the voice signal S (t) may be multiplied by a third gain to obtain a third signal, and then the target signal may be generated by mixing the third signal and the first signal. For example, the target signal may be represented by the following equation (2):

wherein Tar (t) is a target signal, Is the third gain. Here, the third gain may be a target voice amplification gain.

According to the embodiment of the disclosure, the first gain, the second gain and the third gain can be determined according to the preset signal-to-noise ratio, so that the generated mixed signal and the target signal are more in line with the actual situation, and the trained voice processing model is more accurate. The third gain may be adjusted by the user according to actual needs, or may be a predetermined value, to which the present disclosure is not limited.

As an example, a first gain may be determined based on a first predetermined signal-to-noise ratio and a second gain may be determined based on a second signal-to-noise ratio and the first gain. For example, the first gain and the second gain may be determined using the following equations (3) and (4):

Wherein TARGET SNR is a first predetermined signal-to-noise ratio and TARGET SNR is a second signal-to-noise ratio. TARGET SNR1 is represented as the energy ratio between the speech signal and the specific signal, TARGET SNR is represented as the energy ratio of the speech signal plus the specific signal to the noise signal. The above examples are merely exemplary, and the present disclosure is not limited thereto. Alternatively, different signal to noise ratios may be set according to actual requirements.

In addition, in generating the mixed signal and the target signal, if the training data includes other types of audio signals in addition to the above-described voice signal, noise signal, and specific signal, it is possible to distinguish by applying different target gains to each type of signal and satisfy the signal-to-noise ratio of the actual demand.

In step S202, the mixed signal is input to a speech processing model to obtain estimation data. Here, the speech processing model may be obtained by training a deep neural network.

According to embodiments of the present disclosure, different speech processing models may be obtained from different training data. Here, the estimation data may be an estimated target signal or an estimated ideal amplitude mask.

In step S203, a loss function is determined based on the target signal and the estimation data. As an example, in the case where the estimated data is an estimated ideal amplitude mask, the target ideal amplitude mask may be calculated first based on the target signal and the mixed signal, and then the loss function may be determined based on the target ideal amplitude mask and the estimated data.

In step S204, the speech processing model is trained based on the loss function to adjust parameters of the speech processing model. The training process of the speech processing model in the case where the output of the speech processing model is the estimated target signal will be described in detail with reference to fig. 3, and the training process of the speech processing model in the case where the output of the speech processing model is the estimated ideal amplitude mask will be described in detail with reference to fig. 4.

In the case where the output of the speech processing model is the estimated target signal, the speech processing model may be trained with reference to fig. 3. FIG. 3 is a schematic diagram of a trained speech processing model according to an embodiment of the disclosure.

Referring to fig. 3, the mixed signal Mix (t) and the target signal Tar (t) are respectively transferred to the time-frequency domain by the short-time fourier transform STFT to obtain the mixed signal Mix (n, k) and the target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with the lengths T are Tar (T) and Mix (T) in the time domain, where T represents time, 0<t +.t, after the short-time fourier transform STFT, tar (T) and Mix (T) can be expressed as:

Tar(n,k)＝STFT(Tar(t)) (5)

Mix(n,k)＝STFT(Mix(t)) (6)

wherein N is a frame sequence, 0<n is less than or equal to N, and N is the total frame number; k is a center frequency sequence, 0<k is less than or equal to K, and K is the total frequency number.

Next, the mixed signal Mix (n, k) in the time-frequency domain is input to the deep neural network DNN, and the estimated target signal Tar _est (n, k) is output from the DNN. And constructing a loss function based on the target signal Tar (n, k) and the estimated target signal Tar _est (n, k), carrying out optimization iteration on the deep neural network DNN based on the loss function, and finally converging to complete a training stage so as to obtain a voice processing model. However, the above examples of constructing the loss function are merely exemplary, and the present disclosure is not limited thereto.

After inputting the audio signal into the speech processing model trained as shown in fig. 3, an estimated target signal can be obtained.

In the case where the output of the speech processing model is the estimated ideal amplitude mask, the speech processing model may be trained with reference to fig. 4. FIG. 4 is a schematic diagram of a trained speech processing model according to another embodiment of the disclosure.

Referring to fig. 4, the mixed signal Mix (t) and the target signal Tar (t) are respectively transferred to the time-frequency domain by the short-time fourier transform STFT to obtain the mixed signal Mix (n, k) and the target signal Tar (n, k) of the time-frequency domain. For example, if the target signal Tar and the mixed signal Mix with the lengths T are Tar (T) and Mix (T) in the time domain, respectively, where T represents time 0<t +.t, after the short-time fourier transform STFT, tar (T) and Mix (T) can be expressed as equation (5) and equation (6) in the time-frequency domain:

Tar(n,k)＝STFT(Tar(t)) (5)

Mix(n,k)＝STFT(Mix(t)) (6)

The target ideal amplitude mask is calculated based on the mixed signal Mix (n, k) and the target signal Tar (n, k). For example, the target ideal amplitude mask may be calculated using equation (7) below:

As can be seen from the above equation (7),

Next, the mixed signal Mix (n, k) in the time-frequency domain is input to the deep neural network DNN, and the estimated ideal amplitude mask IRM _est (n, k) is output from the DNN. Then, a loss function is constructed based on the target ideal amplitude mask IRM _obj (n, k) and the estimated ideal amplitude mask IRM _est (n, k), and the deep neural network DNN is optimally trained based on the loss function to adjust network parameters, thereby obtaining a speech processing model. However, the above examples of constructing the loss function are merely exemplary, and the present disclosure is not limited thereto.

After inputting the audio signal into the speech processing model trained as shown in fig. 4, an estimated ideal amplitude mask can be obtained.

Referring back to fig. 1, in step S102, in the case where the output of the speech processing model is an estimated target signal, the estimated target signal may be obtained by applying the obtained audio signal to the speech processing model, and then an ideal amplitude mask may be obtained based on the estimated target signal and the audio signal. For example, when an estimated target signal is output from the speech processing model, the ideal amplitude mask may be calculated using the following equation (8):

Where Tar _est (n, k) is an estimated target signal output from the speech processing model, and Aud (n, k) represents a signal on a time-frequency domain of the obtained audio signal after short-time fourier transform.

In step S103, the audio signal is differently processed according to the size of the obtained ideal amplitude mask to obtain a desired signal. Here, the desired signal may be a signal subjected to speech enhancement, a signal subjected to noise reduction processing, or a signal subjected to speech enhancement and noise reduction processing. A determination is made as to whether or not to obtain a desired signal based on an estimated signal obtained by multiplying the obtained audio signal by the ideal amplitude mask by comparing the ideal amplitude mask with a predetermined threshold.

As an example, if the ideal amplitude mask is greater than a predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to the following equation (9):

Where Est (n, k) represents a desired signal and Aud (n, k) represents an audio signal after short-time fourier transform. Additional gain defined for the adjustable user. Here, the preset threshold may be 1 or any value set by the user.

After the desired signal Est (n, k) in the video domain is obtained, the desired signal Est (t) in the time domain is obtained by short-time inverse fourier transform.

By the above-described processing, the speech portion in the obtained audio signal can be further enhanced, and the gain of the speech portion desired to be enhanced can be arbitrarily adjusted according to the user's needs.

As another example, if the ideal amplitude mask is smaller than a predetermined threshold, the estimated signal is taken as the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to the following equation (10):

Where Est (n, k) represents a desired signal and Aud (n, k) represents an audio signal. Here, the preset threshold may be 1 or any value set by the user.

Through the above processing, a denoising effect on the obtained audio signal can be achieved.

As another example, if the ideal amplitude mask is greater than a predetermined threshold, the estimated signal is multiplied by a gain defined by the user to obtain the desired signal, if the ideal amplitude mask is less than the predetermined threshold, the estimated signal is taken as the desired signal, otherwise the obtained audio signal is taken as the desired signal. For example, the desired signal may be obtained according to the following equation (11):

where Est (n, k) represents a desired signal and Aud (n, k) represents an audio signal. Additional gain defined for the adjustable user. Here, the preset threshold may be 1 or any value set by the user.

By the above processing, the voice portion in the obtained audio signal can be further enhanced, and the gain of the voice portion desired to be enhanced can be arbitrarily adjusted according to the user's demand, while the noise reduction processing can be performed on the audio signal.

In the above-described embodiment, in the model training stage and the speech processing stage, the signals on the time domain obtained first may be converted into signals on the time domain via short-time fourier transform, then model training and speech processing may be performed, and finally the signals on the time domain obtained finally may be converted into time domain signals via short-time inverse fourier transform.

Fig. 5 is a flow diagram of a speech processing method according to an embodiment of the present disclosure. In the present embodiment, it is assumed that the output of the speech processing model is an estimated target signal.

Referring to fig. 5, the obtained audio signal Aud (t) is transformed into a signal Aud (n, k) on a time-frequency domain through a short-time fourier transform STFT, and then the signal Aud (n, k) is input to a trained speech processing model.

The estimated target signal Tar _est (n, k) is output from the speech processing model, an ideal amplitude mask IRM (n, k) can be calculated using equation (8), and then a determination is made as to how to post-process the acquired audio signal based on a comparison of the calculated ideal amplitude mask with a predetermined threshold.

The preset threshold value may be set to 1, the desired signal Est (n, k) on the time-frequency domain is obtained using equation (9), equation (10), or equation (11), and then the desired signal Est (n, k) on the time-frequency domain is subjected to the short-time inverse fourier transform ISTFT to obtain the signal on the time domain.

Further, in the case where the output of the speech processing model is an estimated ideal amplitude mask, the time-frequency mask conversion operation in fig. 5 may be omitted, the estimated ideal amplitude mask is directly obtained by the speech processing model, and then a different post-processing operation is performed based on a comparison of the ideal amplitude mask with a preset threshold.

Fig. 6 is a block diagram of a speech processing apparatus according to an embodiment of the present disclosure. Referring to fig. 6, a speech processing apparatus 600 may include a data acquisition module 601, a data processing module 602, and a model training module 603. Each module in the speech processing apparatus 600 may be implemented by one or more modules, and the names of the corresponding modules may vary according to the types of the modules. In various embodiments, some modules in the speech processing apparatus 600 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus functions of the respective modules/elements prior to combination may be equivalently performed.

The data acquisition module 601 may acquire an audio signal, wherein the audio signal may include at least one of a speech signal, a noise signal, and a specific signal belonging to an audio type that does not need to be enhanced and suppressed. Since a new audio class, i.e., a type that is neither desired to be amplified nor desired to be attenuated (such as a music signal, etc.), is introduced during the model training phase of the present disclosure, the speech processing model of the present disclosure is not purely used for speech enhancement and speech noise reduction, and such a design is more in line with practical application requirements. Thus, the present disclosure can perform speech processing on multiple types of signals.

The data processing module 602 may obtain an ideal amplitude mask using a speech processing model based on the acquired audio signal, and then perform different processing on the audio signal according to the size of the ideal amplitude mask to obtain a desired signal.

As an example, the data processing module 602 may determine whether to obtain the desired signal based on an estimated signal resulting from multiplying the audio signal with the ideal amplitude mask by comparing the ideal amplitude mask with a predetermined threshold.

For example, if the ideal amplitude mask is greater than a predetermined threshold, the data processing module 602 may multiply an estimated signal resulting from multiplication of the audio signal and the ideal amplitude mask with a gain defined by the user to obtain the desired signal; otherwise the audio signal may be taken as the desired signal. Here, the preset threshold value may be set to 1, or an arbitrary value set by the user. The post-speech processing operation may be performed with reference to equation (9).

For another example, if the ideal amplitude mask is less than a predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the audio signal may be taken as the desired signal. The post-speech processing operation may be performed with reference to equation (10).

For another example, if the ideal amplitude mask is greater than a predetermined threshold, the data processing module 602 may multiply the estimated signal with a gain defined by the user to obtain the desired signal. If the ideal amplitude mask is less than the predetermined threshold, the data processing module 602 may treat the estimated signal as the desired signal; otherwise the data processing module 602 may treat the audio signal as the desired signal. The post-speech processing operation may be performed with reference to equation (11).

Different speech processing models can be trained due to different training data. In the present disclosure, the output of the speech processing model may be an ideal amplitude mask or an estimated target signal.

In the case where the output of the speech processing model is an estimated target signal, the data processing module 602 may obtain the estimated target signal by applying the obtained audio signal to the speech processing model, and then obtain an ideal amplitude mask based on the estimated target signal and the audio signal. A speech post-processing operation is performed based on the obtained ideal amplitude mask.

Optionally, the speech processing apparatus can further comprise a model training module 603. Model training module 603 may train the speech processing model based on the following method: generating a mixed signal and a target signal based on at least one of the voice signal, the noise signal, and the specific signal, inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; the speech processing model is trained based on the loss function to adjust parameters of the speech processing model.

Alternatively, the model training module 603 may multiply the specific signal by a first gain to obtain a first signal and multiply the noise signal by a second gain to obtain a second signal, and generate the mixed signal by mixing the first signal, the second signal, and the voice signal. For example, equation (1) may be utilized to generate a mixed signal.

Alternatively, the model training module 603 may multiply the speech signal by a third gain to obtain a third signal and generate the target signal by mixing the third signal with the first signal. For example, equation (2) may be utilized to generate the target signal.

Alternatively, the first gain may be determined based on a first predetermined signal-to-noise ratio and the second gain may be determined based on the second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal and target signal serving as training data are more in line with the actual application requirements.

Since the speech processing models are different, the estimated data output by the speech processing models may be different. For example, the estimated data output by the speech processing model may be an estimated target signal or an estimated ideal amplitude mask.

In the case where the estimated data is an estimated ideal amplitude mask, the model training module 603 may calculate a target ideal amplitude mask based on the target signal and the mixed signal, and then determine a loss function based on the target ideal amplitude mask and the estimated data. Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mixed signal.

Fig. 7 is a block diagram of a model training apparatus according to an embodiment of the present disclosure. Referring to fig. 7, a model training apparatus 700 may include a data generation module 701 and a data training module 702. Each module in model training apparatus 700 may be implemented by one or more modules, and the name of the corresponding module may vary depending on the type of module. In various embodiments, some modules in model training apparatus 700 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus functions of the respective modules/elements prior to combination may be equivalently performed.

Training data is divided into three categories during the model design stage, distinguishing from single speech enhancement and speech noise reduction, and there are three sources of input in the mix, e.g., speech (which requires enhancement), music (which does not enhance nor suppress audio types), and noise (which requires suppression).

The data generation module 701 may generate the mixed signal and the target signal based on at least one of a voice signal, a noise signal, and a specific signal. Specifically, the data generation module 701 may multiply a specific signal by a first gain to obtain a first signal and multiply a noise signal by a second gain to obtain a second signal, and generate a mixed signal by mixing the first signal, the second signal, and the voice signal. For example, a mixed signal as shown in equation (1) may be generated.

The data generation module 701 may multiply the voice signal by a third gain to obtain a third signal, and generate the target signal by mixing the third signal and the first signal. For example, a target signal may be generated as shown in equation (2).

Here, the first gain may be determined based on a first predetermined signal-to-noise ratio, and the second gain may be determined based on the second signal-to-noise ratio and the first gain. Through the design, the generated mixed signal and target signal serving as training data are more in line with the actual application requirements. For example, equations (3) and (4) may be utilized to determine gain values for different signals.

The data training module 702 may input the mixed signal into a speech processing model (such as a deep neural network) to obtain estimated data, determine a loss function based on the target signal and the estimated data, and train the speech processing model based on the loss function to adjust parameters of the speech processing model.

According to embodiments of the present disclosure, different training data may be utilized to obtain different speech processing models. Assuming that the training output is a speech processing model of the target signal, the data training module 702 inputs the mixed signal Mix (n, k) on the time-frequency domain into the deep neural network DNN, and outputs the estimated target signal Tar _est (n, k) from the DNN. And constructing a loss function based on the target signal Tar (n, k) and the estimated target signal Tar _est (n, k), carrying out optimization iteration on the deep neural network DNN based on the loss function, and finally converging to complete a training stage so as to obtain a voice processing model.

Assuming that the training output is a speech processing model of an ideal amplitude mask, the data training module 702 may calculate the target ideal amplitude mask based on the target signal Tar (n, k) and the mixed signal Mix (n, k), and the data training module 702 inputs the mixed signal Mix (n, k) on the time-frequency domain into the deep neural network DNN, and outputs the estimated ideal amplitude mask IRM _est (n, k) from the DNN. The loss function is then determined based on the target ideal amplitude mask IRM _obj (n, k) and the estimated ideal amplitude mask IRM _est (n, k). Here, the target ideal amplitude mask may be an energy ratio of the target signal to the mixed signal.

According to embodiments of the present disclosure, an electronic device may be provided. Fig. 8 is a block diagram of an electronic device according to an embodiment of the present disclosure, the electronic device 800 may include at least one memory 802 and at least one processor 801, the at least one memory 802 storing a set of computer-executable instructions that, when executed by the at least one processor 801, perform a speech processing method or training method of a speech processing model according to an embodiment of the present disclosure.

Processor 801 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 801 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The memory 802, which is a storage medium, may include an operating system, a data storage module, a network communication module, a user interface module, a video playback parameter determination program, and a database.

The memory 802 may be integrated with the processor 801, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. In addition, the memory 802 may include a stand-alone device, such as an external disk drive, a storage array, or other storage device usable by any database system. The memory and the processor may be operatively coupled or may communicate with each other, for example, through an I/O port, a network connection, etc., such that the processor is able to read files stored in the memory.

In addition, the electronic device 800 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 800 may be connected to each other via a bus and/or a network.

By way of example, electronic device 800 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 800 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction set) individually or in combination. The electronic device 800 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

It will be appreciated by those skilled in the art that the structure shown in fig. 8 is not limiting and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.

According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform a speech processing method or a training method of a speech processing model according to the present disclosure. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card-type memories (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tapes, floppy disks, magneto-optical data storage devices, hard disks, solid state disks, and any other devices configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

In accordance with embodiments of the present disclosure, a computer program product may also be provided, instructions in which are executable by a processor of a computer device to perform the above-described speech processing method or training method of a speech processing model.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of training a speech processing model, the method comprising:

Multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; wherein the specific signal is of an audio type that does not need to be enhanced and suppressed, the noise signal is of an audio type that needs to be suppressed, the first gain is determined based on a first predetermined signal-to-noise ratio, the second gain is determined based on a second signal-to-noise ratio and the first gain, wherein the first predetermined signal-to-noise ratio is an energy ratio between the speech signal and the first signal, and the second signal-to-noise ratio is an energy ratio of the speech signal plus the first signal and the second signal;

Generating a mixed signal by mixing the first signal, the second signal and the voice signal; wherein the speech signal is of the audio type that needs to be enhanced;

multiplying the speech signal by a third gain to obtain a third signal;

Generating a target signal by mixing the third signal and the first signal;

inputting the mixed signal into a voice processing model to obtain estimated data;

Determining a loss function based on the target signal and the estimated data;

training the speech processing model based on the loss function to adjust parameters of the speech processing model.

2. The method of claim 1, wherein the estimated data is an estimated target signal or an estimated ideal amplitude mask,

Where the ideal amplitude mask is related to the signal energy.

3. The method of claim 1, wherein the step of determining a loss function based on the target signal and the estimated data if the estimated data is an estimated ideal amplitude mask comprises:

Calculating a target ideal amplitude mask based on the target signal and the mixed signal;

A loss function is determined based on the target ideal amplitude mask and the estimated data.

4. A method according to claim 3, wherein the target ideal amplitude mask is an amplitude ratio of the target signal to the mixed signal in the time-frequency domain.

5. A method of speech processing, the method comprising:

acquiring an audio signal, wherein the audio signal comprises at least one of a voice signal, a noise signal and a specific signal, wherein the voice signal belongs to an audio type needing to be enhanced, the noise signal belongs to an audio type needing to be suppressed, and the specific signal belongs to an audio type not needing to be enhanced and suppressed;

obtaining an ideal amplitude mask based on the audio signal using a speech processing model trained using the training method of any of claims 1-4; and

And processing the audio signal according to the comparison result of the size of the ideal amplitude mask and a preset threshold value to obtain a desired signal.

6. The method of claim 5, wherein the step of processing the audio signal to obtain the desired signal based on a comparison of the magnitude of the ideal amplitude mask to a predetermined threshold comprises:

A determination is made as to whether to obtain the desired signal based on an estimated signal obtained by multiplying the audio signal by the ideal amplitude mask based on a comparison result of the size of the ideal amplitude mask and a predetermined threshold.

7. The method of claim 6, wherein determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask comprises: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.

8. The method of claim 6, wherein determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask comprises: if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.

9. The method of claim 6, wherein determining whether to obtain the desired signal based on an estimated signal resulting from multiplication of the audio signal with the ideal amplitude mask comprises: multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; if the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.

10. The method of claim 5, wherein the output of the speech processing model is the ideal amplitude mask or an estimated target signal,

Wherein, in the case where the output of the speech processing model is the estimated target signal, the step of obtaining the ideal amplitude mask comprises:

Obtaining an estimated target signal by applying the audio signal to a speech processing model;

the ideal amplitude mask is obtained based on the estimated target signal and the audio signal.

11. A training device for a speech processing model, the device comprising:

A data generation module configured to: multiplying the specific signal by a first gain to obtain a first signal and multiplying the noise signal by a second gain to obtain a second signal; wherein the specific signal is of an audio type that does not need to be enhanced and suppressed, the noise signal is of an audio type that needs to be suppressed, the first gain is determined based on a first predetermined signal-to-noise ratio, the second gain is determined based on a second signal-to-noise ratio and the first gain, wherein the first predetermined signal-to-noise ratio is an energy ratio between the speech signal and the first signal, and the second signal-to-noise ratio is an energy ratio of the speech signal plus the first signal and the second signal; generating a mixed signal by mixing the first signal, the second signal and the voice signal; wherein the speech signal is of the audio type that needs to be enhanced; multiplying the speech signal by a third gain to obtain a third signal; generating a target signal by mixing the third signal and the first signal; and

A data training module configured to: inputting the mixed signal into a voice processing model to obtain estimated data; determining a loss function based on the target signal and the estimated data; training the speech processing model based on the loss function to adjust parameters of the speech processing model.

12. The apparatus of claim 11, wherein the estimated data is an estimated target signal or an estimated ideal amplitude mask, wherein the ideal amplitude mask is related to signal energy.

13. The apparatus of claim 11, wherein, in the case where the estimated data is an estimated ideal amplitude mask, the data training module is configured to:

14. The apparatus of claim 13, wherein the target ideal amplitude mask is an amplitude ratio of the target signal to the mixed signal in a time-frequency domain.

15. A speech processing apparatus, the apparatus comprising:

A data acquisition module configured to acquire an audio signal, wherein the audio signal includes at least one of a speech signal, a noise signal, and a specific signal, wherein the speech signal belongs to an audio type that needs to be enhanced, the noise signal belongs to an audio type that needs to be suppressed, and the specific signal belongs to an audio type that does not need to be enhanced and suppressed;

a data processing module configured to:

16. The apparatus of claim 15, wherein the data processing module is configured to:

17. The apparatus of claim 16, wherein the data processing module is configured to:

multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold; otherwise, the audio signal is used as the expected signal.

18. The apparatus of claim 16, wherein the data processing module is configured to:

If the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal; otherwise, the audio signal is used as the expected signal.

19. The apparatus of claim 16, wherein the data processing module is configured to:

multiplying the estimated signal with a user-defined gain to obtain the desired signal if the ideal amplitude mask is greater than the predetermined threshold;

If the ideal amplitude mask is less than the predetermined threshold, then taking the estimated signal as the desired signal;

Otherwise, the audio signal is used as the expected signal.

20. The apparatus of claim 15 wherein the output of the speech processing model is the ideal amplitude mask or an estimated target signal,

Wherein, in case the output of the speech processing model is the estimated target signal, the data processing module is configured to:

21. An electronic device, comprising:

At least one processor;

At least one memory storing computer-executable instructions,

Wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the training method of any of claims 1 to 4 and the speech processing method of any of claims 5 to 10.

22. A computer readable storage medium storing instructions which, when executed by at least one processor, cause the at least one processor to perform the training method of any one of claims 1 to 4 and the speech processing method of any one of claims 5 to 10.

23. A computer program product having instructions that are executed by at least one processor in an electronic device to perform the training method of any of claims 1 to 4 and the speech processing method of any of claims 5 to 10.