Background
Currently, speech processing technologies are applied in various fields, including speech recognition, speech synthesis, human-computer speech interaction, and other speech processing technologies. In the prior art, when voice information obtained through voice synthesis is output, the obtained voice is relatively hard, the trace of the robot voice is obvious, and the emotion is lacked, so that the sense of reality of a listener is not strong.
Disclosure of Invention
In view of this, embodiments of the present invention provide an audio processing method, an audio processing apparatus, a readable storage medium, and an electronic device, which aim to increase emotional colors and improve reality in a speech synthesis process.
In a first aspect, an embodiment of the present invention provides an audio processing method, where the method includes:
determining first audio data;
determining an audio clip sequence according to the first audio data, wherein the audio clip sequence comprises at least one audio clip with corresponding audio attributes;
adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence;
splicing each target audio segment in the target audio segment sequence to determine second audio data.
Further, the determining a sequence of audio segments from the first audio data comprises:
determining a word vector sequence corresponding to the first audio data;
performing word segmentation processing based on the word vector sequence to determine audio segments corresponding to a plurality of word vectors;
an audio segment sequence is determined from each of the audio segments.
Further, the audio attribute includes at least one of a pitch, a volume, and a speech rate.
Further, the adding a disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule to adjust an audio attribute corresponding to each audio segment to determine a target audio segment sequence includes:
determining a disturbance coefficient corresponding to each audio clip according to the corresponding audio attribute;
and adding disturbance to the corresponding audio segments according to the disturbance coefficients so as to adjust the audio properties corresponding to the audio segments to determine a target audio segment sequence.
Further, the determining the disturbance coefficient corresponding to each audio clip according to the corresponding audio attribute includes:
determining a current audio clip;
determining a difference value between the audio attribute value corresponding to the current audio clip and the audio attribute value corresponding to at least one adjacent audio clip in the audio clip sequence;
and determining a disturbance coefficient corresponding to the current audio clip according to the difference.
Further, the adding a disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule to adjust the audio attribute corresponding to each audio segment to determine a target audio segment sequence specifically includes:
and randomly disturbing each audio clip in the audio clip sequence to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence.
Further, the splicing each of the target audio segments in the sequence of target audio segments to determine second audio data includes:
splicing each target audio segment in the target audio segment sequence to determine candidate audio data;
smoothing the candidate audio data to determine second audio data.
In a second aspect, an embodiment of the present invention provides an audio processing apparatus, where the apparatus includes:
a first audio determining module for determining first audio data;
the word segmentation module is used for determining an audio clip sequence according to the first audio data, wherein the audio clip sequence comprises at least one audio clip with corresponding audio attributes;
the adjusting module is used for adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence;
and the second audio determining module is used for splicing each target audio segment in the target audio segment sequence to determine second audio data.
In a third aspect, the present invention provides a computer-readable storage medium for storing computer program instructions, which when executed by a processor implement the method according to any one of the first aspect.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to any one of the first aspect.
The embodiment of the invention determines the first audio data and divides the first audio data to determine the audio segment sequence comprising at least one audio segment. And adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence. Splicing each target audio segment in the target audio segment sequence to determine second audio data. According to the embodiment of the invention, the audio data are segmented to obtain a plurality of audio segments with corresponding audio attributes, disturbance is added to each audio segment to adjust the audio attributes such as tone, volume, speed and the like, the emotion color is added to the audio data determined by the adjusted audio segments, and the reality of the synthesized voice is improved.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the embodiment of the present invention, the audio processing method may be implemented by a server or a terminal device, that is, the terminal device or the server performing the audio processing determines first audio data by generating or receiving, and the like, determines audio segment sequences according to the first audio data to adjust audio attributes, and determines second audio data based on the adjusted audio segment sequences. The terminal device may be a general data processing terminal capable of running a computer program and having a communication function, such as a smart phone, a tablet computer, or a notebook computer. The server may be a single server or a cluster of servers configured in a distributed manner. The first audio data can be acquired by an audio acquisition device arranged on the terminal equipment or connected with the server, or can be transmitted to the terminal equipment or the server through other equipment, or can be directly and automatically generated through the terminal equipment or the server, so that audio processing can be performed through the terminal equipment or the server. The following description will take the example of audio processing performed by the server according to the embodiment of the present invention.
Fig. 1 is a flowchart of an audio processing method according to an embodiment of the invention. As shown in fig. 1, the audio processing method includes the steps of:
and step S100, determining first audio data.
Specifically, the first audio data is audio data to be processed, and may be determined by a server. In the embodiment of the present invention, the first audio data may be stored in a form of storing text data and corresponding audio attributes, or directly stored in a form of time domain waveform. In the case where the first audio data is directly stored in the form of a time-domain waveform, the first audio data may be determined in such a manner that text information generated, received, or previously stored by a server is converted into speech information by a speech synthesis technique. For example, when the embodiment of the present invention is applied to infant text recognition software, the text information may be a text to be recognized that is input by a child or a parent through a user terminal, and a server of the infant text recognition software converts the text to be recognized into corresponding voice information as first audio data in a voice synthesis manner after receiving the text to be recognized. When the embodiment of the invention is applied to software with a voice interaction function, the text information can be system information preset in a server for guiding a user to operate the software, and the system information is converted into corresponding voice information in a voice synthesis mode to be used as first audio data under the condition that the user selects voice prompt.
Alternatively, the speech synthesis process may be performed by a speech synthesis system such as WaveNet, DeepVoice, Tacotron, etc. The voice synthesis process comprises three parts of text analysis, prosody analysis and acoustic analysis, namely, text features are extracted through the text analysis, and on the basis, various prosodic features such as fundamental frequency, duration, rhythm and the like are predicted to obtain phonetic notation characters corresponding to the text. Then mapping from front-end parameters to voice parameters is realized through an acoustic model, and finally voice is synthesized through a vocoder.
And S200, determining an audio fragment sequence according to the first audio data.
Specifically, after determining first audio data to be processed, the server segments the first audio data to obtain an audio segment sequence including a plurality of audio segments. Each audio segment has corresponding audio attributes, which are used for representing attributes of tone, speech speed, volume and the like of the corresponding audio segment. For example, when the content corresponding to the first audio data is "true good weather of today", the server determines, according to the first audio data, that the content corresponding to each of the audio clips in the audio clip sequence may be "today", "weather", "true" or "good". Since the first audio data is stored in different forms, for example, the first audio data may be stored in a form of storing text data and corresponding audio attributes of the first audio data, or directly stored in a form of carrying audio information and time domain waveforms of the audio data. Thus, for first audio data stored in different formats, embodiments of the present invention determine the manner for the sequence of audio segments differently.
Based on the first audio data storage form described above, in an optional implementation manner of the embodiment of the present invention, when the first audio data is stored in a time-domain waveform format, both the audio content and the audio attribute corresponding to the first audio data are carried on the time-domain waveform. The server can segment the first audio data in a time domain through a preset segmentation rule to obtain a plurality of waveform files bearing corresponding audio content and audio attributes as audio segments. And determining the audio segment sequence according to the position relation of each audio segment in the first audio data. For example, when the first audio data is time domain waveform data corresponding to "today's weather is really good", the vertical axis of the waveform data represents the corresponding volume, and the horizontal axis represents the corresponding speech speed and pitch. The server directly divides the first audio data to obtain time domain waveform data respectively bearing today, weather, true and good, and the vertical axis of each time domain waveform data respectively represents the volume of the corresponding content and the horizontal axis is used for representing the speed and the tone of the corresponding content.
In another optional implementation manner of the embodiment of the present invention, the first audio data is stored in a form of text data, and simultaneously, audio attributes corresponding to the text data are stored. In the above case, the method for determining the audio segment sequence according to the first audio data is to perform word segmentation on the text data, and then convert a plurality of results obtained by word segmentation into time-domain waveforms based on corresponding audio attributes to determine the audio segments. And determining the audio segment sequence according to the position relation of each audio segment in the first audio data. The word segmentation process may be implemented by Natural Language Processing (Natural Language Processing), such as word segmentation based on a dictionary, word segmentation based on a neural network, word segmentation based on a word, and the like.
Optionally, the server may obtain a word segmentation model by training in advance according to a word vector training set, where the word vector training set may include a plurality of word vector sequences and a plurality of word vectors corresponding to each word vector sequence. And the training process of the word segmentation model is to take a word vector sequence in the word vector training set as the input of the word segmentation model and take a plurality of corresponding word vectors as the output of the word segmentation model so as to train the word segmentation model. After the first audio data is determined, text data corresponding to the first audio data is converted into a word vector sequence. And inputting the word vector sequence into the trained word segmentation model, and outputting a plurality of word vectors. And determining corresponding characters or words according to the word vectors, converting the characters or words into time domain waveforms through a speech synthesis technology to determine audio segments, and determining an audio segment sequence according to the audio segments. For example, when the first audio data is text data "weather today is really good", the text data corresponds to audio attributes including volume, speech rate, and pitch. The server inputs the text data into the trained word segmentation model to obtain a plurality of word segmentation results as follows: "today", "weather", "true", "good", each of said participle results having a corresponding volume, speech rate and pitch, respectively. And the server determines corresponding audio segments according to the corresponding audio attributes of the word segmentation results to obtain an audio segment sequence.
Fig. 2 is a schematic diagram of an audio clip according to an embodiment of the present invention. As shown in fig. 2, the audio clip 20 in the embodiment of the present invention is stored in the form of a time-domain waveform. The time domain waveform is used to record information corresponding to the audio segment 20. The vertical axis of the audio segment 20 is decibels, and is used for representing the volume corresponding to the audio segment 20. When the peak decibel of the time domain waveform corresponding to the audio segment 20 is higher, the volume is higher; the lower the decibel, the lower the volume. The horizontal axis of the audio segment 20 is time, and is used for representing the speech rate and the pitch of the audio segment 20. When the time of the time domain waveform corresponding to the audio segment 20 is shorter, the faster the speech speed is, the higher the pitch is; the longer the time, the slower the speech rate and the lower the pitch.
Step S300, adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence.
Specifically, since each audio segment in the audio segment sequence is determined directly or indirectly through speech synthesis, the sound output by combining the audio segments usually has mechanical pronunciation and lacks emotion. Therefore, after the audio segment sequence is determined, the server adds disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio segment, so that the emotional color can be increased when the audio segments are combined and output, and the reality of the synthesized voice is improved. In the embodiment of the present invention, the server may add a disturbance to each of the audio clips in the audio clip sequence through a plurality of different disturbance modes.
In an optional implementation manner of the embodiment of the present invention, a manner in which the server adds disturbance to each audio segment in the sequence of audio segments may be directly adding random disturbance to each audio segment, so as to randomly adjust audio attributes such as volume, speech rate, and pitch of each audio segment on the premise of not changing the content carried by the audio segment. The method for adding random disturbance may be to add random noise to the time domain waveform signal corresponding to the audio segment. And the server determines each adjusted audio clip as a target audio clip, and determines a target audio clip sequence based on the position of each audio clip in the audio clip sequence. Because the processing speed of the disturbance method for adding random disturbance to the signal is high, the audio segment disturbance method can realize the rapid addition of emotional colors to each audio segment.
In another optional implementation manner of the embodiment of the present invention, the manner in which the server adds the disturbance to each audio clip in the sequence of audio clips may further include the following steps:
and S310, determining a disturbance coefficient corresponding to each audio clip according to the corresponding audio attribute.
Specifically, the server may add corresponding disturbance to each audio segment in a targeted manner according to the audio attribute corresponding to each audio segment, so as to improve the accuracy of the disturbance result. In order to realize the targeted disturbance of each audio segment, a disturbance coefficient corresponding to each audio segment needs to be determined according to the corresponding audio attribute. In an embodiment of the present invention, the method for determining the disturbance coefficient of each audio segment may further include the following steps:
step S311, determining the current audio segment.
Specifically, when determining the disturbance coefficient corresponding to each of the audio segments in the audio segment sequence, first selecting one audio segment in the audio segment sequence as a current audio segment to determine the disturbance coefficient corresponding to the current audio segment. After the disturbance coefficients corresponding to the current audio clip are determined, other audio clips without determined disturbance coefficients are selected from the audio clip sequence as the current audio clip again until the disturbance coefficients corresponding to all the audio clips in the audio clip sequence are determined. Further, the server may also preset the determined order of the current audio clips. The determined sequence may be that the audio segments in the audio segment sequence are determined to be the current audio segment from front to back in sequence.
Optionally, the server may also determine a preset number of audio segments in the current audio segment sequence simultaneously, or determine all audio segments as current audio segments, so as to determine the disturbance coefficients corresponding to the current audio segments in parallel, and improve the data processing speed.
Step S312, determining a difference between the audio attribute value corresponding to the current audio segment and the audio attribute value corresponding to at least one adjacent audio segment in the audio segment sequence.
Specifically, for a current audio segment, the audio property value of the corresponding audio property value and at least one adjacent audio segment in the audio segment sequence is determined. The audio attribute value is a value of at least one audio attribute to be adjusted corresponding to the audio clip, and the audio attribute to be adjusted may be predetermined. For example, when the audio clip needs to adjust the volume, the audio attribute value is the volume value corresponding to the audio clip. And when the audio clip needs to adjust the tone, the audio attribute value is the tone value corresponding to the audio clip.
After determining the audio attribute value corresponding to the current audio clip, the server also determines the audio attribute corresponding to at least one audio clip adjacent to the current audio clip in the audio clip sequence. For example, audio properties corresponding to audio segments positioned before a current audio segment, audio properties corresponding to audio segments after the current audio segment, or audio properties corresponding to audio segments before and after the current audio segment may be determined in the sequence of audio segments. And calculating a difference between the audio property of the current audio clip and the determined audio property of the vector audio clip. For example, when the audio attribute value is volume, the audio attribute value corresponding to the current segment is 80 db, and the audio attribute value corresponding to the migration audio segment in the audio segment sequence is 70 db, the calculated difference value is 10 db.
And step S313, determining a disturbance coefficient corresponding to the current audio clip according to the difference.
Specifically, the server may determine the difference between the audio attribute value corresponding to the current audio segment and the audio attribute value corresponding to the adjacent audio segment according to the difference obtained in step S312, and determine the corresponding disturbance coefficient according to the difference degree. In this embodiment of the present invention, the server may preset a set of perturbation coefficients corresponding to each of the audio attribute values, where the set of perturbation coefficients includes a plurality of perturbation coefficients corresponding to each of the audio attribute value difference ranges. The following description will be given by taking the case where the set of disturbance coefficients corresponding to the volume values is { "-10 to-5: 0.8", "-4 to 0: 0.9", "0 to 5: 1", "6 to 10: 1.1", "11 to 15: 1.2" }. And when the audio attribute value is volume, the audio attribute value corresponding to the current segment is 80 decibels, and the audio attribute value corresponding to the transferred audio segment in the audio segment sequence is 70 decibels, calculating to obtain a difference value of 10 decibels. The corresponding perturbation coefficient was determined to be 1.1.
Step S320, adding a disturbance to the corresponding audio segment according to each of the disturbance coefficients, so as to adjust the audio attribute corresponding to each of the audio segments to determine a target audio segment sequence.
Specifically, after determining the disturbance coefficient corresponding to each audio segment, adding disturbance to the corresponding audio segment according to each disturbance coefficient to adjust the audio attribute corresponding to each audio segment. And after each audio clip is disturbed respectively, obtaining a corresponding target audio clip so as to determine a target audio clip sequence. The disturbance mode may be directly multiplying the disturbance coefficient by the volume attribute value. For example, when the server determines that the perturbation coefficient is 1.1 and corresponds to the volume value of an audio clip, the server multiplies the volume value corresponding to the audio clip as a whole by 1.1 times.
The disturbance method determines the disturbance coefficients corresponding to the audio segments in a pertinence manner so as to add disturbance to the audio segments respectively, and the audio segment disturbance method can improve the effect of adding emotional colors to the audio segments.
And S400, splicing each target audio clip in the target audio clip sequence to determine second audio data.
Specifically, the second audio data is a time-domain waveform signal. After the target audio segment sequence is determined, splicing the target audio segments according to the sequence of the target audio segments in the audio segment sequence to determine second audio data. Because the time domain waveform corresponding to each target audio segment is obtained by processing for many times after generating the time domain waveform in a voice synthesis mode, the target audio segments can have the problem of unsmooth connection. Therefore, in the embodiment of the present invention, the determining process of the second audio data may be to first splice each target audio segment in the target audio segment sequence to obtain candidate audio data, and then perform smoothing on the candidate audio data to determine the second audio data. Alternatively, the method for performing smoothing processing on the candidate audio data may be first-order low-pass filtering, complementary filtering, kalman filtering, or the like.
Fig. 3 is a schematic diagram of an audio segment splicing process according to an embodiment of the present invention. As shown in fig. 3, when the determined target audio segment sequence includes a first target audio segment 30 and a second target audio segment 31, the first target audio segment 30 and the second target audio segment 31 are first spliced to obtain candidate audio data 32, and then the candidate audio data 32 is further smoothed to obtain corresponding second audio data 33.
Thus, the embodiment of the present invention may obtain the second audio signal output with emotional colors through step S400. The man-machine voice interaction process in the online education field is taken as an example for explanation. The method comprises the steps that a student can input a text through a used student terminal, a server of an online education platform determines an answer text corresponding to the text after receiving the text, first audio data corresponding to the answer text are determined through a voice synthesis technology, and then disturbance is added to an audio fragment sequence determined by the first audio data. And finally, determining a second audio signal to return to the student terminal according to the target audio fragment sequence obtained after disturbance, and outputting the second audio signal through a loudspeaker device of the student terminal.
According to the audio processing method, the audio data are segmented to obtain a plurality of audio segments with corresponding audio attributes, disturbance is added to each audio segment to adjust the audio attributes such as tone, volume and speed, the emotion color is added to the audio data determined by the adjusted audio segments, and the reality of the synthesized voice is improved. Meanwhile, the audio processing method is simple in process and short in processing time, and can perform voice processing in real time in the human-computer voice interaction process.
Fig. 4 is a schematic diagram of an audio processing apparatus according to an embodiment of the present invention, as shown in fig. 4, the audio processing apparatus includes a first audio determining module 40, a word segmentation module 41, an adjusting module 42, and a second audio determining module 43.
In particular, the first audio determining module 40 is configured to determine first audio data. The word segmentation module 41 is configured to determine an audio segment sequence according to the first audio data, where the audio segment sequence includes at least one audio segment with a corresponding audio attribute. The adjusting module 42 is configured to add a disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule, so as to adjust an audio attribute corresponding to each audio segment to determine a target audio segment sequence. The second audio determining module 43 is configured to splice each of the target audio segments in the sequence of target audio segments to determine second audio data.
The audio processing device of the embodiment of the invention obtains a plurality of audio segments with corresponding audio attributes by dividing the audio data, adds disturbance to each audio segment to adjust the audio attributes such as tone, volume, speed and the like, adds emotional colors to the audio data determined by the adjusted audio segments, and improves the sense of reality of the synthesized voice.
Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the invention. As shown in fig. 5, the electronic device shown in fig. 5 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 50 and a memory 51. The processor 50 and the memory 51 are connected by a bus 52. The memory 51 is adapted to store instructions or programs executable by the processor 50. The processor 50 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 50 implements the processing of data and the control of other devices by executing instructions stored by the memory 51 to perform the method flows of embodiments of the present invention as described above. The bus 52 connects the above components together, and also connects the above components to a display controller 53 and a display device and an input/output (I/O) device 54. Input/output (I/O) devices 54 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output device 54 is connected to the system through an input/output (I/O) controller 55.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable vehicle dispatch device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable vehicle scheduling apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable vehicle scheduling apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.