[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112185338A - Audio processing method and device, readable storage medium and electronic equipment - Google Patents

Audio processing method and device, readable storage medium and electronic equipment Download PDF

Info

Publication number
CN112185338A
CN112185338A CN202011062271.1A CN202011062271A CN112185338A CN 112185338 A CN112185338 A CN 112185338A CN 202011062271 A CN202011062271 A CN 202011062271A CN 112185338 A CN112185338 A CN 112185338A
Authority
CN
China
Prior art keywords
audio
sequence
segments
target
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011062271.1A
Other languages
Chinese (zh)
Other versions
CN112185338B (en
Inventor
梁光
杨惠
吴雨璇
舒景辰
周鼎皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Yudi Technology Co ltd
Original Assignee
Beijing Dami Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dami Technology Co Ltd filed Critical Beijing Dami Technology Co Ltd
Priority to CN202011062271.1A priority Critical patent/CN112185338B/en
Publication of CN112185338A publication Critical patent/CN112185338A/en
Application granted granted Critical
Publication of CN112185338B publication Critical patent/CN112185338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

本发明实施例公开了一种音频处理方法、装置、可读存储介质和电子设备,通过确定第一音频数据,对所述第一音频数据进行分割,以确定包括至少一个音频片段的音频片段序列。根据预设扰动规则对所述音频片段序列中各所述音频片段添加扰动,以调整各所述音频片段对应的音频属性确定目标音频片段序列。拼接所述目标音频片段序列中各所述目标音频片段,以确定第二音频数据。本发明实施例通过对音频数据进行分割,获得多个具有对应音频属性的音频片段,对各音频片段添加扰动以调整音调、音量、语速等音频属性,为调整后音频片段确定的音频数据增加情感色彩,提升了合成语音的真实感。

Figure 202011062271

Embodiments of the present invention disclose an audio processing method, apparatus, readable storage medium and electronic device. By determining first audio data and dividing the first audio data, an audio segment sequence including at least one audio segment is determined. . A perturbation is added to each of the audio clips in the audio clip sequence according to a preset perturbation rule, so as to adjust the audio attribute corresponding to each of the audio clips to determine a target audio clip sequence. Each of the target audio segments in the sequence of target audio segments is spliced to determine second audio data. The embodiment of the present invention obtains a plurality of audio clips with corresponding audio properties by dividing the audio data, adds perturbation to each audio clip to adjust the audio properties such as pitch, volume, and speech rate, and increases the audio data determined for the adjusted audio clips. Emotional coloring, which enhances the realism of the synthesized speech.

Figure 202011062271

Description

Audio processing method and device, readable storage medium and electronic equipment
Technical Field
The present invention relates to the field of computer technologies, and in particular, to an audio processing method and apparatus, a readable storage medium, and an electronic device.
Background
Currently, speech processing technologies are applied in various fields, including speech recognition, speech synthesis, human-computer speech interaction, and other speech processing technologies. In the prior art, when voice information obtained through voice synthesis is output, the obtained voice is relatively hard, the trace of the robot voice is obvious, and the emotion is lacked, so that the sense of reality of a listener is not strong.
Disclosure of Invention
In view of this, embodiments of the present invention provide an audio processing method, an audio processing apparatus, a readable storage medium, and an electronic device, which aim to increase emotional colors and improve reality in a speech synthesis process.
In a first aspect, an embodiment of the present invention provides an audio processing method, where the method includes:
determining first audio data;
determining an audio clip sequence according to the first audio data, wherein the audio clip sequence comprises at least one audio clip with corresponding audio attributes;
adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence;
splicing each target audio segment in the target audio segment sequence to determine second audio data.
Further, the determining a sequence of audio segments from the first audio data comprises:
determining a word vector sequence corresponding to the first audio data;
performing word segmentation processing based on the word vector sequence to determine audio segments corresponding to a plurality of word vectors;
an audio segment sequence is determined from each of the audio segments.
Further, the audio attribute includes at least one of a pitch, a volume, and a speech rate.
Further, the adding a disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule to adjust an audio attribute corresponding to each audio segment to determine a target audio segment sequence includes:
determining a disturbance coefficient corresponding to each audio clip according to the corresponding audio attribute;
and adding disturbance to the corresponding audio segments according to the disturbance coefficients so as to adjust the audio properties corresponding to the audio segments to determine a target audio segment sequence.
Further, the determining the disturbance coefficient corresponding to each audio clip according to the corresponding audio attribute includes:
determining a current audio clip;
determining a difference value between the audio attribute value corresponding to the current audio clip and the audio attribute value corresponding to at least one adjacent audio clip in the audio clip sequence;
and determining a disturbance coefficient corresponding to the current audio clip according to the difference.
Further, the adding a disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule to adjust the audio attribute corresponding to each audio segment to determine a target audio segment sequence specifically includes:
and randomly disturbing each audio clip in the audio clip sequence to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence.
Further, the splicing each of the target audio segments in the sequence of target audio segments to determine second audio data includes:
splicing each target audio segment in the target audio segment sequence to determine candidate audio data;
smoothing the candidate audio data to determine second audio data.
In a second aspect, an embodiment of the present invention provides an audio processing apparatus, where the apparatus includes:
a first audio determining module for determining first audio data;
the word segmentation module is used for determining an audio clip sequence according to the first audio data, wherein the audio clip sequence comprises at least one audio clip with corresponding audio attributes;
the adjusting module is used for adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence;
and the second audio determining module is used for splicing each target audio segment in the target audio segment sequence to determine second audio data.
In a third aspect, the present invention provides a computer-readable storage medium for storing computer program instructions, which when executed by a processor implement the method according to any one of the first aspect.
In a fourth aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to any one of the first aspect.
The embodiment of the invention determines the first audio data and divides the first audio data to determine the audio segment sequence comprising at least one audio segment. And adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence. Splicing each target audio segment in the target audio segment sequence to determine second audio data. According to the embodiment of the invention, the audio data are segmented to obtain a plurality of audio segments with corresponding audio attributes, disturbance is added to each audio segment to adjust the audio attributes such as tone, volume, speed and the like, the emotion color is added to the audio data determined by the adjusted audio segments, and the reality of the synthesized voice is improved.
Drawings
The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of an audio processing method according to an embodiment of the invention;
FIG. 2 is a diagram of an audio clip according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an audio segment splicing process according to an embodiment of the present invention;
FIG. 4 is a diagram of an audio processing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present invention.
Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.
Unless the context clearly requires otherwise, throughout the description, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".
In the description of the present invention, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the embodiment of the present invention, the audio processing method may be implemented by a server or a terminal device, that is, the terminal device or the server performing the audio processing determines first audio data by generating or receiving, and the like, determines audio segment sequences according to the first audio data to adjust audio attributes, and determines second audio data based on the adjusted audio segment sequences. The terminal device may be a general data processing terminal capable of running a computer program and having a communication function, such as a smart phone, a tablet computer, or a notebook computer. The server may be a single server or a cluster of servers configured in a distributed manner. The first audio data can be acquired by an audio acquisition device arranged on the terminal equipment or connected with the server, or can be transmitted to the terminal equipment or the server through other equipment, or can be directly and automatically generated through the terminal equipment or the server, so that audio processing can be performed through the terminal equipment or the server. The following description will take the example of audio processing performed by the server according to the embodiment of the present invention.
Fig. 1 is a flowchart of an audio processing method according to an embodiment of the invention. As shown in fig. 1, the audio processing method includes the steps of:
and step S100, determining first audio data.
Specifically, the first audio data is audio data to be processed, and may be determined by a server. In the embodiment of the present invention, the first audio data may be stored in a form of storing text data and corresponding audio attributes, or directly stored in a form of time domain waveform. In the case where the first audio data is directly stored in the form of a time-domain waveform, the first audio data may be determined in such a manner that text information generated, received, or previously stored by a server is converted into speech information by a speech synthesis technique. For example, when the embodiment of the present invention is applied to infant text recognition software, the text information may be a text to be recognized that is input by a child or a parent through a user terminal, and a server of the infant text recognition software converts the text to be recognized into corresponding voice information as first audio data in a voice synthesis manner after receiving the text to be recognized. When the embodiment of the invention is applied to software with a voice interaction function, the text information can be system information preset in a server for guiding a user to operate the software, and the system information is converted into corresponding voice information in a voice synthesis mode to be used as first audio data under the condition that the user selects voice prompt.
Alternatively, the speech synthesis process may be performed by a speech synthesis system such as WaveNet, DeepVoice, Tacotron, etc. The voice synthesis process comprises three parts of text analysis, prosody analysis and acoustic analysis, namely, text features are extracted through the text analysis, and on the basis, various prosodic features such as fundamental frequency, duration, rhythm and the like are predicted to obtain phonetic notation characters corresponding to the text. Then mapping from front-end parameters to voice parameters is realized through an acoustic model, and finally voice is synthesized through a vocoder.
And S200, determining an audio fragment sequence according to the first audio data.
Specifically, after determining first audio data to be processed, the server segments the first audio data to obtain an audio segment sequence including a plurality of audio segments. Each audio segment has corresponding audio attributes, which are used for representing attributes of tone, speech speed, volume and the like of the corresponding audio segment. For example, when the content corresponding to the first audio data is "true good weather of today", the server determines, according to the first audio data, that the content corresponding to each of the audio clips in the audio clip sequence may be "today", "weather", "true" or "good". Since the first audio data is stored in different forms, for example, the first audio data may be stored in a form of storing text data and corresponding audio attributes of the first audio data, or directly stored in a form of carrying audio information and time domain waveforms of the audio data. Thus, for first audio data stored in different formats, embodiments of the present invention determine the manner for the sequence of audio segments differently.
Based on the first audio data storage form described above, in an optional implementation manner of the embodiment of the present invention, when the first audio data is stored in a time-domain waveform format, both the audio content and the audio attribute corresponding to the first audio data are carried on the time-domain waveform. The server can segment the first audio data in a time domain through a preset segmentation rule to obtain a plurality of waveform files bearing corresponding audio content and audio attributes as audio segments. And determining the audio segment sequence according to the position relation of each audio segment in the first audio data. For example, when the first audio data is time domain waveform data corresponding to "today's weather is really good", the vertical axis of the waveform data represents the corresponding volume, and the horizontal axis represents the corresponding speech speed and pitch. The server directly divides the first audio data to obtain time domain waveform data respectively bearing today, weather, true and good, and the vertical axis of each time domain waveform data respectively represents the volume of the corresponding content and the horizontal axis is used for representing the speed and the tone of the corresponding content.
In another optional implementation manner of the embodiment of the present invention, the first audio data is stored in a form of text data, and simultaneously, audio attributes corresponding to the text data are stored. In the above case, the method for determining the audio segment sequence according to the first audio data is to perform word segmentation on the text data, and then convert a plurality of results obtained by word segmentation into time-domain waveforms based on corresponding audio attributes to determine the audio segments. And determining the audio segment sequence according to the position relation of each audio segment in the first audio data. The word segmentation process may be implemented by Natural Language Processing (Natural Language Processing), such as word segmentation based on a dictionary, word segmentation based on a neural network, word segmentation based on a word, and the like.
Optionally, the server may obtain a word segmentation model by training in advance according to a word vector training set, where the word vector training set may include a plurality of word vector sequences and a plurality of word vectors corresponding to each word vector sequence. And the training process of the word segmentation model is to take a word vector sequence in the word vector training set as the input of the word segmentation model and take a plurality of corresponding word vectors as the output of the word segmentation model so as to train the word segmentation model. After the first audio data is determined, text data corresponding to the first audio data is converted into a word vector sequence. And inputting the word vector sequence into the trained word segmentation model, and outputting a plurality of word vectors. And determining corresponding characters or words according to the word vectors, converting the characters or words into time domain waveforms through a speech synthesis technology to determine audio segments, and determining an audio segment sequence according to the audio segments. For example, when the first audio data is text data "weather today is really good", the text data corresponds to audio attributes including volume, speech rate, and pitch. The server inputs the text data into the trained word segmentation model to obtain a plurality of word segmentation results as follows: "today", "weather", "true", "good", each of said participle results having a corresponding volume, speech rate and pitch, respectively. And the server determines corresponding audio segments according to the corresponding audio attributes of the word segmentation results to obtain an audio segment sequence.
Fig. 2 is a schematic diagram of an audio clip according to an embodiment of the present invention. As shown in fig. 2, the audio clip 20 in the embodiment of the present invention is stored in the form of a time-domain waveform. The time domain waveform is used to record information corresponding to the audio segment 20. The vertical axis of the audio segment 20 is decibels, and is used for representing the volume corresponding to the audio segment 20. When the peak decibel of the time domain waveform corresponding to the audio segment 20 is higher, the volume is higher; the lower the decibel, the lower the volume. The horizontal axis of the audio segment 20 is time, and is used for representing the speech rate and the pitch of the audio segment 20. When the time of the time domain waveform corresponding to the audio segment 20 is shorter, the faster the speech speed is, the higher the pitch is; the longer the time, the slower the speech rate and the lower the pitch.
Step S300, adding disturbance to each audio clip in the audio clip sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio clip to determine a target audio clip sequence.
Specifically, since each audio segment in the audio segment sequence is determined directly or indirectly through speech synthesis, the sound output by combining the audio segments usually has mechanical pronunciation and lacks emotion. Therefore, after the audio segment sequence is determined, the server adds disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule so as to adjust the audio attribute corresponding to each audio segment, so that the emotional color can be increased when the audio segments are combined and output, and the reality of the synthesized voice is improved. In the embodiment of the present invention, the server may add a disturbance to each of the audio clips in the audio clip sequence through a plurality of different disturbance modes.
In an optional implementation manner of the embodiment of the present invention, a manner in which the server adds disturbance to each audio segment in the sequence of audio segments may be directly adding random disturbance to each audio segment, so as to randomly adjust audio attributes such as volume, speech rate, and pitch of each audio segment on the premise of not changing the content carried by the audio segment. The method for adding random disturbance may be to add random noise to the time domain waveform signal corresponding to the audio segment. And the server determines each adjusted audio clip as a target audio clip, and determines a target audio clip sequence based on the position of each audio clip in the audio clip sequence. Because the processing speed of the disturbance method for adding random disturbance to the signal is high, the audio segment disturbance method can realize the rapid addition of emotional colors to each audio segment.
In another optional implementation manner of the embodiment of the present invention, the manner in which the server adds the disturbance to each audio clip in the sequence of audio clips may further include the following steps:
and S310, determining a disturbance coefficient corresponding to each audio clip according to the corresponding audio attribute.
Specifically, the server may add corresponding disturbance to each audio segment in a targeted manner according to the audio attribute corresponding to each audio segment, so as to improve the accuracy of the disturbance result. In order to realize the targeted disturbance of each audio segment, a disturbance coefficient corresponding to each audio segment needs to be determined according to the corresponding audio attribute. In an embodiment of the present invention, the method for determining the disturbance coefficient of each audio segment may further include the following steps:
step S311, determining the current audio segment.
Specifically, when determining the disturbance coefficient corresponding to each of the audio segments in the audio segment sequence, first selecting one audio segment in the audio segment sequence as a current audio segment to determine the disturbance coefficient corresponding to the current audio segment. After the disturbance coefficients corresponding to the current audio clip are determined, other audio clips without determined disturbance coefficients are selected from the audio clip sequence as the current audio clip again until the disturbance coefficients corresponding to all the audio clips in the audio clip sequence are determined. Further, the server may also preset the determined order of the current audio clips. The determined sequence may be that the audio segments in the audio segment sequence are determined to be the current audio segment from front to back in sequence.
Optionally, the server may also determine a preset number of audio segments in the current audio segment sequence simultaneously, or determine all audio segments as current audio segments, so as to determine the disturbance coefficients corresponding to the current audio segments in parallel, and improve the data processing speed.
Step S312, determining a difference between the audio attribute value corresponding to the current audio segment and the audio attribute value corresponding to at least one adjacent audio segment in the audio segment sequence.
Specifically, for a current audio segment, the audio property value of the corresponding audio property value and at least one adjacent audio segment in the audio segment sequence is determined. The audio attribute value is a value of at least one audio attribute to be adjusted corresponding to the audio clip, and the audio attribute to be adjusted may be predetermined. For example, when the audio clip needs to adjust the volume, the audio attribute value is the volume value corresponding to the audio clip. And when the audio clip needs to adjust the tone, the audio attribute value is the tone value corresponding to the audio clip.
After determining the audio attribute value corresponding to the current audio clip, the server also determines the audio attribute corresponding to at least one audio clip adjacent to the current audio clip in the audio clip sequence. For example, audio properties corresponding to audio segments positioned before a current audio segment, audio properties corresponding to audio segments after the current audio segment, or audio properties corresponding to audio segments before and after the current audio segment may be determined in the sequence of audio segments. And calculating a difference between the audio property of the current audio clip and the determined audio property of the vector audio clip. For example, when the audio attribute value is volume, the audio attribute value corresponding to the current segment is 80 db, and the audio attribute value corresponding to the migration audio segment in the audio segment sequence is 70 db, the calculated difference value is 10 db.
And step S313, determining a disturbance coefficient corresponding to the current audio clip according to the difference.
Specifically, the server may determine the difference between the audio attribute value corresponding to the current audio segment and the audio attribute value corresponding to the adjacent audio segment according to the difference obtained in step S312, and determine the corresponding disturbance coefficient according to the difference degree. In this embodiment of the present invention, the server may preset a set of perturbation coefficients corresponding to each of the audio attribute values, where the set of perturbation coefficients includes a plurality of perturbation coefficients corresponding to each of the audio attribute value difference ranges. The following description will be given by taking the case where the set of disturbance coefficients corresponding to the volume values is { "-10 to-5: 0.8", "-4 to 0: 0.9", "0 to 5: 1", "6 to 10: 1.1", "11 to 15: 1.2" }. And when the audio attribute value is volume, the audio attribute value corresponding to the current segment is 80 decibels, and the audio attribute value corresponding to the transferred audio segment in the audio segment sequence is 70 decibels, calculating to obtain a difference value of 10 decibels. The corresponding perturbation coefficient was determined to be 1.1.
Step S320, adding a disturbance to the corresponding audio segment according to each of the disturbance coefficients, so as to adjust the audio attribute corresponding to each of the audio segments to determine a target audio segment sequence.
Specifically, after determining the disturbance coefficient corresponding to each audio segment, adding disturbance to the corresponding audio segment according to each disturbance coefficient to adjust the audio attribute corresponding to each audio segment. And after each audio clip is disturbed respectively, obtaining a corresponding target audio clip so as to determine a target audio clip sequence. The disturbance mode may be directly multiplying the disturbance coefficient by the volume attribute value. For example, when the server determines that the perturbation coefficient is 1.1 and corresponds to the volume value of an audio clip, the server multiplies the volume value corresponding to the audio clip as a whole by 1.1 times.
The disturbance method determines the disturbance coefficients corresponding to the audio segments in a pertinence manner so as to add disturbance to the audio segments respectively, and the audio segment disturbance method can improve the effect of adding emotional colors to the audio segments.
And S400, splicing each target audio clip in the target audio clip sequence to determine second audio data.
Specifically, the second audio data is a time-domain waveform signal. After the target audio segment sequence is determined, splicing the target audio segments according to the sequence of the target audio segments in the audio segment sequence to determine second audio data. Because the time domain waveform corresponding to each target audio segment is obtained by processing for many times after generating the time domain waveform in a voice synthesis mode, the target audio segments can have the problem of unsmooth connection. Therefore, in the embodiment of the present invention, the determining process of the second audio data may be to first splice each target audio segment in the target audio segment sequence to obtain candidate audio data, and then perform smoothing on the candidate audio data to determine the second audio data. Alternatively, the method for performing smoothing processing on the candidate audio data may be first-order low-pass filtering, complementary filtering, kalman filtering, or the like.
Fig. 3 is a schematic diagram of an audio segment splicing process according to an embodiment of the present invention. As shown in fig. 3, when the determined target audio segment sequence includes a first target audio segment 30 and a second target audio segment 31, the first target audio segment 30 and the second target audio segment 31 are first spliced to obtain candidate audio data 32, and then the candidate audio data 32 is further smoothed to obtain corresponding second audio data 33.
Thus, the embodiment of the present invention may obtain the second audio signal output with emotional colors through step S400. The man-machine voice interaction process in the online education field is taken as an example for explanation. The method comprises the steps that a student can input a text through a used student terminal, a server of an online education platform determines an answer text corresponding to the text after receiving the text, first audio data corresponding to the answer text are determined through a voice synthesis technology, and then disturbance is added to an audio fragment sequence determined by the first audio data. And finally, determining a second audio signal to return to the student terminal according to the target audio fragment sequence obtained after disturbance, and outputting the second audio signal through a loudspeaker device of the student terminal.
According to the audio processing method, the audio data are segmented to obtain a plurality of audio segments with corresponding audio attributes, disturbance is added to each audio segment to adjust the audio attributes such as tone, volume and speed, the emotion color is added to the audio data determined by the adjusted audio segments, and the reality of the synthesized voice is improved. Meanwhile, the audio processing method is simple in process and short in processing time, and can perform voice processing in real time in the human-computer voice interaction process.
Fig. 4 is a schematic diagram of an audio processing apparatus according to an embodiment of the present invention, as shown in fig. 4, the audio processing apparatus includes a first audio determining module 40, a word segmentation module 41, an adjusting module 42, and a second audio determining module 43.
In particular, the first audio determining module 40 is configured to determine first audio data. The word segmentation module 41 is configured to determine an audio segment sequence according to the first audio data, where the audio segment sequence includes at least one audio segment with a corresponding audio attribute. The adjusting module 42 is configured to add a disturbance to each audio segment in the audio segment sequence according to a preset disturbance rule, so as to adjust an audio attribute corresponding to each audio segment to determine a target audio segment sequence. The second audio determining module 43 is configured to splice each of the target audio segments in the sequence of target audio segments to determine second audio data.
The audio processing device of the embodiment of the invention obtains a plurality of audio segments with corresponding audio attributes by dividing the audio data, adds disturbance to each audio segment to adjust the audio attributes such as tone, volume, speed and the like, adds emotional colors to the audio data determined by the adjusted audio segments, and improves the sense of reality of the synthesized voice.
Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the invention. As shown in fig. 5, the electronic device shown in fig. 5 is a general address query device, which includes a general computer hardware structure, which includes at least a processor 50 and a memory 51. The processor 50 and the memory 51 are connected by a bus 52. The memory 51 is adapted to store instructions or programs executable by the processor 50. The processor 50 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 50 implements the processing of data and the control of other devices by executing instructions stored by the memory 51 to perform the method flows of embodiments of the present invention as described above. The bus 52 connects the above components together, and also connects the above components to a display controller 53 and a display device and an input/output (I/O) device 54. Input/output (I/O) devices 54 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, the input/output device 54 is connected to the system through an input/output (I/O) controller 55.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus (device) or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may employ a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations of methods, apparatus (devices) and computer program products according to embodiments of the application. It will be understood that each flow in the flow diagrams can be implemented by computer program instructions.
These computer program instructions may be stored in a computer-readable memory that can direct a computer or other programmable vehicle dispatch device to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable vehicle scheduling apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable vehicle scheduling apparatus, create means for implementing the functions specified in the flowchart flow or flows.
Another embodiment of the invention is directed to a non-transitory storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described method embodiments.
That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be accomplished by specifying the relevant hardware through a program, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1.一种音频处理方法,其特征在于,所述方法包括:1. an audio processing method, it is characterised in that the method comprises: 确定第一音频数据;determining the first audio data; 根据所述第一音频数据确定音频片段序列,所述音频片段序列中包括至少一个具有对应音频属性的音频片段;Determine an audio segment sequence according to the first audio data, and the audio segment sequence includes at least one audio segment with corresponding audio attributes; 根据预设扰动规则对所述音频片段序列中各所述音频片段添加扰动,以调整各所述音频片段对应的音频属性确定目标音频片段序列;Add perturbation to each of the audio clips in the audio clip sequence according to a preset perturbation rule, so as to adjust the audio attribute corresponding to each of the audio clips to determine a target audio clip sequence; 拼接所述目标音频片段序列中各所述目标音频片段,以确定第二音频数据。Each of the target audio segments in the sequence of target audio segments is spliced to determine second audio data. 2.根据权利要求1所述的方法,其特征在于,所述根据所述第一音频数据确定音频片段序列包括:2. The method according to claim 1, wherein the determining an audio segment sequence according to the first audio data comprises: 确定所述第一音频数据对应的词向量序列;Determine the word vector sequence corresponding to the first audio data; 基于所述词向量序列进行分词处理,以确定多个词向量对应的音频片段;Perform word segmentation processing based on the word vector sequence to determine audio clips corresponding to multiple word vectors; 根据各所述音频片段确定音频片段序列。A sequence of audio clips is determined from each of the audio clips. 3.根据权利要求1所述的方法,其特征在于,所述音频属性包括音调、音量和语速中至少一种。3. The method according to claim 1, wherein the audio attribute comprises at least one of pitch, volume and speech rate. 4.根据权利要求1所述的方法,其特征在于,所述根据预设扰动规则对所述音频片段序列中各所述音频片段添加扰动,以调整各所述音频片段对应的音频属性确定目标音频片段序列包括:4 . The method according to claim 1 , wherein, according to a preset perturbation rule, a disturbance is added to each of the audio segments in the audio segment sequence, so as to adjust the audio attribute corresponding to each of the audio segments to determine a target. 5 . The sequence of audio clips includes: 根据对应的音频属性确定各所述音频片段对应的扰动系数;Determine a disturbance coefficient corresponding to each of the audio segments according to the corresponding audio attributes; 根据各所述扰动系数为对应的音频片段添加扰动,以调整各所述音频片段对应的音频属性确定目标音频片段序列。A perturbation is added to the corresponding audio segment according to each of the perturbation coefficients, so as to adjust the audio attribute corresponding to each of the audio segments to determine a target audio segment sequence. 5.根据权利要求4所述的方法,其特征在于,所述根据对应的音频属性确定各所述音频片段对应的扰动系数包括:5. The method according to claim 4, wherein the determining the perturbation coefficient corresponding to each of the audio segments according to the corresponding audio attribute comprises: 确定当前音频片段;Determine the current audio clip; 确定所述当前音频片段对应的音频属性值与所述音频片段序列中至少一个相邻的音频片段对应的音频属性值的差值;determining the difference between the audio attribute value corresponding to the current audio clip and the audio attribute value corresponding to at least one adjacent audio clip in the audio clip sequence; 根据所述差值确定所述当前音频片段对应的扰动系数。The disturbance coefficient corresponding to the current audio segment is determined according to the difference value. 6.根据权利要求1所述的方法,其特征在于,所述根据预设扰动规则对所述音频片段序列中各所述音频片段添加扰动,以调整各所述音频片段对应的音频属性确定目标音频片段序列具体为:6 . The method according to claim 1 , wherein, according to a preset disturbance rule, a disturbance is added to each of the audio segments in the audio segment sequence, so as to adjust the audio attribute corresponding to each of the audio segments to determine a target. 7 . The audio clip sequence is specifically: 对所述音频片段序列中各所述音频片段进行随机扰动,以调整各所述音频片段对应的音频属性确定目标音频片段序列。Randomly perturbing each of the audio segments in the audio segment sequence to adjust the audio attributes corresponding to each of the audio segments to determine a target audio segment sequence. 7.根据权利要求1所述的方法,其特征在于,所述拼接所述目标音频片段序列中各所述目标音频片段,以确定第二音频数据包括:7. The method according to claim 1, wherein the splicing each of the target audio segments in the sequence of target audio segments to determine the second audio data comprises: 拼接所述目标音频片段序列中各所述目标音频片段以确定候选音频数据;splicing each of the target audio segments in the sequence of target audio segments to determine candidate audio data; 对所述候选音频数据进行平滑处理以确定第二音频数据。The candidate audio data is smoothed to determine second audio data. 8.一种音频处理装置,其特征在于,所述装置包括:8. An audio processing device, wherein the device comprises: 第一音频确定模块,用于确定第一音频数据;a first audio determining module for determining the first audio data; 分词模块,用于根据所述第一音频数据确定音频片段序列,所述音频片段序列中包括至少一个具有对应音频属性的音频片段;a word segmentation module, configured to determine an audio segment sequence according to the first audio data, and the audio segment sequence includes at least one audio segment with corresponding audio attributes; 调整模块,用于根据预设扰动规则对所述音频片段序列中各所述音频片段添加扰动,以调整各所述音频片段对应的音频属性确定目标音频片段序列;an adjustment module, configured to add disturbance to each of the audio segments in the audio segment sequence according to a preset disturbance rule, so as to adjust the audio attribute corresponding to each of the audio segments to determine a target audio segment sequence; 第二音频确定模块,用于拼接所述目标音频片段序列中各所述目标音频片段,以确定第二音频数据。A second audio determination module, configured to splicing each of the target audio segments in the sequence of target audio segments to determine second audio data. 9.一种计算机可读存储介质,用于存储计算机程序指令,其特征在于,所述计算机程序指令在被处理器执行时实现如权利要求1-7中任一项所述的方法。9. A computer-readable storage medium for storing computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method according to any one of claims 1-7. 10.一种电子设备,包括存储器和处理器,其特征在于,所述存储器用于存储一条或多条计算机程序指令,其中,所述一条或多条计算机程序指令被所述处理器执行以实现如权利要求1-7中任一项所述的方法。10. An electronic device comprising a memory and a processor, wherein the memory is used to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to achieve The method of any of claims 1-7.
CN202011062271.1A 2020-09-30 2020-09-30 Audio processing method, device, readable storage medium and electronic equipment Active CN112185338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011062271.1A CN112185338B (en) 2020-09-30 2020-09-30 Audio processing method, device, readable storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011062271.1A CN112185338B (en) 2020-09-30 2020-09-30 Audio processing method, device, readable storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN112185338A true CN112185338A (en) 2021-01-05
CN112185338B CN112185338B (en) 2024-01-23

Family

ID=73948833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011062271.1A Active CN112185338B (en) 2020-09-30 2020-09-30 Audio processing method, device, readable storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN112185338B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114203150A (en) * 2021-11-26 2022-03-18 南京星云数字技术有限公司 Voice data processing method and device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0635492A (en) * 1992-05-20 1994-02-10 Sanyo Electric Co Ltd Speech synthesizing method
KR20010027891A (en) * 1999-09-16 2001-04-06 정선종 A method for analyzing synthetic speech by using graphic user interface
US20040024600A1 (en) * 2002-07-30 2004-02-05 International Business Machines Corporation Techniques for enhancing the performance of concatenative speech synthesis
US20070192105A1 (en) * 2006-02-16 2007-08-16 Matthias Neeracher Multi-unit approach to text-to-speech synthesis
US20160260425A1 (en) * 2015-03-05 2016-09-08 Yamaha Corporation Voice Synthesis Method, Voice Synthesis Device, Medium for Storing Voice Synthesis Program
WO2017028003A1 (en) * 2015-08-14 2017-02-23 华侃如 Hidden markov model-based voice unit concatenation method
CN107566952A (en) * 2016-07-01 2018-01-09 北京小米移动软件有限公司 Acoustic signal processing method and device
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
WO2019033943A1 (en) * 2017-08-18 2019-02-21 Oppo广东移动通信有限公司 Volume adjusting method and device, mobile terminal and storage medium
CN109599090A (en) * 2018-10-29 2019-04-09 阿里巴巴集团控股有限公司 A kind of method, device and equipment of speech synthesis
CN109616094A (en) * 2018-12-29 2019-04-12 百度在线网络技术(北京)有限公司 Speech synthesis method, device, system and storage medium
CN110189754A (en) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 Voice interactive method, device, electronic equipment and storage medium
CN111276119A (en) * 2020-01-17 2020-06-12 平安科技(深圳)有限公司 Voice generation method and system and computer equipment
CN111415650A (en) * 2020-03-25 2020-07-14 广州酷狗计算机科技有限公司 Text-to-speech method, device, equipment and storage medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0635492A (en) * 1992-05-20 1994-02-10 Sanyo Electric Co Ltd Speech synthesizing method
KR20010027891A (en) * 1999-09-16 2001-04-06 정선종 A method for analyzing synthetic speech by using graphic user interface
US20040024600A1 (en) * 2002-07-30 2004-02-05 International Business Machines Corporation Techniques for enhancing the performance of concatenative speech synthesis
US20070192105A1 (en) * 2006-02-16 2007-08-16 Matthias Neeracher Multi-unit approach to text-to-speech synthesis
US20160260425A1 (en) * 2015-03-05 2016-09-08 Yamaha Corporation Voice Synthesis Method, Voice Synthesis Device, Medium for Storing Voice Synthesis Program
WO2017028003A1 (en) * 2015-08-14 2017-02-23 华侃如 Hidden markov model-based voice unit concatenation method
CN107566952A (en) * 2016-07-01 2018-01-09 北京小米移动软件有限公司 Acoustic signal processing method and device
WO2019033943A1 (en) * 2017-08-18 2019-02-21 Oppo广东移动通信有限公司 Volume adjusting method and device, mobile terminal and storage medium
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN109599090A (en) * 2018-10-29 2019-04-09 阿里巴巴集团控股有限公司 A kind of method, device and equipment of speech synthesis
CN109616094A (en) * 2018-12-29 2019-04-12 百度在线网络技术(北京)有限公司 Speech synthesis method, device, system and storage medium
CN110189754A (en) * 2019-05-29 2019-08-30 腾讯科技(深圳)有限公司 Voice interactive method, device, electronic equipment and storage medium
CN111276119A (en) * 2020-01-17 2020-06-12 平安科技(深圳)有限公司 Voice generation method and system and computer equipment
CN111415650A (en) * 2020-03-25 2020-07-14 广州酷狗计算机科技有限公司 Text-to-speech method, device, equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114203150A (en) * 2021-11-26 2022-03-18 南京星云数字技术有限公司 Voice data processing method and device

Also Published As

Publication number Publication date
CN112185338B (en) 2024-01-23

Similar Documents

Publication Publication Date Title
KR102582291B1 (en) Emotion information-based voice synthesis method and device
CN106898340B (en) Song synthesis method and terminal
KR20240081458A (en) Method and system for generating synthesis voice for text via user interface
CN104538024B (en) Phoneme synthesizing method, device and equipment
KR102493141B1 (en) Method and system for generating object-based audio content
CN108847215B (en) Method and device for voice synthesis based on user timbre
CN112309365A (en) Training method, device, storage medium and electronic device for speech synthesis model
US10217454B2 (en) Voice synthesizer, voice synthesis method, and computer program product
Bulut et al. On the robustness of overall F0-only modifications to the perception of emotions in speech
CN105957515A (en) Voice Synthesis Method, Voice Synthesis Device, Medium for Storing Voice Synthesis Program
CN112992109B (en) Auxiliary singing system, auxiliary singing method and non-transient computer readable recording medium
US11842719B2 (en) Sound processing method, sound processing apparatus, and recording medium
KR20220165666A (en) Method and system for generating synthesis voice using style tag represented by natural language
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
JP7036014B2 (en) Speech processing equipment and methods
Story et al. A model of speech production based on the acoustic relativity of the vocal tract
JP2021101252A (en) Information processing method, information processing apparatus, and program
US11195511B2 (en) Method and system for creating object-based audio content
CN112185338B (en) Audio processing method, device, readable storage medium and electronic equipment
JP6728116B2 (en) Speech recognition device, speech recognition method and program
CN112750422B (en) Singing voice synthesis method, device and equipment
JP6314884B2 (en) Reading aloud evaluation device, reading aloud evaluation method, and program
CN116825085A (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
Lu et al. Automatic stress exaggeration by prosody modification to assist language learners perceive sentence stress
CN112634861A (en) Data processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20250109

Address after: No. 902, 9th Floor, Unit 2, Building 1, No. 333 Jiqing 3rd Road, Chengdu High tech Zone, Chengdu Free Trade Zone, Sichuan Province 610000

Patentee after: Chengdu Yudi Technology Co.,Ltd.

Country or region after: China

Address before: 2223, 2nd floor, building 23, 18 anningzhuang East Road, Qinghe, Haidian District, Beijing, 100142

Patentee before: BEIJING DA MI TECHNOLOGY Co.,Ltd.

Country or region before: China