CN118918912A

CN118918912A - Singing voice synthesizing method, singing voice synthesizing equipment and computer readable storage medium

Info

Publication number: CN118918912A
Application number: CN202411153076.8A
Authority: CN
Inventors: 李博文
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2024-08-21
Filing date: 2024-08-21
Publication date: 2024-11-08

Abstract

The application discloses a singing voice synthesizing method, equipment and a computer readable storage medium, which are applied to the technical field of computers and comprise the following steps: splitting the voice data according to the fixed frame length to obtain a voice subtask; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is the data of voice parts in songs; and sequentially carrying out reasoning synthesis on the target tone and the voice subtasks to obtain target tone voice data. Compared with the prior art that singing voice synthesis cannot be performed on the edge equipment, the application divides the singing voice data into a plurality of singing voice subtasks according to the fixed frame length, thereby realizing the whole singing voice synthesis on the edge equipment, and because the target tone is sequentially subjected to reasoning synthesis with the singing voice subtasks, the current singing voice data subtasks can be played after being subjected to reasoning synthesis, realizing playing while synthesizing, and effectively reducing the playing waiting time.

Description

Singing voice synthesizing method, singing voice synthesizing equipment and computer readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a singing voice synthesis method, apparatus, and computer readable storage medium.

Background

Singing voice synthesis is a technique that can replace the user's timbre on the target audio. With the development of technology, singing voice synthesis has evolved from early harshness and mechanical sound to now increasingly natural and realistic AI (ARTIFICIAL INTELLIGENCE ) sound. This technique allows the user to replace his or her timbre with a selected music track, thereby enabling a personalized singing experience.

AI singing because model loading reasoning requires GPU (Graphics Processing Unit, image processing unit) equipment support, the initial approach is to perform reasoning synthesis in the background. The user initiates a request at the mobile phone, and the background is combined with the currently available equipment resources to schedule and queue, and the combined equipment resources are delivered to the user. Considering the current large background of the intense power of the GPU, the complete playing link of the background is almost more than 8s, and the waiting time is longer when the network condition is poor or the number of users is large.

Therefore, when the GPU is used for synthesizing singing voice, if the GPU is stressed and the network is not good, the technical problems of long singing voice synthesis time and long playing waiting time exist.

Disclosure of Invention

Accordingly, the present invention is directed to a singing voice synthesizing method, apparatus, device and computer readable storage medium, which solve the technical problems of the prior art that the playing waiting time is long when the computing power is intense and the network is not good.

In order to solve the technical problems, the invention provides a singing voice synthesizing method, which comprises the following steps:

Splitting the voice data according to the fixed frame length to obtain a voice subtask; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is the data of voice parts in songs;

Sequentially carrying out reasoning synthesis on the target tone and the voice subtasks to obtain target tone voice data;

And inserting a mute task between each segment of voice data in the target tone voice data to obtain the target synthesized singing voice.

Optionally, before splitting the voice data according to the fixed frame length to obtain the voice subtask, the method further includes:

Cutting the resource file according to the audio characteristics to obtain a voice part;

And merging the voice parts to obtain the voice data.

Optionally, after splitting the voice data according to the fixed frame length to obtain the voice subtask, the method further includes:

Combining the voice subtasks to obtain combined voice data;

performing fade-in fade-out processing on the spliced part in the combined voice data to obtain a target voice subtask;

Correspondingly, the step of sequentially carrying out reasoning synthesis on the target tone and the voice subtask to obtain target tone voice data comprises the following steps:

and carrying out reasoning synthesis on the target tone and the target human phonon task in sequence to obtain the target tone human voice data.

Optionally, the step of sequentially performing reasoning synthesis on the target tone and the voice subtask to obtain target tone voice data includes:

Sequentially carrying out reasoning synthesis on the target tone and each voice subtask to obtain a plurality of initial segment tone data;

Performing root mean square processing on the initial segment tone data according to the timestamp information of the voice subtask until the root mean square processing of all the voice subtasks is completed, and obtaining an inference root mean square; the root mean square processing is to square the human voice subtasks element by element according to preset time length, sum the square, and open the square;

According to the comparison result of the target root mean square and the reasoning root mean square, adjusting the plurality of initial segment tone data to obtain the target tone voice data; the target root mean square is the root mean square corresponding to the original singing voice data.

Optionally, the adjusting the plurality of initial segment timbre data according to the comparison result of the target root mean square and the reasoning root mean square to obtain the target timbre voice data includes:

determining a scale factor according to the target root mean square and the reasoning root mean square;

and carrying out loudness balance adjustment on the plurality of initial segment tone data according to the scale factors to obtain the target tone voice data.

and connecting each voice subtask by using a circular doubly linked list to obtain a voice subtask doubly linked list.

Optionally, the method further comprises:

When the file pointer moving operation occurs, moving the file pointer to a corresponding target human voice task in the human voice sub task doubly-linked list according to the time stamp; the file pointer is a variable for marking the position of a time stamp, each subtask in the voice subtask doubly-linked list corresponds to a time stamp, and the time stamp is a number for marking a time node where the voice subtask is located.

Optionally, splitting the voice data according to the fixed frame length to obtain a voice subtask includes:

and splitting the voice data in the half-precision floating point number data format according to the fixed frame length to obtain the voice subtask.

Optionally, the fixed frame length is greater than or equal to a preset shortest frame length.

The application also provides a singing voice synthesizing device which is applied to the edge equipment and comprises:

the voice data splitting module is used for splitting voice data according to the fixed frame length to obtain voice subtasks; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is the data of voice parts in songs;

The successive reasoning synthesis module is used for carrying out reasoning synthesis on the target tone and the voice subtasks in sequence to obtain target tone voice data;

And the mute task inserting module is used for inserting a mute task between each segment of voice data in the target tone voice data to obtain the target synthesized singing voice.

The present application also provides a singing voice synthesizing apparatus including:

A memory for storing a computer program;

and a processor for implementing the steps of the singing voice synthesizing method as described above when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the singing voice synthesizing method as described above.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the singing voice synthesis method described above.

Therefore, the voice subtask is obtained by splitting the voice data according to the fixed frame length; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is the data of voice parts in songs; sequentially carrying out reasoning synthesis on the target tone and the voice subtasks to obtain target tone voice data; and inserting a mute task between each segment of voice data in the target tone voice data to obtain the target synthesized singing voice. According to the application, the voice data is divided into a plurality of voice subtasks according to the fixed frame length, the fixed frame length is smaller than the maximum threshold value of the preset frame length, so that the memory peak value is lower than the threshold value of the highest memory peak value of the edge equipment, the reasoning of the whole singing voice synthesis can be carried to the edge equipment for implementation, and the target tone color is sequentially subjected to the reasoning synthesis with the voice subtasks, so that the voice data can be played after the reasoning synthesis is completed into one subtask, namely, the playing is realized while the playing is realized, the playing waiting time is effectively reduced, and the playing speed is improved.

In addition, the invention also provides singing voice synthesizing device, equipment and a computer readable storage medium, which also have the beneficial effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a singing voice synthesizing method provided by an embodiment of the invention;

FIG. 2 is a diagram showing a comparison of reasoning results provided by the embodiment of the present invention;

fig. 3 is a schematic diagram of a fade-in and fade-out process according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating a singing voice synthesizing method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of merging of vocal parts according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention;

Fig. 7 is a schematic structural diagram of singing voice synthesizing apparatus according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Some of the terms appearing in describing embodiments of the present application are applicable to the following explanation:

SVC: singing Voice Conversion singing voice conversion algorithm;

Limiter: performing pressure limiting treatment;

VUV voice unvoiced, speech and silence;

RMS: root Mean Square, root Mean Square;

rate: rate is the ratio of rms_ infer, representing the scale factor;

FP16: a half-precision floating point number data format.

Singing synthesis because model loading reasoning requires GPU equipment support, the initial approach is to synthesize in the background. The user initiates a request at the mobile phone, and the background is combined with the currently available equipment resources to schedule and queue, and the combined equipment resources are delivered to the user. Considering the great background of the existing GPU with intense computing power, the inference operation can be carried out on the edge equipment to effectively release the dependency of the part on the computing power of the GPU equipment, and the cost is reduced.

Meanwhile, the background complete playing link is almost more than 8s, and the waiting time is longer when the network condition is poor or the number of users is large.

User initiated request- > background scheduling- > inference synthesis- > user downloaded work- > play.

When the whole reasoning process is carried out on the edge equipment, the waiting time from clicking to playing of the user can be compressed to be about 1-3 s;

The user initiates a request- > downloads the resource- > infers the first segment- > plays [ background continue inference ].

The main difficulty of reasoning and synthesizing at the edge devices (devices close to the data source and capable of performing local calculation and data processing) is that the computing power of the edge devices is far lower than that of the GPU, and the computing of multiplexing can not be performed at one time like the background. Meanwhile, the resource file on which the edge device reasoning depends is in direct proportion to the length of the song, the template of a single song is up to tens of megabytes, and if the bandwidth cost is directly downloaded, the cost of the bandwidth is far higher than the cost of the saved GPU device.

Referring to fig. 1, fig. 1 is a flowchart of a singing voice synthesizing method according to an embodiment of the invention. The method may include:

S101, splitting voice data according to a fixed frame length to obtain a voice subtask; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is data of a voice portion in the song.

The execution main body of the embodiment is edge equipment, the edge equipment is equipment which is near a data source and is on the edge side of a network, local calculation and data processing can be carried out on the edge equipment, the edge equipment corresponds to the edge calculation, the edge calculation is to directly analyze data acquired from a terminal in the local equipment or the network which is near the data generation, and the data is not required to be transmitted to a cloud data processing center. The embodiment is not limited to a specific method for obtaining the voice data, and the voice data may be obtained according to audio characteristics, where the audio characteristics are characteristics that can distinguish the voice from other parts. For example, this embodiment may obtain voice data according to pitch; or the embodiment may also derive the voice data based on loudness. The voice subtask in this embodiment refers to a task of dividing a continuous voice signal into short periods (frames) of a fixed length when processing voice data. It should be noted that, in order to facilitate the traversal of the subtasks, the subtasks may be connected by using a data structure of a storage sequence such as an array, a linked list, or the like. The embodiment can be converted into a half-precision floating point number data format after splitting the voice data according to the fixed frame length; or the voice data can be directly converted into a half-precision floating point number data format and then split according to the fixed frame length, so that the resource volume is reduced. The fixed frame length in this embodiment may be greater than a minimum frame length threshold, where the minimum frame length threshold is a time length that can ensure that the singing voice inference synthesis effect is stable on the edge device, and the fixed frame length in this embodiment needs to be less than a preset frame length maximum threshold, where the preset frame length maximum threshold is the longest singing voice synthesis time supported by the edge device. The voice data in this embodiment is data of a voice portion in a song. For example, the preset frame length maximum threshold in this embodiment is 800 frames (8 s). The fixed frame length in this embodiment is equal to or greater than the preset shortest frame length. The embodiment is not limited to a specific preset shortest frame length, and for example, the preset shortest frame length in the embodiment may be 300 frames; or the preset shortest frame length in this embodiment may also be 400 frames.

It should be noted that, in this embodiment, splitting the voice data according to the fixed frame length is to disassemble tasks, and gradual merging is implemented by frame-by-frame reasoning. The background reasoning scheme is to split songs according to 25s, and splice the songs into a whole song after reasoning and synthesis. The edge device cannot directly copy the corresponding logic because if the data of 25s are inferred at one time, the peak value of the memory overhead of the edge device can reach more than 1.8G, and the increment of the CPU (Central Processing Unit ) is far more than 50%, which is far more than the online performance level line. The reasoning is performed by splitting a single long task into multiple subtasks, taking into account that the reasoning overhead of the model is proportional to the tensor length of the input. There is a problem that the split subtasks are respectively synthesized, mutation exists in direct splicing, and noise is introduced. One solution is to refer to vuv (voice and silence) boundaries, cut with granularity of sentences, and ensure that each subtask is distinguished by silence. This allows the two subtasks to be directly combined without fear of the splice introducing noise. The reason for the final discard of the selection is that: a. the existing resource file cannot be accurately segmented according to sentences, and errors exist in labeling of time axes. b. The length of the sentence is not controllable, the length of a conventional single sentence is about 6-10s, but a sentence with a long vowel at the tail of part may be pulled longer. In combination with the requirement standard of the performance test, the content of the single reasoning about 8s duration is supported at most on the edge equipment, and 8s in the embodiment refers to the maximum duration of the single reasoning about the voice, and silence insertion is not included. Post-processing such as pressure limit is performed on the data after the voice is inferred, and no time length is required. Therefore, the final selection is to directly cut according to the fixed frame length, and the user clicks a button to start playing under the real scene, and the waiting time is approximately equal to the reasoning time consumption of the first subtask under the condition that all the resource files are ready. Theoretically, the shorter the length of a single task is, the faster the playing speed is. However, in a practical scenario, a problem is that when the subtasks are too short, the reasoning effect is not stable. For ease of understanding, please refer to fig. 2, fig. 2 is a comparison chart of inference results provided by the embodiment of the present invention, wherein the first row is the effect of one-time inference, the second row is the effect of 100-time inference, the third row is the effect of 200-time inference, and the fourth row is the effect of 300-time inference. From the comparison of the differences of the spectrograms in the last column, the longer the duration is, the more stable the effect is. Through comparison test, it is found that the sentence length of single reasoning is preferably not shorter than 300 frames (3 s), i.e. greater than or equal to 300 frames.

It should be further noted that, in order to reduce the volume of the resource file, the splitting the voice data according to the fixed frame length to obtain the voice subtask may include: and splitting the voice data in the half-precision floating point number data format according to the fixed frame length to obtain a voice subtask. The vocal data in this embodiment may include a treble file and a phoneme-to-phoneme mapping file for singing voice synthesis. The resource files on which singing voice synthesis is dependent for each song in this embodiment are ppg (Phonetic Posteriorgram, speech posterior probability)/pitch two files. It will be appreciated that the quantization process is performed on the entire resource file, and experiments are performed using float32 (32-bit floating point number) and float16 (FP 16 (float 16) is a half-precision floating point data type, using 16 bits (2 bytes) to store values). After statistics of 100 songs in the random sampling background, the range of float32 is between [ -64,64], if the customization is to be performed, at least 7 bit integer positions need to be reserved, and the customization is performed: [ s 1bit ] [ int:6bit ] [ float:9bit ]; float16 is better or flatter than fixed-point, and the accuracy is higher with float 16. After the float16 is changed, the effect auditory sense is not changed greatly, the amplitude statistics and the frequency statistics are basically consistent, and the spectrum comparison difference is not great. The existing compression means can compress 15% -25% of the segmentation, the float16 can be compressed by 50% on the basis, and ppg about 50M can be compressed to 15-20M.

It should be further noted that, in order to improve the effect of determining the voice data, before splitting the voice data according to the fixed frame length to obtain the voice subtask, the method may further include: cutting the resource file according to the audio characteristics to obtain a voice part; and merging the voice parts to obtain voice data. It will be appreciated that the extraction of pitch is achieved by taking advantage of the difference in frequency between human voice and background music or other sounds. The embodiment is not limited to a specific audio feature as long as the human voice portion can be obtained by clipping according to the audio feature. For example, the audio feature in this embodiment may be a pitch, which in this embodiment refers to one of the basic features of the sound, i.e., the height of the sound, which is a sound with various tone heights. In this embodiment, clipping the resource file according to the pitch means that the voice portion and the mute portion are distinguished according to the pitch, thereby clipping the voice portion. Advantages of this approach include: improving the definition: by extracting the sound in a specific pitch range, the human voice can be highlighted, and the influence of other disturbing background sounds is reduced, so that the hearing definition of the human voice by a listener is improved. And (3) maintaining tone quality: unwanted frequency components can be removed or reduced without compromising the quality and detail of the original voice recording. Flexibility: the pitch extraction can be adjusted for different audio materials, is suitable for various sound scenes and music styles, and has strong flexibility. Time is saved: compared with manual editing and separation of human voice, the method can greatly save time and labor by using software to automatically extract the human voice part according to the pitch, and can improve definition and keep printing due to the fact that the human voice part is extracted through the pitch.

It should be further noted that, in order to improve the effect of singing voice synthesis, after splitting the voice data according to the fixed frame length to obtain the voice task, the method may further include: combining the voice subtasks to obtain combined voice data; performing fade-in and fade-out processing on the spliced part in the combined voice data to obtain a target voice subtask; correspondingly, the target tone color and the voice subtask are sequentially processed by reasoning synthesis to obtain target tone color voice data, which can include: and sequentially carrying out reasoning synthesis on the target tone and the target human phonon task to obtain target tone human voice data. The splice in this embodiment refers to the junction between two segments of voice subtasks. It can be understood that a certain overlap is left at the splicing place for fading, and it can be understood that a certain length of overlap exists between the front and rear pieces of human voice, and the joint is more natural after the overlap is faded, otherwise, the human voice has noise. Fade-in fade-out has two modes, soft (smooth) and hard (abrupt). Let fadeIn (transition effect from transparent to opaque) and fadeOut (transition effect from opaque to transparent) be N in length, N being a non-negative integer. i= [0, n ]. For ease of understanding, please refer to fig. 3, fig. 3 is a schematic diagram of a fade-in and fade-out process provided in an embodiment of the present invention, and assume that the lengths of fadeIn (transition effect from transparent to opaque, i.e. fade-in) and fadeOut (transition effect from opaque to transparent, i.e. fade-out) are N, where N is a non-negative integer. i= [0, n ]. The Soft calculation formula is:； ; the Hard calculation formula is: ； Wherein fadeIn represents the array of coefficients for fade-in, fadeIn [ i ] represents the ith value, fadeOut represents fade-out, N represents the length, i.e., number, of the array for fade-in, pi is . The embodiment improves the naturalness of the transition by carrying out fade-in and fade-out treatment on the spliced part, thereby improving the effect of singing voice synthesis.

It should be further noted that, in order to improve the traversing efficiency of the singing voice synthesizing part, after splitting the voice data according to the fixed frame length to obtain the voice task, the method may further include: and connecting each human phonon task by using the circular double linked list to obtain the human phonon task double linked list. The circular doubly linked list in the embodiment is a doubly linked list with a special structure, the next of the tail node points to the first element node, and the prev of the first element node points to the tail node to form a closed loop. The specific characteristics and the operation are as follows: due to the bi-directionality, the linked list can be traversed from two directions, so that the traversing efficiency is improved.

It should be further noted that, in order to improve the efficiency of the target person phonon task positioning, the singing voice synthesizing method may further include: when the file pointer moving operation occurs, moving the file pointer to a corresponding target human voice subtask in the human voice subtask doubly-linked list according to the time stamp; the file pointer is a variable for marking the position of a time stamp, each subtask in the voice subtask doubly-linked list corresponds to a time stamp, and the time stamp is a number for marking a time node where the voice subtask is located. The file pointer in this embodiment is a variable that identifies the location of the timestamp. The move operation in this embodiment means that the position of the file pointer has changed. The timestamp in this embodiment is a time-significant flag that can mark the time node where the voice subtask is located. According to the embodiment, the file pointer can be moved to the corresponding target person phonon task in the person phonon task doubly-linked list according to the time stamp, so that the efficiency of positioning the target person phonon task is improved.

S102, sequentially carrying out reasoning synthesis on the target tone and the voice subtasks to obtain target tone voice data.

The target tone in this embodiment is a determined tone parameter that needs to be synthesized by reasoning about the tone and the voice portion of the corresponding song, and refers to the characteristics and quality of the sound, which are determined by the frequency, amplitude, waveform, time interval, and other factors of the sound. This embodiment is not limited to a particular source of the target timbre. For example, the target timbre in this embodiment may originate from the reference audio; or may be sourced with specified tone parameters; or the target timbre may also be generated by a machine learning model. The embodiment enables the subtasks to be inferred and played in segments by sequentially reasoning and synthesizing the target timbre with the voice subtasks. The maximum threshold value of the preset frame length, which is the reasoning composition in this embodiment, is the longest reasoning composition frame length supported by the edge device.

It should be further noted that, in order to optimize the effect of singing voice synthesis, the above-mentioned reasoning synthesis is performed on the target timbre and the voice subtask in sequence to obtain target timbre voice data, which may include:

s1021, sequentially carrying out reasoning synthesis on the target tone and each voice subtask to obtain a plurality of initial segment tone data;

S1022, carrying out root mean square processing on the initial segment tone data according to the timestamp information of the voice subtask until the root mean square processing of all the voice subtasks is finished, and obtaining an inference root mean square; the root mean square processing is to square the human voice subtasks element by element according to preset time length, sum the square, and open the square;

S1023, adjusting the tone data of the plurality of initial segments according to the comparison result of the target root mean square and the reasoning root mean square to obtain target tone voice data; the target root mean square is the root mean square corresponding to the original singing voice data.

The overall steps of this embodiment include: and (3) reasoning and synthesizing: at this stage, the target timbre is combined with each human voice subtask in turn, and synthesized by some inference model (possibly a deep learning model) to obtain a plurality of initial segment timbre data. Each segment corresponds to a person phonon task. Root mean square processing: and carrying out Root Mean Square (RMS) processing on the initial segment tone data according to the timestamp information of the voice subtask. The time stamp information refers to data recording a specific time point of data generation or event occurrence, and the time stamp can be calculated by dividing a serial number of a sampling point by a sampling rate. This process is to calculate the energy level of each segment and possibly to adjust the loudness of the segments to match the energy distribution of the entire song. Element-by-element squaring: each sample value in the segment is squared. And (3) summing: all squared sample values are added to obtain a sum, which represents the total energy of the segment. Square opening: the sum is squared resulting in the RMS value of the segment. This process continues until all voice subtasks have been RMS processed to obtain a series of inferred root mean square values. Adjusting the initial segment tone color data: and adjusting the tone color data of the plurality of initial segments according to the comparison result of the target root mean square and the reasoning root mean square. The target root mean square is the root mean square corresponding to the original singing voice data, which represents the energy level that the synthesized singing voice is expected to reach. Comparison: the target root mean square is compared to the inferred root mean square to determine whether each segment requires a gain (increase) or a subtraction (decrease) to better match the target. And (3) adjusting: according to the comparison result, the loudness of each segment is adjusted to be closer to the target root mean square, so that the uniform energy distribution is maintained in the whole song. Eventually, this process will produce an energy level adjusted vocal accompaniment data that should have consistent loudness and energy distribution throughout the song, thereby improving the naturalness and overall hearing quality of the synthesized singing, i.e., improving the effectiveness of the synthesized singing.

It should be further noted that, in order to improve the experience of the user, the adjusting the plurality of initial segment tone data according to the comparison result of the target root mean square and the inference root mean square to obtain the target tone voice data may include: determining a scale factor according to the target root mean square and the reasoning root mean square; and carrying out loudness balance adjustment on the plurality of initial segment tone data according to the scale factors to obtain target tone voice data. The scale factor in this embodiment may calculate the target root mean square and the inferred root mean square difference, determine the duty ratio proportionally, rate (scale factor) is a number between 0 and 1, representing whether per original or per target voice RMS,0 is the full reference to the original, and 1 is the full reference to the target voice. In this embodiment, the loudness equalization adjustment is to calculate a scale factor according to a difference value of the template segment tone data corresponding to each initial segment tone data, and multiply the difference value corresponding to each segment data by the scale factor to obtain a loudness equalization adjustment parameter, so that the loudness equalization adjustment is performed based on the loudness equalization adjustment parameter. For example, the loudness corresponding to the initial segment audio data is 3, the loudness corresponding to the target template segment audio data (the audio corresponding to the target root mean square) is 5, the difference is 2, and at this time, the 2 is multiplied by the calculated scale factor to obtain the loudness adjustment parameter, so that the loudness equalization adjustment is performed based on the loudness adjustment parameter. According to the embodiment, through loudness equalization adjustment, a user can obtain consistent listening experience on different devices, and can enjoy similar sound intensity and quality in a headset, a vehicle-mounted sound system or a home theater system, so that the experience of the user can be improved.

It should be further noted that, in order to improve the quality of the vocal accompaniment data, the above-mentioned performing inference synthesis on the target accompaniment and the vocal subtasks in sequence to obtain the vocal accompaniment data may include: and sequentially carrying out reasoning synthesis on the target accompaniment and the voice subtask, and carrying out restriction processing to obtain voice accompaniment data. It will be appreciated that the compression limiting process may compress the dynamic range of the audio signal, i.e. reduce the gap between the maximum amplitude and the minimum amplitude. By doing so, the audio signal can be prevented from exceeding the maximum volume which can be born by the system when in peak value, thereby preventing distortion and improving the quality of the voice accompaniment data.

S103, inserting a mute task between each segment of voice data in the target tone voice data to obtain the target synthesized singing voice.

The mute task in this embodiment refers to a segment without a voice, and the audio signal corresponding to the time segment corresponding to the mute task is set to 0 to obtain the mute task. The mute part in this embodiment directly fills mute data without model synthesis, improving the efficiency of singing voice. It will be appreciated that the full curve has a proportion of silence that is data that can actually be filled directly with silence data without model synthesis. And directly performing boundary detection by using the existing pitch resource file, and dividing the pitch sequence by a demarcation threshold thresholdDb (demarcation threshold). The sliced segments are very finely divided, essentially the region where the vowels are located. The fine fragments are combined on this basis. With such compression, a majority of songs may be compressed by 15% -25% of the volume. For ease of understanding, please refer to table 1, table 1 is an exemplary table of song volumes after silence and voice splitting provided in an embodiment of the present invention, wherein npy is a file format, ppg/pitch information is stored, ppg represents a speech posterior probability, and pitch represents pitch information.

Table 1 an example table of song volumes after silence and voice splitting

Song name	npy（M)	After resolution (ppg+pitch)	Duty cycle after splitting
				A	47.3	33.7 + 66kb	71.2%
B	53.9	40.1 + 78kb	74.3%
				C	40.1	33 + 64kb	82.5%
D	47.3	39.9 + 78kb	84.3%
				E	52.2	40.3 + 79kb	77.2%
F	61.7	45.9 + 90kb	74.4%
				G	53	43.7 + 85kb	82.4%

The singing voice synthesizing method provided by the embodiment of the application can comprise the following steps: s101, splitting voice data according to a fixed frame length to obtain a voice subtask; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is the data of voice parts in songs; s102, sequentially carrying out reasoning synthesis on the target tone and the voice subtasks to obtain target tone voice data; s103, inserting a mute task between each segment of voice data in the target tone voice data to obtain the target synthesized singing voice. According to the application, the voice data is divided into a plurality of voice subtasks according to the fixed frame length, the fixed frame length is larger than the preset frame length threshold, so that the voice synthesis stability can be kept, the memory peak value is lower than the threshold value of the highest memory peak value of the edge equipment, the reasoning of the whole voice synthesis can be carried to the edge equipment for implementation, and the target tone color is sequentially reasoning synthesized with the voice subtasks, so that the voice data can be played after the reasoning synthesis is completed into a subtask, namely, the playing is realized while the playing is simultaneously performed, the playing waiting time is effectively reduced, the playing speed is improved, and the background GPU reasoning cost is saved. And, this embodiment splits the voice data by a fixed frame length of not shorter than 300 frames, so that singing voice synthesis can be performed on the edge device; and the voice data is converted into a half-precision floating point number data format, so that the resource volume is reduced; and the voice part is extracted through the pitch, so that the voice data determining effect is improved; in addition, the embodiment improves the naturalness of transition by carrying out fade-in and fade-out treatment on the spliced part, thereby improving the effect of singing voice synthesis; moreover, the file pointer can be moved to the corresponding target human phonon task in the human phonon subtask doubly-linked list according to the timestamp, so that the efficiency of positioning the target human phonon task is improved; and according to the comparison result of the target root mean square and the reasoning root mean square, a plurality of initial segment tone data are adjusted, and the singing voice synthesizing effect is improved; in addition, the embodiment can obtain consistent listening experience on different devices through loudness equalization adjustment, and users can enjoy similar sound intensity and quality in the earphone, the vehicle-mounted sound equipment or the home theater system, so that the experience of the users can be improved; in addition, the implementation can perform pressure limiting processing, so that the audio signal is prevented from exceeding the maximum volume which can be born by the system when in peak value, thereby preventing distortion and improving the quality of the voice accompaniment data.

In order to facilitate understanding of the present invention, referring to fig. 4, fig. 4 is a flowchart illustrating a singing voice synthesizing method according to an embodiment of the present invention, which may specifically include:

S401, clipping the resource file according to pitch to obtain a voice part, and merging the voice parts to obtain voice data.

The resource files in this embodiment are two files of ppg (Phonetic Posteriorgram, speech posterior probability) and pitch. The background model is inferred as a whole, and it is desirable to split the portion where no coupling is present before the model is placed on the ground side. The 3 sub-modules can be split, NSF (Neural source filter ), VITS (Variational INFERENCE WITH ADVERSARIAL LEARNING for end-to-end Text-to-specific, for Speech synthesis with condition-variant self-encoder against learning), decoder (vocoder), respectively. This is done mainly for several purposes: a. operators which do not support hardware acceleration are prevented from being existed in the whole model, and the whole operation efficiency is reduced. The conventional model uses GPU/TPU to accelerate the operation, but similar heterogeneous operations are only obviously improved on operators mainly used for multiply-add operations like convolution/MLP (multi-layer perceptron), and the GPU/TPU (tensor processing unit) is not well supported by operators like element-wise operation/reduce (party operation). There are two options for the general reasoning framework, the whole scheme is rolled back to the CPU, or an unsupported operator is rolled back to the CPU for calculation, but one problem in doing so is that the communication cost of the CPU and the GPU/TPU is high, and the time consumption for calculation reduction can be far lower than the communication time consumption. Similar situations are encountered during the subsequent verification process, and the time consumption of part of the module cpu+gpu may be higher than the CPU's solution. If the system is split into a plurality of decoupled sub-modules, hardware acceleration matching can be respectively carried out, so that mutual influence is avoided. b. Reducing the performance peak of the operation process. The problem with edge devices is that there is little computing power allocated to the application programs and that the background and foreground may run multiple programs simultaneously. The operating system monitors the resources and if the occupation is excessive, the resources may be reclaimed to cause crash (system crash). And the heat dissipation of the edge equipment cannot be compared with that of a machine room, and the high-strength computing equipment can be quickly scalded and frequency-reduced. Ensuring that the CPU/memory-like overhead is smooth when the program is running is an important challenge for edge device development. For ease of understanding, please refer to fig. 5, fig. 5 is a schematic diagram illustrating merging of human voice portions according to an embodiment of the present invention. The human voice data is depicted in fig. 5 as data for determining a division point of the human voice data, and the data format may be JSON (JavaScript Object Notation, lightweight data exchange format). The non-dashed box in fig. 5 is a mute portion obtained by splitting the undivided template audio (i.e., audio for singing synthesis), and the dashed box is a vocal portion. The whole voice data can be obtained by combining the individual voice parts.

S402, converting the voice data into an FP16 format to obtain voice data in a target format.

This embodiment converts the voice data into FP16 format, it being understood that the resource files on which singing is dependent for each song are ppg/pitch two files. Where the pitch file is one float32 (4 bytes) every 10ms and the ppg file is 1024 floats 32 (4096 bytes) every 20 ms. For a simple reference, a singing synthesis of 10s requires a pitch file of 2kb and ppg of 2M. The resource size of one song in C is 40.1M in total. Whereas if the synthesis is performed in the background, the issued M4a file is about 5M. There are two major problems: a. the bandwidth costs introduced by the oversized resource file (mainly ppg) are very high b. The current average download network speed is about 2M/s, the oversized resource file means that the download time is 15-20s more than the background composition. To be able to compress the resource file volume, one starts mainly from two aspects: the full-curve has a certain proportion of mute parts, the data of the mute parts can be filled with mute data directly, the existing pitch resource file is directly used for carrying out boundary detection without model synthesis, and a threshold thresholdDb is defined to segment the pitch sequence. The fragments thus split are very finely divided, essentially the region where the vowels are located. The fine fragments are combined on this basis. The resources of the voice part are combined, the mark is made and issued, and the edge equipment is unfolded on the basis.

S403, splitting the target format voice data according to the fixed frame length to obtain subtasks; wherein the fixed frame length is greater than 300 frames.

S404, connecting each subtask by using a circular double-chain table to obtain a subtask linked list.

S405, reasoning is carried out according to the subtask linked list and the target audio, the current task reasoning is sequentially traversed backwards after the current task reasoning is carried out, and the next reasoning task to be carried out is searched.

The reasoning in this embodiment refers to reasoning subtasks corresponding to the target tone and the resource file.

S406, when the number of the executed tasks is determined to be equal to the total number of the tasks, the tasks are ended.

S407, after each reasoning task is completed, square sums of corresponding ranges of the corresponding rms (root mean square) templates are obtained according to the timestamp information, secondary summation is carried out, square is obtained, target root mean square is obtained, and meanwhile the reasoning root mean square is calculated.

And S408, determining a scale factor according to the target root mean square and the reasoning root mean square, and performing loudness balance adjustment on the target root mean square according to the scale factor to obtain loudness adjustment singing voice data.

After the synthesis of the voice, the background of the embodiment refers to the original vocal accompaniment ratio before the accompaniment remix (remixing or editing) to make loudness adjustment on the synthesized voice and accompaniment. Where the calculation of loudness requires overall audio data. However, the end side is subjected to streaming processing, the first segment is directly played after being synthesized, and meanwhile, the background performs reasoning calculation of the next subtask. That is, the audio data without full music in the playing process cannot be aligned with the logic of the background, so that the loudness equalization is performed. Meanwhile, because the size of the fragments of the edge equipment is smaller, the problem of uncoordinated loudness among the fragments also exists. To solve this problem, with reference to making ppg of singing guide, rms is piecewise aligned using an rms calculation method,Wherein N represents the number, xi represents the ith number, i is more than or equal to 0 and less than or equal to N. It can be seen from the calculation formula of rms that rms is calculated by summing after element-by-element squaring. In order to be more flexible and convenient for the edge device to adjust, taking the granularity of 10ms of the previous pitch as a reference, the square sum of every 10ms is calculated in advance. Meanwhile, the existing slicing logic at the end side is utilized, the square sum is obtained according to the shift of the pitch peer-to-peer during each slicing synthesis, and the square sum is accumulated again to obtain the target rms.; Where Scale is the final scaled coefficient, rms_ infer is the rms of each piece of inferred data, rms_global is the rms of the template, rate is the proportion of rms_ infer, where the range of values is [0,1], where Rate is user-defined, typically by default, 1 represents the duty cycle distribution of the original and target dry sounds, scale is the coefficient acting on each sample point, and it is equivalent to multiplying Scale by each point. It can be understood that the end side performs reasoning and playing, so that the loudness of all the voices cannot be calculated without all the voice data at the time of starting playing. Since it is not possible to calculate, and make the corresponding adjustments to pull the voice to the target loudness as in the background, an attempt is made to extract this rms template to adjust rms from segment to segment, simply to ensure that the final synthesized voice is the desired one, rather than an uncontrollable state. For example, the reference voice of the extracted rms template is-17 dB, then each segment aligns with the rms when synthesizing the voice, and the final synthesized voice product loudness is also around-17 dB, and if not aligned, the loudness has a large fluctuation range.

S409, fade-in and fade-out processing is carried out on adjacent subtasks in the loudness adjustment singing voice data, meanwhile, mute tasks are inserted between each segment of voice data, and pressure limiting processing is carried out.

The embodiment can carry out the mute task after carrying out the limit processing on the loudness singing voice data, and the mute task can be ensured to be inserted. In order to avoid overlaw (overload) that may be introduced by the gain, a limiter (limiter) post-processing is required.

S410, the data after the pressure limit processing is subjected to the re-sampling output, and the data is written into the dry sound memory according to the corresponding offset.

The technical scheme of the embodiment of the invention has the following beneficial effects: 1. the starting speed is improved, and the waiting time is reduced to about 1-3s from at least about 8 s; 2. and the background GPU reasoning cost is saved.

The following describes an singing voice synthesizing apparatus according to an embodiment of the present invention, and the singing voice synthesizing apparatus described below and the singing voice synthesizing method described above may be referred to correspondingly.

Referring to fig. 6 specifically, fig. 6 is a schematic structural diagram of an singing voice synthesizing apparatus according to an embodiment of the present invention, which may include:

The voice data splitting module 100 is configured to split voice data according to a fixed frame length to obtain a voice subtask; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is the data of voice parts in songs;

The successive reasoning synthesis module 200 is used for sequentially reasoning and synthesizing the target tone with the voice subtasks to obtain target tone voice data;

And the mute task inserting module 300 is used for inserting a mute task between each segment of voice data in the target tone voice data to obtain the target synthesized singing voice.

Further, based on the above embodiment, the singing voice synthesizing apparatus may further include:

the clipping module is used for clipping the resource file according to the audio characteristics to obtain a voice part;

and the merging module is used for merging the voice parts to obtain the voice data.

Further, based on any of the above embodiments, the singing voice synthesizing apparatus may further include:

the voice subtask merging module is used for merging the voice subtasks to obtain merged voice data;

the fade-in fade-out processing module is used for carrying out fade-in fade-out processing on the spliced part in the combined voice data to obtain a target voice subtask;

accordingly, the successive inference synthesis module 200 includes:

a successive reasoning synthesis unit for sequentially reasoning and synthesizing the target tone with the target phononic task to obtain the target tone phononic data

Further, based on any of the above embodiments, the successive inference synthesis module 200 may include:

the initial segment voice accompaniment data determining unit is used for sequentially carrying out reasoning synthesis on the target tone and each voice subtask to obtain a plurality of initial segment tone data;

the root mean square determining unit is used for carrying out root mean square processing on the initial segment tone data according to the timestamp information of the voice subtask until the root mean square processing of all the voice subtasks is completed, and obtaining an inference root mean square; the root mean square processing is to square the human voice subtasks element by element according to preset time length, sum the square, and open the square;

The voice accompaniment data adjusting unit is used for adjusting the plurality of initial segment tone data according to the comparison result of the target root mean square and the reasoning root mean square to obtain the target tone voice data; the target root mean square is the root mean square corresponding to the original singing voice data.

Further, based on the above embodiment, the above vocal accompaniment data adjustment unit may include:

A scale factor determination subunit configured to determine a scale factor according to the target root mean square and the inferred root mean square;

And the loudness balance adjusting unit is used for carrying out loudness balance adjustment on the plurality of initial segment tone data according to the scale factors to obtain the target tone voice data.

And the human phonon task doubly-linked list determining module is used for connecting each human phonon task by utilizing the circular doubly-linked list to obtain the human phonon task doubly-linked list.

the target voice subtask searching module is used for moving the file pointer to a corresponding target voice subtask in the voice subtask doubly-linked list according to the time stamp when the file pointer moving operation occurs; the file pointer is a variable for marking the position of a time stamp, each subtask in the voice subtask doubly-linked list corresponds to a time stamp, and the time stamp is a number for marking a time node where the voice subtask is located.

Further, based on any of the above embodiments, the voice data splitting module 100 may include:

and the voice data splitting unit is used for splitting the voice data in the half-precision floating point data format according to the fixed frame length to obtain the voice subtask.

Further, based on any of the above embodiments, the fixed frame is 300 frames or more.

It should be noted that the modules and units in the singing voice synthesizing apparatus can change the order of the modules and units before and after each other without affecting the logic.

The singing voice synthesizing device provided by the embodiment of the application can comprise: the voice data splitting module 100 is configured to split voice data according to a fixed frame length to obtain a voice subtask; the fixed frame length is smaller than a preset frame length maximum threshold, and the preset frame length maximum threshold is the longest singing voice reasoning synthesis time supported by the edge equipment; the voice data is the data of voice parts in songs; the successive reasoning synthesis module 200 is used for sequentially reasoning and synthesizing the target tone with the voice subtasks to obtain target tone voice data; and the mute task inserting module 300 is used for inserting a mute task between each segment of voice data in the target tone voice data to obtain the target synthesized singing voice. According to the application, the voice data is divided into a plurality of voice subtasks according to the fixed frame length, the fixed frame length is larger than the preset frame length threshold, so that the voice synthesis stability can be kept, the memory peak value is lower than the threshold of the highest memory peak value of the edge equipment, the reasoning of the whole voice synthesis can be carried to the edge equipment for implementation, and the target tone is sequentially reasoning-synthesized with the voice subtasks, so that the voice data can be played after the reasoning-synthesis of one subtask, namely, the playing is realized while synthesizing, the playing waiting time is effectively reduced, the playing speed is increased, and the background GPU reasoning cost is saved. And, this embodiment splits the voice data by a fixed frame length of not shorter than 300 frames, so that singing voice synthesis can be performed on the edge device; and the voice data is converted into a half-precision floating point number data format, so that the resource volume is reduced; and the voice part is extracted through the pitch, so that the voice data determining effect is improved; in addition, the embodiment improves the naturalness of transition by carrying out fade-in and fade-out treatment on the spliced part, thereby improving the effect of singing voice synthesis; moreover, the file pointer can be moved to the corresponding target human phonon task in the human phonon subtask doubly-linked list according to the timestamp, so that the efficiency of positioning the target human phonon task is improved; and according to the comparison result of the target root mean square and the reasoning root mean square, a plurality of initial segment tone data are adjusted, and the singing voice synthesizing effect is improved; in addition, the embodiment can obtain consistent listening experience on different devices through loudness equalization adjustment, and users can enjoy similar sound intensity and quality in the earphone, the vehicle-mounted sound equipment or the home theater system, so that the experience of the users can be improved; in addition, the implementation can perform pressure limiting processing, so that the audio signal is prevented from exceeding the maximum volume which can be born by the system when in peak value, thereby preventing distortion and improving the quality of the voice accompaniment data.

An apparatus for synthesizing singing voice according to an embodiment of the present invention will be described below, and the apparatus for synthesizing singing voice described below and the method for synthesizing singing voice described above may be referred to correspondingly.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an singing voice synthesizing apparatus according to an embodiment of the present invention, which may include:

A memory 10 for storing a computer program;

A processor 20 for executing a computer program to implement the singing voice synthesizing method described above.

The memory 10, the processor 20, and the communication interface 30 all communicate with each other via a communication bus 40.

In the embodiment of the present invention, the memory 10 is used for storing one or more programs, the programs may include program codes, the program codes include computer operation instructions, and in the embodiment of the present invention, the memory 10 may store programs for implementing the following functions:

In one possible implementation, the memory 10 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, and at least one application program required for functions, etc.; the storage data area may store data created during use.

In addition, memory 10 may include read only memory and random access memory and provide instructions and data to the processor. A portion of the memory may also include NVRAM. The memory stores an operating system and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for performing various operations. The operating system may include various system programs for implementing various basic tasks as well as handling hardware-based tasks.

The processor 20 may be a central processing unit (Central Processing Unit, CPU), an asic, a dsp, a fpga or other programmable logic device, and the processor 20 may be a microprocessor or any conventional processor. The processor 20 may call a program stored in the memory 10.

The communication interface 30 may be an interface of a communication module for connecting with other devices or systems.

Of course, it should be noted that the structure shown in fig. 7 does not limit the singing voice synthesizing apparatus according to the embodiment of the present invention, and the singing voice synthesizing apparatus may include more or less components than those shown in fig. 7 or may combine some components in practical applications.

The following describes a computer-readable storage medium provided in an embodiment of the present invention, and the computer-readable storage medium described below and the singing voice synthesizing method described above may be referred to correspondingly.

The present invention also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements the steps of the singing voice synthesizing method described above.

The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The present invention also provides a computer program product comprising computer programs/instructions which when executed by a processor implement the steps of the singing voice synthesis method as described above.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Finally, it is further noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The above detailed description of the singing voice synthesizing method, apparatus and computer readable storage medium provided by the present invention applies specific examples to illustrate the principles and embodiments of the present invention, and the above examples are only used to help understand the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A singing voice synthesizing method, characterized by being applied to an edge device, comprising:

2. The singing voice synthesizing method of claim 1, further comprising, before splitting the voice data by a fixed frame length to obtain a voice subtask:

And merging the voice parts to obtain the voice data.

3. The singing voice synthesizing method of claim 1, further comprising, after splitting the voice data by a fixed frame length to obtain a voice subtask:

Combining the voice subtasks to obtain combined voice data;

4. A singing voice synthesizing method as in any one of claims 1-3, wherein said sequentially performing inference synthesis on the target timbre and the voice subtask to obtain target timbre voice data includes:

5. The singing voice synthesizing method of claim 4, wherein the adjusting the plurality of initial segment timbre data to obtain the target timbre voice data based on a comparison of a target root mean square and the inferred root mean square comprises:

6. A singing voice synthesizing method as in any one of claims 1 to 3, further comprising, after splitting the voice data by a fixed frame length to obtain a voice subtask:

7. The singing voice synthesizing method of claim 6, further comprising:

8. The singing voice synthesizing method of claim 1, wherein the splitting the voice data according to the fixed frame length to obtain the voice subtask comprises:

9. A singing voice synthesizing apparatus, characterized by comprising:

A memory for storing a computer program;

A processor for implementing the steps of the singing voice synthesizing method as recited in any one of claims 1 to 8 when executing the computer program.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the singing voice synthesizing method as recited in any one of claims 1 to 8.