CN117075772A

CN117075772A - Multimedia content display method and device, electronic equipment and storage medium

Info

Publication number: CN117075772A
Application number: CN202311126243.5A
Authority: CN
Inventors: 张颖; 李轩; 曲贺
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2023-11-17

Abstract

The embodiment of the disclosure provides a multimedia content display method, a device, electronic equipment and a storage medium, and relates to the technical field of computers. The method comprises the following steps: after the recording of the dry audio aiming at the multimedia content is completed, displaying a multimedia editing page, wherein the multimedia editing page comprises an audio matching control; responding to the triggering operation of the audio matching control, obtaining mixed audio matched with the dry sound audio, and displaying an audio display page, wherein the mixed audio is formed by mixing a plurality of audios generated according to a plurality of sound objects for the multimedia content, and the audio display page comprises a play control; and in response to the triggering operation for the playing control, performing superposition playing on the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content. The method realizes the effect of intuitively playing the superposition of the dry voice audio and the chorus audio, enriches the audio display form of the multimedia content and improves the man-machine interaction efficiency.

Description

Multimedia content display method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to a multimedia content display method, a multimedia content display device, an electronic device, and a computer-readable storage medium.

Background

With the development of application of internet technology and intelligent equipment, the application of audio social software is becoming more popular, and users can record and release audio through platforms such as an online singing platform, a live broadcast platform, a short video platform and the like.

In the related art, after a user records audio, the audio recorded by the user is beautified, and the tone color in the audio obtained by the processing mode is single, so that the playing effect of the audio is poor.

Disclosure of Invention

The embodiment of the disclosure provides a multimedia content display method, a multimedia content display device, electronic equipment and a computer readable storage medium, wherein the method can be realized as a dry audio superposition mixed audio, and the dry audio superposition mixed audio is realized, so that the audio after the mixed audio and the dry audio are superposed can contain various tone colors, and the playing effect of the audio is improved; meanwhile, the effect of intuitively playing the dry voice audio and the chorus audio after superposition is achieved, the audio display form of the multimedia content is enriched, and the man-machine interaction efficiency is improved.

The embodiment of the disclosure provides a multimedia content display method, which comprises the following steps: after the recording of the dry sound frequency aiming at the multimedia content is completed, displaying a multimedia editing page, wherein the multimedia editing page comprises an audio matching control, and the audio matching control is used for matching the mixed audio for the dry sound frequency; responding to the triggering operation of the audio matching control, obtaining mixed audio matched with the dry sound audio, and displaying an audio display page, wherein the mixed audio is formed by mixing a plurality of audios generated according to a plurality of sound objects for the multimedia content, and the audio display page comprises a play control; and in response to the triggering operation for the playing control, performing superposition playing on the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content.

In some exemplary embodiments of the present disclosure, the audio presentation page further includes a plurality of text information of the multimedia content and a selection control corresponding to each text information one-to-one, and states of the selection control include a to-be-selected state and a selected state; and responding to the triggering operation for the playing control, and performing superposition playing on the dry sound audio corresponding to at least part of the multimedia content and the mixed audio corresponding to the at least part of the multimedia content, wherein the method comprises the following steps: responding to a triggering operation of a selection control with a state to be selected, and updating the state of the selection control into a selected state; sequentially playing the audio clips corresponding to each text message in response to the triggering operation for the playing control; and when the text information corresponding to the selection control with the selected state is played, performing superposition playing on the dry audio and the mixed audio corresponding to the currently played text information.

In some exemplary embodiments of the present disclosure, the method further comprises: responding to triggering operation of a selection control with a selected state, and updating the state of the selection control into a to-be-selected state; and when the text information corresponding to the selection control with the state of the candidate state is played, playing the dry audio corresponding to the currently played text information.

In some exemplary embodiments of the present disclosure, the audio presentation page further includes a volume adjustment control; wherein the method further comprises: determining a target volume for playing the mixed audio in response to an adjustment operation for the volume adjustment control; wherein, the superposition playing of the dry sound audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content comprises: and playing the audio fragments corresponding to at least part of the multimedia contents in the dry sound audio by using a preset volume, and playing the audio fragments corresponding to at least part of the multimedia contents in the mixed audio by using the target volume.

In some exemplary embodiments of the present disclosure, the audio presentation page further includes a completion control; wherein the method further comprises: and responding to the triggering operation for the completion control, and generating a target multimedia file according to the dry sound audio and the mixed audio.

In some exemplary embodiments of the present disclosure, the method further comprises: in response to a triggering operation for the completion control, jumping from the audio presentation page to the multimedia editing page, wherein the multimedia editing page comprises a release control; and responding to the triggering operation for the release control, and releasing the target multimedia file.

In some exemplary embodiments of the present disclosure, obtaining mixed audio that matches the dry audio comprises: obtaining a range of the dry sound audio; obtaining at least one first sound object identical to the range of the sound range and at least one second sound object different from the range of the sound range in a matching way from a plurality of candidate sound objects; generating at least one first audio for the multimedia content from the dry audio and the at least one first sound object, generating at least one second audio for the multimedia content from the dry audio and the at least one second sound object; mixing the at least one first audio and the at least one second audio to obtain the mixed audio.

In some exemplary embodiments of the present disclosure, generating at least one first audio for the multimedia content from the dry sound audio and the at least one first sound object comprises: extracting the characteristics of the dry sound frequency to obtain text content corresponding to the dry sound frequency and time information of the text content; extracting fundamental frequency from the dry sound frequency to obtain singing melody corresponding to the dry sound frequency; acquiring tone characteristics of the at least one first sound object; inputting text content corresponding to the dry sound frequency and time information thereof, singing melody corresponding to the dry sound frequency and tone characteristics of the at least one first sound object into a singing voice changing model to obtain the at least one first audio; wherein generating at least one second audio for the multimedia content from the dry audio and the at least one second sound object comprises: acquiring tone characteristics of the at least one second sound object; and inputting the text content corresponding to the dry sound frequency and time information thereof, the singing melody corresponding to the dry sound frequency and tone characteristics of the at least one second sound object into a singing voice changing model to obtain the at least one second audio.

In some exemplary embodiments of the present disclosure, generating first audio for the multimedia content from the dry audio and the first sound object includes: extracting the characteristics of the dry sound frequency to obtain text content corresponding to the dry sound frequency and time information of the text content; extracting fundamental frequency from the dry sound frequency to obtain singing melody corresponding to the dry sound frequency; acquiring tone characteristics of the first sound object; and inputting the text content corresponding to the dry sound frequency and time information thereof, the singing melody corresponding to the dry sound frequency and tone characteristics of the first sound object into a singing voice changing model to obtain the first audio.

In some exemplary embodiments of the present disclosure, before mixing the first audio and the second audio to obtain the mixed audio, the method includes: adjusting the volume of the first audio and the second audio so that the volume of the first audio is smaller than the volume of the second audio; and adjusting sound images of the first audio and the second audio so that the first audio is far away from a virtual center position relative to the second audio to obtain the mixed audio with the stereo effect.

The embodiment of the disclosure provides a multimedia content display device, comprising: a display module configured to perform displaying a multimedia editing page after recording of a dry audio for multimedia content is completed, the multimedia editing page including an audio matching control for matching mixed audio for the dry audio; the display module is further configured to perform a trigger operation for the audio matching control, obtain mixed audio matched with the dry sound audio, and display an audio presentation page, wherein the mixed audio is formed by mixing a plurality of audios generated for the multimedia content according to a plurality of sound objects, and the audio presentation page comprises a play control; and the playing module is configured to execute superposition playing of the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content in response to the triggering operation of the playing control.

An embodiment of the present disclosure provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute executable instructions to implement a multimedia content presentation method as in any of the above.

The disclosed embodiments provide a computer readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform a multimedia content presentation method as any one of the above.

The disclosed embodiments provide a computer program product, a computer program, which when executed by a processor, implements a multimedia content presentation method of any of the above.

According to the multimedia content display method provided by the embodiment of the disclosure, on one hand, in response to the triggering operation of the audio matching control in the multimedia editing page, the mixed audio formed by mixing the plurality of audio generated by the plurality of sound objects matched with the dry audio for the multimedia content is obtained, and the mixed audio is overlapped for the dry audio, so that the audio after the mixed audio and the dry audio are overlapped can contain various tone colors, and the playing effect of the audio is improved; on the other hand, in response to the triggering operation of the playing control in the audio display page, the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content are overlapped and played, the effect of intuitively previewing the overlapped dry audio and chorus audio is achieved, the audio display form of the multimedia content is enriched, and the man-machine interaction efficiency is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the multimedia content presentation method of the embodiments of the present disclosure may be applied.

Fig. 2 is a flowchart illustrating a multimedia content presentation method according to an exemplary embodiment.

Fig. 3 is a schematic diagram of a multimedia editing page shown according to an example.

Fig. 4 is a schematic diagram of an audio presentation page shown according to an example.

Fig. 5 is a flowchart illustrating another multimedia content presentation method according to an exemplary embodiment.

Fig. 6 is an interaction diagram of a client corresponding to a multimedia content presentation method according to an example.

Fig. 7 is a block diagram illustrating a multimedia content presentation device according to an exemplary embodiment.

Fig. 8 is a schematic diagram illustrating a structure of an electronic device suitable for use in implementing an exemplary embodiment of the present disclosure, according to an exemplary embodiment.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments can be embodied in many forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted.

The described features, structures, or characteristics of the disclosure may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the present disclosure. However, those skilled in the art will recognize that the aspects of the present disclosure may be practiced with one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The drawings are merely schematic illustrations of the present disclosure, in which like reference numerals denote like or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in at least one hardware module or integrated circuit or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and not necessarily all of the elements or steps are included or performed in the order described. For example, some steps may be decomposed, and some steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

In the present specification, the terms "a," "an," "the," "said" and "at least one" are used to indicate the presence of at least one element/component/etc.; the terms "comprising," "including," and "having" are intended to be inclusive and mean that there may be additional elements/components/etc., in addition to the listed elements/components/etc.; the terms "first," "second," and "third," etc. are used merely as labels, and do not limit the number of their objects.

As shown in fig. 1, the system architecture may include a server 101, a network 102, a terminal device 103, a terminal device 104, and a terminal device 105. Network 102 is the medium used to provide communication links between terminal device 103, terminal device 104, or terminal device 105 and server 101. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

The server 101 may be a server providing various services, such as a background management server providing support for devices operated by a user with the terminal device 103, the terminal device 104, or the terminal device 105. The background management server may perform analysis and other processing on the received data such as the request, and feed back the processing result to the terminal device 103, the terminal device 104, or the terminal device 105.

The terminal device 103, the terminal device 104, and the terminal device 105 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a wearable smart device, a virtual reality device, an augmented reality device, and the like.

In the embodiment of the disclosure, a user may record dry audio for multimedia content using a terminal device, and after the recording of the dry audio for multimedia content is completed, the terminal device may: displaying a multimedia editing page, wherein the multimedia editing page comprises an audio matching control; responding to the triggering operation of the audio matching control, obtaining mixed audio formed by mixing a plurality of audio generated by a plurality of sound objects matched with the dry audio aiming at the multimedia content, and displaying an audio display page, wherein the audio display page comprises a play control; and in response to the triggering operation for the playing control, performing overlapped playing on the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content.

It should be understood that the numbers of the terminal device 103, the terminal device 104, the terminal device 105, the network 102 and the server 101 in fig. 1 are only illustrative, and the server 101 may be a server of one entity, may be a server cluster formed by a plurality of servers, may be a cloud server, and may have any number of terminal devices, networks and servers according to actual needs.

Hereinafter, respective steps of the multimedia content presentation method in the exemplary embodiment of the present disclosure will be described in more detail with reference to the accompanying drawings and embodiments.

Fig. 2 is a flowchart illustrating a multimedia content presentation method according to an exemplary embodiment. The method provided by the embodiment of fig. 2 may be performed by any electronic device, for example, the terminal device in fig. 1, or the server in fig. 1, or a combination of the terminal device and the server in fig. 1, which is not limited in this disclosure.

As shown in fig. 2, the method provided by the embodiments of the present disclosure may include the following steps.

In step S210, after the recording of the dry audio for the multimedia content is completed, a multimedia editing page is displayed, which includes an audio matching control.

In the embodiments of the present disclosure, the multimedia content may be text content, image content or audio content, and may be song content, dubbing content, podcast content, etc. when the multimedia content is audio content, the multimedia content is taken as song content for example in the following explanation, but the present disclosure is not limited thereto.

In the embodiment of the disclosure, dry sound refers to original audio recorded by equipment, sound processed without reverberation or other sound effects, and also refers to human sound without music background. The dry audio refers to voice audio of the user for the multimedia content recorded by using the terminal device, such as singing voice of the user for song a recorded by using a mobile phone; it should be noted that, the dry audio recorded in the embodiments of the present disclosure may include noise, so long as the signal-to-noise ratio in the dry audio is greater than a preset value.

In the embodiment of the disclosure, the multimedia editing page refers to a page for editing multimedia content and dry audio, for example, the multimedia editing page may be a preview editing page of lyrics and sounds; the multimedia editing page may include, but is not limited to, an audio matching control, an intelligent sound repair control, a clip editing control, a noise reduction alignment control, an accompaniment volume control bar, a voice volume control bar, a release control, and the like; where the audio matching control refers to a control for matching mixed audio for dry sound, the mixed audio may be chorus audio generated by AI (Artificial Intelligence ), for example, i.e. the audio matching control may be referred to as AI chorus control, for example.

In the embodiment of the disclosure, a user may record dry audio for multimedia content using a terminal device, and after the recording of the dry audio is completed, a multimedia editing page is displayed.

Fig. 3 is a schematic diagram of a multimedia editing page that is illustrated according to an example, and referring to fig. 3, the multimedia editing page 300 may include text information 301 of multimedia content as a background, and may further include an audio matching control 302, an intelligent trimming control, a clip editing control, a noise reduction alignment control, an accompaniment volume control bar, a vocal volume control bar, and a publishing control.

In an embodiment of the present disclosure, the states of the audio matching controls included in the multimedia editing page may include an available state and an unavailable state; after the user records the dry audio, whether the user wears the earphone or whether the signal to noise ratio of the dry audio is larger than a preset value can be detected, and when the signal to noise ratio of the user wearing the earphone or the dry audio is larger than the preset value, the state of the audio matching control is an available state; when the user does not wear the earphone and the signal-to-noise ratio of the dry sound audio is smaller than or equal to a preset value, the state of the audio matching control is in an unavailable state.

In step S220, in response to the triggering operation for the audio matching control, a mixed audio matching the dry audio is obtained, and an audio presentation page is displayed, wherein the mixed audio is formed by mixing a plurality of audios generated for the multimedia content according to a plurality of sound objects, and the audio presentation page includes a play control.

In the embodiment of the disclosure, the mixed audio may be mixed by a plurality of audio objects respectively generated for the multimedia content, for example, the mixed audio may be chorus audio obtained by mixing a plurality of simulated singing audio respectively generated by a plurality of audio objects for song a.

In the embodiment of the disclosure, the audio display page refers to a page for displaying the superposition playing effect of the dry audio and the mixed audio, and the audio display page can be, for example, a chorus editing page; the audio display page can comprise a playing control, a plurality of text information of the multimedia content, a selection control, a volume adjustment control and a completion control which are in one-to-one correspondence with each text information, wherein the playing control is used for playing audio corresponding to the multimedia content, and the audio can only comprise dry audio or can be audio after superposition of the dry audio and mixed audio; the plurality of text information of the multimedia content may be, for example, lyric information, line information, wherein each line of text may be one text information; the selection control corresponding to each text message one by one refers to a selection control corresponding to each line of text message, the selection control is used for selecting whether mixed audio is used or not, the state of the selection control can comprise a to-be-selected state and a selected state, when the state of the selection control is the to-be-selected state, the text message corresponding to the selection control does not use the mixed audio, namely, only dry audio is played when the text message is played; when the state of the selection control is the selected state, the text information corresponding to the selection control uses mixed audio, namely, dry audio and mixed audio are overlapped and played when the text information is played; the initial state of the selection control can be a to-be-selected state or a selected state, and a user can change the state of the selection control by clicking the selection control; the volume adjustment control means a control for adjusting the volume of the mixed audio; a completion control refers to a control for generating a target multimedia file after the adjustment of the mixed audio is completed.

Fig. 4 is a schematic diagram of an audio presentation page shown according to an example, referring to fig. 4, the audio presentation page 400 may include a play control 401, a plurality of text information of multimedia content, a selection control corresponding to each text information one-to-one, a volume adjustment control 404, and a completion control 405, where the plurality of text information may include first text information 4021, second text information 4022, third text information 4023, and the like, and the selection control may include a first selection control 4031 corresponding to the first text information 4021, a second selection control 4032 corresponding to the second text information 4022, and a third selection control 4033 corresponding to the third text information 4023; wherein the state of the first selection control 4031 is the to-be-selected state, and the states of the second selection control 4032 and the third selection control 4033 are the selected states.

In the embodiment of the disclosure, a user can click on the audio matching control, and in response to the triggering operation for the audio matching control, mixed audio for multimedia content can be obtained according to dry audio matching, and an audio display page is displayed at the same time.

In step S230, in response to the triggering operation for the play control, the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content are played in a superimposed manner.

In the embodiment of the disclosure, a user may click a play control in an audio presentation page, and in response to the click operation, multimedia content and corresponding audio thereof are played from the current position of a play progress bar. Taking the current position of the playing progress bar as an initial position as an example, after a user clicks a playing control, sequentially playing each text information and audio corresponding to each text information according to the sequence of each text information of the multimedia content, wherein at least part of audio corresponding to the text information can be formed by superposition of dry audio and mixed audio.

For example, when the user wants to preview the chorus effect of the dry audio and the chorus audio, the user may click on the play control in the audio presentation page, and in response to the click operation, sequentially play the lyrics of the song and the dry audio and the chorus audio corresponding to the currently played lyrics, so that the user may intuitively preview the effect after the superposition of the dry audio and the chorus audio.

In an exemplary embodiment, in response to a triggering operation for a play control, performing superposition play on at least part of dry audio corresponding to the multimedia content and at least part of mixed audio corresponding to the multimedia content, including: responding to a triggering operation of a selection control with a state to be selected, and updating the state of the selection control into a selected state; sequentially playing the audio clips corresponding to each text message in response to the triggering operation for the playing control; and when the text information corresponding to the selection control with the selected state is played, performing superposition playing on the dry audio and the mixed audio corresponding to the currently played text information.

In an exemplary embodiment, the method may further include: responding to a triggering operation of a selection control with a selected state, and updating the state of the selection control into a to-be-selected state; and when the text information corresponding to the selection control with the state of the candidate state is played, playing the dry audio corresponding to the currently played text information.

In the embodiment of the disclosure, a user can update the state of the selection control by clicking the selection control to determine whether text information corresponding to the selection control uses mixed audio; when the state of the selection control is the to-be-selected state, the text information corresponding to the selection control does not use mixed audio, namely only dry audio is played when the text information is played; when the state of the selection control is the selected state, the text information corresponding to the selection control uses mixed audio, namely, the dry audio and the mixed audio are overlapped and played when the text information is played.

For example, referring to fig. 4, the state of the first selection control 4031 is the to-be-selected state, and the states of the second selection control 4032 and the third selection control 4033 are the selected states; after the user clicks the play control 401, sequentially playing audio clips corresponding to the first text information 4021, the second text information 4022 and the third text information 4023; when the first text information 4021 is played, the state of the first selection control 4031 is the standby state, so that the dry audio corresponding to the first text information 4021 is played; when the second text information 4022 is played, the second selection control 4032 is in the selected state, so that the dry audio and the mixed audio corresponding to the second text information 4022 are overlapped and played; when the third text information 4023 is played, the dry audio and the mixed audio corresponding to the third text information 4023 are overlapped and played because the state of the third selection control 4033 is the selected state.

For example, when the user hears the dry audio corresponding to the first text information 4021, if the user wants to superimpose the mixed audio on the dry audio of the first text information 4021, the user can click on the first selection control 4031 to update the state of the first selection control 4031 to the selected state; for another example, when the user hears the superposition of the dry audio and the mixed audio corresponding to the second text information 4022, if the user wants to cancel the mixed audio of the superposition of the dry audio for the second text information 4022, the user may click on the second selection control 4032 to update the state of the second selection control 4032 to the candidate state.

In the embodiment of the disclosure, when the text information corresponding to the selection control with the selected state is played, the dry sound audio and the mixed audio corresponding to the currently played text information are overlapped and played; when the text information corresponding to the selection control with the state of the candidate state is played, the dry audio corresponding to the currently played text information is played; according to the method, whether the mixed audio is overlapped on the text information corresponding to the selection control can be determined according to the state of the selection control, the audio overlapping mode is enriched, and the flexibility of audio generation is improved.

In an exemplary embodiment, the method may further include: determining a target volume for playing the mixed audio in response to an adjustment operation for the volume adjustment control; wherein, the superposition playing of the dry sound audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content comprises: and playing the audio fragments corresponding to at least part of the multimedia contents in the dry sound audio by using the preset volume, and playing the audio fragments corresponding to at least part of the multimedia contents in the mixed audio by using the target volume.

In the embodiment of the disclosure, the volume adjustment control of the audio presentation page is a control for adjusting the volume of the mixed audio, and a user can adjust the volume of the played mixed audio on the audio presentation page; the method comprises the steps that dry sound audio in an audio display page is played by using preset volume when being played, and the preset volume can be set according to actual conditions; when the mixed audio is played, the target volume obtained after adjustment is used for playing; when the user hears the superposition playing effect of the dry audio and the mixed audio, the volume of the mixed audio can be adjusted, so that the superposition effect of the dry audio and the mixed audio is better, for example, the dry audio and the mixed audio are more similar to a chorus group.

In the embodiment of the disclosure, when the dry sound audio and the mixed audio are overlapped and played, the dry sound audio is played by using a preset volume, and the mixed audio is played by using an adjustable target volume; the method can adjust the volume of the mixed audio, so that the superposition effect of the dry audio and the mixed audio is better, the superposition effect of the audio is improved, and the flexibility of audio generation is improved.

In an exemplary embodiment, the method may further include: in response to a triggering operation for the completion control, a target multimedia file is generated from the dry audio and the mixed audio.

In the embodiment of the disclosure, after the user completes the adjustment of the audio on the audio display page, the completion control may be clicked, and in response to the clicking operation, the target multimedia file may be generated according to the dry audio and the mixed audio.

For example, for song a, after determining whether each sentence of lyrics uses mixed audio and the volume of the mixed audio when the mixed audio is used, a target chorus file with chorus effect is generated from each sentence of lyrics without using the mixed audio and the corresponding dry audio thereof, and each sentence of lyrics with the mixed audio and the corresponding dry audio and the mixed audio thereof.

In the embodiment of the disclosure, the target multimedia file is generated according to the dry audio and the mixed audio in response to the triggering operation of the completion control, and the target multimedia file with the superposition effect can be automatically generated by the method, so that the man-machine interaction efficiency is improved.

In an exemplary embodiment, the method may further include: in response to a triggering operation for the completion control, jumping from the audio presentation page to a multimedia editing page, the multimedia editing page including a release control; and responding to the triggering operation for the release control, and releasing the target multimedia file.

In the embodiment of the disclosure, after the target multimedia file is generated in response to the triggering operation for the completion control, the multimedia editing page can be jumped back, and the target multimedia file can be published through the publishing control in the multimedia editing page.

In an embodiment of the disclosure, the multimedia editing page may also include a draft storing control, and in response to a triggering operation for the draft storing control, the target multimedia file may be stored as a draft.

In the embodiment of the disclosure, in response to the triggering operation for the completion control, the multimedia editing page can be jumped, and the multimedia file is released through the release control of the multimedia editing page.

Fig. 5 is a flowchart illustrating another multimedia content presentation method according to an exemplary embodiment, and fig. 5 illustrates specific steps of obtaining mixed audio matching with dry audio.

In the embodiment of fig. 5, the step S220 in the embodiment of fig. 2 described above may further include the following steps.

In step S221, a range of the gamut of the dry audio is obtained.

In the embodiment of the present disclosure, the range of the dry sound audio refers to the range between the lowest and highest tones in the dry sound audio.

In step S222, at least one first sound object identical to the range of the sound range and at least one second sound object different from the range of the sound range are obtained by matching from the plurality of candidate sound objects.

In the embodiment of the disclosure, the candidate sound object may be a sound in a sound library, where the sound in the sound library may be collected sound of singing of different users meeting a certain quality condition, or may be a sound automatically generated through a neural network model.

In the embodiment of the disclosure, according to the range of the dry audio and the range of each candidate sound object, at least one sound object (may be referred to as a first sound object) identical to the range of the dry audio is matched, and at least one sound object (may be referred to as a second sound object) different from the range of the dry audio is matched according to a preset range searching and matching principle.

For example, the range of the dry audio is [60, 80], then 2 first sound objects of the range of [60, 80] may be matched, and 2 second sound objects of the range of [48, 68] (e.g., the lowest and highest sounds in the range of the range for the dry audio are subtracted by 12, respectively).

In the embodiment of the disclosure, when a second sound object different from the range of the dry sound audio is matched, the range of the plurality of sound audios is larger than a preset range (i.e. the range of the dry sound audio is higher), the second sound object with the range of the range smaller than the range of the dry sound audio can be matched; if the range of the plurality of sound tones is smaller than or equal to the preset range (i.e. the range of the dry sound tone is lower), the second sound object with the range of the range being larger than the range of the dry sound tone can be matched.

For example, when the range of the dry audio is high, the lowest sound and the highest sound in the range of the dry audio can be subtracted by preset values respectively, and the obtained range of the range is used for matching the second sound object; when the range of the dry sound audio is low, the lowest sound and the highest sound in the range of the dry sound audio can be respectively added with preset values, and the obtained range of the range is used for matching the second sound object.

In general, the range of male sounds is low and the range of female sounds is high; when the dry sound audio is male sound, the first sound object which is obtained by matching and is the same as the range of the dry sound audio is also male sound, and the second sound object which is obtained by matching and is different from the range of the dry sound audio is female sound; when the dry sound audio is female sound, the first sound object which is obtained by matching and is the same as the range of the sound object is also female sound, and the second sound object which is obtained by matching and is different from the range of the sound object is male sound.

In step S223, at least one first audio for the multimedia content is generated from the dry audio and the at least one first sound object, and at least one second audio for the multimedia content is generated from the dry audio and the at least one second sound object.

In the embodiment of the disclosure, the dry audio and each first sound object can be processed through the singing voice changing model which is completed through training, and each first audio aiming at the multimedia content is generated; processing the dry sound and each second sound object through the singing variable sound model which is completed through training, and generating each second audio aiming at the multimedia content; wherein the process of generating the first audio and the process of generating the second audio are similar, in the following illustration, one of the first audio and the first second audio are illustrated as an example.

Wherein, the singing voice changing refers to changing the tone color of singing on the premise of not changing the singing content of a piece of singing audio.

In the embodiment of the present disclosure, the first audio generated from the dry sound audio and the first sound object is audio of multimedia content (e.g., lyrics and melodies) corresponding to the dry sound audio using timbre of the first sound object.

In an exemplary embodiment, generating first audio for multimedia content from dry audio and a first sound object comprises: extracting the characteristics of the dry audio to obtain text content corresponding to the dry audio and time information of the text content; extracting fundamental frequency of the dry sound frequency to obtain a singing melody corresponding to the dry sound frequency; acquiring tone characteristics of a first sound object; inputting text content corresponding to the dry sound frequency, time information of the text content, singing melody corresponding to the dry sound frequency and tone characteristics of a first sound object into a singing voice changing model to obtain first audio; wherein generating second audio for the multimedia content from the dry audio and the second sound object comprises: acquiring tone characteristics of a second sound object; and inputting the text content corresponding to the dry sound frequency, the time information of the text content, the singing melody corresponding to the dry sound frequency and the tone characteristics of the second sound object into the singing voice changing model to obtain the second audio.

In the embodiment of the disclosure, a pre-trained self-supervision feature decoupling model may be used to extract text content and time information of the text content in dry audio, where the text content may be, for example, lyric content, and the time information of the text content may be a singing duration of each lyric content; a fundamental frequency extraction algorithm may be used to extract fundamental frequency features representing the singing melody in the dry audio, wherein the fundamental frequency extraction algorithm may use a YIN algorithm or other algorithms; the tone characteristic of the first sound object can be obtained through audio data extraction of the first sound object; inputting text content corresponding to the dry voice frequency and time information thereof, singing melody corresponding to the dry voice frequency and tone characteristics of a first voice object into a singing voice changing model after training is completed, and obtaining first audio of text lyrics and melodies corresponding to tone singing dry voice frequency of the first voice object, wherein the first audio is audio after voice changing based on sampling point levels; and inputting the text content corresponding to the dry voice frequency and the time information thereof, the singing melody corresponding to the dry voice frequency and the tone characteristic of the second voice object into the singing voice changing model after training, so as to obtain the second audio of the text lyrics and the melody corresponding to the tone singing dry voice frequency using the second voice object.

In the embodiment of the disclosure, the singing voice-changing model can be obtained through training by the following steps: acquiring a first training audio of a first training sound object aiming at a training song, and acquiring a second training audio of a second training sound object aiming at the same training song; extracting features of the first training audio to obtain text content corresponding to the first training audio and time information of the text content; extracting fundamental frequency from the first training audio to obtain singing melody corresponding to the first training audio; acquiring tone characteristics of a second training sound object according to the second training audio; inputting text content corresponding to the first training audio, time information of the text content, singing melody corresponding to the first training audio and tone characteristics of the second training sound object to a singing voice changing model to obtain predicted audio; and taking the second training audio as a training label, and adjusting model parameters of the singing voice changing model through predicting the loss values of the audio and the second training audio so as to obtain the singing voice changing model after training.

In the embodiment of the disclosure, the singing voice-changing model can also be obtained through training by the following steps: acquiring first training audio aiming at a training song of a first training sound object; extracting features of the first training audio to obtain text content corresponding to the first training audio and time information of the text content; extracting fundamental frequency from the first training audio to obtain singing melody corresponding to the first training audio; acquiring tone characteristics of a first training sound object according to the first training audio; inputting text content corresponding to the first training audio, time information of the text content, singing melody corresponding to the first training audio and tone characteristics of the first training sound object into a singing voice changing model to obtain predicted audio; and taking the first training audio as a training label, and adjusting model parameters of the singing voice changing model through predicting the audio and the loss value of the first training audio so as to obtain the singing voice changing model after training.

In the embodiment of the disclosure, text content corresponding to a dry sound frequency and time information thereof, a singing melody corresponding to the dry sound frequency and tone characteristics of a first sound object are input into a singing voice-changing model, so that a first audio of the text content and the melody corresponding to the dry sound frequency can be obtained by using the tone of the first sound object; inputting text content corresponding to the dry sound frequency and time information thereof, singing melody corresponding to the dry sound frequency and tone characteristics of the second sound object into the singing voice changing model, and obtaining second audio of the text content and the melody corresponding to the tone singing dry sound frequency using the second sound object; the method can automatically generate the audio which is singed by using different colors and is the same as the text content and the melody of the dry audio, thereby improving the audio generation efficiency and the flexibility of the audio generation.

In step S224, at least one first audio and at least one second audio are mixed to obtain mixed audio.

In the embodiment of the disclosure, at least one first audio and at least one second audio obtained by matching can be mixed to obtain mixed audio; because the range of the first audio frequency is the same as the range of the dry audio frequency, the range of the second audio frequency is different from the range of the dry audio frequency, and the tone colors of each first audio frequency, each second audio frequency and the dry audio frequency are different, the obtained mixed audio frequency has chorus effects of different sound parts. Wherein, the sound part refers to a sound part, such as a male high-pitch sound part, of each melody or composition tone when a plurality of persons or a plurality of musical instruments are simultaneously sounding.

In the embodiment of the disclosure, on one hand, according to the range matching of the dry audio, a first sound object identical to the range of the dry audio and a second sound object different from the range of the dry audio are obtained, so that the mixed audio obtained later has chorus effects of different sound parts; on the other hand, the first audio aiming at the multimedia content is generated according to the dry sound audio and the first sound object, and the second audio aiming at the multimedia content is generated according to the dry sound audio and the second sound object, namely, the method can automatically generate the audio which is singed by different sound colors and is the same as the text content and the melody of the dry sound audio, so that the audio generation efficiency is improved, and the flexibility of the audio generation is improved.

In an exemplary embodiment, before mixing at least one first audio and at least one second audio to obtain mixed audio, the method comprises: adjusting the volume of the at least one first audio and the at least one second audio so that the volume of the at least one first audio is smaller than the volume of the at least one second audio; and performing sound image adjustment on the at least one first audio and the at least one second audio so that the at least one first audio is far away from the virtual center position relative to the at least one second audio, thereby obtaining mixed audio with a stereo effect.

In the embodiment of the disclosure, before the first audio and the second audio are mixed, volume adjustment can be performed on the first audio and the second audio, and sound image adjustment can be performed on the first audio and the second audio; the sequence of volume adjustment and sound image adjustment is not limited.

In the embodiment of the disclosure, when the volume adjustment is performed, the volume of the first audio may be reduced, and/or the volume of the second audio may be increased, so that the volume of the first audio in the same range as the range of the dry audio is smaller than the volume of the second audio in a different range from the range of the dry audio, thereby creating the effect of chorus group.

In the embodiment of the disclosure, when the sound image adjustment is performed, the virtual sound image position of the first audio and/or the second audio may be adjusted, so that the first audio with the same range as the range of the dry audio is far away from the virtual center position relative to the second audio with a different range from the range of the dry audio, thereby creating a stereo effect; the virtual center position may be, for example, a midpoint of a straight line where the simulated left and right ears are located.

In the embodiment of the disclosure, before the first audio and the second audio are mixed, volume adjustment and sound image adjustment can be performed on the first audio and the second audio, so that the obtained mixed audio has the effects of chorus and stereo.

Fig. 6 is an interaction diagram of a client corresponding to a multimedia content presentation method according to an example. In the embodiment shown in fig. 6, the multimedia content is taken as an example of a target song, and the audio matching control is taken as an AI chorus control, but the disclosure is not limited thereto.

Referring to fig. 6, the multimedia content presentation method may include:

in step S61, the user records dry audio for the target song.

In step S62, after the recording of the dry audio for the target song is completed, it is determined whether the availability condition of the AI chorus control is met.

For example, whether the available conditions of the AI chorus group control are met can be judged by judging whether the signal-to-noise ratio of the dry audio is larger than a preset value; when the available conditions of the AI chorus group control are not met, the state of the AI chorus group control is an unavailable state; and when the available conditions of the AI chorus group control are met, the state of the AI chorus group control is an available state.

In step S63, the user clicks the AI chorus control.

In step S64, in response to the user clicking the AI chorus control, the client foreground uploads the user' S dry audio to the client background.

In step S65, the client background obtains a plurality of sound objects from the singing voice model library according to the range of the dry voice frequency, and generates a plurality of voice frequencies according to the tone characteristics of each sound object and the dry voice frequency.

For example, the plurality of audio may include audio 1, audio 2, audio 3, and audio 4, which 4 may correspond to different sound parts.

In step S66, the generated plurality of audios are mixed according to the mixing policy to obtain a chorus track.

In step S67, the user' S dry audio and chorus audio are played in superposition in the client foreground.

In step S68, the user previews the effect of the superimposed playback of the dry audio and the chorus audio and performs editing operations.

In step S69, a target chorus file is generated from the target song, the dry audio, and the chorus audio, and the target chorus file is distributed.

It should also be understood that the above is only intended to assist those skilled in the art in better understanding the embodiments of the present disclosure, and is not intended to limit the scope of the embodiments of the present disclosure. It will be apparent to those skilled in the art from the foregoing examples that various equivalent modifications or variations can be made, for example, some steps of the methods described above may not be necessary, or some steps may be newly added, etc. Or a combination of any two or more of the above. Such modifications, variations, or combinations thereof are also within the scope of the embodiments of the present disclosure.

It should also be understood that the foregoing description of the embodiments of the present disclosure focuses on highlighting differences between the various embodiments and that the same or similar elements not mentioned may be referred to each other and are not repeated here for brevity.

It should also be understood that the sequence numbers of the above processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure.

It is also to be understood that in the various embodiments of the disclosure, terms and/or descriptions of the various embodiments are consistent and may be referenced to one another in the absence of a particular explanation or logic conflict, and that the features of the various embodiments may be combined to form new embodiments in accordance with their inherent logic relationships.

Examples of the multimedia content presentation method provided by the present disclosure are described above in detail. It will be appreciated that the computer device, in order to carry out the functions described above, comprises corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Fig. 7 is a block diagram illustrating a multimedia content presentation device according to an exemplary embodiment. Referring to fig. 7, the apparatus 700 may include a display module 710 and a play module 720.

Wherein the display module 710 is configured to perform displaying a multimedia editing page after the recording of the dry audio for the multimedia content is completed, the multimedia editing page including an audio matching control for matching the mixed audio for the dry audio; the display module 710 is further configured to perform obtaining mixed audio matching the dry sound in response to a trigger operation for the audio matching control, and displaying an audio presentation page, the mixed audio being a mixture of a plurality of audio generated for the multimedia content according to a plurality of sound objects, the audio presentation page including a play control; the playing module 720 is configured to perform overlapped playing of the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content in response to the triggering operation for the playing control.

In some exemplary embodiments of the present disclosure, the audio presentation page further includes a plurality of text information of the multimedia content and a selection control corresponding to each text information one-to-one, and states of the selection control include a to-be-selected state and a selected state; wherein the play module 720 is configured to perform: responding to a triggering operation of a selection control with a state to be selected, and updating the state of the selection control into a selected state; sequentially playing the audio clips corresponding to each text message in response to the triggering operation for the playing control; and when the text information corresponding to the selection control with the selected state is played, performing superposition playing on the dry audio and the mixed audio corresponding to the currently played text information.

In some exemplary embodiments of the present disclosure, the play module 720 is configured to perform: responding to triggering operation of a selection control with a selected state, and updating the state of the selection control into a to-be-selected state; and when the text information corresponding to the selection control with the state of the candidate state is played, playing the dry audio corresponding to the currently played text information.

In some exemplary embodiments of the present disclosure, the audio presentation page further includes a volume adjustment control; wherein the method further comprises: determining a target volume for playing the mixed audio in response to an adjustment operation for the volume adjustment control; wherein the play module 720 is configured to perform: and playing the audio fragments corresponding to at least part of the multimedia contents in the dry sound audio by using a preset volume, and playing the audio fragments corresponding to at least part of the multimedia contents in the mixed audio by using the target volume.

In some exemplary embodiments of the present disclosure, the audio presentation page further includes a completion control; wherein the apparatus further comprises: and the generation module is configured to execute a trigger operation responding to the completion control and generate a target multimedia file according to the dry sound audio and the mixed audio.

In some exemplary embodiments of the present disclosure, the apparatus further comprises: a skip module configured to perform a skip from the audio presentation page to the multimedia editing page in response to a trigger operation for the completion control, the multimedia editing page including a release control; and the release module is configured to execute the release of the target multimedia file in response to the triggering operation for the release control.

In some exemplary embodiments of the present disclosure, the apparatus further comprises: an obtaining module configured to perform: obtaining a range of the dry sound audio; obtaining at least one first sound object identical to the range of the sound range and at least one second sound object different from the range of the sound range in a matching way from a plurality of candidate sound objects; generating at least one first audio for the multimedia content from the dry audio and the at least one first sound object, generating at least one second audio for the multimedia content from the dry audio and the at least one second sound object; mixing the at least one first audio and the at least one second audio to obtain the mixed audio.

In some exemplary embodiments of the present disclosure, the generation module is configured to perform: extracting the characteristics of the dry sound frequency to obtain text content corresponding to the dry sound frequency and time information of the text content; extracting fundamental frequency from the dry sound frequency to obtain singing melody corresponding to the dry sound frequency; acquiring tone characteristics of the at least one first sound object; inputting text content corresponding to the dry sound frequency and time information thereof, singing melody corresponding to the dry sound frequency and tone characteristics of the at least one first sound object into a singing voice changing model to obtain the at least one first audio; acquiring tone characteristics of the at least one second sound object; and inputting the text content corresponding to the dry sound frequency and time information thereof, the singing melody corresponding to the dry sound frequency and tone characteristics of the at least one second sound object into a singing voice changing model to obtain the at least one second audio.

In some exemplary embodiments of the present disclosure, the apparatus further comprises: an adjustment module configured to perform: adjusting the volume of the at least one first audio and the at least one second audio so that the volume of the at least one first audio is smaller than the volume of the at least one second audio; and performing sound image adjustment on the at least one first audio and the at least one second audio so that the at least one first audio is far away from a virtual center position relative to the at least one second audio, so as to obtain the mixed audio with a stereo effect.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

An electronic device 800 according to such an embodiment of the present disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is merely an example and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 8, the electronic device 800 is embodied in the form of a general purpose computing device. Components of electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one storage unit 820, a bus 830 connecting the different system components (including the storage unit 820 and the processing unit 810), and a display unit 840.

Wherein the storage unit stores program code that is executable by the processing unit 810 such that the processing unit 810 performs steps according to various exemplary embodiments of the present disclosure described in the above section of the present description of exemplary methods. For example, the processing unit 810 may perform the various steps as shown in fig. 2.

As another example, the electronic device may implement the various steps shown in fig. 2.

Storage unit 820 may include readable media in the form of volatile storage units such as Random Access Memory (RAM) 821 and/or cache memory unit 822, and may further include Read Only Memory (ROM) 823.

The storage unit 820 may also include a program/utility 824 having a set (at least one) of program modules 825, such program modules 825 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

Bus 830 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 870 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 800, and/or any device (e.g., router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 850. Also, electronic device 800 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 860. As shown, network adapter 860 communicates with other modules of electronic device 800 over bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 800, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In an exemplary embodiment, a computer readable storage medium is also provided, e.g., a memory, comprising instructions executable by a processor of an apparatus to perform the above method. Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In an exemplary embodiment, a computer program product is also provided, comprising a computer program/instruction which, when executed by a processor, implements the multimedia content presentation method in the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of multimedia content presentation, comprising:

after the recording of the dry sound frequency aiming at the multimedia content is completed, displaying a multimedia editing page, wherein the multimedia editing page comprises an audio matching control, and the audio matching control is used for matching the mixed audio for the dry sound frequency;

responding to the triggering operation of the audio matching control, obtaining mixed audio matched with the dry sound audio, and displaying an audio display page, wherein the mixed audio is formed by mixing a plurality of audios generated according to a plurality of sound objects for the multimedia content, and the audio display page comprises a play control;

and in response to the triggering operation for the playing control, performing superposition playing on the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content.

2. The method for displaying multimedia contents according to claim 1, wherein the audio display page further comprises a plurality of text information of the multimedia contents and a selection control corresponding to each text information one by one, and states of the selection control comprise a to-be-selected state and a selected state;

And responding to the triggering operation for the playing control, and performing superposition playing on the dry sound audio corresponding to at least part of the multimedia content and the mixed audio corresponding to the at least part of the multimedia content, wherein the method comprises the following steps:

responding to a triggering operation of a selection control with a state to be selected, and updating the state of the selection control into a selected state;

sequentially playing the audio clips corresponding to each text message in response to the triggering operation for the playing control;

and when the text information corresponding to the selection control with the selected state is played, performing superposition playing on the dry audio and the mixed audio corresponding to the currently played text information.

3. The method of claim 2, further comprising:

responding to triggering operation of a selection control with a selected state, and updating the state of the selection control into a to-be-selected state;

and when the text information corresponding to the selection control with the state of the candidate state is played, playing the dry audio corresponding to the currently played text information.

4. The method of claim 1, wherein the audio presentation page further comprises a volume adjustment control;

Wherein the method further comprises:

determining a target volume for playing the mixed audio in response to an adjustment operation for the volume adjustment control;

wherein, the superposition playing of the dry sound audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content comprises:

and playing the audio fragments corresponding to at least part of the multimedia contents in the dry sound audio by using a preset volume, and playing the audio fragments corresponding to at least part of the multimedia contents in the mixed audio by using the target volume.

5. The method of any one of claims 1-4, wherein the audio presentation page further comprises a completion control;

wherein the method further comprises:

and responding to the triggering operation for the completion control, and generating a target multimedia file according to the dry sound audio and the mixed audio.

6. The method of claim 5, further comprising:

in response to a triggering operation for the completion control, jumping from the audio presentation page to the multimedia editing page, wherein the multimedia editing page comprises a release control;

And responding to the triggering operation for the release control, and releasing the target multimedia file.

7. The method of claim 1, wherein obtaining mixed audio matching the dry audio comprises:

obtaining a range of the dry sound audio;

obtaining at least one first sound object identical to the range of the sound range and at least one second sound object different from the range of the sound range in a matching way from a plurality of candidate sound objects;

generating at least one first audio for the multimedia content from the dry audio and the at least one first sound object, generating at least one second audio for the multimedia content from the dry audio and the at least one second sound object;

mixing the at least one first audio and the at least one second audio to obtain the mixed audio.

8. The method of claim 7, wherein generating at least one first audio for the multimedia content from the dry audio and the at least one first sound object comprises:

extracting the characteristics of the dry sound frequency to obtain text content corresponding to the dry sound frequency and time information of the text content;

Extracting fundamental frequency from the dry sound frequency to obtain singing melody corresponding to the dry sound frequency;

acquiring tone characteristics of the at least one first sound object;

inputting text content corresponding to the dry sound frequency and time information thereof, singing melody corresponding to the dry sound frequency and tone characteristics of the at least one first sound object into a singing voice changing model to obtain the at least one first audio;

wherein generating at least one second audio for the multimedia content from the dry audio and the at least one second sound object comprises:

acquiring tone characteristics of the at least one second sound object;

and inputting the text content corresponding to the dry sound frequency and time information thereof, the singing melody corresponding to the dry sound frequency and tone characteristics of the at least one second sound object into a singing voice changing model to obtain the at least one second audio.

9. The method of claim 7, wherein prior to mixing the at least one first audio and the at least one second audio to obtain the mixed audio, the method comprises:

Adjusting the volume of the at least one first audio and the at least one second audio so that the volume of the at least one first audio is smaller than the volume of the at least one second audio;

and performing sound image adjustment on the at least one first audio and the at least one second audio so that the at least one first audio is far away from a virtual center position relative to the at least one second audio, so as to obtain the mixed audio with a stereo effect.

10. A multimedia content presentation device, comprising:

a display module configured to perform displaying a multimedia editing page after recording of a dry audio for multimedia content is completed, the multimedia editing page including an audio matching control for matching mixed audio for the dry audio;

the display module is further configured to perform a trigger operation for the audio matching control, obtain mixed audio matched with the dry sound audio, and display an audio presentation page, wherein the mixed audio is formed by mixing a plurality of audios generated for the multimedia content according to a plurality of sound objects, and the audio presentation page comprises a play control;

And the playing module is configured to execute superposition playing of the dry audio corresponding to at least part of the multimedia content and the mixed audio corresponding to at least part of the multimedia content in response to the triggering operation of the playing control.

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the executable instructions to implement the multimedia content presentation method of any one of claims 1 to 9.

12. A computer readable storage medium, which when executed by a processor of an electronic device, causes the electronic device to perform the multimedia content presentation method of any one of claims 1 to 9.