WO2024001661A1

WO2024001661A1 - Video synthesis method and apparatus, device, and storage medium

Info

Publication number: WO2024001661A1
Application number: PCT/CN2023/097738
Authority: WO
Inventors: 谢炜航
Original assignee: 北京新唐思创教育科技有限公司
Priority date: 2022-06-28
Filing date: 2023-06-01
Publication date: 2024-01-04
Also published as: CN114845136A; CN114845136B

Abstract

The present disclosure relates to the technical field of computers. Disclosed are a video synthesis method and apparatus, a device, and a storage medium. The method is applied to a server and comprises: receiving a user video stream, the user video stream being a video stream filmed by means of a camera of a user terminal; recording a target virtual scene by using a target-view-angle camera independent of a user-view-angle camera so as to generate a scene video stream at a target view angle, the target virtual scene being a virtual scene corresponding to a theme virtual space displayed on the user terminal; and fusing the user video stream and the scene video stream to generate a synthesized video stream. The technical solution lowers the requirements for the device performance of the user terminal and a network, solves the problems of slow uploading and frame loss of the scene video stream and the like and thus improves the efficiency of video synthesis and the smoothness of the synthesized video stream, and also improves the consistency of the content of the synthesized video stream and the content of the target virtual scene.

Description

Video synthesis method, device, equipment and storage medium

This application claims priority for an invention application with a filing date of June 28, 2022, an application number of "202210740529.1", and a patent title of "Video synthesis method, device, equipment and storage medium", the entire content of which is incorporated herein by reference. .

Technical field

The present disclosure relates to the field of computer technology, and in particular, to a video synthesis method, device, equipment and storage medium.

Background technique

With the development of Internet technology, various resource sharing platforms provide many video-related functions. For example, the user's real camera footage is video-fused with the virtual scene content in a specific theme scene to generate a synthetic video for the user's later consumption.

The current video synthesis solutions mainly include manual editing and server-side automatic synthesis. Among them, the manual editing method is roughly to manually use video editing software to synthesize and edit the user's real camera footage and virtual scene content. The server-side automatic synthesis method is roughly that the user terminal obtains the user's real camera picture and virtual scene content, and sends both to the server for automatic synthesis processing.

However, the manual editing method is time-consuming and labor-intensive and cannot meet the needs of batch video synthesis processing; the server-side automatic synthesis method has high requirements on network and user terminal performance, which can easily cause the synthesized video screen to freeze.

Contents of the invention

In order to solve the above technical problems, the present disclosure provides a video synthesis method, device, equipment and storage medium.

In a first aspect, the present disclosure provides a video synthesis method, which is applied to a server. The method includes: receiving a user video stream; wherein the user video stream is a video stream captured by a camera of a user terminal; using independent The target perspective camera of the user perspective camera records the target virtual scene and generates a scene video stream under the target perspective; wherein the target virtual scene is a virtual scene corresponding to the theme virtual space displayed in the user terminal; the fusion of the The user video stream and the scene video stream generate a composite video stream.

In a second aspect, the present disclosure provides a video synthesis device, which is configured on a server. The device includes: a user video stream receiving module, configured to receive a user video stream; wherein the user video stream is captured by a camera of a user terminal. The resulting video stream; the scene video stream generation module is used to record the target virtual scene using a target perspective camera that is independent of the user perspective camera, and generate a scene video stream under the target perspective; wherein the target virtual scene is the The virtual scene corresponding to the theme virtual space displayed in the user terminal; the first synthetic video stream generation module is used to fuse the user video stream and the scene video stream to generate a synthetic video stream.

In a third aspect, the present disclosure provides an electronic device, the electronic device including: a processor; and a memory storing a program, wherein the program includes instructions that, when executed by the processor, cause the The processor executes the video synthesis method described in any embodiment of the present disclosure.

In a fourth aspect, the present disclosure provides a non-transitory computer-readable storage medium storing computer instructions, the computer instructions being used to cause the computer to execute the video synthesis method described in any embodiment of the present disclosure.

One or more technical solutions provided in the embodiments of the present disclosure can receive user video streams captured by the camera of the user terminal, and use a target perspective camera that is independent of the user perspective camera to correspond to the theme virtual space displayed in the user terminal. Record the target virtual scene to generate a scene video stream from the target perspective; and fuse the user video stream and the scene video stream to generate a synthetic video stream; on the one hand, the automatic generation of the synthetic video stream in the server is realized, It avoids the time-consuming and labor-intensive problem of artificially synthesized videos; on the other hand, recording the scene video stream through the server avoids problems caused by equipment performance and network reasons in the process of recording the scene video stream at the user terminal and uploading it to the server. The problem of synthetic video freezing not only reduces the requirements for user terminal equipment performance and network, but also solves the problems of slow scene video stream upload and frame loss, improves the efficiency of video synthesis and the smoothness of the synthetic video stream; and On the one hand, the scene video stream is obtained by recording the target virtual scene, which improves the content consistency between the synthetic video stream and the target virtual scene.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those of ordinary skill in the art, It is said that other drawings can be obtained based on these drawings without exerting creative labor.

Figure 1 is a flow chart of a video synthesis method provided by an embodiment of the present disclosure;

Figure 2 is a schematic display diagram of a user video stream provided by an embodiment of the present disclosure;

Figure 3 is a schematic display diagram of a synthetic video stream provided by an embodiment of the present disclosure;

Figure 4 is a flow chart of another video synthesis method provided by an embodiment of the present disclosure;

Figure 5 is a flow chart of yet another video synthesis method provided by an embodiment of the present disclosure;

Figure 6 is a schematic structural diagram of a video synthesis device provided by an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, which rather are provided for A more thorough and complete understanding of this disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that various steps described in the method implementations of the present disclosure may be executed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performance of illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "include" and its variations are open-ended, ie, "including but not limited to." The term "based on" means "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; and the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below. It should be noted that concepts such as “first” and “second” mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units. Or interdependence.

It should be noted that the modifications of "one" and "plurality" mentioned in this disclosure are illustrative and not restrictive. Those skilled in the art will understand that unless the context clearly indicates otherwise, it should be understood as "one or Multiple”.

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are for illustrative purposes only and are not used to limit the scope of these messages or information.

The video synthesis method provided by the embodiments of the present disclosure is mainly suitable for video synthesis of a user video stream collected by a camera of a user terminal and a scene video stream corresponding to a virtual scene. In some embodiments, the video synthesis method can be applied to fuse the user's real camera footage with special effects audio and video content in the theme scene of a short video to generate a synthesized special effects video. In other embodiments, the video synthesis method can be used to seamlessly integrate the user's real camera footage into the virtual scene of the corresponding theme under an educational theme, a game theme, or a live broadcast room theme, and generate a synthetic video under the corresponding theme. (such as playback video containing user images).

The video synthesis method provided by the embodiments of the present disclosure can be executed by a video synthesis device. The device can be implemented in software and/or hardware. The device can be integrated in the corresponding electronic device of the server, such as a laptop computer, desktop computer, etc. Computers, servers or server clusters, etc.

Figure 1 is a flow chart of a video synthesis method provided by an embodiment of the present disclosure. Referring to Figure 1, the video synthesis method specifically includes:

S110. Receive user video stream.

Among them, the user video stream is a video stream captured by the camera of the user terminal.

Specifically, according to the above description, the video synthesis in the embodiment of the present disclosure is to fuse the real picture captured by the camera of the user terminal with the scene picture corresponding to the virtual scene. Therefore, the server will receive the user video stream sent by the user terminal.

In some embodiments, S110 includes: receiving a user video stream from a user terminal through a real-time communication transport protocol.

Specifically, in the related technology, the user video stream is transmitted from the user terminal to the server according to the Transmission Control Protocol (TCP). However, due to the relatively large amount of data in the user video stream, and the TCP transmission protocol requiring a three-way handshake, it is easy to cause transmission delays or even frame loss. Therefore, in this embodiment, Real-time Communications (RTC) is used to transmit user video streams. This is because the RTC transmission protocol carries redundant fields, which can be used to accurately determine whether there is packet loss, and the UDP transmission on its link is one-way transmission without the need for a three-way handshake, making this transmission protocol more demanding on the network. Low, thus making the transmission of user video streams extremely resistant to weak networks, thereby reducing the network delay of user video stream transmission and avoiding frame loss problems to a certain extent.

S120. Use the target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate Scene video streaming from the target perspective.

Among them, the user perspective camera is a virtual camera in the rendering engine corresponding to the viewing perspective when the user views the target virtual scene through the user terminal. The target perspective is the viewing perspective required for the synthesized video stream, which may be the perspective of a bystander other than the user, for example. The target perspective camera is a virtual camera in the rendering engine corresponding to the target perspective. The target virtual scene is the virtual scene corresponding to the theme virtual space displayed in the user terminal. The theme virtual space is the network space corresponding to the application scenario. For example, the themed virtual space includes an online live broadcast room, a virtual game room or a virtual education space. The scene video stream is a video stream generated by recording the target virtual scene.

Specifically, in the related art, the scene video stream is recorded through the user terminal, which requires the user terminal to upload the scene video stream. This will cause the above-mentioned scene video stream upload delay and frame loss problems, which will lead to video freezes. Therefore, in the embodiment of the present disclosure, a target perspective camera is directly opened in the server corresponding to the user terminal, and the target perspective camera is used to record the target virtual scene running in the server along the target perspective, and generate the target perspective camera. scene video streaming.

For example, for application scenarios where the application main body runs in the cloud (such as cloud games, cloud live broadcasts, cloud classrooms, etc.), the corresponding server in the cloud originally runs a target virtual scene synchronized with the user terminal. At this time, you can Directly turn on the target perspective camera in the corresponding server in the cloud to record the target virtual scene and obtain the scene video stream.

For another example, for application scenarios where the main body of the application runs on the user terminal (such as ordinary games, online education, etc.), because the main part of the application is not running on the server, the target virtual scene may not be running on the server. , At this time, a service needs to be opened in the server corresponding to the user terminal to run the target virtual scene, and a target perspective camera is started in the service. When the server receives the scene recording instruction, the server starts to record the target virtual scene using the target perspective camera to obtain the scene video stream.

It should be noted that in order to avoid the impact of recording the scene video stream on the application functions corresponding to the user's normal use of the application scene, the server can use back-end processing to record and render the target virtual scene, that is, the generation process of the scene video stream It is independent of the running process of the application main body corresponding to the application scenario. As for the execution subject of the process of generating the scene video stream, it can be an independent thread opened in the execution server of the application subject, or it can be a restarted server.

Referring to Figure 2, taking the online lecture application scenario in online education as an example, the user terminal displays the video stream of the three-dimensional virtual lecture scene rendered by the user's perspective camera, and the user terminal's camera collection is displayed in the upper left corner. of real user footage. In addition to responding to the display request of the user terminal, the server can also record the target virtual scene from the target perspective, as shown in Figure 3. In Figure 3, the server records the three-dimensional virtual lecture hall scene with the target perspective camera corresponding to the audience perspective, and generates a scene video stream from the audience perspective.

S130. Fusion of the user video stream and the scene video stream to generate a composite video stream.

Specifically, the server embeds the user video stream at a certain position in the scene video stream to generate a synthetic video stream containing the user's real picture and the virtual scene picture.

In some embodiments, the target virtual scene includes a preset view. The preset view refers to the view layer pre-set in the target virtual scene, which is used to carry the user video stream. The position of the preset view can be customized; the position of the preset view can also be determined based on the type and/or spatial position of each virtual object contained in the target virtual scene. Location. For example, for the above example of the three-dimensional virtual lecture scene, the target virtual scene contains a virtual screen for playing lecture-related information, then the preset view can be set at the position of the virtual screen. For another example, a preset view can be set in a free area with fewer virtual objects in the target virtual scene.

Correspondingly, S130 includes: fusing the user video stream to a preset view in the scene video stream to generate a composite video stream.

Specifically, the server can input the user video stream into the preset view to embed the user video stream into the scene video stream, and the result is a composite video stream. As shown in Figure 3, the virtual screen is set as the preset view, and then the server embeds the user video stream on the virtual screen in the three-dimensional virtual lecture hall scene to generate an online lecture playback video from the audience's perspective.

The above video synthesis method provided by the embodiment of the present disclosure can receive the user video stream captured by the camera of the user terminal, and use a target perspective camera that is independent of the user perspective camera to perform target virtualization corresponding to the theme virtual space displayed in the user terminal. Record the scene and generate the scene video stream from the target perspective; and integrate the user video stream and the scene video stream to generate a synthetic video stream; on the one hand, it realizes the automatic generation of synthetic video streams in the server, avoiding the problem of artificially synthesized videos. Time-consuming and labor-intensive problem; on the other hand, recording the scene video stream through the server avoids the problem of synthetic video lagging caused by device performance and network reasons during the process of recording the scene video stream on the user terminal and uploading it to the server. It not only reduces the requirements for the equipment performance and network of the user terminal, but also solves the problems of slow scene video stream upload and frame loss, and improves the efficiency of video synthesis and the smoothness of the synthesized video stream; on the other hand, by virtualizing the target The scene is recorded to obtain a scene video stream, which improves the content consistency between the synthesized video stream and the target virtual scene.

Figure 4 is a flow chart of another video synthesis method provided by an embodiment of the present disclosure. It adds relevant steps to generate action responses containing virtual objects based on user operation instructions. The explanation of terms that are the same as or corresponding to the above embodiments will not be repeated here. Referring to Figure 4, the video synthesis method includes:

S410. Receive user video stream.

S420. Receive user operation instructions.

Among them, user operation instructions are operation instructions generated by the user in the theme virtual space by manipulating the user terminal, which are used to control the execution actions of the user's corresponding virtual character in the theme virtual space, such as moving, jumping, etc.

Specifically, during the running of the application, the user will perform some operations to control virtual objects in the theme virtual space by operating the user terminal. The user terminal will convert the user's operations into corresponding user operation instructions, and according to the User operation instructions trigger the application to control the virtual object to perform corresponding action responses (ie, virtual object action responses).

Based on the above description, it can be seen that the process of recording scene video streams on the server side and the process of the application responding to user operation instructions are independent of each other. Then, in order to make the recorded scene video stream consistent with the running results of the application viewed by the user, the server can pull the user operation instructions to restore the same virtual object action response in the target virtual scene.

In some embodiments, the server can establish a communication connection between the process of recording the scene video stream and the process of running the application program in response to user operation instructions, so as to transmit the user operation instructions generated in the application program to the recording The process of scene video streaming.

For example, for an application scenario where the application main body runs in the cloud, the server can establish a communication connection between the main bodies such as services or threads running the above two processes respectively to transmit the user operation instructions generated by the application to the recording scene video stream the process of.

For another example, for an application scenario where the main application program runs on the user terminal, a communication connection can be established between the user terminal and the server running the target virtual scene to send user operation instructions generated in the user terminal to the server.

In other embodiments, the server creates virtual users, associates the virtual users with the theme virtual space, and shares user operation instructions from the theme virtual space.

Specifically, in order to improve the efficiency and synchronization of obtaining user operation instructions, the server can create a new virtual user and associate the virtual user with the theme virtual space corresponding to the user terminal, for example, add the virtual user as a bystander. Virtual game room. In this way, the virtual user corresponding to the user terminal and the new virtual user are in the same theme virtual space. Therefore, the server can share user operation instructions from the theme virtual space in real time.

S430. Execute the virtual object action response corresponding to the user operation instruction in the target virtual scene.

Specifically, during the process of recording the scene video stream, the server executes the corresponding virtual object action response in the target virtual scene according to the obtained user operation instructions, so that the same virtual object action response as the application program is presented in the target virtual scene. .

S440: Use the target perspective camera to record the target virtual scene and generate a scene video stream containing the virtual object's action response.

Specifically, the server uses the target perspective camera to record the target virtual scene in which the virtual object's action response is executed, and can obtain a scene video stream from the target perspective that includes the virtual object's action response.

S450: Fusion of the user video stream and the scene video stream to generate a composite video stream.

The above video synthesis method provided by the embodiment of the present disclosure executes the virtual object action response corresponding to the user operation instruction generated by the user terminal in the target virtual scene, so that the target virtual scene also contains the virtual object action response, and uses the target perspective camera to The target virtual scene is recorded to generate a scene video stream containing the action response of the virtual object; this further improves the consistency between the scene video stream and the running results of the application viewed by the user, thus further improving the consistency between the synthesized video stream and the target virtual scene. content consistency.

In some embodiments, the user video stream carries a first timestamp, and the user operation instructions carry a second timestamp. The first timestamp and the second timestamp here are both the moment when the user operation instruction is generated (also called the instruction timestamp), but the first timestamp is the instruction timestamp recorded in the user video stream, and the second timestamp is The instruction timestamp recorded in the user operation instruction. This is because the data amounts of the user video stream and the user operation instructions are different, so the user operation instructions arrive at the server before the user video stream. If the information is responded to after it reaches the server, the virtual object action response restored in the target virtual scene will not match the user video stream, resulting in confusing content in the synthesized video stream. Therefore, in this embodiment, both the user video stream and the user operation instructions carry instruction timestamps, so that subsequent virtual object action responses can be executed based on the timestamps.

Correspondingly, after S420, the video synthesis method also includes: caching user operation instructions. Based on the above description, after the user operation instruction reaches the server, it cannot respond directly. Therefore, the server will cache the user operation instruction first.

Correspondingly, S430 includes: filtering out target operation instructions whose second timestamp is less than or equal to the first timestamp from each user operation instruction; and executing a virtual object action response corresponding to the target operation instruction in the target virtual scene.

Specifically, after receiving the user's video stream, the server extracts the first timestamp therein. Then, obtain the second timestamp of each user operation instruction from the cache space, compare the first timestamp with each second timestamp, and filter out at least one second timestamp that is less than or equal to the first timestamp. Then, the server uses the filtered user operation instructions corresponding to each second timestamp as the target operation instruction, and executes the virtual object action response corresponding to the target operation instruction in the target virtual scene to restore the user video in the target virtual scene. Stream and its virtual object action responses at previous moments. In this way, it is not only ensured that the subsequently recorded scene video stream contains the same virtual object action response as the running result viewed by the user, but also further ensures that the virtual object action response in the scene video stream is consistent with the virtual object action response in the running result viewed by the user. Temporal consistency between action responses, thereby further improving synchronization between the scene video stream and the user video stream.

Figure 5 is a flow chart of yet another video synthesis method provided by an embodiment of the present disclosure. The video synthesis method adds relevant steps to generate a synthesized video stream based on a video template. The explanations of terms that are the same as or corresponding to the above embodiments will not be repeated here. Referring to Figure 5, the video synthesis method includes:

S510. Receive user video stream.

Specifically, the server may continue to execute S520-S530, or execute S540-S550 according to application requirements (such as video synthesis speed, video synthesis accuracy, etc.).

S520: Use a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate a scene video stream from the target perspective.

S530: Fusion of the user video stream and the scene video stream to generate a composite video stream.

S540. Based on the template filtering conditions, determine the target video template corresponding to the target virtual scene from each preset video template.

Among them, the template filtering conditions are preset dimensions for filtering each preset video template. The preset video template is a preset video template, which contains a blank part that can be integrated with external videos and an immutable video part, and the immutable video part can contain preset characters, preset special effects components, etc. In the embodiment of the present disclosure, the template filtering conditions include at least one of the video duration of the user's video stream, user information, user operation instructions, and playing audio. User information is information related to the user. For example, the user information includes the user's emotion and/or the user's age, and the user information is used to match the character image in the preset video template. User operation instructions are used to match the recording angle in the preset video template. Playing audio is used to match the special effects components in the preset video template.

Specifically, multiple preset video templates are pre-stored in the server. After receiving the user's video stream, the server can select an adapted preset video template from multiple preset video templates according to the template filtering conditions as the target video template.

For example, if the template filter condition includes the video duration of the user's video stream, then you can match the video based on the video duration. Set the duration of the blank part in the preset video template to ensure that the user's video stream can be integrated into the filtered target video template.

For another example, if the template filtering conditions include user information, then the server can filter out the target video templates whose video style matches the user's emotions from each preset video template based on the user's emotion and/or user's age in the user information. , and/or, select a target video template that matches the characters in the video and the user's age from each preset video template.

For another example, if the template filtering condition includes a user operation instruction, the server can determine the recording perspective based on the user perspective corresponding to the user operation instruction, and filter out the target video template consistent with the recording perspective from each preset video template. For example, for the above example of the three-dimensional virtual lecture scene, collect user operation instructions for the recording process. When the user operation instruction is that the user walks to a specific area, switch to the recording perspective corresponding to the specific area, and switch to select the preset corresponding to the recording perspective. Set up a video template to complete the transition in the video.

For another example, if the template filtering conditions include playing audio, then the server selects a target video template with the same or similar audio characteristics based on the audio pause position and pause duration of the played audio, and can select the target video template in the target video template. Add special effect components such as fireworks and applause at the corresponding positions to optimize the target video template.

S550: Fusion of the user video stream and the target video template to generate a synthetic video stream.

Specifically, the user video stream is added to a blank part of the target video template, or the user video stream is embedded in a certain position of the target video template to generate a synthesized video stream.

In some embodiments, S550 may be implemented through the following step A and/or step B.

Step A. Fuse the user video stream to the green screen position in the target video template to generate a synthetic video stream.

Specifically, the green screen position is preset in the target video template. Then the server can embed the user video stream at the green screen position in the target video template to generate a synthetic video stream.

Step B: Determine the video synthesis position in the target video template based on at least one preset time point in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

Specifically, at least one preset time point can be preset in the target video template, such as the beginning time point, the mid-title time point and the end time point, and each preset time point can be set with a corresponding position for embedding the video stream ( That is, the video synthesis position), for example, the beginning time point corresponds to the video synthesis position in the upper left corner, the time point in the film corresponds to the video synthesis position in the middle, and the end time point corresponds to the video synthesis position in the lower right corner. In each time period, the server embeds the user video stream into the video synthesis position corresponding to the corresponding preset time point to generate a synthesized video stream.

The above video synthesis method provided by the embodiments of the present disclosure determines the target video template corresponding to the target virtual scene from each preset video template according to the template filtering conditions, and fuses the user video stream and the target video template to generate a synthesized video stream; achieved The preset video template is used to synthesize the user's real picture and the virtual scene picture, which reduces the resource consumption of the server and further improves the efficiency of generating the synthesized video stream.

FIG. 6 is a schematic structural diagram of a video synthesis device provided by an embodiment of the present disclosure. The video synthesis device is configured in the server. Referring to Figure 6, the video synthesis device 600 specifically includes:

The user video stream receiving module 610 is used to receive the user video stream; wherein the user video stream is a video stream captured by the camera of the user terminal;

The scene video stream generation module 620 is used to record the target virtual scene using a target perspective camera that is independent of the user perspective camera, and generate a scene video stream from the target perspective; wherein the target virtual scene is a theme virtual space displayed in the user terminal. Corresponding virtual scene;

The first synthetic video stream generation module 630 is used to fuse the user video stream and the scene video stream to generate a synthetic video stream.

The above-mentioned video synthesis device provided by the embodiment of the present disclosure can receive the user video stream captured by the camera of the user terminal, and use a target perspective camera that is independent of the user perspective camera to perform target virtualization corresponding to the theme virtual space displayed in the user terminal. Record the scene and generate the scene video stream from the target perspective; and integrate the user video stream and the scene video stream to generate a synthetic video stream; on the one hand, it realizes the automatic generation of synthetic video streams in the server, avoiding the problem of artificially synthesized videos. Time-consuming and labor-intensive problem; on the other hand, recording the scene video stream through the server avoids the problem of synthetic video lagging caused by device performance and network reasons during the process of recording the scene video stream on the user terminal and uploading it to the server. It not only reduces the requirements for the equipment performance and network of the user terminal, but also solves the problems of slow scene video stream upload and frame loss, and improves the efficiency of video synthesis and the smoothness of the synthesized video stream; on the other hand, by virtualizing the target The scene is recorded to obtain a scene video stream, which improves the content consistency between the synthesized video stream and the target virtual scene.

In some embodiments, the video synthesis device 600 further includes a user operation instruction receiving module for:

Receive user operation instructions before merging the user video stream and the scene video stream to generate a composite video stream;

Correspondingly, the scene video stream generation module 620 includes:

The action response execution submodule is used to execute the virtual object action response corresponding to the user operation instruction in the target virtual scene;

The scene video stream generation submodule is used to record the target virtual scene using the target perspective camera and generate a scene video stream containing the virtual object's action response.

In some embodiments, the user video stream carries a first timestamp, and the user operation instructions carry a second timestamp;

Correspondingly, the video synthesis device 600 also includes a user operation instruction cache module for:

After receiving the user operation instructions, cache the user operation instructions;

Correspondingly, the action response execution sub-module is specifically used to:

Filter out target operation instructions whose second timestamp is less than or equal to the first timestamp from each user operation instruction;

Execute the virtual object action response corresponding to the target operation instruction in the target virtual scene.

In some embodiments, the user operation instruction receiving module is specifically used to:

Create virtual users and associate the virtual users to the theme virtual space;

Share user operation instructions from the topic virtual space.

In some embodiments, the target virtual scene includes a preset view;

Correspondingly, the first synthesized video stream generating module 630 is specifically used to:

Fusion of the user video stream to the preset view in the scene video stream to generate a composite video stream.

In some embodiments, the video synthesis device 600 further includes:

The target video template determination module is used to determine the target video template corresponding to the target virtual scene from each preset video template based on template filtering conditions after receiving the user video stream; wherein the template filtering conditions include the video duration of the user video stream , at least one of user information, user operation instructions and playing audio. The user information includes user emotions and/or user age, and the user information is used to match the characters in the preset video template; the user operation instructions are used to match the preset video. The recording perspective in the template; playing audio is used to match the special effects components in the preset video template;

The second synthetic video stream generation module is used to fuse the user video stream and the target video template to generate a synthetic video stream.

Further, the second synthetic video stream generation module is specifically used to:

Fusion of the user video stream to the green screen position in the target video template to generate a synthetic video stream;

And/or, based on at least one preset time point in the target video template, determine the video synthesis position in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

In some embodiments, the user video stream receiving module 610 is specifically used to:

Receive user video streams from user terminals through real-time communication transmission protocols.

In some embodiments, the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.

The video synthesis device provided by the embodiments of the present disclosure can execute the video synthesis method provided by any embodiment of the present disclosure, and has functional modules and beneficial effects corresponding to the execution method.

It is worth noting that in the above embodiments of the video synthesis device, the various modules and sub-modules included are only divided according to functional logic, but are not limited to the above divisions, as long as the corresponding functions can be realized; in addition, each module The specific names of functional modules/sub-modules are only for the convenience of distinguishing each other and are not used to limit the scope of protection of the present disclosure.

Exemplary embodiments of the present disclosure also provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor. The memory stores a computer program that can be executed by at least one processor. When executed by at least one processor, the computer program is used to cause the electronic device to perform a video synthesis method, including:

Receive a user video stream; wherein the user video stream is a video stream captured by a camera of the user terminal; use a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate a scene video stream from the target perspective; where , the target virtual scene is the virtual scene corresponding to the theme virtual space displayed in the user terminal; the user video stream and the scene video stream are merged to generate a composite video stream.

In some embodiments of the present disclosure, the computer program, when executed by at least one processor, is also used to cause the electronic device to: receive user operation instructions; record the target virtual scene using a target perspective camera that is independent of the user perspective camera. , Generating a scene video stream from the target perspective includes: executing the virtual object action response corresponding to the user operation instruction in the target virtual scene; using the target perspective camera to record the target virtual scene, and generating a scene video stream containing the virtual object action response.

In some embodiments of the present disclosure, the user video stream carries a first timestamp, and the user operation instruction carries a second timestamp; when the computer program is executed by at least one processor, it is also used to cause the electronic device to execute: caching User operation instructions; the virtual object action response corresponding to the user operation instructions executed in the target virtual scene includes: from each Target operation instructions whose second timestamp is less than or equal to the first timestamp are screened out from the user operation instructions; and a virtual object action response corresponding to the target operation instruction is executed in the target virtual scene.

In some embodiments of the present disclosure, receiving user operation instructions includes: creating a virtual user and associating the virtual user to the theme virtual space; and sharing the user operation instructions from the theme virtual space.

In some embodiments of the present disclosure, the target virtual scene includes a preset view; fusing the user video stream and the scene video stream to generate a synthetic video stream includes: fusing the user video stream to the preset view in the scene video stream to generate Synthetic video stream.

In some embodiments of the present disclosure, the computer program, when executed by at least one processor, is also used to cause the electronic device to execute: based on the template filtering conditions, determine a target video template corresponding to the target virtual scene from each preset video template. ; Wherein, the template filtering conditions include at least one of the video duration of the user video stream, user information, user operation instructions and playback audio, the user information includes user emotions and/or user age, and the user information is used to match the preset video template character image; the user operation instructions are used to match the recording perspective in the preset video template; the audio playback is used to match the special effects components in the preset video template; the user video stream and the target video template are merged to generate a synthetic video stream.

In some embodiments of the present disclosure, fusing the user video stream and the target video template to generate a synthetic video stream includes: fusing the user video stream to a green screen position in the target video template to generate a synthetic video stream; and/or based on Determine at least one preset time point in the target video template, determine the video synthesis position in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

In some embodiments of the present disclosure, receiving the user video stream includes: receiving the user video stream from the user terminal through a real-time communication transmission protocol.

In some embodiments of the present disclosure, the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.

Exemplary embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of the computer, is used to cause the computer to perform a video synthesis method, including:

In some embodiments of the present disclosure, the computer program, when executed by the processor of the computer, is also used to cause the computer to: receive user operation instructions; record the target virtual scene using a target perspective camera that is independent of the user perspective camera, Generating a scene video stream from the target perspective includes: executing the virtual object action response corresponding to the user operation instruction in the target virtual scene; using the target perspective camera to record the target virtual scene and generating a scene video stream containing the virtual object action response.

In some embodiments of the present disclosure, the user video stream carries a first timestamp, and the user operation instructions carry a second timestamp; when the computer program is executed by the processor of the computer, it is also used to cause the computer to execute: cache the user Operation instructions; executing the virtual object action response corresponding to the user operation instructions in the target virtual scene includes: filtering out the target operation instructions whose second timestamp is less than or equal to the first timestamp from each user operation instruction; executing in the target virtual scene The virtual object action response corresponding to the target operation instruction.

In some embodiments of the present disclosure, the computer program, when executed by the processor of the computer, is also used to cause the computer to execute: based on the template filtering conditions, determine the target video template corresponding to the target virtual scene from each preset video template; Wherein, the template filtering conditions include at least one of the video duration of the user video stream, user information, user operation instructions and playback audio. The user information includes user emotions and/or user age, and the user information is used to match the preset video template. Character image; user operation instructions are used to match the recording perspective in the preset video template; audio playback is used to match the special effects components in the preset video template; the user video stream and the target video template are merged to generate a composite video stream.

In some embodiments of the present disclosure, fusing the user video stream and the target video template to generate a synthetic video stream includes: fusing the user video stream to a green screen position in the target video template to generate a synthetic video stream; and/or, based on Determine at least one preset time point in the target video template, determine the video synthesis position in the target video template, and fuse the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

Exemplary embodiments of the present disclosure also provide a computer program product, including a computer program, wherein the computer program, when executed by a processor of the computer, is used to cause the computer to execute the video synthesis method described in any embodiment of the present disclosure.

Referring to FIG. 7 , a structural block diagram of an electronic device 700 that may serve as a server or client of the present disclosure will now be described, which is an example of a hardware device that may be applied to aspects of the present disclosure. Electronic devices are intended to refer to various forms of digital electronic computing equipment, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are examples only and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 7 , the electronic device 700 includes a computing unit 701 that can perform calculations according to a computer program stored in a read-only memory (ROM) 702 or loaded from a storage unit 708 into a random access memory (RAM) 703 . Perform various appropriate actions and processing. In the RAM 703, various data required for the operation of the device 700 can also be stored. programs and data. Computing unit 701, ROM 702 and RAM 703 are connected to each other via bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Multiple components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700. The input unit 706 may receive input numeric or character information and generate key signal input related to user settings and/or function control of the electronic device. Output unit 707 may be any type of device capable of presenting information, and may include, but is not limited to, a display, speakers, video/audio output terminal, vibrator, and/or printer. The storage unit 708 may include, but is not limited to, a magnetic disk or an optical disk. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver and/or a chip Groups such as Bluetooth™ devices, WiFi devices, WiMax devices, cellular communications devices and/or the like.

Computing unit 701 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processing processor (DSP), and any appropriate processor, controller, microcontroller, etc. The computing unit 701 performs the various methods and processes described above. For example, in some embodiments, the video synthesis method described in any embodiment of the present disclosure may be implemented as a computer software program, which is tangibly included in a machine-readable medium, such as the storage unit 707. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. In some embodiments, the computing unit 701 may be configured in any other suitable manner (eg, by means of firmware) to perform the video synthesis method described in any embodiment of the present disclosure.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that the program codes, when executed by the processor or controller, cause the functions specified in the flowcharts and/or block diagrams/ The operation is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, laptop disks, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or means for providing machine instructions and/or data to a programmable processor (For example, magnetic Disk, optical disk, memory, programmable logic device (PLD)), including, machine-readable media that receive machine instructions as machine-readable signals. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices may also be used to provide interaction with the user; for example, the feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and may be provided in any form, including Acoustic input, voice input or tactile input) to receive input from the user.

The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., A user's computer having a graphical user interface or web browser through which the user can interact with implementations of the systems and technologies described herein), or including such backend components, middleware components, or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communications network). Examples of communication networks include: local area network (LAN), wide area network (WAN), and the Internet.

Computer systems may include clients and servers. Clients and servers are generally remote from each other and typically interact over a communications network. The relationship of client and server is created by computer programs running on corresponding computers and having a client-server relationship with each other.

Claims

A video synthesis method, applied to the server, including:

Receive a user video stream; wherein the user video stream is a video stream captured by a camera of a user terminal;

Using a target perspective camera that is independent of the user perspective camera, record the target virtual scene and generate a scene video stream from the target perspective; wherein the target virtual scene is a virtual scene corresponding to the theme virtual space displayed in the user terminal;

The user video stream and the scene video stream are fused to generate a composite video stream.
The method according to claim 1, wherein before fusing the user video stream and the scene video stream to generate a composite video stream, the method further includes:

Receive user operation instructions;

The use of a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate the scene video stream from the target perspective includes:

Execute the virtual object action response corresponding to the user operation instruction in the target virtual scene;

The target virtual scene is recorded using the target perspective camera, and the scene video stream containing the action response of the virtual object is generated.
The method according to claim 2, wherein the user video stream carries a first timestamp, and the user operation instruction carries a second timestamp;

After receiving the user operation instruction, the method further includes:

Cache the user operation instructions;

The virtual object action response corresponding to the execution of the user operation instruction in the target virtual scene includes:

Filter out target operation instructions whose second timestamp is less than or equal to the first timestamp from each of the user operation instructions;

Execute the virtual object action response corresponding to the target operation instruction in the target virtual scene.
The method according to claim 2 or 3, wherein receiving user operation instructions includes:

Create a virtual user and associate the virtual user to the theme virtual space;

The user operation instructions are shared from the theme virtual space.
The method according to any one of claims 1-4, wherein the target virtual scene includes a preset view;

The fusing of the user video stream and the scene video stream to generate a composite video stream includes:

The user video stream is merged into the preset view in the scene video stream to generate the composite video stream.
The method according to any one of claims 1-5, wherein after receiving the user video stream, the method further includes:

Based on the template filtering conditions, determine the target video template corresponding to the target virtual scene from each preset video template; wherein the template filtering conditions include the video duration of the user video stream, user information, and user operations At least one of instructions and playing audio, the user information includes user emotion and/or user age, and the user information is used to match the character image in the preset video template; the user operation instruction is used to match the preset video template. Assume the recording angle in the video template; the playback audio is used to match the special effects components in the preset video template;

The user video stream and the target video template are fused to generate the synthesized video stream.
The method according to claim 6, wherein said fusing the user video stream and the target video template to generate the synthetic video stream includes:

Fusion of the user video stream to the green screen position in the target video template to generate the synthetic video stream;

and / or,

determining a video synthesis position in the target video template based on at least one preset time point in the target video template, and fusing the user video stream to the video synthesis position in the target video template, Generate the composite video stream.
The method according to any one of claims 1-7, wherein receiving the user video stream includes:

The user video stream is received from the user terminal through a real-time communication transmission protocol.
The method according to any one of claims 1 to 8, wherein the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.
A video synthesis device, configured on the server side, including:

A user video stream receiving module, configured to receive a user video stream; wherein the user video stream is a video stream captured by a camera of a user terminal;

A scene video stream generation module, configured to use a target perspective camera that is independent of the user perspective camera to record the target virtual scene and generate a scene video stream from the target perspective; wherein the target virtual scene is the one displayed in the user terminal. The virtual scene corresponding to the theme virtual space;

The first synthetic video stream generation module is used to fuse the user video stream and the scene video stream to generate a synthetic video stream.
The device of claim 10, wherein the device further comprises:

A user operation instruction receiving module is used to receive user operation instructions;

Wherein, the scene video stream generation module includes:

The action response execution submodule is used to execute the virtual object action response corresponding to the user operation instruction in the target virtual scene;

The scene video stream generation submodule is used to record the target virtual scene using the target perspective camera and generate a scene video stream containing the virtual object's action response.
The device according to claim 11, wherein the user video stream carries a first timestamp, and the user operation instruction carries a second timestamp, and the device further includes:

The user operation instruction cache module is used to cache user operation instructions;

The action response execution sub-module is also used to:

Filter out target operation instructions whose second timestamp is less than or equal to the first timestamp from each of the user operation instructions;

Execute the virtual object action response corresponding to the target operation instruction in the target virtual scene.
The device according to claim 11 or 12, wherein the user operation instruction receiving module is used for:

Create a virtual user and associate the virtual user to the theme virtual space;

The user operation instructions are shared from the theme virtual space.
The device according to any one of claims 10-13, wherein the target virtual scene includes a preset view, and the first synthetic video stream generation module is also used to:

The user video stream is merged into the preset view in the scene video stream to generate the composite video stream.
The device according to any one of claims 10-14, wherein the device further includes:

A target video template determination module, configured to determine a target video template corresponding to the target virtual scene from each preset video template based on template filtering conditions; wherein the template filtering conditions include the video duration of the user video stream, At least one of user information, user operation instructions and playing audio, the user information includes user emotions and/or user age, and the user information is used to match characters in preset video templates; the user operation instructions are To match the recording angle in the preset video template; the playback audio is used to match the special effects components in the preset video template;

The second synthetic video stream generation module is used to fuse the user video stream and the target video template to generate the synthetic video stream.
The device according to claim 15, wherein the second synthesized video stream generating module is further configured to:

Fusion of the user video stream to the green screen position in the target video template to generate the synthetic video stream;

and / or,

determining a video synthesis position in the target video template based on at least one preset time point in the target video template, and fusing the user video stream to the video synthesis position in the target video template, Generate the composite video stream.
The device according to any one of claims 10-16, wherein the user video stream receiving module is also used for:

The user video stream is received from the user terminal through a real-time communication transmission protocol.
The device according to any one of claims 10 to 17, wherein the theme virtual space includes an online live broadcast room, a virtual game room or a virtual education space.
An electronic device including:

processor; and

memory for storing programs,

Wherein, the program includes instructions that, when executed by the processor, cause the processor to execute the video synthesis method according to any one of claims 1-9.
A non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the video synthesis method according to any one of claims 1-9.