CN114222077A - Video processing method and device, storage medium and electronic equipment - Google Patents
Video processing method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN114222077A CN114222077A CN202111531779.6A CN202111531779A CN114222077A CN 114222077 A CN114222077 A CN 114222077A CN 202111531779 A CN202111531779 A CN 202111531779A CN 114222077 A CN114222077 A CN 114222077A
- Authority
- CN
- China
- Prior art keywords
- face image
- voice
- video
- face
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 37
- 238000000034 method Methods 0.000 claims abstract description 29
- 238000004590 computer program Methods 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 12
- 230000000694 effects Effects 0.000 description 11
- 230000014509 gene expression Effects 0.000 description 7
- 239000000463 material Substances 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000036544 posture Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
- H04N5/272—Means for inserting a foreground image in a background image, i.e. inlay, outlay
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/222—Studio circuitry; Studio devices; Studio equipment
- H04N5/262—Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Studio Devices (AREA)
Abstract
The application provides a video processing method, a video processing device, a storage medium and electronic equipment. The video processing method comprises the steps that a video to be processed is obtained, and the video to be processed comprises a first face image and a first voice; acquiring a second face image and a second voice; generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice; and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video. According to the method and the device, the second face image is acquired and then not directly replaced into the video, the target face image is generated according to the second face image and the first face image in the video, the target face image is replaced into the video, the first voice of the video is processed, the corresponding first voice in the video is replaced into the target voice, and therefore the flexibility of face replacement is improved.
Description
Technical Field
The present application relates to the field of video processing technologies, and in particular, to a video processing method and apparatus, a storage medium, and an electronic device.
Background
With the development of networks, social software with the nature of mass entertainment is increasing, and in many social software with the functions of live broadcast, video shooting, image editing and the like, face exchange gradually becomes a new hotspot of mass entertainment, and has increasingly wide application scenes. Face exchange or face change techniques refer to changing one person's face into another person's face in an image or video.
In the face changing technology, in order to achieve a better face changing effect, certain requirements are placed on face materials, such as: the face to be exchanged is of a similar size, and needs to include various expressions such as joy, anger, sadness and the like, and needs to have a face view angle in various postures such as head raising, head lowering, side face and the like. However, most of face materials are videos or images of specific persons from the internet, and the acquired face materials are directly replaced into the videos, so that the flexibility of face replacement is poor.
Disclosure of Invention
The embodiment of the application provides a video processing method, a video processing device, a storage medium and electronic equipment, which can improve the flexibility of face replacement.
An embodiment of the present application provides a video processing method, including:
acquiring a video to be processed, wherein the video to be processed comprises a first face image and a first voice;
acquiring a second face image and a second voice;
generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice;
and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video.
An embodiment of the present application further provides a video processing apparatus, including:
the first acquisition module is used for acquiring a video to be processed, wherein the video to be processed comprises a first face image and a first voice;
the second acquisition module is used for acquiring a second face image and acquiring a second voice;
the generating module is used for generating a target face image according to the first face image and the second face image and generating a target voice according to the first voice and the second voice;
and the replacing module is used for replacing the first face image in the video to be processed with the target face image and replacing the first voice in the video to be processed with the target voice to obtain the target video.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the storage medium, and the computer program is executed by a processor to implement the steps in any video processing method provided in the embodiment of the present application.
The embodiment of the present application further provides an electronic device, where the electronic device includes a processor, a memory, and a computer program stored in the memory and capable of running on the processor, and the processor executes the computer program to implement the steps in any video processing method provided in the embodiment of the present application.
The video processing method provided by the embodiment of the application obtains a video to be processed, wherein the video to be processed comprises a first face image and a first voice; acquiring a second face image and a second voice; generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice; and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video. According to the method and the device, the second face image is acquired and then not directly replaced into the video, the target face image is generated according to the second face image and the first face image in the video, the target face image is replaced into the video, the first voice of the video is processed, the corresponding first voice in the video is replaced into the target voice, and therefore the flexibility of face replacement is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a schematic flowchart of a video processing method according to an embodiment of the present disclosure.
Fig. 2 is a schematic flowchart of a second video processing method according to an embodiment of the present disclosure.
Fig. 3 is a schematic view of a first structure of a video processing apparatus according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a second structure of a video processing apparatus according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All embodiments obtained by a person skilled in the art based on the embodiments in the present application without any inventive step are within the scope of the present invention.
The terms "first," "second," "third," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the objects so described are interchangeable under appropriate circumstances. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, or apparatus, terminal, system comprising a list of steps is not necessarily limited to those steps or modules or elements expressly listed, and may include other steps or modules or elements not expressly listed, or inherent to such process, method, apparatus, terminal, or system.
An execution main body of the video processing method may be the video processing apparatus provided in the embodiment of the present application, or an electronic device integrated with the video processing apparatus, where the video processing apparatus may be implemented in a hardware or software manner.
The following are detailed below. The order of the following examples is not intended to limit the preferred order of the examples.
Referring to fig. 1, fig. 1 is a first flowchart illustrating a video processing method according to an embodiment of the present disclosure. The video processing method may include:
and 110, acquiring a video to be processed, wherein the video to be processed comprises a first face image and a first voice.
In the embodiment of the application, the video to be processed can be a video in a live broadcast process, and can also be a short video, a small video and other videos. In some possible embodiments, the video image to be processed may also be a video recorded or shot manually, or may also be a video downloaded from a network, such as a video of a television show, a movie, and the like on video software, and the details may not be limited. The video to be processed may include faces of a plurality of objects and corresponding sounds, where the faces include a first face image and a first voice. The first person sound may be a person sound corresponding to a face object of the first face image.
And 120, acquiring a second face image and acquiring a second voice.
The step of acquiring the second face may include: starting a camera; and acquiring a second face image of the shot object in real time through the camera.
In the embodiment of the application, the faces of the photographers can be freely collected, and the freely collected faces of the photographers are replaced into the video to be processed.
In order to replace the first face image in the video to be processed with the face of the photographer, which can be freely collected, the camera of the device can be started, and the person in front of the lens is photographed through the camera to obtain a second face image. The user who is collected the face image is the shot object.
For example, a user can directly enter a self-photographing mode, a camera is used for photographing a face of the user to obtain a second face image, and then the face image contained in the video is replaced according to the second face image, so that the face of the user can be replaced into the video as if the user personally participates in the video photographing, and the interestingness and the participation sense of the user in the video watching process are improved.
In addition to face replacement of the video to be processed, in order to further improve the participation and interest of the user when watching the video, in the embodiment of the application, the sound of the video to be processed is processed, and the corresponding voice is replaced by the voice of the object to be shot.
Wherein the step of acquiring the second voice may include: guiding the shot object to read text information; and collecting a second voice when the shot object reads the text information aloud.
In order to obtain the sound material of the object to be shot, firstly, the shot object is guided to read text information aloud, the text information can be some characters or letters, and the like, and when a user reads the text information aloud, the second voice of the shot object is collected when the text information aloud.
And 130, generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice.
After a second face image of the shot object is collected in real time through the camera, the second face image is input into the face feature extraction model to obtain a second face parameter of the face feature in the second face image, and the second face parameter and the label of the shot object are correspondingly stored in a face image library of the shot object.
When the face image contained in the video is replaced according to the second face image, because the shooting angle, light, expression and the like of the second face image and the first face image in the video may have differences, after the second face image is obtained, the second face image is not directly replaced into the video, but a target face image is generated according to the second face image, and the target face image has the face characteristics of the second face image and is consistent with the shooting angle, light, expression and the like of the first face image in the video.
And the acquired second face image comprises second face parameters of the shot object, and the second face parameters are used for representing the face characteristics of the shot object. In order to generate a target face image from the second face image, a second face parameter of the subject may be acquired. For example, the second face image is input into the trained face feature extraction model to obtain a second face feature parameter of the shot object.
When a target face image is generated according to a first face image and a second face image, the first face image is input into a face feature extraction model to obtain first face parameters of face features in the first face image, and the first face parameters of the face features in the first face image are correspondingly replaced by the second face parameters to obtain the target face image.
In the embodiment of the application, a face image library of each shot object is established aiming at different shot objects. Different face image libraries correspond to different labels so as to distinguish shot objects corresponding to the face image libraries.
And after obtaining a second face parameter of the shot object, storing the second face parameter as a face image library of the shot object. And during storage, the second face parameters are stored correspondingly to the tags of the shot objects, so that when the face image of the shot object is used for realizing the face changing effect, the face image library of the shot object and the second face parameters can be quickly found through the tags of the shot object.
For a to-be-processed video to be face-changed, face changing is performed on a certain object in the video, and the object can be called a to-be-face-changed object. Firstly, determining a video frame containing a face image of a face object to be changed in a video to be processed according to the face object to be changed. Then, from these video frames, the first face image of the object to be face-changed and the remaining background image are stripped off to prepare for the subsequent video face-changing process.
After a first face image of a to-be-changed object is stripped from a to-be-processed video, inputting the first face image into a trained face feature extraction model to obtain first face parameters of face features in the first face image, wherein the first face parameters are used for representing the face features of the to-be-changed object.
And according to the labels corresponding to the face image libraries, determining the face image library of the shot object from the face image libraries, and acquiring the second face parameters of the shot object stored in the face image library of the shot object. According to the second face parameter, the face feature of the object to be shot can be obtained.
According to a first face image in a video to be processed and a second face image collected in real time, a face in the second face image collected in real time can be migrated into the first face image of the video to be processed according to a first face parameter of the first face image and a second face parameter of the second face image, a target face image is obtained, and the target face image is used for replacing the first face image in a video frame of the video to be processed.
The method includes the steps that first face parameters of face features in a first face image can be replaced by second face parameters of face features in a second face image according to the acquired first face parameters of the face features in the first face image, namely, the face features of a to-be-replaced object in a to-be-processed video are replaced by the face features of a shot object acquired in real time, so that the face of the to-be-replaced object in the first face image is replaced by the face of the shot object. Namely, target face images which are different in face but same in shooting angle, light, expression and the like and accord with video content and video shooting effect are generated according to the first face image and the second face image.
In the embodiment of the application, after the second voice of the shot object reading the text information is collected, the second voice parameter of the voice feature in the second voice is extracted, and the second voice parameter and the tag of the shot object are correspondingly stored in the voice library of the shot object.
And extracting a second sound parameter of the shot object in the second collected human voice, wherein the second sound parameter is used for reflecting the sound characteristics of the shot object, such as phoneme characteristics, tone characteristics and the like.
In the embodiment of the application, a voice library of each shot object is established aiming at different shot objects. Different voice libraries correspond to different labels so as to distinguish shot objects corresponding to the face image libraries.
After the second sound parameter of the object is obtained, the second sound parameter is stored as a human voice library of the object. And when the second sound parameter is stored, the second sound parameter and the label of the shot object are correspondingly stored, so that when a second voice of the shot object is used for realizing a sound exchange effect, the voice library of the shot object and the second sound parameter can be quickly found through the label of the shot object.
When the target voice is generated according to the first voice and the second voice, the character information contained in the first voice is recognized, and the target voice corresponding to the character information is obtained by encoding according to the character information and the second voice parameters stored in the voice database.
And for the video to be processed, which is to be subjected to sound change, performing sound change aiming at the object to be subjected to the sound change in the video. Firstly, determining audio frequency paragraphs where objects to be changed faces are located from audio frequencies of videos to be processed, and stripping first voices of the objects to be changed faces and remaining background sounds from the audio frequency paragraphs to prepare for subsequent video face changing processing. The background sound may be all sounds except the first human voice.
And when the target voice is obtained, coding is carried out according to the character information and the second voice parameters stored in the voice database, so as to obtain the target voice corresponding to the character information. And replacing the original first human voice with the coded target human voice, and combining the target human voice with the background voice of the audio so as to obtain the processed audio by combining the background voice.
And 140, replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video.
And after the target face image is obtained, combining the target face image with the stripped background image to obtain a complete video frame.
The method has the advantages that all video frames containing objects to be face-changed in the video to be processed are processed, the faces of specific objects in the video to be processed can be changed into the faces of shot objects, the shot objects can be users or other people, the users can take the objects which want to participate in the video as the shot objects and shoot the shot objects in real time, accordingly, the expected faces of the users or other people can be seen in the video, and the participation and interest of the users when watching the video are enhanced.
Combining the target face image with the background image of the video frame, and combining the target voice with the background sound of the audio, then replacing the face of the object to be changed in the video to be processed with the face of the object to be shot, and replacing the voice corresponding to the object to be changed with the voice of the object to be shot, so as to obtain the target video after the video to be processed is subjected to face changing and voice changing.
The target video reserves the shooting effect and the sound effect of the original face on the basis of the original video to be processed, only the object in the original face is replaced by processing the characteristics, intelligent processing is carried out by means of a pre-trained model, and the user only needs to collect the image and the sound of the user as materials without participating in recording the video picture and the sound again, so that the interestingness and the participation of the video are improved, and the operation of the user is facilitated.
The method according to the previous embodiment is described in further detail below by way of example.
Referring to fig. 2, fig. 2 is a second flowchart illustrating a video processing method according to an embodiment of the present disclosure. The video processing method can comprise the following steps:
201. and acquiring a video to be processed, wherein the video to be processed comprises a first face image and a first voice.
In the embodiment of the application, the video to be processed can be a video in a live broadcast process, and can also be a short video, a small video and other videos. In some possible embodiments, the video image to be processed may also be a video recorded or shot manually, or may also be a video downloaded from a network, such as a video of a television show, a movie, and the like on video software, and the details may not be limited. The video to be processed may include faces of a plurality of objects and corresponding sounds, where the faces include a first face image and a first voice. The first person sound may be a person sound corresponding to a face object of the first face image.
202. And starting the camera.
203. And acquiring a second face image of the shot object in real time through the camera.
In the embodiment of the application, the faces of the photographers can be freely collected, and the freely collected faces of the photographers are replaced into the video to be processed.
In order to replace the first face image in the video to be processed with the face of the photographer, which can be freely collected, the camera of the device can be started, and the person in front of the lens is photographed through the camera to obtain a second face image. The user who is collected the face image is the shot object.
For example, a user can directly enter a self-photographing mode, a camera is used for photographing a face of the user to obtain a second face image, and then the face image contained in the video is replaced according to the second face image, so that the face of the user can be replaced into the video as if the user personally participates in the video photographing, and the interestingness and the participation sense of the user in the video watching process are improved.
204. And inputting the second face image into the face feature extraction model to obtain a second face parameter of the face feature in the second face image.
When the face image contained in the video is replaced according to the second face image, because the shooting angle, light, expression and the like of the second face image and the first face image in the video may have differences, after the second face image is obtained, the second face image is not directly replaced into the video, but a target face image is generated according to the second face image, and the target face image has the face characteristics of the second face image and is consistent with the shooting angle, light, expression and the like of the first face image in the video.
And the acquired second face image comprises second face parameters of the shot object, and the second face parameters are used for representing the face characteristics of the shot object. In order to generate a target face image from the second face image, a second face parameter of the subject may be acquired. For example, the second face image is input into the trained face feature extraction model to obtain a second face feature parameter of the shot object.
205. And correspondingly storing the second face parameters and the labels of the shot objects into a face image library of the shot objects.
In the embodiment of the application, a face image library of each shot object is established aiming at different shot objects. Different face image libraries correspond to different labels so as to distinguish shot objects corresponding to the face image libraries.
And after obtaining a second face parameter of the shot object, storing the second face parameter as a face image library of the shot object. And during storage, the second face parameters are stored correspondingly to the tags of the shot objects, so that when the face image of the shot object is used for realizing the face changing effect, the face image library of the shot object and the second face parameters can be quickly found through the tags of the shot object.
206. And stripping the first face image and the background image from the video frame of the video to be processed.
For a to-be-processed video to be face-changed, face changing is performed on a certain object in the video, and the object can be called a to-be-face-changed object. Firstly, determining a video frame containing a face image of a face object to be changed in a video to be processed according to the face object to be changed. Then, from these video frames, the first face image of the object to be face-changed and the remaining background image are stripped off to prepare for the subsequent video face-changing process.
207. And inputting the first face image into a face feature extraction model to obtain first face parameters of the face features in the first face image.
After a first face image of a to-be-changed object is stripped from a to-be-processed video, inputting the first face image into a trained face feature extraction model to obtain first face parameters of face features in the first face image, wherein the first face parameters are used for representing the face features of the to-be-changed object.
208. And acquiring a second face parameter from the face image library of the shot object.
And according to the labels corresponding to the face image libraries, determining the face image library of the shot object from the face image libraries, and acquiring the second face parameters of the shot object stored in the face image library of the shot object. According to the second face parameter, the face feature of the object to be shot can be obtained.
209. And correspondingly replacing the first face parameters of the face features in the first face image with the second face parameters to obtain a target face image.
According to a first face image in a video to be processed and a second face image collected in real time, a face in the second face image collected in real time can be migrated into the first face image of the video to be processed according to a first face parameter of the first face image and a second face parameter of the second face image, a target face image is obtained, and the target face image is used for replacing the first face image in a video frame of the video to be processed.
The method includes the steps that first face parameters of face features in a first face image can be replaced by second face parameters of face features in a second face image according to the acquired first face parameters of the face features in the first face image, namely, the face features of a to-be-replaced object in a to-be-processed video are replaced by the face features of a shot object acquired in real time, so that the face of the to-be-replaced object in the first face image is replaced by the face of the shot object. Namely, target face images which are different in face but same in shooting angle, light, expression and the like and accord with video content and video shooting effect are generated according to the first face image and the second face image.
210. And combining the target face image with the background image of the video frame.
And after the target face image is obtained, combining the target face image with the stripped background image to obtain a complete video frame.
The method has the advantages that all video frames containing objects to be face-changed in the video to be processed are processed, the faces of specific objects in the video to be processed can be changed into the faces of shot objects, the shot objects can be users or other people, the users can take the objects which want to participate in the video as the shot objects and shoot the shot objects in real time, accordingly, the expected faces of the users or other people can be seen in the video, and the participation and interest of the users when watching the video are enhanced.
211. Guiding the shot object to read text information;
in addition to face replacement of the video to be processed, in order to further improve the participation and interest of the user when watching the video, in the embodiment of the application, the sound of the video to be processed is processed, and the corresponding voice is replaced by the voice of the object to be shot.
In order to acquire the sound material of the object to be shot, the object to be shot is firstly guided to read text information, wherein the text information can be some characters or letters and the like.
212. And collecting a second voice when the shot object reads the text information aloud.
And when the user reads the text information aloud, acquiring a second voice of the shot object when the text information aloud is read.
213. And extracting a second sound parameter of the sound feature in the second human voice.
And extracting a second sound parameter of the shot object in the second collected human voice, wherein the second sound parameter is used for reflecting the sound characteristics of the shot object, such as phoneme characteristics, tone characteristics and the like.
214. And correspondingly storing the second sound parameter and the label of the shot object into a human voice library of the shot object.
In the embodiment of the application, a voice library of each shot object is established aiming at different shot objects. Different voice libraries correspond to different labels so as to distinguish shot objects corresponding to the face image libraries.
After the second sound parameter of the object is obtained, the second sound parameter is stored as a human voice library of the object. And when the second sound parameter is stored, the second sound parameter and the label of the shot object are correspondingly stored, so that when a second voice of the shot object is used for realizing a sound exchange effect, the voice library of the shot object and the second sound parameter can be quickly found through the label of the shot object.
215. And stripping the first human voice and the background sound from the audio of the video to be processed.
And for the video to be processed, which is to be subjected to sound change, performing sound change aiming at the object to be subjected to the sound change in the video. Firstly, determining audio frequency paragraphs where objects to be changed faces are located from audio frequencies of videos to be processed, and stripping first voices of the objects to be changed faces and remaining background sounds from the audio frequency paragraphs to prepare for subsequent video face changing processing. The background sound may be all sounds except the first human voice.
216. And recognizing the text information contained in the first person voice.
The content of the voice contained in the video is often more, and if the user is led to practice the lines in the video in order to realize voice conversion, the operation is too complicated. In the embodiment of the application, the text information contained in the first voice is identified, and the target voice which contains the same text information as the first voice and has the sound characteristic consistent with the first voice is obtained according to the text information contained in the first voice and the input second voice.
217. And coding according to the text information and the second sound parameters stored in the voice library to obtain the target voice corresponding to the text information.
And when the target voice is obtained, coding is carried out according to the character information and the second voice parameters stored in the voice database, so as to obtain the target voice corresponding to the character information.
218. And combining the target human voice with the background voice of the audio.
And replacing the original first human voice with the coded target human voice, and combining the target human voice with the background voice of the audio so as to obtain the processed audio by combining the background voice.
219. And obtaining the target video.
Combining the target face image with the background image of the video frame, and combining the target voice with the background sound of the audio, then replacing the face of the object to be changed in the video to be processed with the face of the object to be shot, and replacing the voice corresponding to the object to be changed with the voice of the object to be shot, so as to obtain the target video after the video to be processed is subjected to face changing and voice changing.
The target video reserves the shooting effect and the sound effect of the original face on the basis of the original video to be processed, only the object in the original face is replaced by processing the characteristics, intelligent processing is carried out by means of a pre-trained model, and the user only needs to collect the image and the sound of the user as materials without participating in recording the video picture and the sound again, so that the interestingness and the participation of the video are improved, and the operation of the user is facilitated.
As can be seen from the above, the video processing method provided by the embodiment of the application obtains a to-be-processed video, where the to-be-processed video includes a first face image and a first voice; acquiring a second face image and a second voice; generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice; and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video. According to the method and the device, the second face image is acquired and then not directly replaced into the video, the target face image is generated according to the second face image and the first face image in the video, the target face image is replaced into the video, the first voice of the video is processed, the corresponding first voice in the video is replaced into the target voice, and therefore the flexibility of face replacement is improved.
In order to better implement the video processing method provided by the embodiment of the present application, an embodiment of the present application further provides a device based on the video processing method. The terms are the same as those in the video processing method, and details of implementation can be referred to the description in the method embodiment.
Referring to fig. 3, fig. 3 is a first structural diagram of a video processing apparatus according to an embodiment of the present disclosure. The video processing apparatus 300 may include a first obtaining module 301, a second obtaining module 302, a generating module 303, and a replacing module 304:
the first obtaining module 301 is configured to obtain a to-be-processed video, where the to-be-processed video includes a first face image and a first voice;
a second obtaining module 302, configured to obtain a second face image and obtain a second voice;
a generating module 303, configured to generate a target face image according to the first face image and the second face image, and generate a target voice according to the first voice and the second voice;
the replacing module 304 is configured to replace a first face image in the video to be processed with a target face image, and replace a first person voice in the video to be processed with a target person voice to obtain a target video.
In this embodiment of the application, when acquiring the second face image, the second acquiring module 302 may be configured to:
starting a camera;
and acquiring a second face image of the shot object in real time through the camera.
Referring to fig. 4, fig. 4 is a schematic diagram illustrating a second structure of a video processing apparatus according to an embodiment of the present disclosure. In this embodiment of the application, after the second face image of the photographed object is acquired by the camera in real time, the video processing apparatus 300 may further include a first storage module 305, and the first storage module 305 may be configured to:
inputting the second face image into a face feature extraction model to obtain a second face parameter of the face feature in the second face image;
and correspondingly storing the second face parameters and the labels of the shot objects into a face image library of the shot objects.
In this embodiment of the application, when generating the target face image according to the first face image and the second face image, the generating module 303 may be configured to:
inputting the first face image into a face feature extraction model to obtain first face parameters of face features in the first face image;
acquiring a second face parameter from a face image library of the shot object;
and correspondingly replacing the first face parameters of the face features in the first face image with the second face parameters to obtain a target face image.
With continued reference to fig. 4, in the embodiment of the present application, before generating the target face image according to the first face image and the second face image, the video processing apparatus 300 may further include:
the first stripping module 306 is configured to strip the first face image and the background image from the video frame of the video to be processed.
When replacing the first face image in the video to be processed with the target face image, the replacing module 304 may be configured to:
and combining the target face image with the background image of the video frame.
In this embodiment of the application, when the second voice is obtained, the second obtaining module 302 may be configured to:
guiding the shot object to read text information;
and collecting a second voice when the shot object reads the text information aloud.
Referring to fig. 4, in the embodiment of the present application, after the second voice of the photographed object in reading the text information is collected, the video processing apparatus 300 may further include a second storage module 307, where the second storage module 307 may be configured to:
extracting a second sound parameter of the sound feature in the second human voice;
and correspondingly storing the second sound parameter and the label of the shot object into a human voice library of the shot object.
In this embodiment of the application, when generating the target voice according to the first voice and the second voice, the generating module 303 may be configured to:
recognizing text information contained in the first person voice;
and coding according to the text information and the second sound parameters stored in the voice library to obtain the target voice corresponding to the text information.
Referring to fig. 4, in the embodiment of the present application, before generating the target voice according to the first voice and the second voice, the video processing apparatus 300 may further include:
and a second stripping module 308, configured to strip the first human voice and the background sound from the audio of the video to be processed.
Wherein, when replacing the first person voice of the video to be processed with the target person voice, the replacing module 304 may be configured to:
and combining the target human voice with the background voice of the audio.
As can be seen from the above, in the video processing apparatus 300 provided in the embodiment of the present application, the first obtaining module 301 obtains a to-be-processed video, where the to-be-processed video includes a first face image and a first voice; the second obtaining module 302 obtains a second face image and a second voice; the generating module 303 generates a target face image according to the first face image and the second face image, and generates a target voice according to the first voice and the second voice; the replacing module 304 replaces the first face image in the video to be processed with the target face image, and replaces the first person voice in the video to be processed with the target person voice to obtain the target video. According to the method and the device, the second face image is acquired and then not directly replaced into the video, the target face image is generated according to the second face image and the first face image in the video, the target face image is replaced into the video, the first voice of the video is processed, the corresponding first voice in the video is replaced into the target voice, and therefore the flexibility of face replacement is improved.
An electronic device is further provided in the embodiment of the present application, please refer to fig. 5, and fig. 5 is a schematic structural diagram of the electronic device provided in the embodiment of the present application. The electronic device 400 includes a processor 401 and a memory 402. The processor 401 is electrically connected to the memory. The power consumption load of the electronic equipment comprises a variable frequency load and a main control load.
The processor 401 is a control center of the electronic device 400, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device 400 by running or loading a computer program stored in the memory 402, and by data stored in the memory 402, and processes the data, thereby performing overall monitoring of the electronic device 400.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the computer programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, a computer program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.
In the embodiment of the present application, the processor 401 in the electronic device 400 stores a computer program executable on the processor 401 in the memory 402, and the processor 401 executes the computer program stored in the memory 402, thereby implementing various functions as follows:
acquiring a video to be processed, wherein the video to be processed comprises a first face image and a first voice;
acquiring a second face image and a second voice;
generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice;
and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video.
As can be seen from the above, the electronic device 400 provided in the embodiment of the present application obtains a to-be-processed video, where the to-be-processed video includes a first face image and a first voice; acquiring a second face image and a second voice; generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice; and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video. According to the method and the device, the second face image is acquired and then not directly replaced into the video, the target face image is generated according to the second face image and the first face image in the video, the target face image is replaced into the video, the first voice of the video is processed, the corresponding first voice in the video is replaced into the target voice, and therefore the flexibility of face replacement is improved.
The present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement the video processing method in any of the foregoing embodiments, such as:
acquiring a video to be processed, wherein the video to be processed comprises a first face image and a first voice;
acquiring a second face image and a second voice;
generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice;
and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain the target video.
In the embodiment of the present application, the computer-readable storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
It should be noted that, for the video processing method of the embodiment of the present application, it can be understood by a person skilled in the art that all or part of the process of implementing the video processing method of the embodiment of the present application can be completed by controlling the relevant hardware through a computer program, where the computer program can be stored in a computer readable storage medium, such as a memory of an electronic device, and executed by at least one processor in the electronic device, and during the execution process, the process of the embodiment of the video processing method can be included. The computer readable storage medium may be a magnetic disk, an optical disk, a read-only memory, a random access memory, etc.
In the video processing apparatus according to the embodiment of the present application, each functional module may be integrated into one processing chip, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, or the like.
The term "module" as used herein may be considered a software object executing on the computing system. The various components, modules, engines, and services described herein may be viewed as objects implemented on the computing system. The apparatus and method described herein are preferably implemented in software, but may also be implemented in hardware, and are within the scope of the present application.
The foregoing detailed description has provided a video processing method, an apparatus, a storage medium, and an electronic device according to embodiments of the present application, and specific examples are applied herein to illustrate principles and implementations of the present application, and the above descriptions of the embodiments are only used to help understand methods and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.
Claims (12)
1. A video processing method, comprising:
acquiring a video to be processed, wherein the video to be processed comprises a first face image and a first voice;
acquiring a second face image and a second voice;
generating a target face image according to the first face image and the second face image, and generating a target voice according to the first voice and the second voice;
and replacing the first face image in the video to be processed with the target face image, and replacing the first voice in the video to be processed with the target voice to obtain a target video.
2. The video processing method according to claim 1, wherein said obtaining a second face image comprises:
starting a camera;
and acquiring a second face image of the shot object in real time through the camera.
3. The video processing method according to claim 2, wherein after the capturing of the second face image of the subject by the camera in real time, the method further comprises:
inputting the second face image into a face feature extraction model to obtain a second face parameter of the face feature in the second face image;
and correspondingly storing the second face parameters and the labels of the shot objects into a face image library of the shot objects.
4. The video processing method according to claim 3, wherein the generating a target face image from the first face image and the second face image comprises:
inputting the first face image into the face feature extraction model to obtain first face parameters of face features in the first face image;
acquiring the second face parameters from the face image library of the shot object;
and correspondingly replacing the first face parameters of the face features in the first face image with the second face parameters to obtain the target face image.
5. The video processing method according to any one of claims 1 to 4, wherein before generating the target face image from the first face image and the second face image, further comprising:
stripping the first face image and the background image from the video frame of the video to be processed;
replacing the first face image in the video to be processed with the target face image comprises:
and combining the target face image with the background image of the video frame.
6. The video processing method according to claim 2, wherein said obtaining the second voice comprises:
guiding the shot object to read text information;
and collecting a second voice of the shot object when reading the text information.
7. The video processing method according to claim 6, wherein after the capturing of the second voice of the photographed object when reading the text information, further comprising:
extracting a second sound parameter of the sound feature in the second human voice;
and correspondingly storing the second sound parameter and the label of the shot object into a human voice library of the shot object.
8. The video processing method according to claim 7, wherein the generating a target human voice from the first human voice and the second human voice comprises:
recognizing text information contained in the first person voice;
and coding according to the text information and the second sound parameters stored in the voice database to obtain the target voice corresponding to the text information.
9. The video processing method according to any one of claims 6 to 8, wherein before generating the target voice from the first voice and the second voice, further comprising:
stripping the first human voice and the background sound from the audio of the video to be processed;
the replacing the first person voice of the video to be processed into the target person voice comprises:
and combining the target human voice with the background voice of the audio.
10. A video processing apparatus, comprising:
the device comprises a first acquisition module, a second acquisition module and a processing module, wherein the first acquisition module is used for acquiring a video to be processed, and the video to be processed comprises a first face image and a first voice;
the second acquisition module is used for acquiring a second face image and acquiring a second voice;
the generating module is used for generating a target face image according to the first face image and the second face image and generating a target voice according to the first voice and the second voice;
and the replacing module is used for replacing the first face image in the video to be processed with the target face image and replacing the first voice in the video to be processed with the target voice to obtain a target video.
11. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to implement the video processing method of any one of claims 1 to 9.
12. An electronic device, characterized in that the electronic device comprises a processor, a memory and a computer program stored in the memory and executable on the processor, the processor executing the computer program to implement the video processing method according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111531779.6A CN114222077A (en) | 2021-12-14 | 2021-12-14 | Video processing method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111531779.6A CN114222077A (en) | 2021-12-14 | 2021-12-14 | Video processing method and device, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114222077A true CN114222077A (en) | 2022-03-22 |
Family
ID=80702145
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111531779.6A Pending CN114222077A (en) | 2021-12-14 | 2021-12-14 | Video processing method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114222077A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115589453A (en) * | 2022-09-27 | 2023-01-10 | 维沃移动通信有限公司 | Video processing method and device, electronic equipment and storage medium |
CN116137673A (en) * | 2023-02-22 | 2023-05-19 | 广州欢聚时代信息科技有限公司 | Digital human expression driving method and device, equipment and medium thereof |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106599817A (en) * | 2016-12-07 | 2017-04-26 | 腾讯科技(深圳)有限公司 | Face replacement method and device |
CN108780643A (en) * | 2016-11-21 | 2018-11-09 | 微软技术许可有限责任公司 | Automatic dubbing method and apparatus |
CN110688911A (en) * | 2019-09-05 | 2020-01-14 | 深圳追一科技有限公司 | Video processing method, device, system, terminal equipment and storage medium |
CN110889381A (en) * | 2019-11-29 | 2020-03-17 | 广州华多网络科技有限公司 | Face changing method and device, electronic equipment and storage medium |
CN110968736A (en) * | 2019-12-04 | 2020-04-07 | 深圳追一科技有限公司 | Video generation method and device, electronic equipment and storage medium |
CN113223555A (en) * | 2021-04-30 | 2021-08-06 | 北京有竹居网络技术有限公司 | Video generation method and device, storage medium and electronic equipment |
CN113486785A (en) * | 2021-07-01 | 2021-10-08 | 深圳市英威诺科技有限公司 | Video face changing method, device, equipment and storage medium based on deep learning |
-
2021
- 2021-12-14 CN CN202111531779.6A patent/CN114222077A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108780643A (en) * | 2016-11-21 | 2018-11-09 | 微软技术许可有限责任公司 | Automatic dubbing method and apparatus |
CN106599817A (en) * | 2016-12-07 | 2017-04-26 | 腾讯科技(深圳)有限公司 | Face replacement method and device |
CN110688911A (en) * | 2019-09-05 | 2020-01-14 | 深圳追一科技有限公司 | Video processing method, device, system, terminal equipment and storage medium |
CN110889381A (en) * | 2019-11-29 | 2020-03-17 | 广州华多网络科技有限公司 | Face changing method and device, electronic equipment and storage medium |
CN110968736A (en) * | 2019-12-04 | 2020-04-07 | 深圳追一科技有限公司 | Video generation method and device, electronic equipment and storage medium |
CN113223555A (en) * | 2021-04-30 | 2021-08-06 | 北京有竹居网络技术有限公司 | Video generation method and device, storage medium and electronic equipment |
CN113486785A (en) * | 2021-07-01 | 2021-10-08 | 深圳市英威诺科技有限公司 | Video face changing method, device, equipment and storage medium based on deep learning |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115589453A (en) * | 2022-09-27 | 2023-01-10 | 维沃移动通信有限公司 | Video processing method and device, electronic equipment and storage medium |
CN116137673A (en) * | 2023-02-22 | 2023-05-19 | 广州欢聚时代信息科技有限公司 | Digital human expression driving method and device, equipment and medium thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111988658B (en) | Video generation method and device | |
US8494338B2 (en) | Electronic apparatus, video content editing method, and program | |
CN108920640B (en) | Context obtaining method and device based on voice interaction | |
CN113709384A (en) | Video editing method based on deep learning, related equipment and storage medium | |
CN110839173A (en) | Music matching method, device, terminal and storage medium | |
CN110519636A (en) | Voice messaging playback method, device, computer equipment and storage medium | |
CN103620682A (en) | Video summary including a feature of interest | |
CN105812920B (en) | Media information processing method and media information processing unit | |
CN112738557A (en) | Video processing method and device | |
CN114222077A (en) | Video processing method and device, storage medium and electronic equipment | |
EP4300431A1 (en) | Action processing method and apparatus for virtual object, and storage medium | |
US9881086B2 (en) | Image shooting device, image shooting method, and recording medium | |
CN112422844A (en) | Method, device and equipment for adding special effect in video and readable storage medium | |
CN105744144A (en) | Image creation method and image creation apparatus | |
CN108600604A (en) | Image pickup method, dual-screen mobile terminal and storage medium | |
CN113395569B (en) | Video generation method and device | |
US20240353739A1 (en) | Image processing apparatus, image processing method, and storage medium | |
US20080122867A1 (en) | Method for displaying expressional image | |
CN104780341B (en) | A kind of information processing method and information processing unit | |
CN112188116B (en) | Video synthesis method, client and system based on object | |
CN114143429A (en) | Image shooting method, image shooting device, electronic equipment and computer readable storage medium | |
CN113542604A (en) | Video focusing method and device | |
CN112584225A (en) | Video recording processing method, video playing control method and electronic equipment | |
CN116708852A (en) | Bullet screen generation method and device, electronic equipment and storage medium | |
CN117478824B (en) | Conference video generation method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220322 |
|
RJ01 | Rejection of invention patent application after publication |