CN115914834A

CN115914834A - Video processing method and device

Info

Publication number: CN115914834A
Application number: CN202111168114.3A
Authority: CN
Inventors: 查俊伟; 余晓铭; 周易; 易阳; 涂娟辉; 李峰; 左小祥
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2023-04-04

Abstract

Embodiments of the present disclosure provide a video processing method, apparatus, device, and computer-readable storage medium. The method provided by the embodiment of the disclosure determines the motion transformation matrix of the motion of the specific target by matching the key points related to the specific target and acquired from the adjacent video frames, so as to smooth the video frames based on the motion of the specific target, thereby generating a more stable and smooth image-stabilized video. By the method of the embodiment of the disclosure, when the video device makes involuntary movement (such as shaking) in operations such as face video shooting or instant video communication, unstable video images can be converted into coherent and smooth images in real time and efficiently, so that the quality of the video is improved, and the viewing experience of a viewer is improved.

Description

Video processing method and device

Technical Field

The present disclosure relates to the field of artificial intelligence and video processing, and more particularly, to a video processing method, apparatus, device, and storage medium.

Background

At present, with the development of media technology, video equipment is more and more popular, more and more ordinary families and individuals can shoot and process various videos on their own conditions, so people have higher and higher requirements for the quality of videos, especially videos which are shot by handheld Video equipment and take human faces as main shooting objects (hereinafter referred to as "human face videos" for short, which may include self-shot human face videos and human face videos shot by others), the problem of Video jitter is concerned, and the videos need to be stabilized, that is, a Video Stabilization (Video Stabilization) technology. The purpose of video image stabilization is to make human eyes feel comfortable, facilitate human observation and the like, and can be used as a preprocessing stage of many other subsequent processing such as detection, tracking or compression.

Existing video image stabilization techniques can be broadly divided into two categories: optical flow-based video stabilization techniques and statistical learning-based video stabilization techniques. The video image stabilization technology based on the optical flow does not need a training sample, is simple and efficient, but for the human face video, the occlusion of the human face to the background cannot be effectively processed, so that key points tracked by the optical flow are easily lost, and the image stabilization effect of the human face video is not obvious. The video image stabilization technology based on statistical learning can generate high-quality image stabilization videos, but a large number of paired shaking-image stabilization videos are needed for learning training, the paired video data for training are required to be consistent in strict scenes and difficult to acquire in a large scale, and training models are generally large and difficult to meet the requirement of real-time performance.

Therefore, a real-time and effective video image stabilization method is needed, so that stable and efficient image stabilization processing can be realized on the face video.

Disclosure of Invention

In order to solve the above problem, the present disclosure estimates motion of a video device by matching face key points of detected adjacent video frames to smooth a video image based on the motion of the video device, thereby generating a smoother and smoother steady-image video.

Embodiments of the present disclosure provide a video processing method, apparatus, device, and computer-readable storage medium.

An embodiment of the present disclosure provides a video processing method, including: acquiring a first video frame and a second video frame, wherein the first video frame is an adjacent video frame before the second video frame; acquiring a first number of key points of the first video frame and a first number of key points of the second video frame, wherein the first number of key points of the first video frame and the first number of key points of the second video frame are related to a specific target in the video, and the first number of key points of the second video frame and the first number of key points of the first video frame are in one-to-one correspondence; determining a motion transformation matrix for the particular object based on the first number of keypoints for the second video frame and the first number of keypoints for the first video frame; and generating a processed second video frame based on the first video frame and the motion transformation matrix.

An embodiment of the present disclosure provides a video processing apparatus, including: a video frame acquisition module configured to acquire a first video frame and a second video frame, wherein the first video frame is an adjacent video frame before the second video frame; a key point obtaining module configured to obtain a first number of key points of the first video frame and a first number of key points of the second video frame, where the first number of key points of the first video frame and the first number of key points of the second video frame are related to a specific target in the video, and the first number of key points of the second video frame and the first number of key points of the first video frame correspond to each other one by one; a motion transformation determination module configured to determine a motion transformation matrix for the particular object based on the first number of keypoints for the second video frame and the first number of keypoints for the first video frame; and a video frame generation module configured to generate a processed second video frame based on the first video frame and the motion transformation matrix.

An embodiment of the present disclosure provides a video processing apparatus including: one or more processors; and one or more memories, wherein the one or more memories have stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-13.

Embodiments of the present disclosure provide a computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of claims 1-13 when executed by a processor.

Embodiments of the present disclosure provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video processing method according to the embodiment of the present disclosure.

Compared with the traditional video image stabilization technology based on optical flow, the method provided by the embodiment of the disclosure can select more appropriate key points for the face video, thereby effectively estimating the motion of the video equipment and generating the smooth image stabilization video. Compared with the traditional video image stabilization technology based on statistical learning, the method provided by the embodiment of the disclosure can perform efficient video image stabilization processing on the face video without a large amount of video sample training, improves the real-time performance of the video image stabilization processing, and reduces the processing complexity.

The method provided by the embodiment of the disclosure determines the motion transformation matrix of the motion of the specific target by matching the key points related to the specific target and acquired from the adjacent video frames, so as to smooth the video frames based on the motion of the specific target, thereby generating a more stable and smooth image-stabilized video. Through the method disclosed by the embodiment of the disclosure, when the video equipment generates involuntary movement (such as shaking) in operations such as human face video shooting or instant video communication, unstable video images can be converted into coherent and smooth images in real time and high efficiency, so that the quality of the video is improved, and the viewing experience of a viewer is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. It is apparent that the drawings in the following description are only exemplary embodiments of the disclosure, and that other drawings may be derived from those drawings by a person of ordinary skill in the art without inventive effort.

Fig. 1A is an exemplary schematic diagram illustrating a scene processing video captured by a video device according to an embodiment of the disclosure;

FIG. 1B is a comparison diagram illustrating example video frames before and after video stabilization processing according to an embodiment of the present disclosure;

fig. 2 is a flow diagram illustrating a video processing method according to an embodiment of the present disclosure;

FIG. 3A is a schematic diagram illustrating detection of a particular target in a video frame according to an embodiment of the present disclosure;

FIG. 3B is a schematic diagram illustrating a first number of keypoints for acquiring a video frame from a video frame, according to an embodiment of the disclosure;

FIG. 4A is a schematic diagram illustrating keypoint matching according to an embodiment of the present disclosure;

FIG. 4B is a flow diagram illustrating determining a motion transformation matrix according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram showing a comparison before and after filtering a plurality of transformation parameters according to an embodiment of the disclosure;

FIG. 6 is a schematic diagram illustrating the generation of a processed second video frame according to an embodiment of the present disclosure;

FIG. 7 is a contrast diagram illustrating example video frames of optical flow-based video stabilization and video stabilization according to embodiments of the present disclosure;

fig. 8 is a schematic diagram illustrating a video processing apparatus according to an embodiment of the present disclosure;

fig. 9 shows a schematic diagram of a video processing device according to an embodiment of the present disclosure;

FIG. 10 shows a schematic diagram of an architecture of an exemplary computing device, according to an embodiment of the present disclosure; and

FIG. 11 shows a schematic diagram of a storage medium according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions, and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some of the embodiments of the present disclosure, and not all of the embodiments of the present disclosure, and it is to be understood that the present disclosure is not limited by the example embodiments described herein.

In the present specification and the drawings, steps and elements having substantially the same or similar characteristics are denoted by the same or similar reference numerals, and repeated description of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.

For the purpose of describing the present disclosure, concepts related to the present disclosure are introduced below.

The video processing method of the present disclosure may be Artificial Intelligence (AI) based. Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. For example, with respect to an artificial intelligence based video processing method, it is possible to image-stabilize a video in a manner similar to a manner in which a human being recognizes image shake occurring in a video viewed by the naked eye and tries to eliminate such image shake. Artificial intelligence enables the video processing method disclosed by the invention to have a function of efficiently converting unstable video images into coherent and smooth images in real time by researching the design principle and implementation method of various intelligent machines.

The video processing method of the present disclosure may be based on video image stabilization techniques. Video stabilization refers to processing of a sequence of original video frames captured by a video device to remove effects therein due to involuntary movement of the video device. The video stabilization technology converts a video damaged to be stabilized due to involuntary movement of a video device into a stabilized video, and in the processing, a video frame sequence is smoothed according to the movement of the video device to generate a continuous and continuous video frame sequence, so that the discomfort caused by the involuntary movement of the video device is relieved, and the quality of the video is improved. Video stabilization techniques can be divided by their mechanism of action into optical, mechanical and electronic video stabilization. The electronic (digital) video image stabilization is a comprehensive image stabilization technology applying light, mechanical, electrical and computation, and is based on motion estimation between successive video images, and then motion compensation processing is performed on each frame of image in the video to obtain a stable image.

Thus, the video processing method of the present disclosure may also be based on motion compensation techniques. The motion compensation removes unnecessary motion according to a global motion vector obtained by motion estimation, thereby achieving the purpose of stabilizing video images. For a video frame sequence acquired by video equipment, the video frame sequence may include both random jitter of the video equipment and intentional motion, so that image stabilization processing of the image sequence requires compensation of a current image by using a result of motion estimation to retain normal intentional scanning motion of the video equipment, and remove random jitter, so that the output video sequence is coherent and smooth, and the purpose of image stabilization is achieved. In general, random jitter is a high-frequency noise and intentional sweeping motion is a smooth motion at a low frequency, which are different in frequency. Thus, the shake component can be extracted and compensated by performing a filtering process (e.g., kalman filtering) on the motion vector.

For the case that the specific target in the present disclosure is a face, the video processing method of the present disclosure may also be based on a face key point detection technology. Face keypoint detection techniques may automatically identify the locations of face keypoints in a face image or video, which may include dominant points that describe unique locations of a face feature (e.g., corners of the eyes) and interpolated points that connect these dominant points along the face feature contour and the face contour. In general, the face key point detection method can be roughly divided into three types: a conventional method based on an ASM (Active Shape Model) and an AAM (Active Appearance Model), a method based on a cascade Shape regression, and a method based on a deep learning. The video processing method disclosed by the invention is based on key points acquired by a face key point detection technology, and the manner of acquiring the key points is not limited, so that any one of the face key point detection methods or any other face key point detection method can be adopted to acquire the key points by the video processing method disclosed by the invention.

In summary, the solutions provided by the embodiments of the present disclosure relate to technologies such as artificial intelligence, video image stabilization, face key point detection, and the like, and the embodiments of the present disclosure will be further described with reference to the accompanying drawings.

Fig. 1A is an exemplary schematic diagram illustrating a scene 100 processing video captured by a video device according to an embodiment of the disclosure. Fig. 1B is a contrast diagram illustrating example video frames before and after video stabilization processing according to an embodiment of the present disclosure.

As shown in fig. 1A, a user may capture various videos including, but not limited to, a face video, a landscape video, a food video, and the like, through a video device in his/her possession. The captured video may be transmitted over a network to a processing end (e.g., a server) for video processing (e.g., video stabilization processing in embodiments of the present disclosure).

Optionally, the processed video may be output to any other video playing device accessing the network through the network, so as to be watched by other users through the video playing device in the possession.

Alternatively, the video device may specifically include any device having video capture functionality, such as a smartphone, tablet, laptop, handheld camera, vehicle camera, wearable device, and the like. For example, the video playback device may specifically include any device having video playback functionality, such as a smartphone, tablet, laptop, wearable device, television (e.g., internet television), and so on.

Optionally, the processed video may also be returned to the video device for viewing by the user through the video device he or she holds. Thus, the video device may also have video playback functionality, which may be any device having both video acquisition and video playback functionality, such as a smartphone, tablet, laptop portable computer, wearable device, and the like. In this case, the other video devices can also view the processed video through the network.

The video processing may be synchronized with the capturing of the video frames (e.g., in a scenario where the user performs instant video communication with other users through the video device) or may be unified processing after the video capturing is completed (e.g., the user takes a facial video offline using the video device), depending on the actual usage scenario of the user.

Alternatively, the network may be an internet of Things (internet of Things) based on the internet and/or a telecommunication network, which may be a wired network or a wireless network, for example, which may be a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a cellular data communication network, or other electronic network capable of implementing an information exchange function.

Optionally, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like.

In the embodiment of the present disclosure, the video device may be any device having a video capture function, and with the popularization of the video device and the wide application, the video device is becoming more and more diversified. Since platforms such as various handheld video devices and vehicle-mounted video devices are affected by posture changes and vibration of the platforms, video signals obtained through these video devices usually include irregular random motions (such as rotation, translation, zooming, etc.) of the video devices, and these irregular random motions may cause blurring and shaking of the obtained video frame sequence, which is inconvenient for observation, seriously affects the viewing experience of the viewer, and is not beneficial to effective utilization of image information. Particularly for human face videos, when the relative position of a human face in a video image is severely shaken (high-frequency shaking), the human face videos can bring a strong sense of discomfort to viewers.

For example, in the video frame in fig. 1B (a) (first column), a video image directly captured by a video device appears blurred due to irregular random motion of a user during the capturing process and the position of a human face in the image is suddenly changed. For example, referring to the chin portion of a human face in an image, it can be found that the position of the chin of the human face changes greatly during a short video frame interval, and when watching a video, such high-frequency blurring and shaking become more noticeable, causing discomfort to the viewer.

Therefore, it is necessary to smooth the acquired video frame sequence with respect to the motion of the video device to remove the influence of the irregular random motion of the video device.

To address this problem, solutions can broadly include two types of video stabilization methods: the video image stabilization method based on the optical flow and the video image stabilization method based on the statistical learning. The method is simple and efficient without training samples, but for videos with faces as main shooting objects, the method cannot effectively process the shielding of the human faces on the background, so that key points of optical flow tracking are easy to lose, and the image stabilizing effect of the method on the videos with the faces as the main shooting objects is not obvious. The video image stabilization method based on statistical learning directly utilizes the neural network to generate the stabilized video by learning the conversion relation between the video to be stabilized and the image stabilization video, so that a high-quality image stabilization video can be generated, but a large amount of paired shaking-image stabilization videos are required for learning training, and the paired video data for training are required to be consistent in strict scenes, so that large-scale acquisition is difficult, and the training model is generally large and cannot meet the requirement of real-time performance.

The present disclosure is based on this point, and provides a video processing method, which estimates a motion of a video device using face key points detected in adjacent video frames based on a face key point detection method to smooth a video image based on the motion of the video device, thereby generating a smoother and smoother steady-image video.

As shown in the video frame in fig. 1B (B) (the second column), after applying the video processing method of the present disclosure to the video to which the video frame in fig. 1B (a) (the first column) belongs, the change in the position of the face in the image is significantly reduced, for example, referring to the chin portion of the face in the image, and it can be found that the change in the position of the chin of the face is significantly reduced during the same video frame interval compared to before the video stabilization processing.

The method provided by the embodiment of the disclosure determines the motion transformation matrix of the motion of the specific target by matching the key points related to the specific target and acquired from the adjacent video frames, so as to smooth the video frames based on the motion of the specific target, thereby generating a more stable and smooth image-stabilized video. By the method of an embodiment of the present disclosure,

fig. 2 is a flow diagram illustrating a video processing method 200 according to an embodiment of the present disclosure.

As shown in fig. 2, in step 201, a first video frame and a second video frame may be acquired, where the first video frame is an adjacent video frame before the second video frame.

Optionally, the video may be acquired in different manners according to actual usage scenarios of the user, and thus the first video frame and the second video frame may be acquired in different manners.

Optionally, in a scene in which the user performs instant video communication with other users through the video device, the video may be acquired and acquired frame by frame in real time for processing, that is, when the first video frame is acquired, the second video frame and subsequent video frames may not be acquired yet, and at this time, the acquisition of the second video frame may be subsequent to the acquisition of the first video frame.

For example, when a user is engaged in a video conference with other users via application software such as an Tencent conference on their smart phone, video frames captured in real time may be acquired for processing. It should be understood that the application scenarios of the video processing method of the present disclosure may include, for example, webcast and the like, in addition to the video conference described above, which is used as an example of instant video communication only and is not limited.

Alternatively, in a scenario where a user uses a video device to capture a video, the video may be for processing after the capture is completed, i.e., upon acquiring a first video frame, the capture of a second video frame and its subsequent video frames has been completed, and the acquisition of the second video frame may be simultaneous with or subsequent to the acquisition of the first video frame. For example, the captured video may be processed after the user completes the video capture using the video device.

According to an embodiment of the present disclosure, the video may be a video with a human face as a specific target.

For example, the video may be the above-mentioned face video, the face video may be a self-shot face video shot by the user himself, or may be a face video shot by the user himself and shot by a non-specific object, which is not limited in this disclosure.

In step 202, a first number of key points of the first video frame and a first number of key points of the second video frame may be obtained, where the first number of key points of the first video frame and the first number of key points of the second video frame are related to a specific target in the video, and the first number of key points of the second video frame and the first number of key points of the first video frame are in one-to-one correspondence.

Alternatively, the keypoints in the first and second video frames may be determined based on the relationship of the first number of keypoints in the first and second video frames to a particular target in the video.

Thus, before determining the keypoints, the specific target may be determined in the first video frame and the second video frame, respectively, to determine the keypoints based on the specific target.

According to an embodiment of the present disclosure, the specific target may be detected in the first video frame, and a first number of key points of the first video frame may be acquired based on a detection result.

According to an embodiment of the present disclosure, the specific target may be detected in the second video frame, and a first number of key points of the second video frame may be acquired based on a detection result.

Wherein, according to an embodiment of the present disclosure, a detection result of detecting the specific target in the first video frame may be a first target frame of the specific target, and a detection result of detecting the specific target in the second video frame may be a second target frame of the specific target.

Alternatively, the first object box and the second object box may be regions where the specific object is located in the first video frame and the second video frame, respectively, from which the position coordinate range of the specific object in the corresponding video frame may be roughly determined.

According to an embodiment of the present disclosure, a first number of keypoints of the particular target in the first video frame may be obtained based on the first target frame.

According to an embodiment of the present disclosure, a first number of keypoints of the specific target in the second video frame may be acquired based on the second target frame.

Optionally, after determining the first target frame and the second target frame in which the specific target is located, the keypoints in the first video frame and the second video frame may be determined based on the approximate range defined by the first target frame and the second target frame and the relationship between the first number of keypoints in the first video frame and the first number of keypoints in the second video frame and the specific target in the video.

According to an embodiment of the present disclosure, the particular target may have a first number of keypoints, and the first number of keypoints of the particular target may be associated with a physical feature of the particular target.

Optionally, a first number of keypoints may be selected according to the physical characteristics of the particular target, and these keypoints may be respectively associated with at least a part of the physical characteristics of the particular target.

For example, where the particular object is a human face, its physical features may be physical features (e.g., physical shape, location, etc.) of portions of its face, and thus a first number of its keypoints may include points that are respectively associated with portions on the face, including but not limited to eyebrows, eyes, nose, mouth, and chin, etc.

According to an embodiment of the present disclosure, the first number of keypoints of the first video frame and the first number of keypoints of the second video frame may be determined according to the first number of keypoints of the specific target.

Alternatively, the first number of keypoints of the first video frame and the first number of keypoints of the second video frame may each have a one-to-one correspondence with the first number of keypoints of the particular target. Thus, the keypoints may also be respectively associated with at least a portion of the physical features of the particular target.

For example, where the particular object is a human face, the first number of keypoints in the first video frame and the first number of keypoints in the second video frame may be associated with respective locations on the human face (including, but not limited to, eyebrows, eyes, nose, mouth, jaw, and the like).

Thus, based on the determination of the range of position coordinates of the particular object in the respective video frame and the first number of keypoints of the particular object, the first number of keypoints of the first video frame and the first number of keypoints of the second video frame may be determined.

Note that the acquiring of the first video frame and the acquiring of the first number of key points of the first video frame may be before the acquiring of the second video frame and the acquiring of the first number of key points of the second video frame, that is, when the second video frame is processed, the first number of key points of the first video frame may have already been acquired, so that the image stabilization processing on the second video frame may be directly performed using the first number of key points of the first video frame.

Alternatively, in the case where the video is a video targeted specifically to a face, the keypoints in the video frames may be determined based on the detection of the face and the keypoint localization.

According to an embodiment of the present disclosure, the face may be detected in the first video frame using a predetermined face detection algorithm to determine a face region in the first video frame, and a first number of keypoints associated with the face may be obtained in the face region of the first video frame using a predetermined face keypoint detection algorithm.

According to an embodiment of the present disclosure, the face may be detected in the second video frame by using a predetermined face detection algorithm to determine a face region in the second video frame, and a first number of key points related to the face may be acquired in the face region of the second video frame by using a predetermined face key point detection algorithm.

Alternatively, the predetermined face detection algorithm may be any face detection algorithm capable of implementing face detection on a video image to determine a face region according to an embodiment of the present disclosure, including, but not limited to, a VJ (Viola-joints) algorithm, an MTCNN (Multi-task Cascaded Convolutional neural network) algorithm, and the like, for example.

Also, the predetermined face keypoint detection algorithm may be any face keypoint detection algorithm capable of achieving the obtaining of the first number of keypoints associated with the face according to embodiments of the present disclosure, including, for example and without limitation, ASM and AAM-based methods, cascading shape regression-based methods, deep learning-based methods, and the like.

Optionally, in the case of a failure in face detection or face key point detection (such as no face is detected or a sufficient number of face key points are not acquired), another suitable video image stabilization method (for example, an optical flow-based video image stabilization method) may be selected to perform image stabilization on the corresponding video frame, so as to avoid a situation that the face temporarily draws or is not displayed completely due to severe shaking of the video device, and thus the video image stabilization cannot be performed normally. In case the video image resumes normal (the face resumes full display), the subsequent steps of the video processing method of the present disclosure may be continued.

Through the face detection and the face key point detection, the position coordinates of the key points of the first number of the first video frame and the key points of the first number of the second video frame in the corresponding video frames can be determined. Specifically, the processing in step 202 described above, such as determining the position coordinates of the keypoints, may be as follows with reference to fig. 3A-3B.

In step 203, a motion transformation matrix for the particular object may be determined based on the first number of keypoints for the second video frame and the first number of keypoints for the first video frame.

Optionally, the second video frame is adjacent to and behind the first video frame, and the movement of all the pixel points in the video frame due to the high-frequency jitter of the video device may be considered as applying motion transformation to each pixel point in the first video frame, so that the pixel point is moved from the original position to the corresponding position in the second video frame.

According to an embodiment of the present disclosure, step 203 may include: matching the first number of key points of the second video frame with the first number of key points of the first video frame to generate a first number of key point pairs; and determining a motion transformation matrix for the particular object based on the first number of keypoint pairs.

Alternatively, step 203 may be performed in two parts, namely, generating key point pairs and determining motion transformation matrices.

First, in the generating a keypoint pair section, according to an embodiment of the present disclosure, each keypoint pair may include two keypoints that may correspond to a particular keypoint of the particular target and belong to the second video frame and the first video frame, respectively.

Alternatively, as described above, the key point pair may be composed of two key points before and after the motion transform is applied. Wherein the two keypoints may be associated with the same physical feature of the particular target. For example, where the particular object is a human face, the two keypoints may be associated with one keypoint of a particular location on the human face (including, but not limited to, eyebrows, eyes, nose, mouth, or chin, etc.).

Alternatively, after all keypoints pairs of the first video frame and the second video frame and the keypoints in each keypoint pair are determined, a motion transformation matrix for the particular object may be estimated based on the movement of these keypoints.

Alternatively, based on the estimated motion transformation matrix, high-frequency noise caused by high-frequency jitter of the video device and the like can be filtered, so that the obtained video frame sequence is smoothed to reduce high-frequency blurring and jitter generated by irregular random motion of the video device when the video is viewed. Specifically, the process of determining the motion transformation matrix portion may be as described below with reference to fig. 4B.

In step 204, a processed second video frame may be generated based on the first video frame and the motion transformation matrix.

According to an embodiment of the present disclosure, step 204 may include: applying the motion transformation matrix to all pixel points in the first video frame to generate the processed second video frame, wherein all pixel points in the first video frame include at least a first number of keypoints of the first video frame.

Optionally, since random jitter of the video device causes uniform movement of all pixels in the video image, the motion transformation matrix may be applied to all pixels including key points and non-key points in the entire video image to perform motion transformation on all pixels, so as to generate a second video frame from the first video frame, where high-frequency noise in the second video frame is filtered out.

According to an embodiment of the present disclosure, the size of the processed second video frame is not larger than the size of the second video frame, e.g. the size of the processed second video frame is smaller than the size of the second video frame.

Alternatively, the processed second video frame may be obtained after performing the cropping process on the motion-transformed first video frame. For example, due to irregular random motion (such as rotation, translation, and zooming) of the video device, the video image after the motion transformation may not be suitable for the display screen of the current video device (for example, there may be content that part of the image corners are not displayable), and therefore, the video image after the motion transformation needs to be adaptively trimmed to adapt to the screen size of the video image.

3A-3B take a face video as an example, and show schematic diagrams of acquiring key points from video frames according to an embodiment of the disclosure. Fig. 3A is a schematic diagram illustrating detection of a specific target in a video frame according to an embodiment of the present disclosure. Fig. 3B is a schematic diagram illustrating a first number of keypoints for acquiring a video frame from a video frame according to an embodiment of the disclosure.

As shown in fig. 3A, when the face detection algorithm is performed on the video frame, a thick line box in fig. 3A may be generated, where the thick line box may represent the above-mentioned first target box or second target box (hereinafter, referred to as a face target box), and a defined area of the thick line box is an area where a specific target (in this example, a face) is located in the video frame, and determines a position coordinate range of a key point of the face in the video frame.

For example, in fig. 3A, the position coordinates of the obtained face are { box } _t ＝(x _t ，y _t ，w _t ，h _t ) In which (x) _t ，y _t ) Coordinates of the upper left corner, w, of the face target box _t And h _t Respectively the width and the height of the human face target frame.

After the face target frame is obtained, the face key points of the first number in the face target frame can be detected and positioned through the face key point detection algorithm:

F _t ＝G(box _t ) (1)

F _t+1 ＝G(box _t+1 ) (2)

wherein, box _t Is a face target frame in the t-th frame, G is a function of a face key point detection algorithm, F _t And the position coordinates of the key points of the face in the t-th frame.

As shown by the black dots in fig. 3B, the first number of face keypoints are respectively associated with a specific part (including but not limited to eyebrows, eyes, nose, mouth, or chin, etc.) in the face. For example, in the example shown in fig. 3B, the number of key points of the face is 65, where the eyebrow portions of the face are represented by 10 key points (where the eyebrows on the left and right sides are represented by 5 key points, respectively).

Based on the above-described face detection and face keypoint detection processing, the position coordinates of a first number of keypoints in the image frame may be obtained for subsequent determination of the motion transformation matrix.

Fig. 4A and 4B illustrate a process of determining a motion transformation matrix based on a first number of keypoints for each of a first video frame and a second video frame. Fig. 4A is a schematic diagram illustrating keypoint matching according to an embodiment of the present disclosure. Fig. 4B is a flow diagram illustrating determining a motion transformation matrix according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, each keypoint pair may comprise a first point belonging to the first video frame and a second point belonging to the second video frame.

As shown in fig. 4A, after acquiring a first number of key points of each of the first video frame and the second video frame, key points associated with the same physical feature (for example, a certain key point representing an eyebrow portion in a human face (for example, a second key point in a direction from the middle of the human face to the face side)) among the key points of the first video frame and the second video frame may be matched (connected by a straight line in fig. 4A) based on the association of the key points with the physical feature of the specific object to form a key point pair, that is:

wherein, F _t (i) Is the position coordinate of the ith key point of the face in the t frame, M _t (i) Representing the ith keypoint pair from the t-th frame to the t + 1-th frame.

It should be understood that in fig. 4A, for clarity of illustration, only key point pairs in the left half of the face in the image are shown, and the right half may be matched and connected in the same manner. Thereby, a first number of key point pairs may be formed.

According to an embodiment of the present disclosure, determining the motion transformation matrix of the specific target based on the first number of key point pairs may include the steps as shown in fig. 4B.

According to an embodiment of the present disclosure, the motion transformation matrix may include a plurality of transformation parameters.

Alternatively, the motion of the specific object may be characterized by the motion transformation matrix, and thus to determine the motion of the specific object, the motion transformation matrix needs to be determined, i.e. the form of the motion transformation matrix (i.e. the motion model) and the transformation parameters included in the matrix need to be determined.

Wherein, optionally, for the video image stabilization method, since the interval between two adjacent video frame images is short, it can be considered that there is no torsion change, and irregular random motion of the video device can be generally represented by motion such as rotation, translation and zooming, the motion of a specific object can be generally modeled by using a translation, rotation and scaling model.

According to an embodiment of the present disclosure, the motion of the specific object may be an affine motion, and the motion transformation matrix may be an affine transformation matrix.

According to an embodiment of the present disclosure, the plurality of transformation parameters includes parameters related to at least one of scaling, rotation angle, horizontal displacement, and vertical displacement.

Thus, the motion model H can be represented as follows:

where s denotes the scaling, θ denotes the rotation angle, T _x Represents a horizontal displacement, and T _y The vertical displacement is represented, the four variables are transformation parameters to be determined, and the transformation matrix can represent rotation, translation and uniform scaling of a rigid body.

Alternatively, the determination process of the motion transformation matrix and its transformation parameters is described below by taking the video frame sequence I as an example.

Wherein, with z _t ＝[x _t ，y _t ，1]Indicating the position of the video frame sequence I in the t-th frame as (x) _t ，y _t ) A pixel point, and thus, the motion of a specific object can be expressed as a key pointThe position between different frames. Considering two adjacent video frames I _t And I _t+1 Pixel point z belonging to frame t _t The coordinates at the t +1 frame can be expressed as:

z _t+1 ＝H _t z _t (5)

wherein H _t Which is a two-dimensional motion transformation matrix as described above, which may describe the motion of a particular object (or the motion of a video device) from the t frame to the t +1 frame, may be expressed as:

therefore, a plurality of transformation parameters of the motion transformation matrix need to be determined.

In step 401, current estimated values of the plurality of transformation parameters may be determined.

Alternatively, the estimation of the transformation parameters may be regarded as an optimization process of the transformation parameters, wherein the estimation result of the transformation parameters should minimize or reach a preset threshold value an error of transformation from the t-th frame to the t + 1-th frame based on the motion transformation matrix.

In step 402, for each keypoint pair of the first number of keypoint pairs, a position estimate for its corresponding keypoint may be determined based on the position of the first point of the keypoint pair and the current estimates of the plurality of transformation parameters, and a point transformation error corresponding to the keypoint pair may be determined based on the position of the second point of the keypoint pair and the position estimate.

For example, based on the position F of the first point in the ith keypoint pair _t (i) And current estimated values H of said plurality of transformation parameters _t The position estimation value of the corresponding key point can be determined to be H _t F _t (i) Based on the position F of the second point in the keypoint pair _t+1 (i) The point transform error can be represented as F _t+1 (i)-H _t F _t (i) Or F _t+1 (i)-H _t F _t (i) Of (e.g., a power of two | | F) _t+1 (i)-H _t F _t (i)|| ² ) Or other similar form, to which the present disclosure is not limited.

In step 403, a target transformation error corresponding to the current estimation values of the plurality of transformation parameters is determined based on a first number of point transformation errors corresponding to the first number of key point pairs, respectively.

As described above, the overall transformation errors corresponding to the current estimates of the plurality of transformation parameters may be determined in association with the respective point transformation errors for the first number of key point pairs and optimized to determine the transformation parameter that best fits the motion of the particular object.

According to an embodiment of the present disclosure, step 403 may comprise taking a sum of the first number of point transformation errors as a target transformation error corresponding to the current estimate of the plurality of transformation parameters.

For example, taking the first number 65 as an example, the form of the target transformation error may be determined based on the second power form of the point transformation error, that is:

wherein,

represents the variable H that minimizes the expression in (-) _t ，/>

Represents summing the expression in (·) from i =0 to i = 64.

Wherein, according to an embodiment of the present disclosure, the target transformation error satisfying a predetermined condition includes at least one of: the target transformation error is less than a predetermined threshold; or the target transformation error is a minimum value obtained within a predetermined range of values of the plurality of transformation parameters.

Alternatively, the optimal estimation of the transformation parameters may be to minimize the target transformation error or reach a preset threshold, which is not limited by the present disclosure.

In step 404, in case the target transformation error satisfies a predetermined condition, a pending estimated value of a plurality of transformation parameters of the motion transformation matrix is determined based on the current estimated values of the plurality of transformation parameters.

Alternatively, the undetermined estimated values of the plurality of determined transformation parameters may be used as an optimal estimation of the motion of the specific target, and the smoothing process may be performed based on the optimal estimation.

Optionally, since the determined pending estimated values of the plurality of transformation parameters are not affected when the input video frame is uniformly scaled, the video frame may be scaled before the plurality of transformation parameters of the motion transformation matrix are estimated, so as to reduce the amount of computation.

Determining a motion transformation matrix may further include the following operations, according to embodiments of the present disclosure.

In step 405, the motion transformation matrix may be motion compensated based on the pending estimated values of the plurality of transformation parameters of the motion transformation matrix.

According to an embodiment of the present disclosure, step 405 may comprise filtering each transformation parameter of the plurality of transformation parameters of the motion transformation matrix, respectively, to determine a filtered plurality of transformation parameters.

Optionally, the trajectory (i.e., the sum of the antecedents) of each of the plurality of transformation parameters may be filtered to remove high frequency noise in the transformation parameters to smooth out high frequency jitter in the video frames.

Fig. 5 is a diagram illustrating comparison before and after filtering a plurality of transform parameters according to an embodiment of the present disclosure.

For example, in a human face video, the video picture is unstable due to the interference of involuntary motion such as video equipment shake, and as shown in fig. 5, high-frequency noise is obviously present in the original track of the transformation parameters.

Thus, motion H of the video device may be accounted for _t Motion compensation, i.e. for H _t Track of each transformation parameter in

Performing a filtering (e.g., kalman filtering) to obtain a smoothed transformation parameter ≥>

Wherein, the filter _Kalman (. Cndot.) denotes kalman filtering.

By the motion compensation processing, high-frequency jitter generated by a specific object (or video equipment) is smoothed, and high-frequency noise in the motion of the specific object is reduced, so that when the video equipment generates involuntary motion (such as jitter) in operations such as human face video shooting or instant video communication, the blurring and the jitter of high frequency in the video can be reduced, and a viewer can obtain better experience when watching the processed video.

According to an embodiment of the present disclosure, the motion compensated motion transformation matrix may contain the filtered plurality of transformation parameters.

In step 406, the motion compensated motion transformation matrix is determined as the motion transformation matrix for the specific object.

Thus, by substituting the motion compensated plurality of variation parameters into the motion transformation matrix, a motion transformation matrix to be finally applied to the first video frame may be determined, followed by applying the smoothed motion transformation matrix to the tth frame of the sequence of video frames

A processed t +1 th frame may be generated:

it should be understood that the above-described motion models are used by way of example only and not by way of limitation, and that motion of a video device may be characterized in other forms of motion models in addition to the above-described motion models, such as using transform matrices with more degrees of freedom to fit non-uniform scaling transforms of the video device.

Fig. 6 is a schematic diagram illustrating the generation of a processed second video frame according to an embodiment of the present disclosure.

As shown in fig. 6, generating a processed second video frame based on a first video frame and an original second video frame may include the following four main steps.

First, a first number of key point pairs between a first video frame and an original second video frame are obtained through face detection and face key point positioning and matching.

Then, based on the first number of key point pairs, motion estimation is performed on a plurality of transformation parameters of the motion transformation matrix to obtain undetermined estimation values of the plurality of transformation parameters.

Next, the trajectories (sum of the antecedents) of the plurality of transformation parameters are motion compensated to remove high frequency noise therein, resulting in a filtered plurality of transformation parameters, which are used as final transformation parameters of the motion transformation matrix to be applied to the first video frame.

Finally, the determined motion transformation matrix is applied to all pixel points of the first video frame to generate a processed second video frame.

FIG. 7 is a contrast diagram illustrating example video frames of optical flow-based video stabilization and video stabilization according to embodiments of the present disclosure.

As shown in fig. 7, fig. 7 (a) (first column) is a video stabilization processing result based on optical flow, and fig. 7 (b) (second column) is a video stabilization result according to an embodiment of the present disclosure.

It can be seen that, for the face video, the position of the face in the image in the video image stabilization processing result based on the optical flow is still high and low. For example, referring to the chin portion of a human face in an image, it can be found that the position of the chin of the human face changes greatly during a short video frame interval, and when watching a video, such high-frequency blurring and shaking become more noticeable, causing discomfort to the viewer. Whereas for the same face video, in the video stabilization result according to the embodiment of the present disclosure, the position change of the face in the image is significantly reduced, for example, referring to the chin portion of the face in the image, it can be found that the change of the chin position of the face is significantly reduced during the same video frame interval compared to the optical flow-based video stabilization processing.

In addition, the image shaking caused by the shaking of the video equipment can cause the same motion of the human face and the background in the image, and the video processing method disclosed by the invention also performs video stabilization on the whole image comprising the human face and the background, so that the shaking of the background part is also smoothed. For example, in the comparison results shown in fig. 7, the change in the position of the tree in the background is significantly reduced in the video stabilization results according to embodiments of the present disclosure compared to the optical-flow-based video stabilization processing results.

Therefore, the video image stabilization method can select more appropriate key points for the face video, so that the motion of the video equipment is effectively estimated, and a smooth and smooth image stabilization video is generated.

Fig. 8 is a schematic diagram illustrating a video processing apparatus 800 according to an embodiment of the present disclosure.

The video processing apparatus 800 may include a video frame acquisition module 801, a keypoint acquisition module 802, a motion transformation determination module 803, and a video frame generation module 804.

According to an embodiment of the present disclosure, the video frame acquiring module 801 may be configured to acquire a first video frame and a second video frame, where the first video frame is an adjacent video frame before the second video frame.

Optionally, the first video frame and the second video frame may be video frames in the face video, and the face video may be a self-shot face video shot by the user himself or a face video shot by the non-specific object himself, which is not limited in this disclosure.

According to an embodiment of the present disclosure, the keypoint acquisition module 802 may be configured to acquire a first number of keypoints of the first video frame and a first number of keypoints of the second video frame, which are associated with a specific target in the video, and acquire a first number of keypoints of the second video frame, which correspond to the first number of keypoints of the first video frame in a one-to-one manner.

Alternatively, the keypoints in the first and second video frames may be determined based on a relationship of the first number of keypoints in the first and second video frames to a particular target in the video.

Alternatively, the first number of keypoints of the first video frame and the first number of keypoints of the second video frame may respectively correspond one-to-one to the first number of keypoints of the specific target. Thus, the keypoints may also be respectively associated with at least a portion of the physical features of the particular target.

For example, where the particular object is a human face, the first number of keypoints in the first video frame and the first number of keypoints in the second video frame may be associated with respective locations on the human face (including but not limited to eyebrows, eyes, nose, mouth, jaw, etc.), respectively.

According to an embodiment of the present disclosure, the motion transformation determining module 803 may be configured to determine the motion transformation matrix of the specific object based on the first number of keypoints of the second video frame and the first number of keypoints of the first video frame.

According to an embodiment of the present disclosure, the motion transformation determining module 803 may be further configured to: matching the first number of key points of the second video frame with the first number of key points of the first video frame to generate a first number of key point pairs; and determining a motion transformation matrix for the particular object based on the first number of keypoint pairs.

Alternatively, two keypoints before and after applying the motion transformation may constitute one keypoint pair. Wherein the two keypoints may be associated with the same physical feature of the particular target. For example, where the particular object is a human face, the two keypoints may be associated with one keypoint of a particular location on the human face (including, but not limited to, eyebrows, eyes, nose, mouth, or chin, etc.).

Alternatively, based on the estimated motion transformation matrix, high-frequency noise due to high-frequency jitter of the video device and the like may be filtered therein, so that the acquired sequence of video frames is smoothed to mitigate high-frequency blur and jitter generated due to irregular random motion of the video device when viewing the video.

According to an embodiment of the present disclosure, the video frame generation module 804 may be configured to generate a processed second video frame based on the first video frame and the motion transformation matrix.

According to still another aspect of the present disclosure, there is also provided a video processing apparatus. Fig. 9 shows a schematic diagram of a video processing device 2000 according to an embodiment of the present disclosure.

As shown in fig. 9, the video processing device 2000 may include one or more processors 2010, and one or more memories 2020. Wherein the memory 2020 has stored therein computer readable code that, when executed by the one or more processors 2010, may perform a video processing method as described above.

The processor in the embodiments of the present disclosure may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, either of the X86 architecture or the ARM architecture.

In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

For example, a method or apparatus in accordance with embodiments of the present disclosure may also be implemented by way of the architecture of computing device 3000 shown in fig. 10. As shown in fig. 10, computing device 3000 may include a bus 3010, one or more CPUs 3020, a Read Only Memory (ROM) 3030, a Random Access Memory (RAM) 3040, a communication port 3050 to connect to a network, input/output components 3060, a hard disk 3070, and the like. A storage device in the computing device 3000, such as the ROM 3030 or the hard disk 3070, may store various data or files used in the processing and/or communication of the video processing method provided by the present disclosure and program instructions executed by the CPU. Computing device 3000 can also include a user interface 3080. Of course, the architecture shown in FIG. 9 is merely exemplary, and one or more components of the computing device shown in FIG. 10 may be omitted as desired when implementing different devices.

According to yet another aspect of the present disclosure, there is also provided a computer-readable storage medium. FIG. 11 shows a schematic diagram 4000 of a storage medium according to the present disclosure.

As shown in fig. 11, the computer storage medium 4020 has stored thereon computer readable instructions 4010. The computer readable instructions 4010, when executed by a processor, can perform a video processing method according to an embodiment of the present disclosure described with reference to the above figures. The computer readable storage medium in embodiments of the present disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory. It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present disclosure also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the video processing method according to the embodiment of the present disclosure.

The method provided by the embodiment of the disclosure determines the motion transformation matrix of the motion of the specific target by matching the key points related to the specific target and acquired from the adjacent video frames, so as to smooth the video frames based on the motion of the specific target, thereby generating a more stable and smooth image-stabilized video. By the method of the embodiment of the disclosure, when the video device makes involuntary movement (such as shaking) in operations such as face video shooting or instant video communication, unstable video images can be converted into coherent and smooth images in real time and efficiently, so that the quality of the video is improved, and the viewing experience of a viewer is improved.

It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises at least one executable instruction for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The exemplary embodiments of the present disclosure, which are described in detail above, are merely illustrative, and not restrictive. It will be appreciated by those skilled in the art that various modifications and combinations of these embodiments or features thereof may be made without departing from the principles and spirit of the disclosure, and that such modifications are intended to be within the scope of the disclosure.

Claims

1. A video processing method, comprising:

acquiring a first video frame and a second video frame, wherein the first video frame is an adjacent video frame before the second video frame;

acquiring a first number of key points of the first video frame and a first number of key points of the second video frame, wherein the first number of key points of the first video frame and the first number of key points of the second video frame are related to a specific target in the video, and the first number of key points of the second video frame and the first number of key points of the first video frame are in one-to-one correspondence;

determining a motion transformation matrix for the particular target based on the first number of keypoints for the second video frame and the first number of keypoints for the first video frame; and

generating a processed second video frame based on the first video frame and the motion transformation matrix.

2. The method of claim 1, wherein determining the motion transformation matrix for the particular object based on the first number of keypoints for the second video frame and the first number of keypoints for the first video frame comprises:

matching the first number of key points of the second video frame with the first number of key points of the first video frame to generate a first number of key point pairs; and

determining a motion transformation matrix for the particular target based on the first number of keypoint pairs.

3. The method of claim 2, wherein the particular target has a first number of keypoints and the first number of keypoints for the particular target are associated with a physical feature of the particular target,

the first number of keypoints of the first video frame and the first number of keypoints of the second video frame are determined from the first number of keypoints of the particular target, an

Each keypoint pair comprises two keypoints corresponding to a particular keypoint of the particular target and belonging respectively to the second video frame and to the first video frame.

4. The method of claim 2, wherein the motion transformation matrix comprises a plurality of transformation parameters, and each keypoint pair comprises a first point and a second point, the first point belonging to the first video frame and the second point belonging to the second video frame,

wherein determining the motion transformation matrix for the particular target based on the first number of key point pairs comprises:

determining current estimates of the plurality of transformation parameters;

for each keypoint pair of the first number of keypoint pairs, determining a position estimate of its corresponding keypoint based on the position of a first point of the keypoint pair and the current estimates of the plurality of transformation parameters, and determining a point transformation error corresponding to the keypoint pair based on the position of a second point of the keypoint pair and the position estimate;

determining a target transformation error corresponding to the current estimation values of the plurality of transformation parameters based on a first number of point transformation errors corresponding to the first number of key point pairs, respectively; and

determining pending estimated values of a plurality of transformation parameters of the motion transformation matrix based on the current estimated values of the plurality of transformation parameters if the target transformation error satisfies a predetermined condition.

5. The method of claim 4, wherein,

determining a target transformation error corresponding to the current estimate of the plurality of transformation parameters based on a first number of point transformation errors corresponding to the first number of key point pairs, respectively, comprises:

taking a sum of the first number of point transform errors as a target transform error corresponding to the current estimate of the plurality of transform parameters;

wherein the target transformation error satisfying a predetermined condition comprises at least one of:

the target transformation error is less than a predetermined threshold; or

The target transformation error is a minimum value obtained within a predetermined range of values of the plurality of transformation parameters.

6. The method of claim 4, wherein determining the motion transformation matrix for the particular target based on the first number of keypoint pairs further comprises:

performing motion compensation on the motion transformation matrix based on undetermined estimated values of a plurality of transformation parameters of the motion transformation matrix; and

and determining the motion transformation matrix after motion compensation as the motion transformation matrix of the specific target.

7. The method of claim 6, wherein motion compensating the motion transformation matrix comprises:

filtering each of a plurality of transformation parameters of the motion transformation matrix to determine a filtered plurality of transformation parameters;

wherein the motion compensated motion transform matrix comprises the filtered plurality of transform parameters.

8. The method of any one of claims 4-5 and 7, wherein the motion of the particular object is affine motion, the motion transformation matrix is an affine transformation matrix, and

wherein the plurality of transformation parameters include parameters related to at least one of scaling, rotation angle, horizontal displacement, and vertical displacement.

9. The method of claim 1, further comprising:

detecting the specific target in the first video frame, and acquiring a first number of key points of the first video frame based on the detection result; and

the specific target is detected in the second video frame, and a first number of key points of the second video frame is obtained based on the detection result.

10. The method of claim 9, wherein the detection result of the specific object is detected as a first object frame of the specific object in the first video frame, and the detection result of the specific object is detected as a second object frame of the specific object in the second video frame;

wherein a first number of keypoints of the particular target in the first video frame is obtained based on the first target frame; and

based on the second target frame, a first number of keypoints of the particular target in the second video frame is obtained.

11. The method of claim 1, wherein the video is a video targeted specifically to a human face, wherein,

detecting the face in the first video frame by using a predetermined face detection algorithm to determine a face region in the first video frame, and acquiring a first number of key points related to the face in the face region of the first video frame by using a predetermined face key point detection algorithm; and

the face is detected in the second video frame by a predetermined face detection algorithm to determine a face region in the second video frame, and a first number of keypoints associated with the face are acquired in the face region of the second video frame by a predetermined face keypoint detection algorithm.

12. The method of claim 11, wherein generating a processed second video frame based on the first video frame and the motion transformation matrix comprises:

applying the motion transformation matrix to all pixel points in the first video frame to generate the processed second video frame, wherein all pixel points in the first video frame include at least a first number of keypoints of the first video frame.

13. The method of claim 12, wherein the size of the processed second video frame is smaller than the size of the second video frame.

14. A video processing apparatus comprising:

a video frame acquisition module configured to acquire a first video frame and a second video frame, wherein the first video frame is an adjacent video frame before the second video frame;

a key point obtaining module configured to obtain a first number of key points of the first video frame and a first number of key points of the second video frame, where the first number of key points of the first video frame and the first number of key points of the second video frame are related to a specific target in the video, and the first number of key points of the second video frame and the first number of key points of the first video frame correspond to each other one by one;

a motion transformation determination module configured to determine a motion transformation matrix for the particular object based on the first number of keypoints for the second video frame and the first number of keypoints for the first video frame; and

a video frame generation module configured to generate a processed second video frame based on the first video frame and the motion transformation matrix.

15. The apparatus of claim 14, wherein the motion transformation determining module is further configured to:

determining a motion transformation matrix for the particular object based on the first number of keypoint pairs.

16. The apparatus of claim 15, wherein the motion transformation matrix comprises a plurality of transformation parameters, and each keypoint pair comprises a first point and a second point, the first point belonging to the first video frame and the second point belonging to the second video frame,

determining current estimates of the plurality of transformation parameters;

17. The apparatus of claim 16, wherein determining the motion transformation matrix for the particular target based on the first number of keypoint pairs further comprises:

18. A video processing apparatus comprising:

one or more processors; and

one or more memories having stored therein a computer-executable program that, when executed by the processor, performs the method of any of claims 1-13.

19. A computer program product comprising computer instructions which, when executed by a processor, cause a computer device to perform the method of any one of claims 1-13.

20. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of claims 1-13 when executed by a processor.