CN111209897A

CN111209897A - Video processing method, device and storage medium

Info

Publication number: CN111209897A
Application number: CN202010157708.3A
Authority: CN
Inventors: 吴韬; 徐叙远; 刘孟洋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2020-05-29
Anticipated expiration: 2040-03-09
Also published as: CN111209897B

Abstract

The invention relates to a video processing method, a video processing device and a storage medium. The method comprises the following steps: acquiring a video to be processed and a target human body area; detecting a plurality of human body regions in a video to be processed; inputting the plurality of human body regions into the trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body regions, and inputting the target human body region into the trained feature extraction network to obtain a second feature describing the target human body region; comparing the plurality of first features with the second features respectively to obtain at least one first matching feature matched with the second features; determining corresponding time points of at least one first matching feature in the video to be processed; the video to be processed is processed based on the respective points in time to obtain a video portion associated with the target object. The feature extraction network is trained using a data set constructed based on a human body region sample set, and the human body region sample set is generated for a plurality of video segments divided according to a video taking shot, respectively.

Description

Video processing method, device and storage medium

Technical Field

The invention relates to the technical field of deep learning and computer vision, in particular to a video processing method, a video processing device and a storage medium.

Background

With the development of multimedia technology, various images and audios and videos add much fun to the life of people. When people watch video files, people usually select interesting segments to watch. Current video clip editing generally clips videos based on certain specific categories or specific scenes, such as based on specific shots or text cues in sports videos and game videos (e.g., goals, shots in sports videos, kills, fifties, etc.) to determine whether or not it is a highlight. It is also desirable to view only a paragraph in a video that pertains to a particular person. In this case, the related art determines a person in a video picture to complete a cut for the specific task, typically by face recognition.

Disclosure of Invention

In the technical solution of identifying a video segment containing a specific character through face recognition, the video segment containing the specific character cannot be identified or cannot be accurately identified in some cases, for example, when the face of the specific character is unclear and incomplete, the character is shown as a side face and a back face, the action amplitude of the character is large (e.g., fighting), and the like, it is less effective to cut the specific character segment based on face recognition. Embodiments of the present invention at least partially address the above-mentioned problems.

According to an aspect of the present invention, a video processing method is provided. The method comprises the following steps: acquiring a video to be processed and a target human body area representing a target object; detecting a plurality of human body regions in a video to be processed; inputting the plurality of human body regions into the trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body regions, and inputting the target human body region into the trained feature extraction network to obtain a second feature describing the target human body region; comparing the plurality of first features with the second features respectively to obtain at least one first matching feature in the first features matched with the second features; determining corresponding time points of at least one first matching feature in the video to be processed; processing the video to be processed based on each time point to obtain a video part associated with the target object; the feature extraction network is trained by using a data set constructed based on a human body region sample set, and the human body region sample set is generated respectively for a plurality of video segments divided according to a video shooting lens.

In some embodiments, the data set is constructed by: acquiring a training video for a feature extraction network; dividing a training video into a plurality of training video segments according to a video shooting lens; for each of a plurality of training video segments, creating one or more human body region sample sets of training video segments; determining whether one or more human body region sample sets contain human faces; in response to determining that each body region in the one or more body region sample sets contains a face, the one or more body region sample sets are merged to construct a training data set based on features of the face.

In some embodiments, for each of the plurality of training video segments, creating one or more human body region sample sets of training video segments comprises: detecting a human body region in a plurality of video frames for each of a plurality of training video segments, each training video segment comprising a plurality of video frames belonging to the same video taking shot; determining a similarity between the detected two or more human body regions; two or more human body regions with similarity satisfying a predetermined threshold range are added to the same set to generate one or more human body region sample sets of the training video segment.

In some embodiments, in response to determining that a human face is included in each of the one or more sets of human body region samples, merging the one or more sets of human body region samples based on features of the human face to construct the training data set comprises: respectively selecting the same preset number of human faces from each human body area sample set in response to the fact that each human body area in one or more human body area sample sets comprises the human face; comparing the face similarity of the face selected from each human body region sample set; the human body region sample sets with the human face similarity higher than the first preset threshold value are combined to construct a training data set.

In some embodiments, the data set is further constructed by: determining a human body region with human body region similarity lower than a preset threshold value in the same human body region sample set by using the pedestrian re-identification ReID; and removing the human body regions with the human body region similarity lower than a second preset threshold value from the human body region sample set.

In some embodiments, determining the similarity between the detected two or more human body regions comprises: similarity between the detected two or more human body regions is determined based on the artificial features.

In some embodiments, multiple human body regions in the video to be processed are detected by a single multi-sided box detector.

In some embodiments, processing the video to be processed based on the respective points in time to obtain the video portion associated with the target object comprises: and splicing the videos to be processed based on the time stamps of all the time points to obtain the video part associated with the target object.

According to another aspect of the invention, a method for constructing a data set for training a feature extraction network is presented. The method comprises the following steps: acquiring a training video for a feature extraction network; dividing a training video into a plurality of training video segments according to a video shooting lens; for each of a plurality of training video segments, creating one or more human body region sample sets of training video segments; determining whether one or more human body region sample sets contain human faces; in response to determining that each body region in the one or more body region sample sets contains a face, the one or more body region sample sets are merged to construct a training data set based on features of the face.

According to another aspect of the present invention, a training method for a feature extraction network is provided, including: acquiring a training video for a feature extraction network, constructing a training data set based on the acquired training video using the method of constructing a data set as in the preceding aspect, and training the feature extraction network using the data set to extract features describing a human body region.

According to another aspect of the present invention, a video processing apparatus is provided. The device includes: the device comprises an acquisition module, a human body detection module, a feature extraction module, a comparison module, a time point determination module and a video processing module. The acquisition module is configured to acquire a video to be processed and a target human body region representing a target object. The human body detection module is configured to detect a plurality of human body regions in the video to be processed. The feature extraction module is configured to input a plurality of human body regions into the trained feature extraction network, resulting in a plurality of first features that respectively describe the plurality of human body regions, and input a target human body region into the trained feature extraction network, resulting in a second feature that describes the target human body region, wherein the feature extraction network is trained using a dataset constructed based on a human body region sample set, and the human body region sample set is generated separately for a plurality of video segments divided according to a video taking shot. The comparison module is configured to compare the plurality of first features with the second features respectively, and obtain at least one first matching feature in the first features matching with the second features. The time point determination module is configured to determine corresponding respective time points of the at least one first matching feature in the video to be processed. The video processing module is configured to process the video to be processed based on the respective points in time to obtain a video portion associated with the target object.

According to another aspect of the present invention, a device for constructing a data set for training a feature extraction network is provided. The device comprises: the device comprises an acquisition module, a video segmentation module, a set creation module, a determination module, a set combination module and a set combination module. The acquisition module is configured to acquire a training video for the feature extraction network. The video segmentation module is configured to divide the training video into a plurality of training video segments according to the video shots. The set creation module is configured to create, for each of a plurality of training video segments, one or more human body region sample sets of training video segments. The determination module is configured to determine whether a human face is contained in one or more human body region sample sets. The set merging module is configured to, in response to determining that a face is included in each body region in the one or more body region sample sets, merge the one or more body region sample sets based on features of the face to construct a training data set.

According to another aspect of the present invention, there is provided a training apparatus for a feature extraction network, including: the system comprises an acquisition module configured to acquire training videos for a feature extraction network, a data set construction module configured to construct training data sets based on the acquired training videos using a method of constructing data sets as above, and a training module configured to train the feature extraction network using the data sets to extract features describing a human body region.

According to some embodiments of the invention, there is provided a computer device comprising: a processor; and a memory having instructions stored thereon, the instructions, when executed on the processor, causing the processor to perform any of the above methods.

According to some embodiments of the invention, there is provided a computer readable storage medium having stored thereon instructions which, when executed on a processor, cause the processor to perform any of the above methods.

The video processing method, the video processing device and the storage medium analyze the character characters in the video content by utilizing deep learning, and clip the segments of the same character in the video through a trained feature extraction network. The video processing method can automatically segment segments with the same role in videos (such as movies, TV shows and heddles), saves a large amount of labor and time cost, improves the editing efficiency, is also beneficial to later-stage video production, and enhances the user experience.

Drawings

Embodiments of the invention will now be described in more detail, by way of non-limiting examples only, with reference to the accompanying drawings, in which like reference numerals refer to like parts throughout, and in which:

FIG. 1 schematically shows a graphical user interface diagram according to an embodiment of the invention;

FIG. 2 illustrates an example application scenario according to one embodiment of the present invention;

FIG. 3 schematically illustrates a network framework diagram for target character video processing according to one embodiment of the present invention;

FIG. 4 schematically illustrates a schematic diagram of the structure of a single-pass multi-frame detector;

FIG. 5 schematically shows a flow diagram of a video processing method according to an embodiment of the invention;

FIG. 6 schematically shows a flow diagram of a method of constructing a data set according to another embodiment of the invention;

fig. 7 schematically shows a schematic view of a video processing apparatus according to an embodiment of the invention;

FIG. 8 schematically shows a schematic view of an apparatus for constructing a data set according to another embodiment of the present invention; and

fig. 9 schematically shows a schematic diagram of an example computer device for video processing and/or constructing a data set.

Detailed Description

The following description provides specific details for a thorough understanding and enabling description of various embodiments of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these details. In some instances, well-known structures and functions have not been shown or described in detail to avoid unnecessarily obscuring the description of the embodiments of the disclosure. The terminology used in the present disclosure is to be understood in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.

First, some terms related to the embodiments of the present disclosure are explained so that those skilled in the art can understand that:

deep Learning (DL): a multi-layer perceptron with a plurality of hidden layers is a deep learning structure. Deep learning forms a more abstract class or feature of high-level representation properties by combining low-level features to discover a distributed feature representation of the data. The motivation for studying deep learning is to build neural networks that simulate the human brain for analytical learning, which mimics the mechanisms of the human brain to interpret data such as images, sounds, text, and the like.

Computer Vision technology (Computer Vision, CV): computer vision is a science of how to "look" at a machine. Furthermore, computer vision refers to machine vision such as identifying, tracking and measuring a target by using a camera and a computer instead of human eyes, and further using the computer to perform image processing to form an image more suitable for human eye observation or transmitted to an instrument for detection. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

Convolutional Neural Networks (CNNs) are a class of feed-forward neural Networks that contain convolution computations and have a deep structure, and are one of the algorithms that represent deep learning. The convolutional neural network has the characteristic learning ability and can carry out translation invariant classification on input information according to the hierarchical structure of the convolutional neural network.

A Single Shot multi box Detector (SSD) is a method of detecting objects in a picture based on a Single deep neural network. It discretizes the output space of the bounding box, placing a series of default bounding boxes with different aspect ratios and different scales at the location of each feature map. During prediction, the neural network generates a score for whether each default bounding box belongs to a certain category or not, and generates correction on the bounding boxes so that the bounding boxes are more fit with the shape of the object.

Scale-invariant feature transform (SIFT) is a feature descriptor with Scale invariance and illumination invariance, and is also a set of theory of feature extraction. Was first published by d.g. Lowe in 2004 and implemented, extended and used in the open source algorithm library OpenCV. The SIFT feature remains invariant to rotation, scale scaling, brightness variation, etc., and is a very stable local feature.

Pedestrian Re-identification (reid) is a technique that uses computer vision techniques to determine whether a particular pedestrian is present in an image or video sequence. The method is a sub-problem of image retrieval, and a monitored pedestrian image is given, and the pedestrian image under the cross-device is retrieved. It may for example retrieve the same pedestrian map under different cameras.

Triple Loss function (triple Loss function): the so-called triplet contains three samples, e.g. (anchor, pos, neg), anchor representing the target, pos representing the positive sample, neg representing the negative sample. A ternary loss function is an objective function that defines a distance from the target to the negative sample that is greater than the sum of the distance from the target to the positive sample and a predetermined threshold.

The invention mainly aims to analyze character roles in video contents by utilizing deep learning and carry out video segment editing of the same role through a feature extraction network. Since the human body in the video has the conditions of multi-pose, multi-angle, multi-scale and the like, distinguishing the same human body region in the video segment is a complex task. The method utilizes a convolutional neural network (such as a single Shot multi box detector) to detect the human body region in the video, and further extracts the corresponding human body features. The invention utilizes the human body characteristics to position the same human body in the video, and can automatically and effectively segment the segments with the same role in the video.

FIG. 1 schematically shows a schematic view of a graphical user interface 100 according to an embodiment of the invention. The graphical user interface 100 may be displayed on various user terminals, such as a laptop, a personal computer, a tablet, a cell phone, a television, and so forth. The video 101 is a video viewed by a user through a user terminal. The video 101 can be automatically clipped into a video clip about a selected target object, such as a target person, in the video 101 by the video processing method provided by the embodiment of the present invention. The selected target person may be one or more. For example, the target persona may be a particular star or a particular character. An icon 102 of the automatically clipped generated character video clip is also displayed on the graphical user interface 100. When viewing the video 101, the user can easily view a video clip of the corresponding person of interest by clicking on the corresponding icon 102.

FIG. 2 illustrates an example application scenario 200 according to one embodiment of this disclosure. The server 201 is connected to a user terminal 203 via a network 202. The user terminal 203 may be, for example, a notebook computer, a personal computer, a tablet computer, a mobile phone, a television, or the like. The network 202 may include wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, etc. An application program for viewing a video is installed on the user terminal 203. When the user views a video through the application installed on the user terminal 203 and wishes to view a video clip of a person of interest, the user may click on an icon of the corresponding person clip presented by the application. In response to the user clicking on the icon of the corresponding character clip, the application presents the clip segment of the corresponding character. It is to be noted that the clipped segment of the corresponding character is obtained at the server 201 or the user terminal 203 (or at both the user terminal 203 and the server 201) by performing the video processing method proposed by the present invention.

Fig. 3 schematically illustrates a network framework 300 for target character video processing according to one embodiment of the invention. First, for a target character to be edited, the human body region 302 of the target character is input to the feature extraction network 310 to extract human features, and thereby the human features F304 of the target character are obtained. The video 301 to be processed is input into the human body detection network 309, and all the human body regions 303 in the detected video 301 to be processed are obtained. Here, the human detection network 309 and the feature extraction network 310 will be described in further detail below. The individual body regions 303 are then input into the feature extraction network 310, and features P for each body region are extracted_iAnd recording the time point T of the human body in the video_i(which may be a timestamp, for example). Then, the characteristics P of all human body regions in the video to be processed are determined_iCombined into a pool of features 305. By combining the character characteristics F304 of the target character and the characteristics of the various body regions in the characteristic pool 305The feature Pi is input into the matching calculation module 311 for similarity calculation to obtain all P_iFeature P matched with character feature F _k306. Illustratively, feature matching may be achieved by calculating Euclidean distances between different features. The distance d between one Pi of the pool of features and the human feature F is calculated. If d is less than a predetermined threshold, then P is determined_iMatching with the body feature F, i.e. P_iThe character corresponding to the corresponding human body region conforms to the target character. Feature P to be matched_k306 input to a timing aggregation module 312 for finding a matching feature P _k306, and is directed to time point T in time sequence_kPolymerization is carried out to give multiple time points 307 of polymerization. The video clipping module 313 splices the video based on the aggregated timestamps of the multiple time points to form a segment corresponding to the target character, that is, all video frames containing the target character in the video to be processed.

In the above, in the case that the target role is a single role, how to process the target role by the video processing method provided by the present invention to obtain the segment for the target role is described. It should be understood that in other embodiments, the target persona may be multiple target personas.

Fig. 4 schematically shows a schematic diagram of a structure 300 of a single-pass multi-box detector. The human detection network used herein employs a single multi-frame detector SSD structure. The SSD detection network has very good performance in both detection speed and detection precision. Specifically, the human body detection efficiency of the SSD detection network can reach 100 frames/second on the GPU, while ensuring a detection rate higher than 85%. The structure of the SSD is based on the VGG-16, because the VGG-16 can provide high quality image classification and transfer learning to improve results. Here, the SSD adjusts VGG-16 to replace the original fully connected layer with a series of auxiliary convolutional layers starting from the Con6 layer. By using the auxiliary convolutional layers, it is possible to extract features at multiple scales of the image and to gradually reduce the size of each convolutional layer.

Fig. 5 schematically shows a flow diagram of a video processing method 500 according to an embodiment of the invention. TheThe method may be executed by a user terminal or a server, or may be executed by both the user terminal and the server. In step 501, a video to be processed and a target human body region representing a target object are acquired. Here, the target human body region may be obtained by inputting an image sample of the target object or a video sample containing the target object into a human body detection network (e.g., SSD). In step 502, a plurality of human body regions in a video to be processed are detected using a human body detection network. In step 503, the plurality of human body regions are input to the trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body regions, and the target human body region is input to the trained feature extraction network to obtain a second feature describing the target human body region. How this feature extraction network is trained will be described in detail below. Here, it is noted that the feature extraction network is trained using a dataset constructed based on a set of human body region samples. The human body region sample sets are respectively generated for a plurality of video segments divided according to video shooting shots. In step 504, the plurality of first features are respectively compared with the second features, and at least one first matching feature of the first features matching with the second features is obtained. For example, the first characteristic is P_iAnd the second characteristic is F, then will be represented by P_iEach P in the composed feature pool_iComparing with F to find P matching with F_k. Here, feature matching is achieved by calculating euclidean distances between different features. Computing one P in a pool of features_iDistance d from feature F. If d is less than a predetermined threshold, then P is determined_iMatch F, i.e. P_iThe character corresponding to the corresponding human body region conforms to the target character. In step 505, respective time points corresponding to the at least one first matching feature in the video to be processed are determined. That is, P matching F is determined_kAt the corresponding time point T in the video_k. In step 506, the video to be processed is processed based on the respective time points to obtain video portions associated with the target object. In one embodiment, for time point T_kAggregation is performed chronologically, thereby resulting in a set of all time points for the same role. In one embodiment, for the time point T of the final acquisition_kThe chronologically aggregating of (a) sets comprises: for any two time points, if the interval is less than a certain threshold, then a continuous segment is considered, otherwise a separate segment is considered. Through the processing, the selected video frames are more consistent, and the pictures do not have jumping. Thereby, a plurality of video clips are obtained. And for the starting time point and the ending time point of each segment, searching the nearest shot switching point forwards from the starting time point of each segment by using an optical flow method, and searching the nearest scene switching point backwards from the ending time point of each segment so as to ensure the integrity of the intercepted segment. Here, the optical flow is the instantaneous velocity of the pixel motion of a spatially moving object on the observation imaging plane. The optical flow method is a method of calculating motion information of an object between adjacent frames by finding a correspondence between a previous frame and a current frame using a change of a pixel in an image sequence in a time domain and a correlation between adjacent frames. After this operation is performed on all the segments, different segment clips of the same target object (e.g., the same character) in the video are obtained. The video processing method 500 can automatically segment segments of the same role in videos (such as movies, television shows and art programs), save a large amount of labor and time cost, improve the editing efficiency, facilitate later-stage video production and enhance the user experience.

In the video processing method, a feature extraction network is trained by using a data set constructed based on a human body region sample set. The data sets used to train the feature extraction network are constructed using temporal and spatial correlations of the video, and using techniques such as face recognition and pedestrian re-recognition ReID. The data set is constructed by the following steps of the method 600 of constructing a data set shown in FIG. 6.

In step 601, a training video for a feature extraction network is obtained.

In step 602, the training video is divided into a plurality of training video segments according to the video shots. Each of the plurality of training video segments contains a plurality of video frames belonging to the same video shot. For example, whether there is a shot cut in the training video may be determined by an optical flow method. And if shot switching exists, dividing the video at the video frame where the shot switching occurs, so that a complete training video is divided into segments corresponding to different shots.

In step 603, for each of a plurality of training video segments, one or more human body region sample sets of training video segments are created. In one embodiment, for each training video segment, detecting a human body region in a plurality of video frames contained therein; determining a similarity between the detected two or more human body regions; and adding two or more human body regions with similarity satisfying a predetermined threshold range into the same set to generate one or more human body region sample sets of the training video segment. The detection of the human body regions in the plurality of video frames is realized by a human body detection network SSD. Here, the similarity between the human body regions is judged using artificial features. For example, the human features may be scale invariant SIFT features. In one embodiment, the predetermined threshold range is set higher than the first threshold and lower than the second threshold, and two or more body regions satisfying the predetermined threshold range are added to the same body region sample set as a set of positive sample pairs. Setting the predetermined threshold higher than the first threshold is used to ensure that the human body regions have a higher similarity, i.e. that the two human body regions belong to the same role; meanwhile, the requirement that the predetermined threshold is lower than the second threshold is used for removing the human body region with the high similarity, because the two frames with the high similarity are hardly changed, which is not beneficial to the training of the network model. In another embodiment, the predetermined threshold range is set below a third threshold, and two or more body regions satisfying the predetermined threshold range are added to the same body region sample set as a set of negative sample pairs, i.e. such body regions do not belong to the same role.

In step 604, it is determined whether one or more human face samples in the set of human body region samples contain a human face. The step is realized by a face recognition technology. In step 605, in response to determining that each body region in the one or more body region sample sets includes a face, the one or more body region sample sets are merged based on features of the face to construct a training data set. In one embodiment, in response to determining that each body region in one or more body region sample sets contains a face, the same predetermined number of faces are respectively selected from each body region sample set; comparing the face similarity of the face selected from each human body region sample set; and merging the human body area sample sets with the face similarity meeting a preset threshold value. Specifically, the human faces in each human body region sample set are compared by using a human face recognition technology. For example, in each human body region sample set in which a human face is determined to exist, N human faces are respectively selected, where N is a positive integer. And performing cross comparison on the selected N human faces. In case the proportion of N face matches in two or more body region sample sets exceeds a predetermined threshold (e.g. 50%), then the two or more body region sample sets are merged into the same body region sample set. I.e. the body regions in the two body region sample sets are illustrated as belonging to substantially the same person. This is caused, in some cases, by switching from the first lens to the second lens and then back again.

In one embodiment, the method of constructing the data set further comprises: determining a human body region with human body region similarity lower than a preset threshold value in the same human body region sample set by using the pedestrian re-identification ReID; and removing the human body regions with the human body region similarity lower than a preset threshold value from the human body region sample set. The ReID is a trained ReID network with an open source, that is, whether dissimilar human body regions exist in a constructed human body region sample set is judged by an open source method.

In addition, on the basis of the above method for constructing a data set, since the same person may have various pose angles and backgrounds in the video, manual screening may be required after the above steps to ensure that the human body in each set is the same person image.

The invention also provides a training method of the feature extraction network, which trains the feature extraction network based on the data set obtained by the method 600. It should be noted that in the training, attacks including random clipping, blurring, rotation, etc. are added to these samples, thereby improving the robustness of the feature extraction network.

The feature extraction network of the invention is further improved and optimized as follows on the basis of the existing deep network structure, so as to improve the effect aiming at the task. Firstly, a shallow layer of the network adopts a larger convolution kernel and step length, so that the effect of increasing the receptive field and the depth speed of the network is achieved. With the depth of the network, the characteristic dimension is continuously increased, and in order to improve the operation efficiency, the size of the convolution kernel is gradually reduced to be finally reduced to be 3x 3. In addition, the feature extraction network adopts a triple loss function as a final loss function. The loss function can reduce the distance between the positive sample pairs and increase the distance between the negative sample pairs, and has a very good effect on subsequent judgment of similarity of human bodies. Here, the positive sample refers to a sample pair in which human body regions belonging to the same person are determined by the similarity between human body regions; the negative sample refers to a sample pair in which human body regions belonging to different persons are determined by the similarity between human body regions. In addition, the final feature is a superposition of deep and shallow features. The shallow features of the depth network represent structural information of the image, and the deep features are rich in more semantic information. According to the method, the attention model is used for combining the deep information and the shallow information of the network, and the accuracy can be improved to a very high degree compared with the method of singly using the shallow characteristic or the deep characteristic.

Fig. 7 schematically shows a schematic diagram of a video processing apparatus 700 according to an embodiment of the present invention. A video processing device 700 comprising: the system comprises an acquisition module 701, a human body detection module 702, a feature extraction module 703, a comparison module 704, a time point determination module 705 and a video processing module 706. The acquisition module 701 is configured to acquire a video to be processed and a target human body region representing a target object. The human body detection module 702 is configured to detect a plurality of human body regions in the video to be processed. The feature extraction module 703 is configured to input a plurality of human body regions into a trained feature extraction network, resulting in a plurality of first features that respectively describe the plurality of human body regions, and input a target human body region into the trained feature extraction network, resulting in a second feature that describes the target human body region, the feature extraction network being trained using a dataset constructed based on a human body region sample set, and the human body region sample set being generated separately for a plurality of video segments divided according to a video taking shot. The comparing module 704 is configured to compare the plurality of first features with the second features respectively, and obtain at least one first matching feature of the first features matching with the second features. The time point determination module 705 is configured to determine corresponding respective time points of the at least one first matched feature in the video to be processed. A video processing module 706 configured to process the video to be processed based on the respective time points to obtain a video portion associated with the target object. The video processing apparatus 700 can automatically segment segments of the same role in a video (e.g., a movie, a tv show, and a variety), save a lot of manpower and time costs, improve editing efficiency, facilitate later-stage video production, and enhance user experience.

Fig. 8 schematically shows a schematic diagram of an apparatus 800 for constructing a data set for training a feature extraction network according to another embodiment of the present invention. The data set building apparatus 800 includes: an acquisition module 801, a video segmentation module 802, a collection creation module 803, a determination module 804, a collection merging module 805, and a data set construction module 806. The acquisition module 801 is configured to acquire a training video for a feature extraction network. The video segmentation module 802 is configured to divide the training video into a plurality of training video segments according to the video shots, each of the plurality of training video segments containing a plurality of video frames belonging to the same video shot. The set creation module 803 is configured to create, for each training video segment, one or more human body region sample sets of training video segments. The determination module 804 is configured to determine whether a human face is included in one or more human body region sample sets. The set merging module 805 is configured to merge the one or more body region sample sets based on features of the human face in response to determining that each body region in the one or more body region sample sets contains a human face.

Fig. 9 schematically shows a schematic diagram illustrating an example computer device 900 for video processing and/or constructing a data set. The computer device 900 may be a variety of different types of devices, such as a server computer (e.g., server 201 shown in fig. 2), a device associated with an application (e.g., user terminal 203 shown in fig. 2), a system on a chip, and/or any other suitable computer device or computing system.

The computer device 900 may include at least one processor 902, memory 904, communication interface(s) 906, display device 908, other input/output (I/O) devices 910, and one or more mass storage devices 912, which may be capable of communicating with each other, such as through a system bus 914 or other appropriate connection.

The processor 902 may be a single processing unit or multiple processing units, all of which may include single or multiple computing units or multiple cores. The processor 902 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuitry, and/or any devices that manipulate signals based on operational instructions. Among other capabilities, the processor 902 can be configured to retrieve and execute computer-readable instructions, such as program code for an operating system 916, program code for an application 918, program code for other programs 920, etc., stored in the memory 904, mass storage 912, or other computer-readable medium to implement a method for video processing and/or constructing a data set as provided by one embodiment of the present invention.

Memory 904 and mass storage device 912 are examples of computer storage media for storing instructions that are executed by processor 902 to perform the various functions described above. By way of example, the memory 904 may generally include both volatile and nonvolatile memory (e.g., RAM, ROM, and the like). In addition, the mass storage device 912 may generally include a hard disk drive, solid state drive, removable media including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network attached storage, storage area networks, and the like. Memory 904 and mass storage device 912 may both be referred to herein collectively as memory or computer storage media, and may be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code that may be executed by processor 902 as a particular machine configured to implement the operations and functions described in the examples herein.

A number of program modules may be stored on the mass storage device 912. These programs include an operating system 916, one or more application programs 918, other programs 920, and program data 922, which can be loaded into memory 904 for execution. Examples of such applications or program modules may include, for instance, computer program logic (e.g., computer program code or instructions) for implementing the following components/functions: an acquisition module 701, a human detection module 702, a feature extraction module 703, a comparison module 704, a point in time determination module 705 and a video processing module 706 as well as an acquisition module 901, a video segmentation module 802, a set creation module 803, a determination module 804, a set merging module 805 and a data set construction module 806 and/or further embodiments described herein.

Although illustrated in fig. 9 as being stored in memory 904 of computer device 900,

modules

916, 918, 920, and 922, or portions thereof, may be implemented using any form of computer-readable media that is accessible by computer device 900. As used herein, "computer-readable media" includes at least two types of computer-readable media, namely computer storage media and communication media.

Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information for access by a computer device.

Computer device 900 may also include one or more communication interfaces 906 for exchanging data with other devices, such as over a network, a direct connection, and so forth, as previously discussed. One or more communication interfaces 906 can facilitate communication within a variety of networks and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, and so forth. Communication interface 906 may also provide for communication with external storage devices (not shown), such as in storage arrays, network attached storage, storage area networks, and the like.

In some examples, a display device 908, such as a monitor, may be included for displaying information and images. Other I/O devices 910 may be devices that take various inputs from a user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input/output devices, and so forth.

Variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed subject matter, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" as used herein does not exclude a plurality. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A method of video processing, the method comprising:

acquiring a video to be processed and a target human body area representing a target object;

detecting a plurality of human body regions in the video to be processed;

inputting the plurality of human body regions into a trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body regions, and inputting the target human body region into the trained feature extraction network to obtain a second feature describing the target human body region;

comparing the plurality of first features with the second features respectively to obtain at least one first matching feature in the first features matched with the second features;

determining corresponding time points of the at least one first matching feature in the video to be processed;

processing the video to be processed based on the various time points to obtain a video part associated with the target object;

the feature extraction network is trained by using a data set constructed based on a human body region sample set, and the human body region sample set is generated respectively for a plurality of video segments divided according to a video shooting shot.

2. The video processing method of claim 1, wherein the data set is constructed by:

acquiring a training video for the feature extraction network;

dividing the training video into a plurality of training video segments according to the video shooting lens;

for each of the plurality of training video segments, creating one or more human region sample sets of the training video segment;

determining whether one or more of the human body region sample sets contain human faces;

in response to determining that each body region in the one or more body region sample sets contains a face, the one or more body region sample sets are merged to construct a training data set based on features of the face.

3. The video processing method of claim 2, the creating, for each of the plurality of training video segments, one or more human body region sample sets of the training video segment comprising:

for each training video segment in the plurality of training video segments, each training video segment comprising a plurality of video frames belonging to the same video shot, detecting a human body region in the plurality of video frames;

determining a similarity between the detected two or more human body regions;

adding two or more human body regions with similarity satisfying a predetermined threshold range into the same set to generate one or more human body region sample sets of the training video segment.

4. The video processing method of claim 2 or 3, wherein said combining the one or more sets of body region samples based on features of the face to construct the training data set in response to determining that each body region in the one or more sets of body region samples contains a face comprises:

respectively selecting the same preset number of human faces from each human body area sample set in response to the fact that each human body area in one or more human body area sample sets comprises the human face;

comparing the face similarity of the face selected from each human body region sample set;

and combining the human body region sample sets with the human face similarity higher than a first preset threshold value to construct a training data set.

5. The video processing method of claim 2, the data set further constructed by:

determining a human body region with human body region similarity lower than a preset threshold value in the same human body region sample set by using the pedestrian re-identification ReID;

removing the human body regions with the human body region similarity lower than a second preset threshold value from the human body region sample set.

6. The video processing method of claim 3, the determining a similarity between the detected two or more human body regions comprising: similarity between the detected two or more human body regions is determined based on the artificial features.

7. A method for constructing a data set for training a feature extraction network, the method comprising:

acquiring a training video for the feature extraction network;

8. The method for constructing a data set according to claim 7, said creating, for each of the plurality of training video segments, one or more human region sample sets of the training video segment comprising:

determining a similarity between the detected two or more human body regions;

9. A method for constructing a data set according to claim 7 or 8, said in response to determining that a human face is included in each of one or more sets of human body region samples, merging the one or more sets of human body region samples based on features of the human face to construct a training data set comprising:

10. A training method of a feature extraction network comprises the following steps:

acquiring a training video for the feature extraction network,

constructing a training data set using the method of constructing a data set according to any of claims 7-9 based on the acquired training video,

training a feature extraction network using the data set to extract features describing a region of the human body.

11. A video processing device, the device comprising:

an acquisition module configured to acquire a video to be processed and a target human body region representing a target object;

a human body detection module configured to detect a plurality of human body regions in the video to be processed;

a feature extraction module configured to input the plurality of human body regions into a trained feature extraction network to obtain a plurality of first features respectively describing the plurality of human body regions, and input the target human body region into the trained feature extraction network to obtain a second feature describing the target human body region, wherein the feature extraction network is trained using a data set constructed based on a human body region sample set, and the human body region sample set is respectively generated for a plurality of video segments divided according to a video taking lens;

a comparison module configured to compare the plurality of first features with the second features respectively to obtain at least one first matching feature of the first features matching with the second features;

a time point determination module configured to determine corresponding respective time points of the at least one first matching feature in the video to be processed;

a video processing module configured to process the video to be processed based on the respective time points to obtain a video portion associated with the target object.

12. An apparatus for constructing a data set for training a feature extraction network, the apparatus comprising:

an acquisition module configured to acquire a training video for the feature extraction network;

a video segmentation module configured to divide the training video into a plurality of training video segments according to video shots;

a set creation module configured to create, for each of the plurality of training video segments, one or more human body region sample sets of the training video segment;

a determination module configured to determine whether one or more of the human body region sample sets contains a human face;

a set merging module configured to, in response to determining that each body region in the one or more body region sample sets contains a face, merge the one or more body region sample sets based on features of the face to construct a training data set.

13. A training apparatus for a feature extraction network, comprising:

an acquisition module configured to acquire a training video for the feature extraction network,

a data set construction module configured to construct a training data set using the method of constructing a data set according to any one of claims 7 to 9 based on the acquired training video,

a training module configured to train a feature extraction network using the data set to extract features describing a region of a human body.

14. A computer arrangement, characterized by a memory and a processor, in which a computer program is stored which, when being executed by the processor, causes the processor to carry out the steps of the method of any one of claims 1-10.

15. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1-10.