CN116962817B

CN116962817B - Video processing method, device, electronic equipment and storage medium

Info

Publication number: CN116962817B
Application number: CN202311218175.5A
Authority: CN
Inventors: 纪智辉
Original assignee: 4u Beijing Technology Co ltd
Current assignee: Shiyou Beijing Technology Co ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-12-08
Anticipated expiration: 2043-09-21
Also published as: CN116962817A

Abstract

The application provides a video processing method, a video processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: identifying a first object existing in a target video frame, and extracting a first feature vector from the first object, wherein the first object is the most obvious object in the target video frame; reading a plurality of second objects from a database, and extracting corresponding second feature vectors from each second object, wherein the second objects are candidate multimedia files to be implanted into the target video frame; selecting a second object associated with the first object from the plurality of second objects as an object to be implanted based on the first feature vector and the second feature vector; and detecting an implantation position in the target video frame, and implanting the object to be implanted into the implantation position in the target video frame. The application solves the technical problem that the video implanted advertisement lacks flexibility.

Description

Video processing method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of video processing technologies, and in particular, to a video processing method, a device, an electronic apparatus, and a storage medium.

Background

Existing video implant advertising techniques typically rely on powerful post-tracking and compositing techniques to achieve high quality implant results. In these techniques, the video is first analyzed and marked by post-processing tools, such as specialized video editing software, to determine the location and point in time appropriate for the placement of the advertisement. Then, based on these markers, post tracking is required, and tracking algorithms are used to locate key objects or scene elements in the video in order to accurately implant the advertising content into the corresponding locations. After tracking is completed, a composition operation is required to fuse the advertising content with the video content, ensuring that the embedded advertisement appears visually natural and seamless.

However, in the prior art, once the post-production of the video-embedded advertisement is completed and the encoded output is uploaded in pieces, the embedded information becomes unalterable. If the advertisement content needs to be replaced, the implantation position is adjusted or a new advertisement strategy is adapted, the post-production and synthesis are needed again, and time, manpower and resources are consumed. This limits the flexibility and adaptability of the advertisement so that changes cannot be made in time after the advertisement is implanted, affecting the actual effect and efficiency of the advertisement.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a device, electronic equipment and a storage medium, which are used for at least solving the technical problem that video embedded advertisements lack flexibility.

According to an aspect of an embodiment of the present invention, there is provided a video processing method including: identifying a first object existing in a target video frame, and extracting a first feature vector from the first object, wherein the first object is the most obvious object in the target video frame; reading a plurality of second objects from a database, and extracting corresponding second feature vectors from each second object, wherein the second objects are candidate multimedia files to be implanted into the target video frame; selecting an object to be implanted associated with the first object from the plurality of second objects based on the first feature vector and the second feature vector; and detecting an implantation position in the target video frame, and implanting the object to be implanted into the implantation position in the target video frame.

According to another aspect of the embodiment of the present invention, there is also provided a video processing apparatus including: the feature extraction module is configured to identify a first object existing in a target video frame, extract a first feature vector from the first object, wherein the first object is the most obvious object in the target video frame, read a plurality of second objects from a database, and extract corresponding second feature vectors from each second object, wherein the second objects are candidate multimedia files to be implanted in the target video frame; a selection module configured to select an object to be implanted associated with the first object from the plurality of second objects based on the first feature vector and the second feature vector; an implantation module configured to detect an implantation location in the target video frame and implant the object to be implanted into the implantation location in the target video frame.

In the embodiment of the application, a first object existing in a target video frame is identified, and a first feature vector is extracted from the first object; reading a plurality of second objects from the database, and extracting corresponding second feature vectors from each second object; selecting an object to be implanted associated with the first object from the plurality of second objects based on the first feature vector and the second feature vector; and detecting an implantation position in the target video frame, and implanting the object to be implanted into the implantation position in the target video frame. Through the scheme, the technical problem that the video implanted advertisement lacks flexibility is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flow chart of a video processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of another video processing method according to an embodiment of the application;

FIG. 3 is a flow chart of a method of detecting an implant location according to an embodiment of the present application;

FIG. 4 is a flow chart of a method of determining an object to be implanted according to an embodiment of the application;

FIG. 5 is a flow chart of yet another video processing method according to an embodiment of the present application;

fig. 6 is a schematic structural view of a video processing apparatus according to an embodiment of the present application;

fig. 7 shows a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

The relative arrangement of the components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application unless it is specifically stated otherwise. Meanwhile, it should be understood that the sizes of the respective parts shown in the drawings are not drawn in actual scale for convenience of description. Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail. In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further discussion thereof is necessary in subsequent figures.

Example 1

The embodiment of the application provides a video processing method, as shown in fig. 1, comprising the following steps:

step S102, identifying a first object existing in a target video frame, and extracting a first feature vector from the first object, where the first object is the most significant object in the target video frame.

First, a first object is identified. Computer vision techniques and object detection algorithms may generally be used to identify the first object. These algorithms can detect and locate different objects in the target frame for further analysis and extraction of features.

Then, a first feature vector is extracted. For example, determining text information corresponding to the first object based on the first object, mapping each word in the text information into a multi-dimensional vector space, wherein each dimension in the multi-dimensional vector space represents one semantic feature; acquiring a first word feature vector of each word from the multidimensional vector space, and weighting the first word feature vector through word frequency inverse document frequency; and taking the weighted first word characteristic vector as the first characteristic vector.

In the embodiment, by mapping each word in the text information into the multidimensional vector space, not only the frequency of the words is captured, but also the semantic relation among the words is captured. Such processing can more fully express textual information, helping to identify semantic associations between a first object and other objects. The importance of the vocabulary can be better reflected through the weighted feature vectors, and noise is reduced, so that the matching accuracy is improved.

Step S104, a plurality of second objects are read from the database, and corresponding second feature vectors are extracted from each second object, wherein the second objects are candidate multimedia files to be implanted in the target video frame.

Acquiring a tag of each second object, and acquiring text description information based on the tag, wherein the tag comprises at least one of the following: title, genre, and keywords; converting the text description information into a second word feature vector, carrying out scale normalization processing on the second word feature vector, and taking the normalized second word feature vector as the second feature vector.

The embodiment can quantize the text information into the numeric feature vector by converting the text description information into the second word feature vector and performing scale normalization processing, which is helpful for comparing and matching the text information with the feature vector of the first object. In addition, the scale normalization processing ensures that the magnitude orders of different text information can not influence the matching result, and improves the matching stability. Finally, the implementation can also capture specific semantic features of the second object, thereby improving the accuracy and reliability of matching and helping to better identify and implant the second object into the target video frame.

Step S106, selecting an object to be implanted associated with the first object from the plurality of second objects based on the first feature vector and the second feature vector.

First, a semantic association is determined. Based on the first feature vector and the second feature vector, a natural language processing method is used to analyze semantic relatedness between the first object and the second object. Specifically, text embedding is performed on the first feature vector and the second feature vector to map the first feature vector and the second feature vector into a shared semantic space; calculating a semantic distance between the first object and the second object in the shared semantic space based on euclidean distance; based on the semantic distance, a semantic association between the first object and the second object is determined. Then, an object to be implanted is selected based on the semantic association. For example, based on the semantic association degree, sorting each second object, and selecting the second object with the highest semantic association degree as the object to be implanted.

The present embodiment can more precisely measure the correlation between the first object and the second object by analyzing the semantic correlation between them using a natural language processing method. Text embedding maps feature vectors of the first object and the second object to a shared semantic space, so that their semantic representations can be compared in a unified semantic space, avoiding differences between different feature vectors. By calculating the semantic distance by using the euclidean distance, the similarity or the difference between the first object and the second object can be quantified, so that the measurement of the semantic association degree is more accurate. In addition, the embodiment can better match advertisement content with target video frames by quantifying semantic association, and improve the accuracy and naturalness of advertisement implantation, thereby enhancing the attraction and effect of advertisements. This helps to improve the flexibility and efficiency of the advertisement implantation technique, providing a better user experience.

Step S108, detecting an implantation position in the target video frame, and implanting the object to be implanted into the implantation position in the target video frame.

First, the implantation location is detected.

In some embodiments, the target video frame may be subjected to target detection, and a bounding box containing a target object in the target video frame is determined; and determining straight lines of all sides of the boundary frame, and detecting whether implantation positions capable of being implanted into the object to be implanted exist in the target video frame based on straight line equation parameters corresponding to the straight lines.

In still other embodiments, a plurality of candidate edges in the target video frame may be identified and a bounding box in the target video frame containing a target object is identified; and screening target side lines meeting preset conditions from the plurality of candidate side lines, and correcting the boundary frame based on the target side lines to obtain an implantation position of an object to be implanted into the target video frame, wherein the preset conditions are that polygons can be formed through communication, and the similarity between the polygons and the boundary frame is larger than a preset similarity threshold.

Then, implantation is performed. For example, a perspective transformation matrix is determined based on the corner coordinates of the object to be implanted and the corner coordinates of the implantation position; performing perspective transformation on the object to be implanted based on the perspective transformation matrix; obtaining pixels of an edge area near the contour line of the object to be implanted after perspective transformation; and performing antialiasing treatment on the pixels in the edge area, and implanting the object to be implanted after the antialiasing treatment into the implantation position.

The embodiment can realize accurate positioning of the target object in the target video frame through target detection and bounding box determination, which helps ensure that the multimedia object to be implanted is placed at the correct position, so as not to impair the viewing experience of the audience. Secondly, performing perspective calibration on the object to be implanted by adopting a perspective transformation matrix, and ensuring that the position of the object to be implanted in the target video frame is consistent with the environment and the angle. In addition, the anti-aliasing treatment is carried out on the object to be implanted after the perspective transformation, which is helpful for improving the quality of the implantation effect and leading the implantation effect to be more natural and smooth. In summary, the present embodiment can ensure the accuracy of the implantation position and the high quality of the implantation effect.

Example 2

An embodiment of the present application provides another video processing method, as shown in fig. 2, including the following steps:

step S202, detecting an implantation position.

The method for detecting the implantation position is shown in fig. 3, and comprises the following steps:

in step S2022, a target video frame in the target video is acquired.

The target video frame may be any one of a set of key frame images or any one of a set of key frame images in the target video. Key frame images can be divided into three types: a first frame image, a last frame image, and a specified frame image.

The first frame image is the first frame of the target video. It plays a key role in the introduction of video content, typically for previews, thumbnails or pictures before the video starts. The first frame image may convey important information about video content and subject matter and is thus widely used in the fields of video sharing platforms, advertising, movies, and the like.

The end frame image is the last frame image of the target video. It plays a key role at the end of the video, typically containing the end of the video, end-of-track captions, brand identification, or other important information. The tail frame image is helpful for the audience to memorize the video content, and simultaneously provides opportunities for sharing and popularizing the video. In movies and advertisements, the end frame images are typically used to present production team, production company, or brand information.

The specified frame images are key frames that are well defined in the video, they may not be the first or last frame, but have special significance in video editing and analysis. These frames are typically selected because they contain important scenes, critical information, or specific actions. The selection of the designated frame image may be based on a timecode, content analysis, subject matter relevance, or other criteria. In video analysis and processing, these images may be used for object detection, emotion analysis, advertisement positioning, and video summary generation applications.

Step S2024 identifies a plurality of candidate edges in the target video frame and identifies a bounding box in the target video frame containing the target object.

First, candidate edges are identified.

The line segment detection model may be invoked to extract feature information of the target frame image and identify line segments in the image based on the feature information. The feature information may include gray values, position information, pixel values, and the like of respective pixel points in the target frame image. The line segment detection method may employ different techniques, including a conventional method based on hough transform and a neural network-based method.

The network structure of the neural network-based segment detection model may include four main modules: the system comprises a trunk module, a connection point prediction module, a line segment sampling module and a line segment correction module. The backbone module is responsible for feature extraction, takes the input image as input, and provides a shared convolution feature map for subsequent modules. These feature maps contain a high-level representation of the image, helping subsequent modules to better understand the image content. The task of the connection point prediction module is to output candidate connection points, which are image locations that may contain line segments. The connection point prediction module predicts the location of the connection point using the feature information extracted by the backbone module. The line segment sampling module receives the connection point information output by the connection point prediction module and predicts a candidate line segment therefrom. The task of the segment sampling module is to combine the connection points into candidate segments. The line segment correction module is responsible for classifying the candidate line segments to determine which candidate line segments are actually straight line segments in the image. This module includes a pooling layer for extracting segment features for each candidate segment. By combining the convolution feature map extracted by the trunk module, the line segment correction module may determine which candidate line segments are valid and output information of the straight line segments, such as endpoint coordinates. The embodiment effectively identifies the line segments in the image through the modularized structure of the neural network, which is helpful for improving the accuracy and efficiency of line segment detection.

Next, a bounding box is identified. A dataset is prepared comprising images of the target object and accurate bounding box annotations of the target object in each image. These labels are typically provided in the form of rectangular boxes, including coordinate information for the upper left and lower right corners. Next, a target detection model is selected that is appropriate for the task. There are many models available in the field of object detection, such as YOLO, fast R-CNN, SSD, etc. Subsequently, model training is performed. The selected object detection model is trained using the annotation data. During the training process, the model will learn how to locate the target object from the image and generate a corresponding bounding box. Once the model training is complete, it can be applied to the target video frame. The video frames are input into the model, and the model performs an inference operation. The model analyzes the images and outputs bounding boxes of the detected target objects, as well as other information related to each bounding box, such as confidence scores. In some cases, post-processing the bounding box of the model output may improve accuracy. Post-processing operations may include removing overlapping bounding boxes, filtering bounding boxes with low confidence, or merging similar bounding boxes using non-maximal suppression (NMS). The processing operation in the later embodiment improves the accuracy and usability of the detection result, and ensures that only the most relevant bounding boxes are reserved.

Step S2026, screening out target edges meeting a preset condition from the plurality of candidate edges, and correcting the bounding box based on the target edges to obtain an implantation position of the object to be implanted into the target video frame, where the preset condition is that a polygon can be formed by communicating, and a similarity between the polygon and the bounding box is greater than a preset similarity threshold.

First, the target edge is screened out. Specifically, connectivity among the plurality of candidate edges is detected, and edges which can be communicated to form a polygon are screened out; and calculating the similarity between the polygon and the boundary frame, and taking the edge of the polygon as the target edge under the condition that the similarity is larger than the preset similarity threshold value. In this way, false detection is facilitated to be reduced and the accuracy of the implantation position is improved, especially in complex scenarios.

In some embodiments, the similarity may be calculated using the following method: calculating the overlapping area based on the polygonal and outline functions of the bounding box; calculating the degree of overlap based on the distance between the polygon and the center point of the bounding box and the overlapping area; calculating the area difference between the polygon and the boundary frame, and carrying out normalization processing on the area difference to obtain the relative size value; the spatial relationship value is calculated based on depth values of the polygon and the bounding box and a distance between center points of the polygon and the bounding box. After calculating the overlapping area, the overlapping degree, the relative size value and the spatial relation value of the polygon and the boundary frame, the similarity between the polygon and the boundary frame is calculated based on the overlapping area, the overlapping degree, the relative size value and the spatial relation value.

For example, the similarity can be calculated using the following formula: similarity = w1 x IoU +w2 x (1-relative size value) +w3 x spatial relationship, where IoU denotes the overlap (Intersection over Union) which measures the ratio of overlap area to the union of the polygon and bounding box contour functions. The relative size value is a normalized value of the area difference between the polygon and the bounding box, and the value obtained by subtracting the relative size value from 1 is used to measure the similarity of the sizes. The spatial relationship value is information such as depth values of the polygon and the bounding box, and distances between their center points. Wherein w1, w2, w3 are preset weights.

In some embodiments, the degree of overlap may be calculated using the following method: and finding out the intersection point inside the polygon by calculating the intersection of the boundary point of the polygon and the boundary point of the boundary box. These intersections are connected to form a new polygon representing the intersection of the polygon with the bounding box. Next, the area of the intersecting polygon is calculated by employing a polygon area calculation algorithm. Then, the areas of the polygon and the bounding box are calculated separately, and finally the union area is calculated, i.e. the area of the polygon plus the area of the bounding box minus the area of the intersection polygon. The intersection area and the union area thus obtained can be used to calculate IoU, i.e. IoU =intersection area divided by union area. This IoU computation method more accurately accounts for complex interactions between polygons and bounding boxes, and is particularly useful in situations where complex shape matching and overlap metrics need to be handled.

In some embodiments, the relative size value calculation formula may be: relative dimensionsValue= (|area of polygon-area of bounding box|/max (area of polygon, area of bounding box)) ² . In this embodiment, the square of the calculation result of the relative size value is increased, so that the contribution of the relative size value to the similarity is more remarkable.

In some embodiments, the spatial relationship value calculation formula may be: spatial relation value = (1-distance/maximum distance) × (1-superposition) × (1-depth value), wherein distance represents the distance between the polygon and the center point of the bounding box and maximum distance represents the furthest spatial separation between the polygon and the bounding box. The maximum distance is typically the furthest distance from a point of the polygon to the bounding box or the furthest distance from a point of the bounding box to the polygon. The present embodiment introduces depth values to more fully consider the spatial relationship between polygons and bounding boxes. Therefore, the relative positions of the polygons and the boundary boxes can be measured according to the depth information, and the accuracy of the spatial relation value is further improved. Further, the present embodiment more fully considers various aspects between the polygon and the bounding box, including distance, degree of superposition, and depth, thereby more accurately measuring the spatial relationship therebetween.

The bounding box is then modified based on the target edge. For example, identifying geometric features of the target edge, the geometric features including a length, an angle, and a curvature of the target edge; analyzing a relative position between the target edge and the bounding box based on the geometric feature; based on the relative positions, the position and shape of the bounding box are adjusted to modify the bounding box. In the embodiment, by identifying geometric characteristics such as the length, the angle, the curvature and the like of the edge of the target, the system can more comprehensively know the shape and the position information of the target. This helps to accurately capture the appearance characteristics of the target object, particularly excellent in the case of complex scenes or irregular shapes. Next, based on the analysis of these geometric features, the relative positional relationship between the target edge and the existing bounding box can be studied in depth. Finally, according to the analysis result of the relative position, the position and the shape of the boundary box can be intelligently adjusted, so that the target object can be better contained, and possible deviation and error of the boundary box are reduced. This fine bounding box adjustment process makes the target detection more accurate.

Specifically, when the relative position indicates that the target edge intersects the boundary frame, an intersection angle of the target edge and the boundary frame is detected, and when the intersection angle is larger than a preset angle threshold value, the boundary frame is narrowed to avoid the intersection of the target edge and the boundary frame. This approach helps to reduce redundant portions of the bounding box, ensuring that they better conform to the shape of the target object, thereby improving the accuracy of the bounding box. In addition, when the relative position indicates that the target edge does not intersect with the boundary frame, detecting a gap distance between the target edge and the boundary frame, and when the gap distance is smaller than a preset gap threshold value, translating the edge of the boundary frame in the direction of the target edge to enable the boundary frame to be closer to the target edge. In this way, the gap between the target edge and the boundary frame is reduced, the boundary frame is ensured to better surround the target object, and the adaptability of the boundary frame is improved.

In step S204, the object to be implanted is determined.

As shown in fig. 4, the method of determining an object to be implanted includes the steps of:

in step S2042, a first object existing in the target video frame is identified, and a first feature vector is extracted from the first object.

First, a first object is identified. Various objects in the target video frame, including the first object, the most salient object in the target frame, can be automatically detected and located using computer vision techniques and target detection algorithms. Computer vision techniques are capable of analyzing pixel information in an image, detecting different objects, and then determining their locations and bounding boxes.

Then, a first feature vector is extracted. Each word may be mapped into a multidimensional vector space based on the textual information determined by the first object, where each dimension represents a semantic feature. Such vector space, which may also be referred to as embedding space (embedding space), allows converting text information into a computer-understandable numerical form. The vectors of these words may then be weighted using the word frequency inverse document frequency (TF-IDF) or the like to emphasize words with greater semantic importance. Finally, the weighted word vectors are combined into a first feature vector representing semantic features of the first object.

The present embodiment associates a first object in an image with related text information so that the computer can understand the image content more deeply. Thus, by combining computer vision and natural language processing, a higher level of analysis and processing of video content may be achieved, thereby providing a more accurate and efficient solution to the task of advertisement placement and the like.

In step S2044, a plurality of second objects are read from the database, and a corresponding second feature vector is extracted from each second object.

First, tag information of each second object including a title, a genre, a keyword, and the like is acquired. These tags are used to describe the second object. These tag information are then converted into a numerical form for comparison and matching with the feature vector of the first object. To convert text information into a digital form, text embedding techniques are employed. In particular, text information is converted into a multi-dimensional vector, where each dimension represents a semantic feature. This vector representation allows the computer to better understand the text information because it provides a structured way to capture the semantic information of the text.

Then, scale normalization processing is performed. The scale normalization process is performed to ensure that the magnitude of the different text information does not unnecessarily interfere with the matching result. The scale normalization process maps the values of the text feature vectors into a uniform range, which eliminates problems due to differences in text information length or feature value size. This approach helps to improve the stability of the matching, making the algorithm more robust.

And finally, taking the normalized text feature vector as a second feature vector so as to facilitate subsequent semantic association analysis and matching. This process enables a more accurate comparison of the degree of semantic association between the first object and the second object, thereby selecting the second object that is most relevant to the first object.

The text information is converted into the operable feature vectors through the text embedding and scale normalization technology, so that subsequent matching and analysis are facilitated. This helps to improve the accuracy and stability of the algorithm, ensuring the effect and reliability of the video processing method.

Step S2046 selects an object to be implanted associated with the first object from the plurality of second objects.

First, a semantic association between a second object and a first object is determined. The determination of the semantic association is realized by a natural language processing method. The feature vector of the first object and the feature vector of each second object are mapped into a shared semantic space using text embedding techniques. This semantic space is represented in the form of a multi-dimensional vector, where each dimension represents a semantic feature.

Next, using a calculation method of euclidean distance, the semantic distance between the first object and each second object is measured in this shared semantic space. This distance reflects the semantic similarity between them, i.e. their proximity in semantic space, thus helping to quantify the semantic relationship between the first object and each second object.

Thereafter, based on the calculated semantic distance, each second object may be ranked, ranking them from high to low in terms of semantic relevance. In this way it is possible to determine which second object is most relevant to the first object, i.e. has the highest degree of semantic association.

Finally, the second object with the highest semantic association is selected as the object to be implanted. This procedure ensures that the selected object is more semantically related to the first object, thereby improving the accuracy and consistency of implantation. In this way, the object to be implanted can be better selected to achieve the high quality effect of video implantation.

Step S206, implanting the object to be implanted into the implantation position in the target video frame.

First, a perspective transformation matrix is determined. The perspective transformation matrix may be determined by calculating a transformation matrix from the corner coordinates of the object to be implanted to the corner coordinates of the implantation location. For example, corner coordinates of the object to be implanted and the implantation position may be extracted using a corner detection algorithm (e.g., harris corner detection). These coordinates are then used to calculate a perspective transformation matrix.

Then, perspective transformation is performed. Once the perspective transformation matrix is obtained, the object to be implanted may be perspective transformed. Perspective transformation is a linear transformation that can project an object from its original position to a new position and angle. Each pixel of the object to be implanted is applied with a perspective transformation matrix, so that it is adjusted to the implantation position.

Finally, the implant is carried out at the position to be implanted. After perspective transformation, the object to be implanted has been adapted to the angle and environment of the implantation site. The transformed object pixel data is accurately embedded into the implant location of the target video frame. Thereby ensuring that the implanted object is compatible with the surrounding environment. Some color correction and brightness adjustment may also be performed during this process to better match the implanted object to the target video frame.

Example 3

An embodiment of the present application provides a video processing method, as shown in fig. 5, including the following steps:

step S502, detecting implantation positions in the target video and setting implantation marks.

1) The target video is divided into a plurality of video frames.

2) And carrying out target detection on target video frames in the plurality of video frames, and determining a boundary box containing a target object in the target video frames.

Performing target detection on the target video frame; acquiring a previous video frame of the target video frame, and performing target detection on the previous video frame; the bounding box containing the target object in the target video frame is determined based on a target detection result of the last video frame and a target detection result of the target video frame.

For example, estimating a displacement of each pixel of a previous video frame to the target video frame in the target video frame using dense optical flow; and deducing the shape change and the motion state of the target object through the displacement, and carrying out target detection based on the shape change and the motion state.

The present embodiment can improve the accuracy of target detection by performing target detection at two different time points (the current frame and the previous frame). The multi-frame detection strategy can reduce false detection or missed detection caused by factors such as shielding, illumination change or noise in a single frame. By comparing the detection results of the two frames, the bounding box of the target object can be determined more reliably.

3) And determining straight lines of all sides of the boundary frame, and detecting whether implantation positions capable of being implanted into an object to be implanted exist in the target video frame based on straight line equation parameters corresponding to the straight lines.

Firstly, the linear equation parameters are converted into a parameter matrix, wherein the parameter matrix is used for describing the positions of all pixel points in the boundary box. For example, substituting the coordinates of each pixel point into a linear equation of the straight line, and calculating the coordinates of each pixel point mapped onto the straight line to obtain the position information of each pixel point relative to the straight line; and constructing the parameter matrix based on the position information, wherein elements in the parameter matrix correspond to the position information of each pixel point.

The present embodiment can more accurately determine the implantation position by modeling the border of the bounding box as a straight line and calculating the straight line equation parameters. Thus, the position of the object to be implanted can be more accurately matched, so that the quality and the sense of reality of the implantation effect are improved. Furthermore, by converting the linear equation parameters into a parameter matrix, the system can adapt to implant location detection in different scenarios. This can help the system work in a variety of contexts and environmental conditions and accommodate a variety of bounding box shapes and sizes.

Then, it is detected whether the implantation position capable of implanting the object to be implanted exists in the target video frame based on the parameter matrix. For example, screening the pixel points in the parameter matrix, and marking the pixel points meeting the preset condition as candidate positions; performing feature matching on the candidate position and the object to be implanted, and particularly performing position matching on the pixel position of the object to be implanted and the candidate position; and under the condition that the positions are matched, extracting feature descriptors of the object to be implanted and the pixel points of the candidate positions respectively, and determining whether the feature attributes of the object to be implanted and the feature attributes of the candidate positions are matched or not based on the feature descriptors. Then, it is determined whether the implantation position capable of implanting the object to be implanted exists in the target video frame. Finally, an implantation mark is arranged at the implantation position.

According to the embodiment, the pixel points can be screened based on the implantation position detection of the parameter matrix, and the pixel points which only meet the preset condition are marked as candidate positions, so that false alarms can be reduced, and the reliability of the system is improved. In addition, the matching degree can be more reliably determined by matching the pixel position of the object to be implanted with the candidate position and extracting the characteristic descriptors of the pixel position and the candidate position, so that the matching of the implantation position and the characteristic attribute of the object to be implanted is facilitated, and the implantation reality is improved. Finally, the embodiment is not only suitable for different types of objects to be implanted, but also suitable for implantation under various background conditions. It can be applied to the implantation of virtual objects, such as virtual characters, objects or effects, as well as a variety of different video contexts.

Step S504, a preset implantation identifier is obtained from a target video, wherein the implantation identifier is used for identifying the implantation position of an object to be implanted in the target video.

Dividing the target video into a plurality of video frames, and converting each video frame of the plurality of video frames into a gray scale map; carrying out Gaussian blur processing on the gray level map based on a preset blur kernel size, and carrying out thresholding processing on the gray level map after Gaussian blur processing so as to convert the gray level map after Gaussian blur processing into a binary image; and extracting a contour corresponding to the tracking graph from the binary image based on a preset tracking graph, and identifying an implantation identifier based on an image in the contour.

According to the embodiment, the target video is divided into a plurality of video frames and converted into the gray level image, and then Gaussian blur and thresholding are carried out, so that noise and interference in the video frames can be reduced, and extraction of the implantation identification is more stable. And extracting the outline from the binary image based on a preset tracking graph, thereby being beneficial to identifying the implantation mark.

Step S506, determining the object to be implanted based on the identity information carried in the implantation identification, and generating a pose transformation matrix of the object to be implanted based on the pose information carried in the implantation identification.

Firstly, determining the object to be implanted based on the identity information carried in the implantation identification. For example, extracting a plurality of tracking points from the image within the contour, and calculating a center point of the image within the contour using positions of the plurality of tracking points in the image within the contour; connecting the plurality of tracking points with the center point based on the angles of the plurality of tracking points and the angles of the center point to obtain a region to be filled; the identity information is determined based on a combination of the geometry of the region to be filled and the color of the center point, and the object to be implanted is determined based on the identity information.

The embodiment considers the geometric characteristics and the color characteristics of the implantation mark, which is helpful for accurately identifying the identity of the object to be implanted, ensures that the selected object is consistent with the video environment, and improves the authenticity of the implantation effect. Second, the present embodiment also helps to precisely locate the position of the object to be implanted. By calculating the geometric shape of the region to be filled and combining the color information of the center point, the position of the object can be determined more accurately, the object to be implanted is ensured to be accurately placed at the expected position in the video, and the problems of position deviation and incompatibility are avoided. In summary, the present embodiment contributes to improvement of the sense of realism of the implantation effect. By comprehensively considering information such as color, shape, angle and the like, the object to be implanted is ensured to be consistent with the visual characteristics of the video environment, so that the implanted object looks more natural and real, the implantation effect is enhanced, and the implanted object is better integrated into the video scene.

And then, generating a pose transformation matrix of the object to be implanted based on the pose information carried in the implantation mark. For example, based on the pose information, determining a position, a rotation angle, and a size of the object to be implanted; the pose transformation matrix is determined based on the position, rotation angle and size of the object to be implanted.

The present embodiment precisely defines the appearance and position of the object to be implanted in the target video by comprehensively considering the position, rotation angle and size information. This ensures that the implanted object is consistent with the video environment without an uncoordinated, distorted or unnatural appearance. Furthermore, the process of generating the pose transformation matrix is programmable and controllable. By flexibly modifying the position, rotation angle, and size parameters, the implant object can be adjusted at any time to accommodate different video scene or creative requirements without the need to recreate or edit the object. This provides a high degree of customization and flexibility, ensuring that the appearance and position of the implant object can be adjusted as desired without having to reprocess the entire implantation procedure. In summary, the method for generating the pose transformation matrix based on the pose information has the advantages of accuracy, customizable performance and automatic processing, and the advantages are helpful for ensuring the coordination consistency of the implanted object and the target video, and meanwhile, the processing efficiency and the controllability are improved.

Specifically, determining the pose transformation matrix based on the position, rotation angle, and size of the object to be implanted may include: generating a translation transformation matrix for translating the object to be implanted to the position based on the position of the object to be implanted; generating a rotation transformation matrix for rotating the object to be implanted according to the rotation angle based on the rotation angle of the object to be implanted; generating a scaling transformation matrix for scaling the object to be implanted according to the size based on the size of the object to be implanted; wherein the pose transformation matrix comprises the translation transformation matrix, the rotation transformation matrix and the scaling transformation matrix.

The present embodiment provides precise control of the object to be implanted, including translational, rotational, and scaling transformations, by using a pose transformation matrix. These exact transformations ensure consistent coordination of the implanted object with the video environment, improving the realism of the implantation effect. The matrix representation mode not only enables the mathematical calculation to be efficient, but also provides flexibility and customizability, and can be quickly adjusted according to different requirements. Meanwhile, the controllability and predictability of the transformation process are ensured by the combined form of the pose transformation matrix, and the processing accuracy and consistency are improved. This embodiment has significant advantages in terms of improving the coordination, accuracy and processing efficiency of the implantation effect.

In some embodiments, different portions of the object to be implanted may need to have different dimensions, including local scaling or deformation of the object. To achieve this effect, the pose transformation matrix may be non-uniformly scaled, allowing different scale factors to be applied on different axes. For example, for non-uniform scaling, anisotropic scaling factors may be introduced. It is sometimes also necessary to warp the object to be implanted so that it can adapt to a particular scene or shape. By applying a warping transformation, the pose transformation matrix allows non-linear deformation of the object to be implanted on its surface to match the target scene. For example, a non-linear deformation may be introduced on the surface of the object to be implanted. This is useful in the case of muscle simulation of a virtual character, morphological adjustment of a deformed object, and the like. The warping may be achieved by introducing a non-linear transformation, such as a Bezier curve or a B-spline curve, to introduce local shape variations at different parts of the object to be implanted.

In addition, in pose transformation, the change of the material and texture of the object to be implanted needs to be considered. Not only the geometry of the object to be implanted changes, but also the changes in the material properties such as color, transparency, reflectivity, etc. need to be considered. The texture attributes are combined with pose transformation to generate a pose transformation matrix to achieve a realistic appearance.

In some embodiments, an adaptive pose transformation approach may be considered, depending on the requirements of the scene. The pose transformation matrix can be adjusted according to the interaction and constraint of the object to be implanted and the surrounding environment. For example, as the virtual character interacts with real world objects, the pose may be dynamically adjusted to better simulate the interaction effect. If variations in camera viewing angle are involved, camera parameters, such as internal and external parameters, need to be considered as well. These parameters may be combined with a pose transformation matrix to ensure a realistic rendering of the object to be implanted at different perspectives.

Step S508, adjusting the object to be implanted based on the pose transformation matrix, and implanting the adjusted object to be implanted into the target video.

Acquiring a target video frame with the implantation identifier in the plurality of video frames; and carrying out fusion processing on the target video frame and the adjusted object to be implanted so as to implant the adjusted object to be implanted into the target video. For example, obtaining perspective transformation information based on the target video frame and the adjusted object to be implanted, and performing perspective transformation on the adjusted object to be implanted based on the perspective transformation information; superposing the object to be implanted after perspective transformation on an implantation position indicated by the implantation identification in the target video frame, and adjusting the transparency of the superposed object to be implanted based on the target video frame; and carrying out edge smoothing treatment on the boundary of the superimposed object to be implanted.

According to the embodiment, the target video frame with the implantation identification is obtained, so that the implantation object can be accurately placed at the specific position of the video, the implantation object is coordinated with the scene, and the sense of reality of the implantation effect is enhanced. Secondly, the introduction of perspective transformation information allows to take into account the deformation and projection effects of the object at different viewing angles, making the projection of the object to be implanted in the video appear more realistic and coordinated. In addition, the adjustment of the transparency ensures that the implant object blends naturally with the background, without causing an uncoordinated sensation. Finally, the edge smoothing process is helpful to eliminate the hard boundary between the implanted object and the background, so that the transition is smoother, and the authenticity and harmony of the implantation effect are further improved. In conclusion, the effects act together, the overall consistency of the implanted object and the target video environment is improved, and the implantation effect is enhanced.

Example 4

An embodiment of the present application provides a video processing apparatus, as shown in fig. 6, including a feature extraction module 62, a selection module 64, and an implantation module 66.

The feature extraction module 62 is configured to identify a first object present in a target video frame and extract a first feature vector from the first object, wherein the first object is the most salient object in the target video frame; reading a plurality of second objects from a database, and extracting corresponding second feature vectors from each second object, wherein the second objects are candidate multimedia files to be implanted into the target video frame; the selection module 64 is configured to select an object to be implanted associated with the first object from the plurality of second objects based on the first feature vector and the second feature vector; the implantation module 66 is configured to detect an implantation location in the target video frame and implant the object to be implanted into the implantation location in the target video frame.

It should be noted that: in the video processing apparatus provided in the above embodiment, only the division of the above functional modules is used as an example, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the video processing apparatus and the video processing method provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the video processing apparatus and the video processing method are detailed in the method embodiments and are not repeated herein.

Example 5

Fig. 7 shows a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. It should be noted that the electronic device shown in fig. 7 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 7, the electronic device includes a Central Processing Unit (CPU) 1001 that can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data required for system operation are also stored. The CPU1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.

In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When being executed by a Central Processing Unit (CPU) 1001, performs the various functions defined in the method and apparatus of the present application. In some embodiments, the electronic device may further include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

It should be noted that the computer readable medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

As another aspect, the present application also provides a computer-readable medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device.

The computer-readable medium carries one or more programs which, when executed by one of the electronic devices, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps of the method embodiments described above, and so on.

The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.

In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In several embodiments provided in the present application, it should be understood that the disclosed terminal device may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A video processing method, comprising:

identifying a first object existing in a target video frame, and extracting a first feature vector from the first object, wherein the first object is the most obvious object in the target video frame;

reading a plurality of second objects from a database, and extracting corresponding second feature vectors from each second object, wherein the second objects are candidate multimedia files to be implanted into the target video frame;

selecting a second object associated with the first object from the plurality of second objects as an object to be implanted based on the first feature vector and the second feature vector;

Detecting an implantation position in the target video frame, and implanting the object to be implanted into the implantation position in the target video frame;

wherein detecting the implant location in the target video frame comprises: identifying a plurality of candidate edges in the target video frame and identifying a bounding box in the target video frame that contains a target object; screening target side lines meeting preset conditions from the plurality of candidate side lines, correcting the boundary frame based on the target side lines, and taking the corrected boundary frame as the implantation position, wherein the preset conditions are that polygons can be formed through communication, and the similarity between the polygons and the boundary frame is larger than a preset similarity threshold;

wherein modifying the bounding box based on the target edge comprises: identifying geometric features of the target edge, the geometric features including a length, an angle, and a curvature of the target edge; analyzing a relative position between the target edge and the bounding box based on the geometric feature; based on the relative positions, the position and shape of the bounding box are adjusted to modify the bounding box.

2. The method of claim 1, wherein extracting a first feature vector from the first object comprises:

Determining text information corresponding to the first object based on the first object, and mapping each word in the text information into a multidimensional vector space, wherein each dimension in the multidimensional vector space represents a semantic feature;

acquiring a first word feature vector of each word from the multidimensional vector space, and weighting the first word feature vector through word frequency inverse document frequency;

and taking the weighted first word characteristic vector as the first characteristic vector.

3. The method of claim 1, wherein extracting a respective second feature vector from each second object comprises:

acquiring a tag of each second object, and acquiring text description information based on the tag, wherein the tag comprises at least one of the following: title, genre, and keywords;

converting the text description information into a second word feature vector, carrying out scale normalization processing on the second word feature vector, and taking the normalized second word feature vector as the second feature vector.

4. The method of claim 1, wherein selecting a second object associated with the first object from the plurality of second objects as an object to be implanted based on the first feature vector and the second feature vector comprises:

Analyzing semantic association between the first object and each of the second objects using a natural language processing method based on the first feature vector and the second feature vector;

and sorting each second object based on the semantic association degree, and selecting the second object with the highest semantic association degree as the object to be implanted.

5. The method of claim 4, wherein analyzing semantic association between the first object and each of the second objects using natural language processing methods based on the first feature vector and the second feature vector comprises:

text embedding the first feature vector and the second feature vector to map the first feature vector and the second feature vector into a shared semantic space;

calculating a semantic distance between the first object and each of the second objects in the shared semantic space based on euclidean distance;

and determining the semantic association degree between the first object and each second object based on the semantic distance.

6. The method of claim 1, wherein implanting the object to be implanted into the target video frame at the implantation location comprises:

Determining a perspective transformation matrix based on the corner coordinates of the object to be implanted and the corner coordinates of the implantation position;

performing perspective transformation on the object to be implanted based on the perspective transformation matrix;

and implanting the object to be implanted after perspective transformation into the implantation position.

7. A video processing apparatus, comprising:

a feature extraction module configured to: identifying a first object existing in a target video frame, and extracting a first feature vector from the first object, wherein the first object is the most obvious object in the target video frame; reading a plurality of second objects from a database, and extracting corresponding second feature vectors from each second object, wherein the second objects are candidate multimedia files to be implanted into the target video frame;

a selection module configured to select an object to be implanted associated with the first object from the plurality of second objects based on the first feature vector and the second feature vector;

an implantation module configured to detect an implantation position in the target video frame and implant the object to be implanted into the implantation position in the target video frame;

Wherein the implant module is further configured to: identifying a plurality of candidate edges in the target video frame and identifying a bounding box in the target video frame that contains a target object; screening target side lines meeting preset conditions from the plurality of candidate side lines, correcting the boundary frame based on the target side lines, and taking the corrected boundary frame as the implantation position, wherein the preset conditions are that polygons can be formed through communication, and the similarity between the polygons and the boundary frame is larger than a preset similarity threshold;

wherein the implant module is further configured to: identifying geometric features of the target edge, the geometric features including a length, an angle, and a curvature of the target edge; analyzing a relative position between the target edge and the bounding box based on the geometric feature; based on the relative positions, the position and shape of the bounding box are adjusted to modify the bounding box.

8. An electronic device, comprising:

a memory configured to store a computer program;

a processor configured to cause a computer to perform the method of any one of claims 1 to 6 when the program is run.

9. A computer-readable storage medium, on which a program is stored, characterized in that the program, when run, causes a computer to perform the method of any one of claims 1 to 6.