CN118381980B

CN118381980B - Intelligent video editing and abstract generating method and device based on semantic segmentation

Info

Publication number: CN118381980B
Application number: CN202410807809.9A
Authority: CN
Inventors: 周斌; 张阳; 李进; 黄伟军
Original assignee: Ansjer Electronics Co ltd
Current assignee: Ansjer Electronics Co ltd
Priority date: 2024-06-21
Filing date: 2024-06-21
Publication date: 2024-08-27
Anticipated expiration: 2044-06-21
Also published as: CN118381980A

Abstract

The application discloses an intelligent video editing and abstract generating method and equipment based on semantic segmentation, wherein the method comprises the steps of preprocessing a plurality of video data to obtain a video data set comprising a plurality of key frame information; performing continuous iterative decomposition on the video data set to obtain a video iterative signal component set comprising a plurality of iterative signal groups, and calculating the information quantity of each iterative signal group to determine the video characteristic distortion degree; generating a semantic topic label, and determining a video frequency control stability index and a video frequency control distortion index according to the video characteristic distortion and the semantic topic label; training the semantic embedding model according to the semantic topic label and the key frame information to obtain a semantic embedding vector; constructing a video semantic index structure according to the video frequency control stationarity index, the video frequency control distortion index and the semantic embedding vector; and obtaining a target abstract segment corresponding to the video editing request from the video semantic index structure as a video abstract to output.

Description

Intelligent video editing and abstract generating method and device based on semantic segmentation

Technical Field

The application relates to the technical field of deep learning, in particular to an intelligent video editing and abstract generating method and device based on semantic segmentation.

Background

With the continuous popularization of video equipment and the rapid development of network technology, video has become an important medium for people to acquire information, record life and display themselves. However, with the explosive growth of video content, how to quickly and accurately extract key information from massive video data and generate a brief and brief video summary becomes a problem to be solved. The traditional video editing and abstract generating modes mainly depend on manual operation, are time-consuming and labor-consuming, and are difficult to comprehensively grasp semantic content and structural context of the video.

In recent years, artificial intelligence technology represented by deep learning has made great progress, and brings new opportunities for intelligent analysis and processing of video content. By performing pixel-level semantic segmentation on the video picture, the deep learning model can automatically identify key objects, characters and scenes in the video, so that the video content is more finely and comprehensively understood. The video analysis method based on semantic segmentation not only can extract semantic information of the video, but also can characterize the time-space evolution rule of the video content, and provides important basis for video structuring and abstract generation.

Therefore, an intelligent video editing and abstract generating method is needed, which can fully mine the semantic information of the video, automatically extract key content, generate a concise and clear video abstract with ordered structure, and support flexible and efficient video retrieval and personalized recommendation.

Disclosure of Invention

The embodiment of the application provides an intelligent video editing and abstract generating method and device based on semantic segmentation, which can fully mine semantic information of videos, automatically extract key contents, generate simple and clear video abstracts with ordered structures, and support flexible and efficient video retrieval and personalized recommendation.

In a first aspect, an embodiment of the present application provides a method for intelligent video editing and summary generation based on semantic segmentation, where the method includes:

Acquiring video data acquired by a plurality of video acquisition devices, preprocessing the video data, and acquiring a video data set which comprises a plurality of key frame information;

Performing continuous iterative decomposition on the video data set to obtain a video iterative signal component set, wherein the video iterative signal component set comprises a plurality of iterative signal groups, calculating the information quantity of each iterative signal group, and determining the video characteristic distortion degree corresponding to the video data set according to the plurality of information quantities;

Obtaining a video influence factor and a candidate topic label corresponding to a video data set, generating a semantic topic label according to the video influence factor and the candidate topic label, and determining a video frequency control stationarity index and a video frequency control distortion index according to the video feature distortion degree and the semantic topic label;

training a semantic embedding model to be trained according to the semantic topic labels and each key frame information, and acquiring a semantic embedding vector corresponding to a video data set output by the semantic embedding model;

constructing a video semantic index structure according to the video frequency control stationarity index, the video frequency control distortion index and the semantic embedding vector;

When a video editing request is received, a target abstract segment corresponding to the video editing request is obtained in a video semantic index structure, and the target abstract segment is used as a video abstract corresponding to the video editing request to be output.

In a second aspect, the present application further provides an intelligent video editing and summary generating device, including:

The data acquisition module is used for acquiring video data acquired by the video acquisition devices, preprocessing the video data, and acquiring a video data set, wherein the video data set comprises a plurality of key frame information.

The distortion determining module is used for carrying out continuous iterative decomposition on the video data set to obtain a video iterative signal component set, wherein the video iterative signal component set comprises a plurality of iterative signal groups, the information quantity of each iterative signal group is calculated, and the video characteristic distortion degree corresponding to the video data set is determined according to the plurality of information quantities.

The tag acquisition module is used for acquiring the video influence factors and the candidate topic tags corresponding to the video data set, generating semantic topic tags according to the video influence factors and the candidate topic tags, and determining video frequency control stability indexes and video frequency control distortion indexes according to the video feature distortion degrees and the semantic topic tags.

The vector acquisition module is used for completing training of the semantic embedding model to be trained according to the semantic topic label and each key frame information, and acquiring the semantic embedding vector corresponding to the video data set output by the semantic embedding model.

The index construction module is used for constructing a video semantic index structure according to the video frequency control stability index, the video frequency control distortion index and the semantic embedding vector.

And the abstract output module is used for acquiring a target abstract segment corresponding to the video editing request from the video semantic index structure when the video editing request is received, and outputting the target abstract segment as a video abstract corresponding to the video editing request.

In a third aspect, the present application also provides a computer device, including a processor and a memory, where the memory is configured to store a computer program, where the computer program, when executed by the processor, implements the intelligent video editing and summary generating method based on semantic segmentation according to the first aspect.

In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program, which when executed by a processor implements the intelligent video editing and summary generating method based on semantic segmentation according to the first aspect.

Compared with the prior art, the application has at least the following beneficial effects:

1. The efficiency and the accuracy of video editing and abstract generation are improved. The invention automatically identifies key objects, characters and scenes in the video through multi-level semantic segmentation and key frame extraction technology, and extracts the most representative video segments to generate a concise and clear video abstract. Meanwhile, by calculating the information quantity of the video iterative signal components and the video characteristic distortion degree, the method and the device can control the information redundancy and the distortion degree of the abstract while guaranteeing the video semantic integrity, and further improve the quality and the accuracy of abstract generation.

2. The multi-dimensional video semantic representation and flexible video retrieval are realized. The invention introduces a video semantic embedding model, maps video content to a low-dimensional semantic space through the learning of semantic topic labels and key frame information, and forms a compact and high-discrimination vectorization representation. Based on the video semantic embedded vector, the invention constructs a multi-level video semantic index structure, supports a user to search videos in a flexible way such as text description, example fragments and the like, rapidly positions related shots and fragments, and greatly improves the accessibility and the searching efficiency of video data.

3. And supporting global optimal video abstract generation and online tuning. According to the method, the candidate abstract segment relation diagram is constructed, the iterative edge cutting algorithm is introduced, and on the premise that constraints of abstract duration, semantic consistency and theme diversity are met, global optimal selection and combination of the candidate abstract segments are achieved, and the video abstract with rich content, reasonable structure and vivid theme is generated.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

Fig. 1 is a schematic flow chart of an intelligent video editing and abstract generating method based on semantic segmentation according to an embodiment of the application;

fig. 2 is a schematic structural diagram of an intelligent video editing and summary generating device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted in context as "when …" or "once" or "in response to a determination" or "in response to detection. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The following describes the technical scheme of the embodiment of the application.

At the same time, higher demands are also put on the retrieval and recommendation of video content by users. The traditional video retrieval mode based on keywords or labels is difficult to accurately describe the semantic meaning of the video, and the relevance and diversity of retrieval results are still to be improved. Users desire to be able to quickly find video content of interest and obtain personalized recommendations and customization services in a more natural and flexible manner. The video analysis and summarization technology is required to accurately understand video content, capture search intention and preference characteristics of users, and realize intelligent and personalized video retrieval and recommendation.

In order to solve the above problems, please refer to fig. 1, fig. 1 is a flow chart of an intelligent video editing and summary generating method based on semantic segmentation according to an embodiment of the present application. The intelligent video editing and abstract generating method based on semantic segmentation can be applied to computer equipment, wherein the computer equipment comprises, but is not limited to, intelligent mobile phones, notebook computers, tablet computers, desktop computers, physical servers, cloud servers and the like. As shown in fig. 1, the intelligent video editing and summary generating method based on semantic segmentation of the present embodiment includes steps S101 to S106, which are described in detail as follows:

Step S101, video data acquired by a plurality of video acquisition devices are acquired, the video data are preprocessed, a video data set is acquired, and the video data set comprises a plurality of key frame information.

In particular, the video acquisition device is a main data source of the scheme, such as adopting a camera as the video acquisition device. The application adopts a plurality of cameras with different functions, such as a monitoring camera, a network camera, a vehicle-mounted camera and the like. The application needs to extract the key frames of the video data so as to further reduce the data quantity and improve the efficiency of subsequent processing. The key frames are frames in the video which can represent the video content most, and the extraction of the key frames can obviously reduce the video data volume, and simultaneously, the main semantic information of the video is reserved.

In some embodiments, the preprocessing the plurality of video data to obtain a video data set includes: converting the format of each video data into a preset format; acquiring quality evaluation information corresponding to the converted video data, and determining target video data in a plurality of video data according to the quality evaluation information; and extracting key frames of each target video data, obtaining the key frame information corresponding to each target video data, and forming the video data set by a plurality of key frame information.

Different cameras may output video data in different formats, such as AVI, MP4, WMV, MOV, etc. The present application requires the conversion of these heterogeneous video data into a unified preset format for the convenience of subsequent processing and analysis. In the scheme, the application selects to convert all videos into the MPEG-4 format, which is a widely used video coding standard and has the advantages of high compression rate, good compatibility, higher quality and the like. The conversion process may be implemented using an FFmpeg or other open source video processing library, e.g., using commands

"ffmpeg -i input.avi -c:v libx264 -preset medium -b:v 1000k -c:a aac -b:a 128k output.mp4"

The AVI format video is converted to MPEG-4 format where the preset parameter specifies the encoding preset to medium (balance of medium quality and encoding speed), b the v parameter specifies the video bit rate to 1000k (i.e. 1 Mbps), b the a parameter specifies the audio bit rate to 128k.

After the video format is converted, the quality evaluation is required to be carried out on the video data so as to reject the video with poor quality, and the target video data is determined in a plurality of video data, so that the reliability and the accuracy of subsequent analysis and processing are ensured. Video quality assessment may be performed from multiple dimensions, such as resolution, frame rate, brightness, contrast, sharpness, noise, and so forth. In the scheme, the application mainly adopts two indexes of peak signal-to-noise ratio (PSNR) and Structural Similarity (SSIM) to evaluate video quality. The PSNR measures the difference between the video frame and the reference frame, i.e. the noise level, and its calculation formula is:

；

Where MAX is the maximum value of a video frame pixel (typically 255), MSE is the mean square error between the video frame and the reference frame. The higher the PSNR value, the better the video quality. The SSIM is used for measuring the structural similarity between the video frame and the reference frame, and takes factors such as brightness, contrast, structure and the like into consideration, and the calculation formula is as follows:

；

wherein μx and μy are the average value of two frames of images, σx and σy are the standard deviation of two frames of images, σxy is the covariance of two frames of images, and C1 and C2 are constants for avoiding the situation that the denominator is zero. The value range of the SSIM is [0, 1], and the closer to 1, the better the video quality is. In quality assessment, the application can set thresholds of PSNR and SSIM, such as video frames with PSNR less than 30dB or SSIM less than 0.8, and the application considers that the quality is poor and needs to be rejected or further processed.

In this scheme, the application adopts a method based on color histogram and motion vector analysis to extract key frames. First, video frames are converted into HSV color space, H (hue) and S (saturation) channels are extracted, and HS histograms for each frame are calculated. The HSV color space can better reflect the perception of human eyes to colors, the H channel represents the types of the colors, the S channel represents the purity of the colors, and the HS histogram is extracted to describe the color distribution characteristics of the frames. Then, the HS histogram difference between two adjacent frames is calculated as a measure of the inter-frame content variation. The histogram difference may be measured using euclidean distance, χ distance, etc., and if the difference is greater than a preset threshold (e.g., 0.6), then a significant change between the two frames is considered, marking the previous frame as a key frame candidate. For key frame candidates, the present application also requires further computation of motion vectors between it and the previous and subsequent frames to reflect motion and changes between video frames. The motion vector may be calculated using classical algorithms such as Lucas-Kanade optical flow, and if the motion vector magnitude is greater than a preset threshold (e.g., 10 pixels), the frame is determined to be a key frame. To avoid too sparse or too dense key frames, the present application sets a key frame interval range, such as extracting a key frame every 50-100 frames. Meanwhile, the application further needs to evaluate and screen the quality of the extracted key frames, and eliminates the key frames with low quality or redundancy to obtain a final key frame set. For example, for a video with a duration of 5 minutes and a frame rate of 30 frames/second, for 9000 frames, if the present application sets the key frame interval to 75 frames, about 120 key frames can be extracted, and the amount of data is significantly reduced. Finally, through three preprocessing steps of video format conversion, video quality evaluation and key frame extraction, the application can convert the original video camera data into a standardized, high-quality and information-concentrated video data set.

Step S102, carrying out continuous iterative decomposition on a video data set to obtain a video iterative signal component set, wherein the video iterative signal component set comprises a plurality of iterative signal groups, calculating the information quantity of each iterative signal group, and determining the video characteristic distortion degree corresponding to the video data set according to the plurality of information quantities.

Specifically, the basic idea of continuous iterative decomposition is to consider video data as a composite signal formed by superimposing signal components of different frequencies, amplitudes and phases, and to gradually decompose signal components of different scales and levels of abstraction by iteratively performing operations such as filtering, downsampling and upsampling on the video data. This process can be expressed by a mathematical formula:

；

where X (t) represents the original video signal, AndRespectively representing a low-frequency signal component and a high-frequency signal component obtained by the ith iterative decomposition, wherein n is the number of iterative decomposition.

In particular implementations, the present application may employ classical signal processing tools such as wavelet transform (Wavelet Transform) to perform successive iterative decomposition of video data. Wavelet transformation is a time-frequency analysis tool that is capable of extracting local features of a signal at different scales by scaling and shifting the signal. Wavelet transforms have unique advantages in processing non-stationary and abrupt signals compared to conventional fourier transforms. The application can carry out multi-scale decomposition on video signals in two dimensions of time and space to extract signal components with different time-space granularity.

For example, the present application may use Haar wavelet to two-dimensional wavelet decomposition of video frames, decomposing each frame into four subbands, a low frequency approximation subband (LL), a horizontal high frequency subband (LH), a vertical high frequency subband (HL), and a diagonal high frequency subband (HH), where the LL subband represents a low frequency approximation of the original frame and the LH, HL, and HH subbands represent high frequency details of the original frame in different directions. By recursively decomposing the LL sub-band, the application can obtain a lower-frequency approximate sub-band and a higher-frequency detail sub-band to form a multi-scale wavelet coefficient pyramid. The method and the device can extract key information and characteristics of video data on different time-space granularities by analyzing and processing different subbands and scales in the wavelet coefficient pyramid.

In the continuous iterative decomposition process, the method and the device need to reasonably set the number of times and parameters of iterative decomposition according to the complexity and the application requirement of video content. In general, the more iterations, the more signal components are decomposed, the finer the characterization of the video content, but at the same time the computational complexity increases. In practical application, the optimal iteration times can be determined through methods such as experiments, cross-validation and the like. For example, for a video with a duration of 5 minutes and a resolution of 1920×1080, the application may first perform 2-3 wavelet decompositions on the keyframes and then perform more recursive decompositions on the low frequency approximation subbands until a preset reconstruction error threshold is met or a maximum number of iterations (e.g., 5-6) is reached.

In addition, the application can also combine the semantic structure and content characteristics of the video, and adopts different iterative decomposition strategies for different video clips. For example, for fragments with abundant semantic information and rapid detail change, the application can increase the iteration times and extract more high-frequency detail information, while for fragments with sparse semantic information and slow content change, the application can reduce the iteration times and only extract main low-frequency trend information. Therefore, the application can realize self-adaptive signal decomposition and feature extraction among different video clips, and further improve the precision and efficiency of video semantic analysis and abstract generation. Through continuous iterative decomposition, the application finally obtains a video iterative signal component set which contains multi-level representation of video data on different time-space scales and semantic levels.

In some embodiments, the set of iterative signals includes a low frequency approximation subband and a high frequency detail subband; the calculating the information quantity of each iterative signal group comprises the following steps: acquiring first edge probability distribution information of low-frequency approximate subbands of each iterative signal group; acquiring second edge probability distribution information of the high-frequency approximate sub-bands of each iterative signal group; acquiring joint probability distribution information corresponding to the low-frequency approximate sub-bands and the high-frequency approximate sub-bands of each iterative signal group; and calculating the information quantity of each iterative signal group according to the first edge probability distribution information, the second edge probability distribution information and the joint probability distribution information.

The information quantity is an index for measuring the complexity and the information richness of the signal components, and reflects the proportion of the signal components in the video data and the carried semantic information. In the scheme, mutual information (Mutual Information) in the information theory is adopted to measure the information quantity of the video iteration signal component. Mutual information is a non-negative symmetry measure that measures the correlation between two random variables, and represents the average amount of information that one random variable obtains by observing another random variable, defined as follows:

；

Wherein X and Y represent two discrete random variables, P (X) and P (Y) represent the edge probability distributions of X and Y, respectively, and P (X, Y) represents the joint probability distribution of X and Y. The larger the value of the mutual information I (X; Y) is, the stronger the correlation between X and Y is, and the more information X obtains by observing Y.

In a set of iterative video signal components, the present application may consider each signal component as a discrete random variable that takes the value of the coefficient value of that component over different video frames or segments. In order to calculate the mutual information between the signal components, the present application needs to first estimate their edge probability distribution and joint probability distribution.

Specifically, for the low-frequency approximate sub-band LLi and the high-frequency detail sub-band LHi, HLi, HHi obtained by the ith iterative decomposition, the coefficient value of the low-frequency approximate sub-band LLi and the high-frequency detail sub-band LHi, HLi, HHi can be normalized and quantized by adopting a method similar to the method for calculating the information entropy. For example, the application can divide the coefficient value range of LLi sub-band into k intervals, count the number of coefficient values in each interval, and divide the number of total coefficients to obtain the edge probability distribution estimation of LLi sub-band:

；

where nj represents the number of coefficient values in the jth interval and m represents the total number of coefficients for the LLi subband. Similarly, the present application can estimate LHi, HLi, HHi the edge probability distribution of the subbands:

；

Next, the present application entails estimating a joint probability distribution between different subbands. Taking LLi and LHi subbands as examples, the application can count the occurrence frequency of coefficient value pairs (j, k) of LLi and LHi subbands in different quantization intervals and divide the occurrence frequency by the total coefficient pairs to obtain joint probability distribution estimation of the LLi and LHi subbands:

；

Where njk represents the frequency at which the coefficient value of the LLi subband falls in the jth interval and the coefficient value of the LHi subband falls in the kth interval, and m represents the total coefficient logarithm of the LLi and LHi subbands. By using the estimated values of the edge probability distribution and the joint probability distribution, the method can calculate the mutual information between LLi and LHi subbands:

；

The above formula represents the average amount of information obtained by observing the LHi sub-bands for the LLi sub-bands, reflecting the correlation and redundancy of information between the two sub-bands. The larger the value of I (LLi; LHi), the more strongly correlated the LLi and LHi subbands, the greater the amount of information they contain.

Similarly, the present application can calculate mutual information I (LLi; HLi) and I (LLi; HHi) between LLi subbands and HLi, HHI subbands, and mutual information I (LHi; HLi), I (LHi; HHi), I (HLi; HHi) between LHi, HLi, HHi subbands. These mutual information values characterize the correlation and information redundancy between sub-bands in different scales and directions, providing an important reference for subsequent video semantic analysis and abstract generation.

By calculating the mutual information of each component in the video iterative signal component set, the application can quantitatively evaluate the correlation and information redundancy among different components and understand the internal structure and information distribution mode of video data.

Illustratively, the determining the distortion degree of the video feature corresponding to the video data set according to a plurality of information amounts includes: calculating a first root mean square error for the low frequency approximation subband and the video dataset; calculating a second root mean square error for the high frequency approximation subband and the video dataset; and determining the video characteristic distortion degree corresponding to the video data set according to the information quantity, the first root mean square error and the second root mean square error.

The distortion degree of the video features measures the degree of information loss introduced in the processes of compressing, transmitting, storing and the like of the video data, and reflects the degradation of the video quality and the loss condition of key features. By evaluating the distortion degree of the video characteristics, the application can judge the performance and effect of video processing and guide corresponding optimization and improvement measures.

In order to determine the distortion degree of the video features, the application needs to comprehensively consider the information quantity distribution of the video iterative signal components and the video quality evaluation index. Specifically, the application can adopt a weighted average method to weight and sum the information amounts of different components and the corresponding distortion degrees to obtain the characteristic distortion degrees of the whole video.

First, the present application calculates distortion degrees of the low-frequency approximation subband LLi and the high-frequency detail subband LHi, HLi, HHi obtained by the ith iterative decomposition, respectively. Here, the present application may employ a mean square error (Mean Squared Error, MSE) as a measure of the distortion degree, which represents the degree of difference between the original video signal and the processed video signal. The formula for the MSE is as follows:

；

Where n represents the total number of samples of the video signal, AndThe i-th sample value of the original video signal and the processed video signal are represented, respectively. The larger the MSE value, the higher the distortion degree of the video signal, and the more serious the quality degradation.

For LLi subbands, the application can compare them to the original video signal, calculating the MSE value:

；

Wherein, The i-th coefficient value representing the LLi subband.

Similarly, for LHi, HLi, HHi subbands, the present application can calculate their MSE values from the original video signal, respectively:

；

、 And The i-th coefficient values of LHi, HLi, HHi subbands are represented, respectively. After obtaining MSE values of each sub-band, the application can perform weighted average on the MSE values and corresponding information quantity to obtain the total distortion of the ith iterative decomposition:

；

wherein I (LLi), I (LHi), I (HLi), I (HHI) respectively represent the information amount of LLi, LHi, HLi, HHi sub-bands, Representing the total distortion of the ith iteration decomposition. This weighted averaging process takes into account the information content and distortion of the different sub-bands, so that the sub-bands with a large information content and a high distortion take up a greater weight in the total distortion and vice versa.

Finally, the application can accumulate the total distortion of all iterative decomposition to obtain the characteristic distortion of the whole video:

；

Where m represents the total number of iterative decomposition and D represents the characteristic distortion of the video. The larger the D value is, the more serious the information loss is introduced in the processing process of the video, the more obvious the quality reduction is, and the smaller the D value is, the more key characteristics and detail information are reserved in the video, and the quality loss is smaller.

In practical application, the application can set a distortion thresholdDistortion degree D of video featuresA comparison is made to determine the performance and effect of the video processing. For example, when D is less than or equal toWhen the application considers that the video processing result is acceptable, the key characteristics and semantic information are well preserved, when D is larger thanWhen the application considers that the video processing introduces larger information loss, the processing algorithm needs to be further optimized and improved.

By adjusting parameters such as iteration decomposition times, quantization intervals, information amount calculating methods and the like, the method can flexibly control the calculation and evaluation processes of the video characteristic distortion degree in different video processing tasks and application scenes. For example, in video compression and transmission, the present application can be suitably increasedTo reduce data rate and transmission overhead as much as possible while ensuring video quality, in video semantic analysis and digest generation, the application can properly reduceTo ensure the integrity of key features and semantic information and to improve the accuracy of analysis and generation.

Step S103, obtaining video influence factors and candidate topic labels corresponding to the video data set, generating semantic topic labels according to the video influence factors and the candidate topic labels, and determining video frequency control stability indexes and video frequency control distortion indexes according to video feature distortion degrees and the semantic topic labels.

Specifically, the video influence factors corresponding to the preprocessed video data set obtained in the steps are obtained, target detection and scene recognition are carried out on video key frames to generate candidate topic labels, and the video influence factors and the candidate topic labels are combined to generate semantic topic labels of videos. And determining a video frequency control sequence according to the video characteristic distortion degree determined by the steps and the video semantic topic label generated by the steps, and further determining a video frequency control stability index and a video frequency control distortion degree index according to the video frequency control sequence.

In some embodiments, the video impact factor comprises a plurality of impact factor parameters; the obtaining the video influence factor and the candidate topic label corresponding to the video data set, and generating the semantic topic label according to the video influence factor and the candidate topic label, includes: acquiring a plurality of influence factor parameters of the video data set, normalizing the plurality of influence factor parameters, and forming the video influence factor according to the normalized influence factor parameters; performing target detection and scene recognition on each piece of key frame information of the video data set, and acquiring semantic information corresponding to the key frame information; the semantic information comprises object information and scene categories; generating candidate topic labels corresponding to the object information and the scene category respectively; and generating the semantic topic label according to a plurality of candidate topic labels and the video influence factors.

By establishing a connection between the low-level video influence factors and the high-level semantic information, support is provided for subsequent video abstract generation and semantic retrieval.

Firstly, the application needs to acquire the video influence factors corresponding to the preprocessed video data set obtained in the steps. The video influence factor is a set of attributes reflecting the importance and influence of video content, and can comprise a plurality of influence factor parameters such as the playing amount, the praise number, the comment number, the sharing number and other user interaction data of the video, and the duration, the definition, the shooting equipment and other intrinsic attributes of the video. These impact factors characterize the popularity and quality level of the video from different perspectives, which is important for understanding the semantic topic of the video and generating accurate labels.

The application can extract these influencing factors from the metadata of the video dataset and normalize and weight them. For example, for the ith video, the present application may define its influence factor vector if_i as:

IF_i = [v_i, l_i, c_i, s_i, d_i, r_i, ...]；

where v_i represents the play amount of the video, l_i represents the praise number, c_i represents the comment number, s_i represents the share number, d_i represents the duration, r_i represents the sharpness, etc. In order to eliminate the dimensional differences of different influence factors, the application can normalize the maximum-minimum value of each influence factor:

IF_i_norm = (IF_i - min(IF)) / (max(IF) - min(IF))；

wherein min (IF) and max (IF) represent the minimum and maximum values of all video impact factors, respectively. After normalization, the value range of each influence factor is between 0 and 1, so that comparison and weighting are facilitated.

Then, the application carries out semantic analysis on the preprocessed video data set, and extracts semantic information in the video key frame through technologies such as target detection, scene recognition and the like. Object detection aims at identifying important objects and people appearing in video pictures, such as faces, vehicles, animals, etc. Scene recognition is intended to determine the scene category to which the video frame belongs, such as indoor, outdoor, city, country, etc. Such semantic information may assist the present application in understanding the content subject matter and context of the video.

For target detection, the method can adopt a method based on deep learning, such as Convolutional Neural Network (CNN), regional recommendation network (RPN) and the like, input video key frames into a pre-trained detection model, and identify target regions and categories in the video key frames. For example, using YOLO (You Only Look Once) algorithm, the present application can obtain bounding box coordinates and class labels for objects in a keyframe:

[x1, y1, x2, y2, class_id, confidence]；

where (x 1, y 1) and (x 2, y 2) represent the upper left and lower right corner coordinates of the target bounding box, class_id represents the class number of the target, and confidence represents the confidence of the detection result. The method can set a confidence threshold, filter out detection results with low possibility, and count the occurrence frequency and duration of targets of different categories to generate candidate topic labels.

For scene recognition, the method can adopt image classification-based methods, such as CNN, resNet and the like, to input video key frames into a pre-trained classification model, and judge the scene category to which the video key frames belong. For example, using the ResNet model trained on the placs 365 dataset, the present application can obtain scene categories and confidence levels for key frames:

[scene_id, confidence]；

where scene_id represents the class number of the scene and confidence represents the confidence of the classification result. The method can count the occurrence frequency and duration of different scene categories in the video and generate the candidate theme labels.

After the results of target detection and scene recognition are obtained, the method and the device need to combine the video influence factors and the candidate theme labels to generate the final video semantic theme labels. The process can be regarded as multi-modal information fusion, and needs to comprehensively consider factors such as visual content, user interaction data, video attributes and the like.

One simple fusion method is weighted averaging, i.e., giving different weights to candidate topic labels of different sources, followed by linear combination. For example, the application may define weights of object detection, scene recognition, and impact factors as w_o, w_s, and w_i, respectively, and the semantic topic label of the i-th video may be expressed as:

L_i = w_o * L_oi + w_s * L_si + w_i * L_ii；

Where l_oi, l_si, and l_ii represent candidate topic label vectors resulting from target detection, scene recognition, and impact factors, respectively. These tag vectors may be one-hot coded or word embedded representations reflecting the degree of relevance of different topics or semantic concepts. Weights w_o, w_s, and w_i may be determined by cross-validation or expert knowledge to balance the contributions of the different information sources.

The following is a specific example of how the semantic topic labels of a video are generated. It is assumed that the application has a video of a football match, and the influence factor vector of the video is normalized as follows:

IF_norm = [0.8, 0.6, 0.7, 0.5, 0.9, 0.8, ...]；

Through target detection, the application identifies objects such as football, athlete, court and the like in the video key frame, and the generated candidate topic label vector is as follows:

L_o = [0.9, 0.8, 0.7, 0.2, 0.1, ...]；

Through scene recognition, the method judges that the video mainly occurs in scenes such as stadium, grassland and the like, and the generated candidate topic label vector is as follows:

L_s = [0.8, 0.6, 0.3, 0.1, 0.2, ...]；

according to the influence factors, the application discovers that the playing quantity, the praise number and the sharing number of the video are higher, which indicates that the content is welcomed and focused by users, and the generated candidate topic label vector is:

L_i = [0.7, 0.5, 0.4, 0.2, 0.1, ...]；

Assuming that the weights of target detection, scene recognition and influence factors are respectively 0.4, 0.3 and 0.3, the finally generated video semantic topic labels are as follows:

L = 0.4 * L_o + 0.3 * L_s + 0.3 * L_i= 0.4 * [0.9, 0.8, 0.7, 0.2, 0.1, ...] + 0.3 * [0.8, 0.6, 0.3, 0.1, 0.2, ...] + 0.3 * [0.7, 0.5, 0.4, 0.2, 0.1, ...] = [0.82, 0.68, 0.54, 0.18, 0.13, ...];

From the result, the semantic topic labels of the video have higher relevance scores of concepts such as football, athlete, court, stadium and the like, and are consistent with the actual content of the video. These semantic topic labels provide important guiding information for subsequent video summary generation and semantic retrieval.

In some embodiments, the determining a video frequency control smoothness indicator and a video frequency control distortion indicator from the video feature distortion and the semantic topic label includes: calculating a video frequency control factor of a video segment corresponding to each key frame information according to the video feature distortion degree and the semantic topic label; generating a video frequency control sequence according to a plurality of video frequency control factors; calculating the mean value and variance corresponding to the first-order difference of the video frequency control sequence, and generating the video frequency control stability index according to the mean value and variance; and obtaining the peak value and the dynamic range of the video frequency control sequence, and generating the video frequency control distortion index according to the peak value and the dynamic range.

In the steps, the application generates the label reflecting the video semantic theme by analyzing the video content and the influence factors. These semantic tags provide a high level conceptual description for understanding the content and subject matter of the video. However, semantic tags alone are not sufficient to fully characterize video dynamics and temporal patterns. In the above steps, the application combines the video characteristic distortion degree and the semantic topic label, further analyzes the change rule of the video in the time dimension, and generates a corresponding video frequency control sequence.

Firstly, the application needs to determine the video frequency control sequence according to the video characteristic distortion degree and the semantic topic label. The video characteristic distortion degree reflects the information loss degree of the video in different time periods, and the semantic topic label characterizes the semantic change of the video content. The application can combine the two indexes to generate a comprehensive video frequency control factor.

Specifically, for the ith video clip, the present application may define its frequency control factor as:

FC_i = α * D_i + β * ΔL_i；

Wherein D_i represents the characteristic distortion degree of the segment, deltaL_i represents the semantic topic label difference of the segment and the previous segment, and alpha and beta are balance factors used for adjusting the influence of the distortion degree and semantic change on frequency control. Δl_i can be measured by calculating the euclidean distance or cosine similarity of the semantic topic label vectors of the two segments.

After obtaining the frequency control factor of each video clip, the application can generate a video frequency control sequence:

FC = [FC_1, FC_2, ..., FC_n]；

where n is the total number of video segments. The frequency control sequence FC reflects the pattern of variation of the video content in the time dimension, the larger the value of fc_i, the more violent the content variation of the i-th clip, requiring higher frequency control to ensure video quality and semantic consistency.

Next, the present application calculates a smoothness index and a distortion index of video frequency control according to the frequency control sequence FC. The smoothness index measures the smoothness of the video content changes, while the distortion index measures the intensity of the video content changes. These two indicators characterize the effect and quality of video frequency control from different angles.

For the stationarity index, the application may be measured using the mean and variance of the first order differences of the frequency control sequences. The first order difference reflects the amplitude of the variation of the frequency control factor between adjacent segments, the mean and variance of which represent the mean level of variation and the degree of fluctuation, respectively.

Specifically, the first-order difference of the frequency control sequence FC is:

ΔFC = [ΔFC_1, ΔFC_2, ..., ΔFC_n-1]；

Where Δfc_i=fc_i+1-fc_i, represents the difference in frequency control factor between the i-th segment and the i+1-th segment.

The mean value of the first-order difference DeltaFC is mu-DeltaFC, and the variance of the first-order difference DeltaFC is sigma-DeltaFC 2. The mean mu_DeltaFC reflects the average level of the frequency control factor variation and the variance sigma_DeltaFC 2 reflects the degree of fluctuation of the frequency control factor variation. The smaller the mean value, the smaller the variance, which means that the higher the smoothness of the video frequency control, the more gradual the video content changes.

For the distortion indicator, the application can be measured by using the peak value and dynamic range of the frequency control factor. The peak value represents the maximum value of the frequency control factor and reflects the most intense degree of the video content variation. The dynamic range represents the difference between the maximum and minimum values of the frequency control factor, reflecting the range of the amplitude of the video content variation.

Specifically, the peak value of the frequency control sequence FC is Fc_max=max (FC); the dynamic range of the frequency control sequence FC is Fc_range=max (FC) -min (FC); the larger the peak value fc_max, the larger the dynamic range fc_range, indicating that the higher the distortion degree of the video frequency control, the more the video content changes.

The application can comprehensively evaluate the effect and quality of video frequency control by integrating the stability index and the distortion index. Ideally, the application expects that the video frequency control has higher smoothness index and lower distortion index, which means that the video content changes smoothly and the distortion introduced by the frequency control is smaller.

Step S104, training of the semantic embedding model to be trained is completed according to the semantic topic labels and each key frame information, and semantic embedding vectors corresponding to the video data set output by the semantic embedding model are obtained.

Specifically, the semantic topic labels generated in the steps and the key frame information extracted in the steps are utilized to train a video semantic embedding model to obtain semantic embedding vectors of video contents, so that low-dimensional and high-discrimination vectorization representation of the video contents is realized.

In some embodiments, the training of the semantic embedded model to be trained is completed according to the semantic topic label and each piece of key frame information, and the training comprises the steps of inputting the key frame information into the semantic embedded model to be trained, and obtaining a predicted semantic embedded vector output by the semantic embedded model; coding the semantic topic label to obtain a semantic coding vector; calculating a cross entropy loss function of the prediction semantic embedded vector and the semantic coding vector; and training the semantic embedding model according to the cross entropy loss function.

Specifically, the application first requires the construction of a training data set of a video semantic embedding model. The training data set consists of two parts, namely a video key frame and a corresponding semantic topic label. Wherein the video key frames are obtained by a key frame extraction algorithm in the above steps, and can represent the main content and scene change of the video. The semantic topic labels are generated by analyzing the video content and the influence factors in the steps, and they characterize the semantic topic and attribute of the video in a high-level conceptual form.

For each video, the present application can represent its key frame sequence as:

F= [ f_1, f_2, ], f_m ]; where f_i represents the ith key frame and m is the total number of key frames. Meanwhile, the application can express the semantic topic label of the video as L= [ l_1, l_2, ]; where l_j represents the j-th semantic topic label and n is the total number of semantic topic labels.

By utilizing the key frame sequence F and the semantic topic label L, the application can construct training samples (F, L) of a video semantic embedded model. The object of the model is to learn a mapping function F- > E, mapping the keyframe sequence F into a low-dimensional semantic embedding space E such that in the E space semantically similar videos have similar embedding vectors and semantically different videos have embedding vectors that are far apart.

In practice, the present application may employ a variety of video semantic embedding models and training algorithms, such as deep learning based Convolutional Neural Networks (CNNs), recurrent Neural Networks (RNNs), attention mechanisms, and the like. These models are able to efficiently learn the spatio-temporal characteristics and semantic information of the video and generate a compact, robust embedded vector representation.

Taking a CNN-based video semantic embedding model as an example, the application can design the following network structure:

1. and an input layer, which is used for receiving a video key frame sequence F, wherein each key frame is an image with a fixed size.

2. And the convolution layer is used for extracting the local characteristics and semantic information of the key frame through multi-layer convolution and pooling operation. The size and number of convolution kernels may be adjusted according to the particular task.

3. And the full connection layer is used for flattening the features extracted by the convolution layer and carrying out feature transformation and semantic mapping through a multi-layer full connection network. The number of neurons of the fully connected layer may control the dimension of the embedded vector.

4. And the output layer is used for generating a video semantic embedded vector with fixed dimension as a vectorized representation of video content.

In the training process, the application optimizes model parameters by minimizing the loss function between the embedded vector and the semantic topic label. Common loss functions include cross entropy loss, contrast loss, triplet loss, etc., which can measure similarity and variability between embedded vectors and semantic tags.

For example, using a cross entropy loss function, the present application can convert the semantic topic label L into one-hot encoded form and compare it to the embedded vector of the model output:

L_onehot = [0, 0, ..., 1, ..., 0]；

Wherein, at the position corresponding to the semantic topic label L, the value is 1, and the rest positions are 0. The application then converts the embedded vector into a probability distribution form by Softmax function, p=softmax (E); wherein E is a video semantic embedded vector, and P is probability distribution output by a Softmax function. The cross entropy loss function calculation formula is:

；

Wherein n is the total number of semantic topic labels, and L_ onehot _i and P_i respectively represent a one-hot coding value and a probability value of the ith semantic topic label. By minimizing the cross entropy loss, the application can enable the embedded vector generated by the model to have a higher probability value at the position corresponding to the semantic topic label, thereby realizing the alignment and mapping of the embedded vector and the semantic label.

After model training is completed, the video key frame sequence can be input into the trained semantic embedding model to obtain the corresponding semantic embedding vector. The embedded vectors represent semantic content and theme properties of the video in a low-dimensional and high-discrimination mode, and can be used for subsequent tasks such as video analysis, retrieval and recommendation.

Step S105, a video semantic index structure is constructed according to the video frequency control stationarity index, the video frequency control distortion index and the semantic embedding vector.

Specifically, according to the video frequency control smoothness index and the video frequency control distortion index determined by the steps, a multi-level video semantic indexing structure is constructed by combining the video content semantic embedded vectors obtained by the steps, and quick and accurate video fragment retrieval and positioning according to semantic similarity are supported.

Step S106, when receiving the video editing request, obtaining a target abstract segment corresponding to the video editing request from the video semantic index structure, and outputting the target abstract segment as a video abstract corresponding to the video editing request.

Specifically, according to the video editing request of the user, retrieving the video segments related to the request from the video semantic indexing structure constructed in the above steps, and generating and outputting the target abstract segments with the subject as the core through semantic similarity calculation and time sequence optimization.

The training of the semantic embedded model to be trained is completed according to the semantic topic label and each key frame information, and the training comprises the following steps:

Inputting the key frame information into the semantic embedded model to be trained, and obtaining a predicted semantic embedded vector output by the semantic embedded model;

coding the semantic topic label to obtain a semantic coding vector;

Calculating a cross entropy loss function of the prediction semantic embedded vector and the semantic coding vector;

and training the semantic embedding model according to the cross entropy loss function.

In some embodiments, the constructing a video semantic index structure according to the video frequency control smoothness index, the video frequency control distortion index, and the semantic embedded vector comprises: hierarchical clustering is carried out on the semantic embedded vectors of the video segments corresponding to the key frame information, so that a semantic cluster hierarchical structure is obtained, and the semantic cluster hierarchical structure comprises a plurality of semantic clusters; acquiring a cluster center vector corresponding to each semantic cluster and the video fragment positioned in the semantic cluster; constructing a mapping relation corresponding to the cluster center vector and the video fragment; and constructing the video semantic index structure according to the mapping relation and the semantic cluster hierarchical structure.

Specifically, the application constructs the video semantic index by adopting a method based on hierarchical clustering and inverted indexing. The method comprises the steps of firstly carrying out hierarchical clustering on semantic embedded vectors of video fragments to obtain a tree-shaped semantic cluster hierarchical structure, and then constructing a local inverted index in each semantic cluster to realize hierarchical retrieval and positioning from top to bottom.

The algorithm flow is as follows:

an input of a video clip set v= { v_1, v_2, v_n }, wherein each video clip v_i contains a semantic embedded vector e_i, a frequency control stationarity index μ_i, and a distortion index max_i; and outputting a multi-level video semantic index structure I.

1. And carrying out hierarchical clustering on all semantic embedded vectors { E_1, E_2, & gt, E_n } in the video fragment set V to obtain a tree-shaped semantic cluster hierarchical structure T. The clustering process can adopt a bottom-up aggregation clustering algorithm, such as AGNES (Agglomerative Nesting) algorithm, to gradually combine the video segments with similar semantics to form a hierarchical nested structure. In clustering, cosine similarity equidistance measurement methods can be used to evaluate the similarity between semantic embedded vectors.

2. Each semantic cluster is processed recursively starting from the root node of the tree structure T. For the current semantic cluster c_j, the following steps are performed 2.1 if c_j is a leaf node, i.e. the cluster contains only one video clip v_i, v_i is directly added to the inverted index i_j of the current layer, and the frequency control stationarity index μ_i and the distortion index max_i thereof are recorded. 2.2 If C_j is a non-leaf node, the above steps are recursively performed on its child nodes (child semantic clusters). 2.3 For all video clips { v_1, v_2, & gt, v_m } inside c_j, a local inverted index i_j is constructed. Specifically, for each video segment v_i, its semantic embedded vector e_i is extracted and quantized into a discrete set of semantic words { w_1, w_2, }, w_k. Then, an inverted mapping relationship of semantic words to video clips is established, i.e., i_j (w_l) = { (v_i, μ_i, max_i) |w_l e v_i }. 2.4 The video segments within c_j are ordered and filtered to generate a representative cluster center vector e_j. An arithmetic mean or geometric center of the semantic embedded vectors of video segments within a cluster may be selected as the cluster center vector. 2.5 Taking the cluster center vector E_j as an index item of the current semantic cluster C_j, and establishing a mapping relation from the cluster center to the video segments in the cluster, namely:

I(E_j) = {(v_i, μ_i, max_i) | v_i ∈ C_j}。

3. And returning a multi-level video semantic index structure I, wherein the multi-level video semantic index structure I comprises a tree-shaped semantic cluster hierarchical structure T and a local inverted index I_j in each semantic cluster.

When retrieving and positioning video clips, the application can utilize a multi-level video semantic indexing structure I and adopts a top-down hierarchical search strategy:

1. Given a query semantic vector Q, a search is first performed at the top level (root node) of the index structure I. And calculating the similarity between Q and a top-level index item (cluster heart vector), and selecting a plurality of semantic clusters most similar to Q as candidate sets.

2. For each candidate semantic cluster c_j, a search is recursively performed in its subtree. If C_j is a leaf node, the similarity of Q and video clips in the cluster is directly compared, and the most similar clip is selected as a candidate result. If C_j is a non-leaf node, searching in its child nodes continues until a leaf node is reached.

3. In the searching process, the frequency control stationarity index mu_i and the distortion index max_i of the video clips can be utilized for sorting and screening. For example, video clips with higher smoothness and lower distortion may be preferentially selected to ensure stability and quality of the search results.

4. And ordering all candidate video fragments according to the similarity with the query semantic vector Q, and selecting the Top-N most similar fragments as a final retrieval result.

5. And returning metadata information of the Top-N video clips, including start and stop time stamps, belonging semantic clusters and the like in the original video, so as to realize accurate positioning and semantic association of the video clips.

The following is a simple example of how video clip retrieval can be performed using a multi-level video semantic indexing structure:

Assume that the present application has a video data set V containing 100 video segments v_1, v_2. Through semantic embedding and clustering, the application obtains a three-layer semantic cluster hierarchical structure T:

First layer (root node) c_0= { v_1, v_2,..v_100 };

a second layer, c_1= { v_1, v_2, & gt, v_50}, c_2= { v_51, v_52, & gt, v_100};

Third layer:

C_3 = {v_1, v_2, ..., v_20}, C_4 = {v_21, v_22, ..., v_50}, C_5 = {v_51, v_52, ..., v_80}, C_6 = {v_81, v_82, ..., v_100}.

Inside each semantic cluster, the application builds a local inverted index. For example, for semantic cluster C_3, its inverted index I_3 may be as follows:

I_3(w_1) = {(v_2, 0.8, 0.1), (v_5, 0.7, 0.2), (v_10, 0.9, 0.05)}; I_3(w_2) = {(v_1, 0.6, 0.3), (v_8, 0.75, 0.15)}.... Wherein w_1, w_2 and the like represent semantic words, and each semantic word corresponds to a group of video clips and a stationarity index and a distortion index thereof.

Now, given a query semantic vector Q, the present application expects to find the Top-3 video segments in the video dataset V that are most similar to Q.

First, search is performed at the top level (root node c_0) of the index structure I, and the similarity between Q and the cluster center vector e_0 of c_0 is calculated. Assuming that the similarity between Q and e_0 is high, two child nodes c_1 and c_2 of c_0 are selected as candidate semantic clusters.

Next, a recursive search is performed in the subtrees of c_1 and c_2. Assuming that video segments v_2 and v_5 highly similar to Q are found in child node c_3 of c_1, video segment v_55 highly similar to Q is found in child node c_5 of c_2.

According to the similarity score and the frequency control index of the video clips, the application can obtain a sorted candidate result list:

V_2 (similarity: 0.95, smoothness: 0.8, distortion: 0.1);

v_55 (similarity: 0.90, smoothness: 0.85, distortion: 0.08);

v_5 (similarity: 0.88, smoothness: 0.7, distortion: 0.2);

finally, video clips v_2, v_55, and v_5 of Top-3 are selected as the final search results, and their metadata information, such as start-stop time stamps in the original video, etc., is returned.

Exemplary, the obtaining, in the video semantic indexing structure, the target abstract segment corresponding to the video editing request includes: encoding the video editing request to obtain a request text vector corresponding to the video editing request; obtaining the similarity between the request text vector and a plurality of cluster core vectors in the video semantic index structure; determining at least one target cluster heart vector from a plurality of cluster heart vectors according to a plurality of the similarities; and extracting a target video segment corresponding to the target cluster heart vector from the semantic index structure according to the mapping relation corresponding to the target cluster heart vector, and taking the target video segment as the target abstract segment.

First, a user's video editing request is represented as a request text vector. Given the user's request text Q, it is encoded with a pre-trained BERT model. Specifically, a request text vector corresponding to the request text Q is input into the BERT model, and a context vector representation of the request text is obtained through calculation by the multi-layer transducer encoder. Then, the vector corresponding to the [ CLS ] tag of the last layer of the BERT model is taken as the semantic embedded vector E_Q of the whole request text.

The request text vector may be represented as a word sequence w 1, w_2..w_m }, where M is the length of the request text. The word sequence is input into the BERT model to obtain a context vector { h_1, h_2,. }, h_m } for each word, where h_i is a D-dimensional vector representing the contextual semantic representation of the i-th word in the requested text. Next, the vector h_cls corresponding to the [ CLS ] tag of the last layer of the BERT model is taken as the semantic embedded vector e_q of the whole request text, namely:

E_q=h_ { cls }; where E_Q is a D-dimensional vector representing the semantic embedded representation of the user request text Q.

Then, the video semantic index structure is searched. The video clips related to the requested semantics are retrieved in the video semantic indexing structure I using the semantic embedded vector e_q of the user request. The index structure I adopts a tree structure, each node represents a semantic cluster, and the leaf nodes correspond to the original video segments.

The retrieval process employs a Breadth First Search (BFS) strategy to traverse each level of the index tree layer by layer, starting from the root node. For each layer of nodes, a cosine similarity between the request vector E_Q and the cluster core vector E_n of the node is calculated:

；

Wherein the numerator represents the inner product of the two vectors and the denominator represents the product of the L2 norms of the two vectors. The cosine similarity has a value range of [ -1, 1], and a larger value indicates that the directions of the two vectors are closer, and the semantic similarity is higher.

In calculating the semantic similarity, a similarity threshold ζ is set. And for the nodes of each layer, selecting the nodes with the similarity with the request vector E_Q being greater than or equal to a threshold value xi, and adding the nodes into a next layer of node queue to be searched. This process is repeated until the leaf node level is reached. And for each node of the leaf node layer, if the similarity between the node and the request vector E_Q is greater than or equal to a threshold value xi, adding the video segment corresponding to the node as a target abstract segment into the candidate segment set S. Finally, a candidate segment set S related to the user request semanteme is obtained, wherein the target abstract segments come from different videos and time periods.

Then, semantic similarity calculation and sequencing are performed. For each target abstract segment s_i in the candidate segment set S, its semantic embedding vector e_i is extracted. The cosine similarity Sim (e_q, e_i) between e_i and the request vector e_q is calculated. And sorting the candidate fragments in a descending order according to the semantic similarity score. After ordering, candidate set S becomes { s_1, s_2, & gt, s_n }, where Sim (e_q, e_1) & gtor more than Sim (e_q, e_2) & gtor more than Sim (e_q, e_n). And setting an upper limit K of the number of candidate fragments, and selecting the first K fragments from the sorted candidate set S to form a preliminary target abstract fragment set S'.

Further, timing optimization and topic division are performed. The segments in the target summary segment set S 'are sorted in ascending order according to their time stamps in the original video, resulting in a time sequence segment list { S' _1, S '_2, &..s' _k }. The semantic similarity Sim (E 'i, E' { i+1 }) between each pair of adjacent segments (s 'i, s' { i+1 }) is calculated, where E 'i and E' { i+1} represent the semantic embedding vectors of the segments s 'i and s' { i+1}, respectively. A semantic similarity threshold δ is set. Traversing the sequence of segment lists, for each pair of adjacent segments (s 'i, s' { i+1 }), if their semantic similarity Sim (E 'i, E' { i+1 }) is smaller than a threshold δ, a topic boundary is inserted between segments s 'i and s' { i+1}, dividing them into different sub-topics. For each topic segment, calculate its segment center vector c_j, i.e., average pooling the semantic embedded vectors of all segments within the topic segment:

；

Where T_j represents the j-th topic segment, |T_j| represents the number of segments within the topic segment, and E_i represents the semantic embedded vector of the i-th segment within the topic segment. For each topic segment T_j, selecting the segment with the largest cosine similarity with the segment center vector C_j as a representative segment s_j of the topic segment. Finally, a candidate summary segment set is generated, and representative target summary segments { s_1, s_2, & gt, s_l } of all the subject segments are spliced according to the time stamp sequence of the target summary segments { s_1, s_2, & gt, s_l } in the original video, so as to form a final candidate summary segment set S '', wherein L represents the number of the subject segments. Thus, a structured, thematically distinct set of candidate summary segments S″ is obtained, wherein the segments are not only highly relevant to the user request, but are also temporally continuous and thematically consistent. The generation process mainly comprises two stages, namely the construction of a candidate abstract segment relation diagram and the generation of a global optimal abstract based on iterative edge clipping.

In the construction phase of the candidate summary segment relation graph, the application represents the candidate summary segment set s″ as a directed weighted graph g= (V, E), where node set V represents the target summary segment and edge set E represents the correlation between segments. In order to calculate the edge weight w (i, j) between any two candidate digest segments s_i and s_j, the following four factors need to be considered:

1. Semantic similarity sim (s_i, s_j) the cosine similarity between the target summary segments s_i and s_j is calculated using their semantic embedding vectors e_i and e_j. The cosine similarity formula is:

sim(s_i, s_j) = \frac{E_i \cdot E_j}{||E_i|| \cdot ||E_j||}

wherein E_i and E_j represent the semantic embedded vectors of segments s_i and s_j, respectively, |·| represents the two norms (euclidean norms) of the vector. The value range of the semantic similarity is [ -1, 1], and the larger the value is, the more similar the two fragments are semantically.

2. Temporal adjacency temp (s_i, s_j) is measured by using the inverse of the timestamp difference of the target digest segments s_i and s_j in the original video. The temporal adjacency formula is temp (s_i, s_j) = \frac {1} { |t_i-t_j| } where t_i and t_j represent the start time stamps of the segments s_i and s_j, respectively, in the original video. The range of values for the temporal adjacency is (0, + -infinity), with a larger value indicating that the two segments are closer in time.

3. Topic relevance theme (s_i, s_j) the cosine similarity between segments s_i and s_j is calculated using the center vectors c_i and c_j of the topic paragraph to which they belong. The topic correlation formula is:

theme(s_i, s_j) = \frac{C_i \cdot C_j}{||C_i|| \cdot ||C_j||}

wherein c_i and c_j represent the center vectors of the subject paragraphs to which the segments s_i and s_j belong, respectively. The value range of the topic relevance is [ -1, 1], and the larger the value is, the more relevant the two target abstract segments are on the topic.

4. The user preference matching degree pref (s_i, s_j) calculates a weighted cosine similarity with the user preference vector P using the segments s_i and s_j. Wherein P represents a user preference vector learned according to user historical behavior data. The user preference matching degree has a value ranging from < -1 > to 1, and the larger the value is, the more the two fragments are matched with the user preference. After the construction of the candidate abstract segment relation graph G is completed, the method enters a global optimal abstract generation stage based on iterative edge clipping. The goal of this stage is to select a node subset V 'from the graph G, so that candidate summary segments corresponding to nodes in V' can form a globally optimal video summary, while satisfying the following three constraints:

1. and constraint of summary duration, wherein the total duration of all fragments in V' does not exceed the preset summary duration upper limit L.

2. Semantic consistency constraint that semantic similarity between adjacent segments in V' is larger than a preset similarity threshold \beta.

3. The topic diversity constraint that the proportion of the number of topics contained in V' to the total number of topics of the original video is larger than a preset topic coverage threshold value\theta.

In order to solve the global optimal problem efficiently, the application designs an approximation algorithm based on iterative edge clipping. The core idea of the algorithm is to iteratively clip the edges in the candidate summary segment relationship graph G until no more clipping is possible, on the premise that the constraint condition is satisfied. At this time, the maximum connected subgraph of G is the globally optimal video abstract.

The specific flow of the algorithm is as follows:

1. Initializing, namely enabling G '=G, V' =V and E '=E, wherein G' represents a dynamic relation diagram in the iteration process, V 'represents a dynamic node set in the iteration process, and E' represents a dynamic edge set in the iteration process.

2. Edge clipping, namely, for each edge E (i, j) E 'in G', calculating the total duration L (V ') of all fragments in V' after clipping the edge, the minimum semantic similarity \beta (V ') between adjacent fragments, and the topic coverage \theta (V'). If L (V ') is less than or equal to L, \beta (V') is greater than or equal to beta, \theta (V ') is greater than or equal to theta after cutting edge E (i, j), removing edge E (i, j) from E', otherwise, reserving edge E (i, j). The above process is repeated until any edges can no longer be trimmed.

3. And (3) extracting the connected subgraph, namely extracting the connected subgraph with the maximum node number in the cut graph G ', and marking the connected subgraph as G ' = (V ', E '), wherein V ' is a candidate abstract segment set corresponding to the global optimal video abstract.

4. Summary generation, namely arranging target summary fragments in V' according to the time stamp sequence in the original video, and generating a final video summary.

By iterative edge clipping, the algorithm can efficiently generate the globally optimal video abstract on the premise of ensuring that constraint conditions are met. The time complexity of the algorithm is O (|E|log|E|), wherein |E| is the edge number of the candidate abstract segment relation graph G. The method based on graph optimization fully considers various correlations among candidate abstract segments, and achieves efficient approximation of a global optimal solution through iterative edge clipping, and the generated video abstract has rich content, reasonable structure and consistent semantics.

In order to execute the intelligent video editing and abstract generating method based on semantic segmentation corresponding to the method embodiment, corresponding functions and technical effects are realized. Referring to fig. 2, fig. 2 shows a block diagram of an intelligent video editing and summary generating apparatus 200 according to an embodiment of the present application. For convenience of explanation, only the portions related to this embodiment are shown, and the intelligent video editing and summary generating apparatus 200 provided in the embodiment of the present application includes:

The data acquisition module 201 is configured to acquire video data acquired by a plurality of video acquisition devices, perform preprocessing on the plurality of video data, and acquire a video data set, where the video data set includes a plurality of key frame information.

The distortion determining module 202 is configured to perform continuous iterative decomposition on the video data set to obtain a video iterative signal component set, where the video iterative signal component set includes a plurality of iterative signal groups, calculate an information amount of each iterative signal group, and determine a video feature distortion degree corresponding to the video data set according to the plurality of information amounts.

The tag obtaining module 203 is configured to obtain a video impact factor and a candidate topic tag corresponding to the video data set, generate a semantic topic tag according to the video impact factor and the candidate topic tag, and determine a video frequency control stability index and a video frequency control distortion index according to the video feature distortion and the semantic topic tag.

The vector obtaining module 204 is configured to complete training of the semantic embedding model to be trained according to the semantic topic label and each key frame information, and obtain a semantic embedding vector corresponding to the video data set output by the semantic embedding model.

The index construction module 205 is configured to construct a video semantic index structure according to the video frequency control smoothness index, the video frequency control distortion index, and the semantic embedded vector.

The summary output module 206 is configured to, when receiving a video editing request, obtain a target summary segment corresponding to the video editing request from the video semantic index structure, and output the target summary segment as a video summary corresponding to the video editing request.

The intelligent video editing and summary generating device 200 can implement the intelligent video editing and summary generating method based on semantic segmentation in the method embodiment. The options in the method embodiments described above are also applicable to this embodiment and will not be described in detail here. The rest of the embodiments of the present application may refer to the content of the above method embodiments, and in this embodiment, no further description is given.

Fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 3, the computer device 3 of this embodiment includes: at least one processor 30 (only one is shown in fig. 3), a memory 31 and a computer program 32 stored in the memory 31 and executable on the at least one processor 30, the processor 30 implementing the steps in any of the method embodiments described above when executing the computer program 32.

The computer device 3 may be a smart phone, a tablet computer, a desktop computer, a cloud server, or other computing devices. The computer device may include, but is not limited to, a processor 30, a memory 31. It will be appreciated by those skilled in the art that fig. 3 is merely an example of the computer device 3 and is not meant to be limiting as the computer device 3, and may include more or fewer components than shown, or may combine certain components, or different components, such as may also include input-output devices, network access devices, etc.

The Processor 30 may be a central processing unit (Central Processing Unit, CPU), the Processor 30 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may in some embodiments be an internal storage unit of the computer device 3, such as a hard disk or a memory of the computer device 3. The memory 31 may in other embodiments also be an external storage device of the computer device 3, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the computer device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the computer device 3. The memory 31 is used for storing an operating system, application programs, boot loader (BootLoader), data, other programs etc., such as program codes of the computer program etc. The memory 31 may also be used for temporarily storing data that has been output or is to be output.

In addition, the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to implement the steps in any of the above-mentioned method embodiments.

Embodiments of the present application provide a computer program product which, when run on a computer device, causes the computer device to perform the steps of the method embodiments described above.

In several embodiments provided by the present application, it will be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present application, and are not to be construed as limiting the scope of the application. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present application are intended to be included in the scope of the present application.

Claims

1. An intelligent video editing and abstract generating method based on semantic segmentation is characterized by comprising the following steps:

Acquiring video data acquired by a plurality of video acquisition devices, preprocessing a plurality of video data, and acquiring a video data set, wherein the video data set comprises a plurality of key frame information;

Performing continuous iterative decomposition on the video data set to obtain a video iterative signal component set, wherein the video iterative signal component set comprises a plurality of iterative signal groups, calculating the information quantity of each iterative signal group, and determining the video characteristic distortion degree corresponding to the video data set according to the information quantity;

Obtaining a video influence factor and a candidate topic label corresponding to the video data set, generating a semantic topic label according to the video influence factor and the candidate topic label, and determining a video frequency control stability index and a video frequency control distortion index according to the video characteristic distortion degree and the semantic topic label;

training a semantic embedding model to be trained according to the semantic topic label and each key frame information, and acquiring a semantic embedding vector corresponding to the video data set output by the semantic embedding model;

constructing a video semantic index structure according to the video frequency control stationarity index, the video frequency control distortion index and the semantic embedded vector;

when a video editing request is received, acquiring a target abstract segment corresponding to the video editing request from the video semantic index structure, and outputting the target abstract segment as a video abstract corresponding to the video editing request;

The iterative signal group comprises a low-frequency approximate sub-band and a high-frequency detail sub-band; the calculating the information quantity of each iterative signal group comprises the following steps:

acquiring first edge probability distribution information of low-frequency approximate subbands of each iterative signal group;

acquiring second edge probability distribution information of high-frequency detail sub-bands of each iterative signal group;

Acquiring joint probability distribution information corresponding to the low-frequency approximate sub-band and the high-frequency detail sub-band of each iterative signal group;

calculating the information quantity of each iterative signal group according to the first edge probability distribution information, the second edge probability distribution information and the joint probability distribution information;

The determining the video characteristic distortion degree corresponding to the video data set according to the information amounts comprises the following steps:

Calculating a first root mean square error for the low frequency approximation subband and the video dataset;

calculating a second root mean square error for the high frequency detail sub-band and the video dataset;

Determining a video characteristic distortion degree corresponding to the video data set according to the information quantity, the first root mean square error and the second root mean square error;

The determining a video frequency control stationarity index and a video frequency control distortion index according to the video feature distortion and the semantic topic label comprises the following steps:

calculating a video frequency control factor of a video segment corresponding to each key frame information according to the video feature distortion degree and the semantic topic label;

Generating a video frequency control sequence according to a plurality of video frequency control factors;

Calculating the mean value and variance corresponding to the first-order difference of the video frequency control sequence, and generating the video frequency control stability index according to the mean value and variance;

Obtaining a peak value and a dynamic range of the video frequency control sequence, and generating the video frequency control distortion index according to the peak value and the dynamic range;

The constructing a video semantic index structure according to the video frequency control stationarity index, the video frequency control distortion index and the semantic embedded vector comprises the following steps:

hierarchical clustering is carried out on the semantic embedded vectors of the video segments corresponding to the key frame information, so that a semantic cluster hierarchical structure is obtained, and the semantic cluster hierarchical structure comprises a plurality of semantic clusters;

acquiring a cluster center vector corresponding to each semantic cluster and the video fragment positioned in the semantic cluster;

constructing a mapping relation corresponding to the cluster center vector and the video fragment;

and constructing the video semantic index structure according to the mapping relation and the semantic cluster hierarchical structure.

2. The method of claim 1, wherein preprocessing a plurality of the video data to obtain a video data set comprises:

converting the format of each video data into a preset format;

Acquiring quality evaluation information corresponding to the converted video data, and determining target video data in a plurality of video data according to the quality evaluation information;

and extracting key frames of each target video data, obtaining the key frame information corresponding to each target video data, and forming the video data set by a plurality of key frame information.

3. The method of claim 1, wherein the video impact factor comprises a plurality of impact factor parameters; the obtaining the video influence factor and the candidate topic label corresponding to the video data set, and generating the semantic topic label according to the video influence factor and the candidate topic label, includes:

Acquiring a plurality of influence factor parameters of the video data set, normalizing the plurality of influence factor parameters, and forming the video influence factor according to the normalized influence factor parameters;

Performing target detection and scene recognition on each piece of key frame information of the video data set, and acquiring semantic information corresponding to the key frame information; the semantic information comprises object information and scene categories;

Generating candidate topic labels corresponding to the object information and the scene category respectively;

and generating the semantic topic label according to a plurality of candidate topic labels and the video influence factors.

4. The method of claim 1, wherein the training of the semantic embedded model to be trained based on the semantic topic labels and each of the key frame information is accomplished, comprising:

coding the semantic topic label to obtain a semantic coding vector;

5. The method according to claim 1, wherein the obtaining, in the video semantic indexing structure, the target digest segment corresponding to the video editing request includes:

encoding the video editing request to obtain a request text vector corresponding to the video editing request;

obtaining the similarity between the request text vector and a plurality of cluster core vectors in the video semantic index structure;

determining at least one target cluster heart vector from a plurality of cluster heart vectors according to a plurality of the similarities;

and extracting a target video segment corresponding to the target cluster heart vector from the semantic index structure according to the mapping relation corresponding to the target cluster heart vector, and taking the target video segment as the target abstract segment.

6. A computer device, comprising: a processor and a memory for storing a computer program which, when executed by the processor, implements the steps of the semantic segmentation based intelligent video editing and summarization method of any one of claims 1 to 5.