Detailed Description
The present invention will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present invention more apparent, and the described embodiments should not be construed as limiting the present invention, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present invention.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the invention described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
In the embodiment of the application, the relevant data collection processing should be strictly according to the requirements of relevant laws and regulations when the example is applied, so as to acquire the informed consent or independent consent of the personal information body, and develop the subsequent data use and processing within the authorized range of the laws and regulations and the personal information body.
Before describing embodiments of the present invention in further detail, the terms and terminology involved in the embodiments of the present invention will be described, and the terms and terminology involved in the embodiments of the present invention will be used in the following explanation.
1) Video definition: is an important indicator for measuring video quality. The definition refers to the definition degree of each detail shadow and the boundary thereof on the image, so that the image quality is compared by looking at the definition degree of the playback image, and the definition of the video identified by the artificial intelligence mode is called as a definition identification result in the application.
2) Convolutional neural network (CNN, convolutional Neural Networks): one type of feed-forward neural network (FNN, feedforward Neural Networks) that includes convolution calculations and has a deep structure is one of the representative algorithms for deep learning (DEEP LEARNING). Convolutional neural networks have a token learning (representation learning) capability that enables a shift-invariant classification (shift-INVARIANT CLASSIFICATION) of input images in their hierarchical structure.
3) The foreground, the definition of which is called foreground definition, is the person or object in front of or near the front of the subject in the video.
4) The scenes which are positioned behind the main body and far away from the camera in the video are used for enriching the space content of the pictures and reflecting the characteristics of places, environments, seasons, time and the like. The emotional degree of the background is called background clarity.
In the related art, methods for determining the definition of a video include determining the definition of a video according to the definition of a target frame, determining the definition of a video based on a 3D convolutional neural network (deep learning method), and determining the definition of a video based on a 2D convolutional neural network (deep learning method) +long Short-Term Memory (LSTM) and other time series models, and these methods are described below.
(1) Judging the definition of the video according to the definition of the target frame: extracting frames from video frames of the video at fixed time points or filtering some excessive frames through some traditional operators, and selecting target frames of the video; and extracting gradient characteristics of the target frame by using a traditional operator (a canny operator, sober operator, laplacian operator and the like) after the target frame is obtained, calculating weighted values of the characteristics, and comparing the weighted values with a preset threshold value to obtain the definition of the video.
(2) Judging the definition of the video based on a 3D convolutional neural network (a deep learning method): and (3) building a common 3D convolutional neural network model such as 3D-resnet, putting the marked video data into the model for training, and finally predicting the video definition by using the trained model.
(3) Judging the definition of the video based on a 2D convolutional neural network (deep learning method) +LSTM and other time sequence models: and (3) obtaining the characteristics of each frame by building a common convolutional neural network model such as 2D-resnet and the like, fusing the characteristics between the frames, and predicting the video definition through the fused characteristics.
In the embodiment of the invention, the following technical problems occur in the actual application process of the method of the related technology:
(1) Because the video has the characteristics of rich scenes, quick content change and the like, especially some life scenes such as square dances, street skateboards and the like, video with higher occurrence frequency can be easily extracted into a motion frame in the frame extraction process, and if the obtained target frames are all motion frames, the identification result of the target frames cannot accurately represent the definition of the whole video. Therefore, the identification result of the method is completely dependent on the acquired target frame, and the video definition of various categories cannot be accurately identified.
(2) The above methods (2) and (3) take the continuity between frames into consideration, and the background processing capacity is limited in the actual service scene, so that the recognition speed based on the time sequence model is slower by the two methods, which results in slower real-time processing efficiency of the background.
Aiming at the problems, the embodiment of the invention provides a video definition processing method and device based on artificial intelligence and electronic equipment, which can efficiently and accurately identify the definition of a video.
An exemplary application of the electronic device provided by the embodiment of the present invention is described below, where the electronic device provided by the embodiment of the present invention may be a server, for example, a server deployed in a cloud end, and according to a remote uploading of a video by a terminal device, a plurality of image frames are extracted from the video, and foreground of the plurality of image frames is subjected to definition recognition to obtain a recognition result of video definition, and according to a determination of the recognition result, background of a corresponding image frame is subjected to definition recognition to obtain an updated video definition recognition result; the method can also be a terminal device, such as a handheld terminal device, and a series of processing of the video definition recognition is performed according to the video input in the terminal device so as to obtain a video definition recognition result. By operating the video definition processing scheme based on artificial intelligence provided by the embodiment of the invention, the electronic equipment can improve the accuracy of video definition identification, enhance the applicability of video definition processing in actual service scenes, improve the processing efficiency of the electronic equipment identification definition, and is suitable for a plurality of application scenes, for example, a recommendation system can recommend videos with high definition preferentially.
Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of an artificial intelligence-based video sharpness processing system 100 according to an embodiment of the present invention. Wherein the artificial intelligence based video sharpness processing system 100 includes: server 200, network 300, and terminal device 400 (terminal device 400-1 and terminal device 400-2 are shown as examples), terminal device 400 connects to server 200 via network 300, and network 300 may be a wide area network or a local area network, or a combination of both.
The terminal device 400 is used to obtain video samples, for example, when a user (e.g., user a and user B in fig. 1) inputs video through a video input interface (e.g., selects a local video file or captures video).
In some embodiments, the terminal device 400 locally executes the video sharpness processing method based on artificial intelligence provided in the embodiments of the present invention to complete video input according to a user to obtain a video sharpness recognition result, for example, on the terminal device 400, the user opens a video input interface, inputs video in the video input interface, the terminal device 400 performs a series of sharpness recognition processes on the video to obtain a video sharpness recognition result, and the obtained video sharpness recognition result is displayed on the video input interface 410 (the video input interface 410-1 and the video input interface 410-2 are exemplarily shown) of the terminal 400.
In some embodiments, the terminal device 400 may also send, to the server 200 through the network 300, a video input by a user on the terminal device 400, invoke the video definition processing function provided by the server 200 and perform a series of processes for recognizing the input video based on the video definition processing method provided by the embodiment of the present invention, so as to obtain a video definition recognition result, for example, the user opens a video input interface on the terminal device 400, inputs the video in the video input interface, the terminal device sends the video to the server 200 through the network 300, after receiving the video, the server 200 recognizes the video definition recognition result, returns the obtained video definition recognition result to the terminal device, and displays the video definition result of the video on the display interface 410 of the terminal device 400, or the server 200 directly gives the video definition result of the video.
The embodiment of the invention can be widely applied to video definition processing scenes, for example, when a video APP background is used for checking basic information (whether video content is clear or not) of video information, a strategy is formulated by combining the characteristics of the video, a plurality of image frames are extracted from the video, the foreground of the image frames is subjected to definition recognition so as to obtain a video definition recognition result, and the background of the corresponding image frames is subjected to definition recognition according to the judgment of the recognition result so as to obtain an updated video definition recognition result, so that the definition of the video is efficiently and accurately recognized, the purpose of simulating the definition of the video given by human senses is finally achieved, and the efficiency of real-time processing is accelerated; the video definition processing system 100 based on artificial intelligence can also be applied to a recommendation system, and the obtained video definition result is input into the recommendation system, so that the recommendation system recommends a video with higher definition to a user, the video click rate and the watching time of the user are increased, and the obtained video definition result can also be stored in a server for later offline use by the recommendation system. In addition, scenes related to video definition processing belong to potential application scenes of the invention.
In the following, an electronic device is taken as an example of a terminal device. Referring to fig. 2, fig. 2 is a schematic architecture diagram of a terminal device 400 (for example, may be the terminal device 400-1 and the terminal device 400-2 shown in fig. 1) provided in an embodiment of the present invention, and the terminal device 400 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.
The Processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose Processor, such as a microprocessor or any conventional Processor, a digital signal Processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.
Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.
Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 450 described in embodiments of the present invention is intended to comprise any suitable type of memory.
In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.
An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;
Network communication module 452 for reaching other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;
A presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;
an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.
In some embodiments, the video sharpness processing apparatus based on artificial intelligence provided in the embodiments of the present invention may be implemented in software, and fig. 2 shows the video sharpness processing apparatus 455 based on artificial intelligence stored in the memory 450, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the extraction module 4551, the first recognition module 4552, the first determination module 4553, the second recognition module 4554, the second determination module 4555, and the recommendation module 4556; the extracting module 4551, the first identifying module 4552, the first determining module 4553, the second identifying module 4554 and the second determining module 4555 are used for implementing the video definition processing method based on artificial intelligence provided by the embodiment of the invention, and the recommending module 4556 is used for implementing the recommendation of the video definition identifying result according to the embodiment of the invention, and these modules are logical, so that any combination or further splitting can be performed based on the implemented functions. The functions of the respective modules will be described hereinafter.
The video sharpness processing method based on artificial intelligence provided by the embodiment of the invention can be executed by the server, or can be executed by terminal devices (for example, can be the terminal device 400-1 and the terminal device 400-2 shown in fig. 1), or can be executed by the server and the terminal device together.
The video definition processing method based on artificial intelligence provided by the embodiment of the invention will be described below in connection with exemplary application and implementation of the terminal device provided by the embodiment of the invention.
Referring to fig. 3 and fig. 4A, fig. 3 is a schematic architecture diagram of an artificial intelligence-based video sharpness processing apparatus 455 according to an embodiment of the present invention, which shows a flow of implementing video sharpness processing by a series of modules, and fig. 4A is a schematic flow diagram of an artificial intelligence-based video sharpness processing method according to an embodiment of the present invention, and the steps shown in fig. 4A will be described with reference to fig. 3.
In step S101, the server extracts a plurality of image frames to be identified from the video.
The user can input the video on the input interface of the terminal, after the input is completed, the terminal can forward the video to the server, and the server can extract a plurality of image frames to be identified from the video after receiving the video so as to obtain the definition identification result of the video according to the image frames.
In some embodiments, referring to fig. 3, a server pair extracts a plurality of image frames to be identified from a video, comprising: equally-spaced frame extraction is carried out on the video to obtain a first image frame set; clustering the image frames in the first image frame set to obtain a plurality of similar image frame subsets, randomly extracting one image frame from each similar image frame subset, and combining the images which are not clustered into any similar image frame subset in the first image set to form a second image frame set; and filtering out the image frames meeting the blurring condition from the second image frame set, and taking the rest multi-frame image frames in the second image frame set as the image frames to be identified.
As an example, equally spaced frames of a video by a server to obtain a first set of image frames may be implemented by a multimedia video processing tool (FFMpeg, fast Forward Mpeg). That is, after the server receives the video, the server reads the stream information in the video file, calls FFMpeg a corresponding decoder in the decoding library to open the stream information, and decodes a plurality of video frames from the video by the set number of frames of the extracted image per second, thereby obtaining the first image frame set.
As an example, the filtering process of the image frames by the server is implemented by clustering. The process of clustering the image frames is specifically described below, and a first image frame set is projected to a feature space to obtain an image feature vector corresponding to each first image frame; calculating the distance (Euclidean distance or cosine distance) between each feature vector and other feature vectors; classifying the feature vectors with the calculated distances falling within a numerical threshold range into similar image frame types to obtain a plurality of similar image frame subsets, and taking the feature vectors which are not clustered into any similar image frame subset as a new similar image frame type; randomly extracting an image frame from each similar image frame subset as a similar image frame category; image frames of all similar image frame categories are combined together to form a second set of image frames.
The method for extracting the image frames is suitable for various types of videos, and can filter out the blurred frames caused by the extraction method, so that the subsequent video definition processing process can be accurately performed.
In step S102, the server performs sharpness recognition on the foreground in the plurality of image frames, to obtain foreground sharpness of each image frame.
In some embodiments, referring to fig. 3, the server performs sharpness recognition on the foreground in the plurality of image frames to obtain foreground sharpness for each image frame, including: the following processing is performed for each image frame based on the foreground definition model: and mapping the image characteristics of the image frame into confidence degrees corresponding to different foreground definition categories through a forward propagation process among various layers of the foreground definition model, and taking the foreground definition category corresponding to the maximum confidence degree as the foreground definition of the image frame.
As an example, referring to fig. 3, the foreground definition model includes an input layer, a hidden layer, and an output layer. The server outputs the confidence coefficient of the foreground definition category to which each image frame belongs through the forward propagation process among the input layer, the hidden layer and the output layer of the foreground definition model, and takes the foreground definition category corresponding to the maximum confidence coefficient as the foreground definition of the image frame, wherein the foreground definition category comprises: the foreground is clear, the foreground is general and the foreground is blurred.
For example, the foreground definition categories of an image frame are classified into three categories of foreground blur, foreground general and foreground definition, and the output result for one image frame is: the foreground is blurred by 2%, the foreground is 7% generally, and the foreground is clear by 91%, so that the foreground definition of the image frame is clear. It should be noted that the closer the confidence level is to 1, the better the prediction effect.
Here, a forward propagation process of the foreground definition model will be described, and a process in which sample data is propagated from a low level to a high level is a forward propagation process of the foreground definition model. In the forward propagation process, an image is input into an input layer, image features are extracted through a hidden layer, the image is input into an output layer for classification, a result of foreground definition type is obtained, and when the output result accords with an expected value, the result is output.
In step S103, the server determines a first video definition of the video based on the foreground definition of each image frame, and as a definition recognition result of the video.
Referring to fig. 4B, fig. 4B is a schematic flow chart of an artificial intelligence-based video sharpness processing method according to an embodiment of the present invention, and in some embodiments, fig. 4B illustrates that step S103 in fig. 4A may be implemented through steps S1031-S1032 shown in fig. 4B.
In step S1031, the server determines the number of image frames to be identified included in each foreground definition category based on the foreground definition category to which each image frame belongs; in step S1032, the server determines the first video definition of the video according to the proportion of the number of image frames included in each foreground definition category to the total number of image frames, wherein the total number is a count corresponding to the plurality of image frames to be identified.
In some embodiments, the foreground definition category of foreground definition corresponds to a first scale threshold, the foreground definition category of foreground general corresponds to a second scale threshold, the foreground definition category of foreground blur corresponds to a third scale threshold, and the second scale threshold, the first scale threshold, and the third scale threshold are arranged in descending order. The first video sharpness of the video is determined according to the proportion of the number of image frames included in each foreground sharpness category to the total number, and can be identified by corresponding conditions, as will be exemplified below.
Condition 1) determining that a first video definition of a video is clear when a proportion of a number of image frames included in a foreground definition category of a current foreground definition in a total number is greater than a first proportion threshold; the proportion of the number of image frames included in the foreground general foreground definition category in the total number is larger than a second proportion threshold value, the proportion of the number of image frames included in the foreground fuzzy foreground definition category in the total number is smaller than a third proportion threshold value, and the first video definition of the video is determined to be general.
Condition 2) the proportion of the number of image frames included in the foreground definition category of the foreground definition to the total number is smaller than the first proportion threshold, and the proportion of the number of image frames included in the foreground definition category of the foreground blur to the total number is zero, and the first video definition of the video is determined to be general.
Condition 3) determining that the first video definition of the video is blurred when the proportion of the number of image frames included in the foreground definition category of the current scene blur in the total number is greater than a third proportion threshold.
For example, assuming that the total number of counts of the plurality of image frames to be identified is m frames, where m is a natural number greater than zero, the server invokes a foreground definition model to perform definition identification on the foreground in the plurality of image frames, so as to obtain a number of image frames included in a foreground definition category of foreground definition, b number of image frames included in a foreground general foreground definition category, and c number of image frames included in a foreground definition category of foreground blur; wherein a, b and c are natural numbers greater than zero. The following cases can be distinguished:
Case 1) if When the video definition is clear, judging the result of the first video definition of the video;
Case 2) if And is also provided withWhen the first video definition of the video is judged to be general; if it isAnd is also provided withWhen the first video definition of the video is normal;
Case 3) if When the first video definition of the video is blurred.
As an example of step S1032, the server determining the first video definition of the video according to the proportion of the number of image frames included in each foreground definition category to the total number may further include: determining that a first video definition of the video is clear when a proportion of a number of image frames included in a foreground definition category of the foreground definition in total number is greater than a first proportion threshold; the proportion of the number of image frames included in the foreground definition category of the foreground definition in the total number is smaller than a first proportion threshold, the proportion of the number of image frames included in the foreground definition category of the foreground definition in the total number is larger than a second proportion threshold, and the first video definition of the video is determined to be general; and when the proportion of the number of the image frames included in the foreground definition category of the foreground blurring in the total number is greater than a third proportion threshold value, determining that the first video definition of the video is blurring.
In step S104, when the first video definition of the video does not meet the definition condition, the server performs definition recognition on the backgrounds of the plurality of image frames, and obtains the background definition of each image frame.
In some embodiments, the server performs sharpness recognition on the background of the plurality of image frames to obtain background sharpness of each image frame, including: the following processing is performed for each image frame: mapping image features of the image frames to confidence levels of different background definition categories; wherein the background definition category comprises: clear background and blurred background.
It should be noted that, the background definition model is a two-class model, and for each image frame, the background definition and its corresponding confidence level, the background blurring and its corresponding confidence level are output, and only the background blurring confidence level is used in the application. As an example, the background sharpness model may employ a resnet-50 network.
It should be noted that, the first video definition recognition result of the video may be a qualitative definition category, such as definition, general, and blur; but may also be a quantized sharpness score, such as any score from 0-10 points.
As an example of step S104, when the first video sharpness of the video is a blurred image, sharpness recognition is performed on the backgrounds of the plurality of image frames; or when the first video definition of the video is an image with the score lower than the definition score threshold value, performing definition recognition on the background of the plurality of image frames. For example, when the score of the first video definition of the video is 0-2, it is determined that the first video definition of the video does not satisfy the definition condition.
In step S105, a second video definition of the video is determined based on the background definition of each image frame, and is used as a definition recognition result of the update of the video.
Referring to fig. 4C, fig. 4C is a schematic flow chart of an artificial intelligence-based video sharpness processing method according to an embodiment of the present invention, and in some embodiments, fig. 4C illustrates that step S105 in fig. 4A may be implemented by steps S1051-S1052 illustrated in fig. 4C. In step S1051, accumulating and averaging the confidence degrees of the background definition category of the background blur of each image frame to obtain an average confidence degree; in step S1052, when the mean confidence is greater than the confidence threshold, the second video definition of the video is determined to be blurred, and when the mean confidence is less than or equal to the confidence threshold, the second video definition of the video is determined to be normal.
In some embodiments, referring to fig. 4C, fig. 4C is a schematic flow chart of an artificial intelligence-based video sharpness processing method according to an embodiment of the present invention, and based on fig. 3, the following steps may be further performed before step S105:
In step S201, category information of a video is acquired;
In step S202, the confidence threshold corresponding to the video category information is searched for in the correspondence between the plurality of video categories and the confidence threshold.
It should be noted that, the confidence threshold may be different for different types of video. For example, for dance-like video, since a person may have a ghost but a clear background, a slight ghost of the person does not affect the human look; for near-view video, slight blurring can accentuate the human body's look and feel to the video. The confidence threshold setting for dancing video may be higher than for near video, and thus the confidence threshold setting may be different for different categories of video. As an example, the correspondence of the video category information to the confidence threshold may be stored in a database of the server or terminal device for invocation of the server or terminal device.
According to the method for processing the video definition based on the artificial intelligence, different identification modes can be automatically selected for videos of different categories, and efficient and accurate identification of various videos is achieved. For the videos of the definition type and the general type, the definition recognition result can be obtained efficiently and accurately only by calling the foreground definition model, so that recall and accuracy of the video of the definition type are improved, and processing efficiency is improved. And for the fuzzy video, the background definition model is further called to carry out definition recognition so as to update the result of video definition recognition. By the method, the video judged to be blurred due to the blur of the foreground can be recalled, and the accuracy of the blurred video is improved. Finally, the purpose of simulating the definition of the sensory recognition video of the human can be achieved.
In some embodiments, referring to fig. 4B, based on fig. 4A, following step S105, the following steps may also be performed:
In step S106, the video definition recognition result is sent to the recommendation system, so that the recommendation system executes a corresponding recommendation operation according to the video definition.
In some embodiments, referring to fig. 6, fig. 6 is a schematic architecture diagram of a recommendation system according to an embodiment of the present invention. In fig. 6, the recommendation system includes a definition module, a personalization module, a recall module, a sort module, a diversity module, and a diversity+definition based recommendation module. The individuation module is used for calculating an object portrait according to object behaviors so as to obtain interest preference under different dimensions according to object attributes, historical behaviors, interest content and the like; the definition module is used for realizing the definition processing process of the video so as to obtain candidate video with higher definition, and the definition identification result obtained by the definition module can be stored locally and is directly used; the recall module comprises a recall model of a plurality of channels such as collaborative filtering, a topic model, content recall, social network service (SNS, social Network Software) and the like, and the diversity of candidate videos is guaranteed during recall; and the sorting module is used for uniformly scoring and sorting the recalled results, and selecting the video which is most interesting to the user and has high definition from the candidate videos, namely selecting the optimal video from the candidate videos so as to obtain the videos meeting diversity and definition conditions. The recommendation system gives consideration to multiple dimensions such as diversity, definition, individuation and the like of recommendation results, and can meet the requirement of diversity of users.
The video definition model based on the artificial intelligence can realize the processing of the definition of various videos by the background, and saves a great deal of labor cost. Meanwhile, the obtained video definition result can be applied to a recommendation system, and videos with higher definition are recommended to a user in the recommendation system so as to increase the video click rate and the watching time of the user; the video definition result can also be stored in a server for subsequent offline use by a recommendation system.
In the following, an exemplary application of the embodiment of the present invention in a practical application scenario will be described.
In the implementation process of the embodiment of the invention, the related technology is found to have the following problems: taking an application scenario of judging the definition of a short video as an example, with the continuous development of the mobile internet, mobile platforms such as smartphones and the like are rapidly rising, and the short video using smartphones/tablets as carriers has become a new content transmission form in recent years. With the explosive growth of short video data, how to quickly and accurately judge the definition of the short video by a background system becomes important. However, the short video has the characteristics of numerous categories and most of video frames are motion frames, so that the difficulty in judging the definition of the short video is increased. If the video frame is a motion frame, the identification result cannot accurately represent the definition of the whole short video; however, the fusion characteristics between frames are identified through the time sequence model, which results in slower identification speed. Therefore, how to efficiently and accurately identify video sharpness is a problem addressed by the present invention.
As an example, referring to fig. 7, fig. 7 is a schematic diagram of two image frames extracted from a short video according to an embodiment of the present invention, and fig. 7 shows a foreground 301, a foreground 302, a background 303, and a background 304, where in the prior art, the foreground of the two image frames includes a virtual image, so that the recognition result of the model in the prior art is blurred, but it is seen that the background of the two image frames is relatively clear, and the definition of the short video where the two image frames are located should be judged to be general by human sense organs. For example, category videos of dancing, sports, etc.: the characters in the video frame may be ghosts, but the background is clearer; distant view video: the main body in a video single frame is too small to be clearly distinguished; the videos in the categories have good overall appearance, and the definition of the short videos is judged to be general by the sense of people. That is, the sharpness recognition method in the related art is not adapted to various business scenarios.
Aiming at the problems, the embodiment of the invention provides the video definition processing method based on the artificial intelligence, which not only can be used for providing video definition processing more suitable for the service scene in combination with the service scene, but also has higher processing efficiency and effectively improves the efficiency and the precision of the video definition processing.
The video definition processing and detecting method based on artificial intelligence provided by the embodiment of the invention extracts a plurality of image frames from a video, performs definition recognition on the foreground of the plurality of image frames to obtain a video definition recognition result, and performs definition recognition on the background of the corresponding image frame according to judgment on the recognition result to obtain an updated video definition recognition result.
Referring to fig. 8, fig. 8 is a schematic flow chart of an artificial intelligence-based video definition processing method according to an embodiment of the present invention, and an implementation scheme of the embodiment of the present invention is specifically as follows:
The method comprises the steps of extracting k frames from video at equal intervals by utilizing a multimedia video processing tool (FFMpeg), clustering the video frames according to characteristics such as a color histogram and a canny operator to filter repeated frames, and meanwhile, performing primary screening on the frames to mainly filter out frames which are too fuzzy, and finally selecting m frames from the k frames, wherein k and m are fixed constants.
Extracting frames from short video uploaded by a user, normalizing each frame of image in m frames, and sending the normalized images into a foreground definition model to judge the definition of the foreground content of each frame of image, wherein the foreground definition model supports any-size image as input; the output is clear foreground, general foreground, fuzzy foreground of three categories and corresponding confidence.
In some examples, the foreground definition category is identified by a corresponding condition:
Condition 1) The frames are clear, and the short video is clear;
Condition 2) The following frames are clear, the rest frames are general, and the short video is general;
Condition 3) The above frames are generally described as such,The following frames are blurred, then the short video is general;
Condition 4) if The above frames are blurred and a background sharpness model is input.
If the foreground definition model judges that the video is fuzzy, acquiring category information of the video, and giving out definition of the short video based on confidence level of m-frame results given by the background definition model and the category information of the video.
As an example, assuming that the result of the background sharpness model output is (cof _1, cof_2), cof _1 characterizes the confidence that the output background is blurred, cof _2 characterizes the confidence that the output background is not blurred, wherein cof _1+cof_2=1, cof _1 of m frames are added, and because of different categories of video, the blurred range may be different, and then the threshold of cof _1 is given based on the different categories, if cof _1_avg > thre of m frames, the short video gives a blurred label, otherwise, a general label is given, wherein thre is the confidence threshold corresponding to the video category.
In some examples, the background sharpness model is mainly used to determine background sharpness of an excessive frame (e.g., a motion frame, a subject blur, and background sharpness), and the model is a two-class model, which is used as an auxiliary determination of the foreground sharpness model to determine whether the background of the image is sharp. The background definition model mainly adopts a network of resnet-50.
In some examples, referring to fig. 5, fig. 5 is a schematic structural diagram of a foreground definition model provided by an embodiment of the present invention, where a backbone network of the foreground definition model mainly includes a convolution layer, a pooling layer, a residual module, a downsampling layer, an adaptive downsampling layer, a random inactivation layer (dropout), and a full connection layer; the residual error module mainly selects convolution layers of convolution kernels of 5 x 5,3 x 3 and 1*1, and then carries out direct connection operation; the downsampling layer downsamples the image mainly by using a convolution layer or a pooling layer with the step length of 2; the adaptive downsampling layer can convert the feature images of any scale with the same channel number into feature vectors of the same dimension, and the convolutional neural network model can take the images of any scale as the input of the model.
In some examples, a frame of a foreground definition model is described, referring to fig. 5, based on fig. 5, fig. 5 is a schematic structural diagram of the foreground definition model provided by an embodiment of the present invention, and as an example, an input layer of the foreground definition model performs normalization processing on an input image; the categories of hidden layers of the foreground definition module may include: the device comprises a convolution layer, a pooling layer, a residual error module, a downsampling layer, a self-adaptive pooling layer, a random inactivation layer and a full connection layer.
Convolution layer: performing convolution linear mapping processing on an input image to extract image characteristics; it should be noted that, some features of the image can be extracted from the input image through mathematical operations with the convolution kernel, and the features of the extracted image are different from each other due to the difference of the convolution kernels, so that for training the foreground definition model, the image features with the best performance can be extracted, so as to reduce the complexity of the model and save a large amount of calculation resources and calculation time.
The pooling layer is used for selecting mean pooling or maximum pooling so as to obtain main image characteristics; the average pooling is to average the values in the pooling areas, and the maximum pooling is to divide the feature map into a plurality of rectangular areas and take the maximum value of each area; after pooling operation, unimportant image features in the feature map of the convolution layer are removed, and the number of parameters is reduced so as to reduce overfitting.
The downsampling layer is a nonlinear downsampling method, and features extracted by each downsampling can be output in parallel through 4 times of serial downsampling processing so as to extract 4 groups of feature graphs with different sizes; wherein, a residual error module is added before the downsampling process. It should be noted that the direct connection operation in the residual error module plays a role in direct transmission through simple identity mapping, maintains the spatial structure of the gradient, and relieves the problem of model gradient crushing.
And the self-adaptive pooling layer converts the 4 groups of feature maps with different sizes output by the downsampling layer into 4 groups of feature maps with the same size, and integrates the 4 groups of feature maps with the same size into 1 group of feature maps through connection processing. The connection processing is to add 4 groups of feature graphs two by two through addition operation, and output all the added feature graphs; the self-adaptive downsampling layer automatically calculates the size of the convolution kernel and the step length of each movement according to the sizes of the set input image and the set output image so as to output the set output image size, namely the self-adaptive downsampling layer can convert the feature images with the same channel number and any size into feature vectors with the same dimension, so that the foreground definition model supports processing images with any size.
And the full connection layer integrates all the features obtained before convolution into an N-dimensional feature vector. Some neuron nodes are discarded with a certain probability through a random inactivation layer (dropout) between the two fully connected layers, so that joint adaptability among the neuron nodes is weakened. For example, the drop out rate may be 50% for discarding half of the neuron nodes.
And the output layer classifies the N-dimensional feature vectors by adopting a logistic regression softmax function so as to output the definition category of each frame of image and the confidence corresponding to each definition category, wherein N is a natural number larger than zero.
In some examples, prior to using the model, the sharpness classification for the short video is based on the nature of the video itself, as well as the business side requirements: and respectively making quantization standards for the three categories of definition, general and fuzzy, and marking the training samples.
The video definition model of the artificial intelligence can directly enable the background to process the definition of the short video, and can save a great deal of labor cost. Meanwhile, the obtained result can be applied to a recommendation system, and short videos with higher definition are recommended to the user in the system so as to increase the click rate and the watching time of the user.
Continuing with the description below of an exemplary architecture implemented as software modules for an artificial intelligence based video sharpness processing apparatus 455 provided in accordance with an embodiment of the present invention, in some embodiments, as shown in FIG. 2, the software modules stored in the memory 440 of the artificial intelligence based video sharpness processing apparatus 455 may include:
An extracting module 4551 for extracting a plurality of image frames to be identified from a video; the first identifying module 4552 is configured to identify the foreground in the plurality of image frames in a sharpness manner, so as to obtain foreground sharpness of each image frame; a first determining module 4553, configured to determine a first video definition of the video based on the foreground definition of each image frame, and use the first video definition as a definition recognition result of the video; the second identifying module 4554 is configured to identify the background of the plurality of image frames when the first video definition of the video does not meet the definition condition, so as to obtain the background definition of each image frame; a second determining module 4555, configured to determine a second video definition of the video based on the background definition of each image frame, and as an updated definition recognition result of the video.
In the above solution, the extracting module 4551 is configured to: equally-spaced frame extraction is carried out on the video to obtain a first image frame set; clustering the image frames in the first image frame set to obtain a plurality of similar image frame subsets, randomly extracting one image frame from each similar image frame subset, and combining the images which are not clustered into any similar image frame subset in the first image set to form a second image frame set; and filtering out the image frames meeting the blurring condition from the second image frame set, and taking the rest multi-frame image frames in the second image frame set as the image frames to be identified.
A first identifying module 4552 for:
the image features of the image frames are mapped to confidence degrees corresponding to different foreground definition categories, and the foreground definition category corresponding to the maximum confidence degree is used as the foreground definition of the image frames.
A first determining module 4553, configured to:
the foreground definition categories include: the foreground is clear, the foreground is general and the foreground is fuzzy;
Determining the number of image frames to be identified, which are included in each foreground definition category, based on the foreground definition category to which each image frame belongs;
And determining the first video definition of the video according to the proportion of the number of the image frames included in each foreground definition category in the total number, wherein the total number is a count corresponding to a plurality of image frames to be identified.
A first determining module 4553, configured to:
the foreground definition category of the foreground definition corresponds to a first proportion threshold, the foreground definition category of the foreground general corresponds to a second proportion threshold, the foreground definition category of the foreground blurring corresponds to a third proportion threshold, and the second proportion threshold, the first proportion threshold and the third proportion threshold are arranged in descending order;
Determining that a first video definition of the video is clear when a proportion of a number of image frames included in a foreground definition category of the foreground definition in total number is greater than a first proportion threshold;
The proportion of the number of image frames included in the general foreground definition category of the foreground in the total number is larger than a second proportion threshold value, the proportion of the number of image frames included in the fuzzy foreground definition category of the foreground in the total number is smaller than a third proportion threshold value, and the first video definition of the video is determined to be general;
the proportion of the number of the image frames included in the foreground definition category of the foreground definition in the total number is smaller than a first proportion threshold, and the proportion of the number of the image frames included in the foreground definition category of the foreground definition in the total number is zero, and the first video definition of the video is determined to be general;
and when the proportion of the number of the image frames included in the foreground definition category of the foreground blurring in the total number is greater than a third proportion threshold value, determining that the first video definition of the video is blurring.
A second identifying module 4554 for:
the following processing is performed for each image frame:
Mapping image features of the image frames to confidence levels of different background definition categories;
wherein the background definition category comprises: clear background and blurred background.
A second determining module 4555 configured to:
accumulating the confidence coefficient of each image frame belonging to the background definition category with clear background and taking the average value to obtain the average value confidence coefficient;
And when the mean confidence is greater than the confidence threshold, determining that the second video definition of the video is fuzzy, and when the mean confidence is less than or equal to the confidence threshold, determining that the second video definition of the video is general.
The second determining module 4555 is further configured to:
Acquiring category information of a video;
Searching the confidence threshold corresponding to the video category information in the corresponding relation between the video categories and the confidence threshold.
Recommendation module 4556 for: and sending the definition identification result of the video to a recommendation system so that the recommendation system executes corresponding recommendation operation according to the definition of the video.
Embodiments of the present invention provide a storage medium having stored therein executable instructions which, when executed by a processor, cause the processor to perform a method provided by embodiments of the present invention, for example, an artificial intelligence based video sharpness processing method as illustrated in fig. 4A, 4B or 4C.
In some embodiments, the storage medium may be FRAM, ROM, PROM, EPROM, EE PROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.
In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.
As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (html, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).
As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or distributed across multiple sites and interconnected by a communication network.
In summary, according to the embodiment of the invention, a plurality of image frames are extracted from the video, the foreground of the plurality of image frames is subjected to definition recognition to obtain a recognition result of video definition, and the background of the corresponding image frame is subjected to definition recognition according to the judgment of the recognition result to obtain an updated video definition recognition result, so that different recognition modes can be automatically selected for different types of videos, the definition of a short video is efficiently and accurately recognized, and the purpose of simulating the definition of a human-simulated sensory recognition video is achieved; aiming at the clear video and the general video, only the foreground definition model is required to be called for definition identification, so that a definition identification result can be obtained efficiently and accurately, recall and accuracy of the clear video are improved, and processing efficiency is improved; aiming at the fuzzy video, a background definition model is further called to carry out definition recognition so as to update the result of video definition recognition, and in this way, the video judged to be fuzzy due to fuzzy foreground comparison can be recalled, so that the accuracy rate of the fuzzy video is improved; inputting the obtained video definition result into a recommendation system, so that the recommendation system recommends videos with higher definition to a user, and the video click rate and the watching time of the user are increased.
The foregoing is merely exemplary embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present invention are included in the protection scope of the present invention.