CN113766299B

CN113766299B - Video data playing method, device, equipment and medium

Info

Publication number: CN113766299B
Application number: CN202110492275.1A
Authority: CN
Inventors: 陈小帅
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2024-04-19
Anticipated expiration: 2041-05-06
Also published as: CN113766299A

Abstract

The embodiment of the application provides a video data playing method, a device, equipment and a medium, wherein the method relates to the field of artificial intelligence and comprises the following steps: displaying a video playing interface for playing the target video; responding to a first triggering operation for a video playing interface, and displaying N double-speed controls associated with a double-speed mode of a target video; n is a positive integer; responding to second trigger operation for the N double-speed controls, determining first double-speed information indicated by the double-speed control corresponding to the second trigger operation, and playing a key video fragment of the target video in the video playing interface; the key video snippet is a video snippet associated with the first speed information selected from video snippets of the target video. By adopting the application, the personalized double-speed playing of the user can be realized, and the accuracy of double-speed playing can be further improved.

Description

Video data playing method, device, equipment and medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a video data playing method, apparatus, device, and medium.

Background

With the development of multimedia technology, video has become a main carrier for people to acquire information and enjoy entertainment in daily life. It will be appreciated that, when the duration of the video is long, in order to view the video in a limited time, the user (e.g., user Y) will typically view the video using a double speed play mode (e.g., 2 times speed).

In the case of realizing the conventional double-speed play mode, time-point compression is mainly performed by mechanically compressing image frames, audio frames, and the like of video. For example, at 2-fold speed play, the time coordinates of the video frame and the audio frame can be reduced to 2-fold. Based on this, when the existing double-speed playing mode is used for double-speed playing of the video, the double-speed playing is achieved for all users indiscriminately by using the time point compression mode, so that scenario contents possibly interested by the user Y are inevitably omitted, and the accuracy of double-speed playing is lowered.

Disclosure of Invention

The embodiment of the application provides a video data playing method, a device, equipment and a medium, which can realize personalized double-speed playing of a user and further improve the accuracy of double-speed playing.

In one aspect, an embodiment of the present application provides a video data playing method, including:

displaying a video playing interface for playing the target video;

Responding to a first triggering operation for a video playing interface, and displaying N double-speed controls associated with a double-speed mode of a target video; n is a positive integer;

Responding to second trigger operation for the N double-speed controls, determining first double-speed information indicated by the double-speed control corresponding to the second trigger operation, and playing a key video fragment of the target video in the video playing interface; the key video snippet is a video snippet associated with the first speed information selected from video snippets of the target video.

An aspect of an embodiment of the present application provides a video data playing device, including:

the interface display module is used for displaying a video playing interface for playing the target video;

The first response module is used for responding to a first triggering operation for the video playing interface and displaying N double-speed controls associated with the double-speed mode of the target video; n is a positive integer;

The second response module is used for responding to second trigger operations for the N speed-doubling controls, determining first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operations, and playing the key video fragment of the target video in the video playing interface; the key video snippet is a video snippet associated with the first speed information selected from video snippets of the target video.

The video playing interface is a full-screen playing interface for playing the target video;

The first response module includes:

The first display unit is used for responding to a first triggering operation aiming at the full-screen playing interface, triggering a double-speed mode of the target video and displaying a first control display interface independent of the full-screen playing interface based on the double-speed mode; the interface size of the first control display interface is smaller than the interface size of the full-screen playing interface;

and the second display unit is used for displaying N double-speed controls associated with the double-speed mode in the first control display interface.

The video playing interface is a non-full-screen playing interface for playing the target video;

The first response module further includes:

the video playing unit is used for playing the target video in a video playing area of the non-full-screen playing interface;

The third display unit is used for responding to the first triggering operation aiming at the non-full-screen playing interface and displaying a control display area of the target video in the non-full-screen playing interface; the control display area is an area suspended above the video playing area, or the control display area is an area which is not overlapped with the video playing area;

And the fourth display unit is used for responding to the double-speed selection operation aiming at the control display area, triggering the double-speed mode of the target video, and displaying N double-speed controls associated with the double-speed mode in the second control display interface based on the double-speed mode.

Wherein the second response module comprises:

The first determining unit is used for responding to second trigger operations of the N speed-doubling controls, determining first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operations, and determining the playing progress of the target video as a first playing progress in the video playing interface;

The first checking unit is used for acquiring double-speed playing fragment identification associated with the first double-speed information and the first playing progress from the server based on the first network state when the network state of the application client for playing the target video is checked to belong to the first network state; a double-speed playing segment identifier is used for representing the segment position of a key video segment in the target video;

the segment acquisition unit is used for acquiring the key video segment matched with the double-speed playing segment identifier from the server;

the first playing unit is used for playing the key video snippets of the target video in the video playing interface based on the snippet positions of the key video snippets in the target video.

Wherein the second response module further comprises:

The second determining unit is used for responding to second trigger operations for the N speed-doubling controls, determining first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operations, and determining the playing progress of the target video as a first playing progress in the video playing interface;

The second checking unit is used for acquiring the key video clip related to the first double-speed information and the first playing progress from the server based on the second network state when the network state of the application client for playing the target video is checked to belong to the second network state; the key video snippets are determined by the server from a key snippet set of the target video based on the first double-speed information and the first playing progress; the key segment set is determined by the server based on the L video segments; the L video clips are determined based on clip interest attributes of K video clips of the target video; l is a positive integer less than K; k is a positive integer;

and the second playing unit is used for playing the key video fragment of the target video in the video playing interface.

Wherein the apparatus further comprises:

The third response module is used for taking the double-speed control corresponding to the second trigger operation as a first double-speed control, and responding to the third trigger operation aiming at the video playing interface when the key video clip is played in the video playing interface, and displaying N double-speed controls associated with the double-speed mode; the N speed-doubling controls comprise second speed-doubling controls;

The fourth response module is used for responding to a fourth triggering operation aiming at the second double-speed control, and switching the double-speed information used for double-speed playing of the target video from the first double-speed information indicated by the first double-speed control to the second double-speed information indicated by the second double-speed control;

The progress determining module is used for determining the playing progress of the key video snippets in the target video as a second playing progress, and determining a switching video snippet used for playing in a video playing interface based on the second double-speed information and the second playing progress; switching video clips to be video clips which are selected from the video clips of the target video and are associated with second double-speed information and second playing progress;

And the segment switching module is used for playing the switched video segments in the video playing interface.

Receiving a double-speed playing request which is sent by an application client based on first double-speed information and is associated with a target video; the first double-speed information is used for indicating the application client to play the target video at double speed in a double-speed mode;

Screening L video clips matched with a target user of an application client from K video clips of a target video based on a double-speed playing request, and taking the L video clips as key video clips for double-speed playing of the target video in a double-speed mode; l is a positive integer less than K; k is a positive integer;

And returning the key video snippets to the application client so that the application client plays the key video snippets of the target video.

The request receiving module is used for receiving a double-speed playing request which is sent by the application client based on the first double-speed information and is associated with the target video; the first double-speed information is used for indicating the application client to play the target video at double speed in a double-speed mode;

The segment determining module is used for screening L video segments matched with a target user of the application client from K video segments of the target video based on the double-speed playing request, and taking the L video segments as key video segments for double-speed playing of the target video in a double-speed mode; l is a positive integer less than K; k is a positive integer;

And the segment returning module is used for returning the key video segments to the application client so that the application client plays the key video segments of the target video.

Wherein the fragment determination module comprises:

The video dividing unit is used for acquiring video identifications of target videos from the double-speed playing request, determining the target videos in the application client based on the video identifications, and dividing the target videos into K video fragments based on the video segmentation parameters;

the first prediction unit is used for obtaining a target network model associated with the target video, predicting first segment attributes of each video segment in the K video segments through the target network model, and determining segment precision and chroma of each video segment based on the first segment attributes of each video segment;

A second prediction unit for predicting a second segment attribute of each of the K video segments by the target network model, and determining a segment heat of each video segment based on the second segment attribute of each video segment;

A third prediction unit, configured to predict a third segment attribute of each of the K video segments through the target network model, and determine a segment interest level of each video segment based on the third segment attribute of each video segment;

the segment screening unit is used for determining the segment precision of each video segment, the segment heat of each video segment and the segment interest of each video segment as the segment interest attribute of each video segment, screening L video segments matched with a target user of an application client from K video segments based on the segment interest attribute of each video segment and the double-speed playing request, and taking the L video segments as key video segments for double-speed playing of a target video in a double-speed mode.

Wherein the K video clips comprise video clips S _i, and i is a positive integer less than or equal to K;

The first prediction unit includes:

A model acquisition subunit, configured to acquire a target network model associated with a target video; the target network model comprises a first target pre-estimated network for predicting a first segment attribute of the video segment S _i;

The first determining subunit is configured to determine, through the first target prediction network, a first image feature vector, a first audio feature vector, and a first text feature vector of the video segment S _i;

The first fusion subunit is configured to perform feature fusion on the first image feature vector, the first audio feature vector and the first text feature vector to obtain a first fusion feature vector of the video segment S _i, input the first fusion feature vector to a first fully-connected network in the first target prediction network, and perform feature extraction on the first fusion feature vector by using the first fully-connected network to obtain a first target feature vector corresponding to the video segment S _i;

The highlight determining subunit is configured to input a first target feature vector into a first classifier in the first target prediction network for classifying attributes of the first segment, output, by the first classifier, a first matching degree between the first target feature vector and first sample feature vectors corresponding to a plurality of first sample attributes in the first classifier, determine a first segment attribute of the video segment S _i based on the first matching degree, and determine a segment highlight of the video segment S _i based on the first segment attribute.

The first target estimating network comprises a first image processing network, a first audio processing network and a first text processing network;

the first determination subunit includes:

The first extraction subunit is configured to take an image frame in the video segment S _i as a first image frame, input the first image frame into the first image processing network, and perform image feature extraction on the first image frame by using the first image processing network to obtain a first image feature vector of the video segment S _i;

The second extraction subunit is configured to take an audio frame in the video segment S _i as a first audio frame, input the first audio frame to the first audio processing network, and perform audio feature extraction on the first audio frame by the first audio processing network to obtain a first audio feature vector of the video segment S _i;

And the third extraction subunit is configured to use the text information associated with the video segment S _i as first text information, input the first text information into the first text processing network, and perform text feature extraction on the first text information by using the first text processing network to obtain a first text feature vector of the video segment S _i.

Wherein the first prediction unit further comprises:

The interaction amount determination subunit is used for taking the video segment used for training the first initial pre-estimated network as a training segment and determining the bullet screen interaction amount of the training segment; the bullet screen mutual quantity of a training fragment is used for describing the true fragment precision and chroma of the training fragment;

the segment dividing subunit is used for taking a training segment with the barrage interaction amount larger than the interaction threshold value as a positive sample segment and taking the real segment essence chroma of the positive sample segment as a first sample label, taking a training segment with the barrage interaction amount smaller than or equal to the interaction threshold value as a negative sample segment and taking the real segment essence chroma of the negative sample segment as a second sample label;

A first association subunit configured to determine a first sample segment for training a first initial pre-estimated network based on the positive sample segment and the negative sample segment, and determine a plurality of first sample attributes based on the first sample tag and the second sample tag;

The second fusion subunit is configured to determine, through the first initial pre-estimation network, a first sample image vector, a first sample audio vector, and a first sample vector of the first sample segment, perform feature fusion on the first sample image vector, the first sample audio vector, and the first sample vector to obtain a first sample fusion vector of the first sample segment, and determine a first prediction attribute of the first sample segment based on the first sample fusion vector;

the first training subunit is configured to perform iterative training on the first initial estimated network based on the predicted sample precision and chroma corresponding to the first predicted attribute and the true sample precision and chroma corresponding to the first sample attribute, so as to obtain a first target estimated network.

Wherein the K video clips comprise video clips S _i, and i is a positive integer less than or equal to K; the target network model comprises a second target pre-estimated network for predicting a second segment attribute of the video segment S _i;

the second prediction unit includes:

the second determining subunit is configured to determine a second image feature vector, a second audio feature vector, and a second text feature vector of the video segment S _i through a second target prediction network;

The third fusion subunit is configured to perform feature fusion on the second image feature vector, the second audio feature vector, and the second text feature vector to obtain a second fusion feature vector of the video segment S _i, determine a second segment attribute of the video segment S _i based on the second fusion feature vector, and determine a first segment heat of the video segment S _i based on the second segment attribute;

The average processing subunit is used for acquiring auxiliary video clips of the business video on the platform to which the target video belongs, and determining the average barrage quantity corresponding to the video clip S _i based on the barrage interaction quantity and the first speed information of the auxiliary video clips;

the bullet screen volume obtaining subunit is used for obtaining the segment bullet screen volume of the video segment S _i, and determining the second segment heat of the video segment S _i based on the segment bullet screen volume and the average bullet screen volume;

And the heat determining subunit is used for determining the segment heat of the video segment S _i according to the first segment heat and the second segment heat of the video segment S _i.

Wherein the second prediction unit further comprises:

the play amount determining subunit is used for taking the sample video for training the second initial pre-estimated network as a second sample fragment and determining the video play amount and the play completion amount of the second sample fragment; the video play quantity and the play completion quantity of one second sample fragment are used for describing the real sample heat of one second sample fragment;

a second association subunit, configured to determine, based on a product of the video play amount and the play completion amount, a real sample heat of the second sample segment, and use the determined real sample heat as a plurality of second sample attributes associated with the second sample segment;

The fourth fusion subunit is configured to determine a second sample image vector, a second sample audio vector and a second sample text vector of the second sample segment through a second initial pre-estimation network, perform feature fusion on the second sample image vector, the second sample audio vector and the second sample text vector to obtain a second sample fusion vector of the second sample segment, and determine a second prediction attribute of the second sample segment based on the second sample fusion vector;

and the second training subunit is used for carrying out iterative training on the second initial estimated network based on the predicted sample heat corresponding to the second predicted attribute and the real sample heat corresponding to the second sample attribute to obtain a second target estimated network.

Wherein the K video clips comprise video clips S _i, and i is a positive integer less than or equal to K; the target network model comprises a third target pre-estimation network for predicting a third segment attribute of the video segment S _i;

The third prediction unit includes:

the first video determining subunit is used for acquiring target associated videos associated with target users in the application client, acquiring target video tags of the target associated videos and taking the target video tags as target interest tags of the target users;

The third determining subunit is configured to determine a target segment feature vector of the video segment S _i through a third target pre-estimation network, determine a target associated feature vector of the target associated video through the third target pre-estimation network, and determine a target interest feature vector of the target interest tag through the third target pre-estimation network;

The interest degree determining subunit is configured to determine a third fusion feature vector of the video segment S _i based on the target segment feature vector, the target association feature vector, and the target interest feature vector, determine a third segment attribute of the video segment S _i based on the third fusion feature vector, and determine a segment interest degree of the video segment S _i based on the third segment attribute.

Wherein the third prediction unit further comprises:

the completion degree determining subunit is used for taking the sample video used for training the third initial pre-estimated network as a training video and determining the watching completion degree of the sample user on the training video; the watching completion degree of one training video is used for describing the real sample interest degree of one sample user on one training video;

the video dividing subunit is used for taking the training video with the watching completion degree being greater than the completion threshold value as a positive sample video and taking the real sample interest degree of the positive sample video as a first video label, taking the training video with the watching completion degree being less than or equal to the completion threshold value as a negative sample video and taking the real sample interest degree of the negative sample video as a second video label;

a third association subunit configured to determine a third sample segment for training a third initial pre-estimated network based on the positive sample video and the negative sample video, and determine a plurality of third sample attributes based on the first video tag and the second video tag;

the second video determining subunit is used for taking the positive sample video as a sample associated video associated with the sample user, acquiring a sample video tag of the sample associated video, and taking the sample video tag as a sample interest tag of the sample user;

A fourth determining subunit, configured to determine, through a third initial pre-estimation network, a sample segment feature vector of a third sample segment, determine, through the third initial pre-estimation network, a sample association feature vector of a sample association video, and determine, through a third target pre-estimation network, a sample interest feature vector of a sample interest tag;

a fifth fusion subunit, configured to determine a third sample fusion vector for a third sample segment based on the sample segment feature vector, the sample association feature vector, and the sample interest feature vector, and determine a third prediction attribute for the third sample segment based on the third sample fusion vector;

And the third training subunit is used for carrying out iterative training on the third initial estimated network based on the predicted sample interest degree corresponding to the third predicted attribute and the real sample interest degree corresponding to the third sample attribute to obtain a third target estimated network.

Wherein the fragment screening unit comprises:

the result determining subunit is used for obtaining a double-speed evaluation result of each video clip based on the clip interest attribute of each video clip;

The segment screening subunit is configured to obtain a first playing progress and first double-speed information of the target video in the double-speed playing request, screen L video segments matching with a target user of the application client from K video segments based on the first double-speed information, the first playing progress and the double-speed evaluation result, and use the L video segments as key video segments for double-speed playing of the target video in the double-speed mode.

In one aspect, an embodiment of the present application provides a computer device, including: a processor and a memory;

The processor is connected to the memory, wherein the memory is configured to store a computer program, and when the computer program is executed by the processor, the computer device is caused to execute the method provided by the embodiment of the application.

In one aspect, the present application provides a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor, so that a computer device having the processor performs the method provided by the embodiment of the present application.

In one aspect, embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided by the embodiment of the present application.

In the embodiment of the application, when the computer equipment acquires a double-speed playing request associated with a certain video (for example, a target video requested to be played by a target user) in the application client, K video segments of the target video can be acquired, and then a plurality of video segments associated with first double-speed information carried by the double-speed playing request are selected from the K video segments, so that the selected video segments are used as key video segments. Where K may be a positive integer. Based on this, the computer device may return the key video snippet to the application client to cause the application client to play the key video snippet in the video playback interface. Therefore, according to the embodiment of the application, the relation between the target user of the application client and the K video clips is determined, so that the key video clip matched with the target user can be determined in the K video clips, wherein the key video clip can be the video clip of interest of the target user in the K video clips. Therefore, for different users, the key video snippets matched with the respective key video snippets can be intelligently selected for the different users so as to play the respective acquired key video snippets in application clients of the different users. Obviously, by introducing the key video snippets, when the target user uses the double-speed mode of the target video, a personalized double-speed playing mode is provided for the target user, and further, the accuracy of double-speed playing can be improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application;

fig. 3 is a flowchart of a video data playing method according to an embodiment of the present application;

fig. 4a is a schematic view of a scene showing a video playing interface according to an embodiment of the present application;

fig. 4b is a schematic view of a scene showing a video playing interface according to an embodiment of the present application;

FIG. 5 is a schematic view of a scenario in which a double-speed selection control is indirectly displayed according to an embodiment of the present application;

FIG. 6a is a schematic diagram of a scenario featuring an indirect display of a speed control according to an embodiment of the present application;

FIG. 6b is a schematic diagram of a scenario featuring an indirect display of a speed control according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a scenario in which a speed control is directly displayed according to an embodiment of the present application;

fig. 8 is a schematic view of a scene for playing a key video snippet according to an embodiment of the present application;

Fig. 9 is a flowchart of a video data playing method according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of a first target prediction network according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a second target prediction network according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a third target prediction network according to an embodiment of the present application;

fig. 13 is a schematic view of a scenario of an intelligent double-speed playing method according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a video data playing device according to an embodiment of the present application;

fig. 15 is a schematic structural diagram of a video data playing device according to an embodiment of the present application;

Fig. 16 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be appreciated that artificial intelligence (ARTIFICIAL INTELLIGENCE, AI for short) is the theory, method, technique, and application that utilizes a digital computer or a machine controlled by a digital computer to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The scheme provided by the embodiment of the application mainly relates to an artificial intelligence Computer Vision (CV) technology and a machine learning (MACHINE LEARNING ML) technology.

The Computer Vision (Computer Vision) is a science for researching how to make a machine "look at", and more specifically, a camera and a Computer are used to replace human eyes to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The machine learning (MACHINE LEARNING) is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Specifically, referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a service server 2000 and a cluster of user terminals. Wherein the user terminal cluster may in particular comprise one or more user terminals, the number of user terminals in the user terminal cluster will not be limited here. As shown in fig. 1, the plurality of user terminals may specifically include a user terminal 3000a, a user terminal 3000b, user terminals 3000c, …, a user terminal 3000n; the user terminals 3000a, 3000b, 3000c, …, 3000n may be directly or indirectly connected to the service server 2000 through a wired or wireless communication manner, respectively, so that each user terminal may perform data interaction with the service server 2000 through the network connection.

The service server 2000 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms.

Wherein, each user terminal in the user terminal cluster may include: smart terminals with video data processing functions such as smart phones, tablet computers, notebook computers, desktop computers, smart home, wearable devices, vehicle-mounted devices and the like. It should be understood that each user terminal in the user terminal cluster shown in fig. 1 may be integrally provided with a target application (i.e., an application client), and when the application client runs in each user terminal, the application client may interact with the service server 2000 shown in fig. 1 respectively. The application client may specifically include: vehicle clients, smart home clients, entertainment clients (e.g., game clients), multimedia clients (e.g., video clients), social clients, and information-based clients (e.g., news clients), etc.

For easy understanding, the embodiment of the present application may select one user terminal from the plurality of user terminals shown in fig. 1 as the target user terminal. For example, the embodiment of the present application may use the user terminal 3000a shown in fig. 1 as a target user terminal, and the target user terminal may integrate a target application (i.e., an application client) having a video encoding function. At this time, the target user terminal may implement data interaction between the application client and the service server 2000.

For easy understanding, the embodiment of the application can collectively refer to the video (such as a television play) which is selected by a certain user (for example, the user Y) in the video recommendation interface of the application client and is attached to the own interest as a target video.

It should be understood that the service scenario applicable to the network framework may specifically include: the network frame can realize double-speed playing of target videos in service scenes such as entertainment program on-demand scenes, online cinema movie scenes, online classroom class scenes and the like, and the service scenes applicable to the network frame are not listed one by one. For example, in an entertainment program on-demand scenario, the target video may be an entertainment program selected by the user Y in a video recommendation interface (e.g., a video program recommendation list) and fitting the user's own interests. For another example, in an on-line cinema movie viewing scenario, the target video may be a movie selected by the user Y in a video recommendation interface (such as a movie recommendation list) and attached to the user's own interest. For another example, in an on-line classroom class listening scenario, the target video may be a course selected by the user Y in a video recommendation interface (for example, a course recommendation list) and fitting the user's own interests.

For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic diagram of a scenario for data interaction according to an embodiment of the present application. The server shown in fig. 2 may be the service server 2000 in the embodiment corresponding to fig. 1, and the terminal Z shown in fig. 2 may be any one of the user terminals in the user terminal cluster in the embodiment corresponding to fig. 1, for convenience of understanding, the embodiment of the present application uses the user terminal 3000a shown in fig. 1 as the terminal Z, to describe a specific process of performing data interaction between the terminal Z and the server shown in fig. 2. An application client is installed on the terminal Z, where the application client may be used to play a target video that is interested by a user corresponding to the terminal Z, and the user corresponding to the terminal Z may be the user Y.

The video playing interface 2a and the video playing interface 2b shown in fig. 2 may be video playing interfaces of the application client at different moments. It should be understood that, the video playing interface 2a may be a video playing interface for playing the target video at time T1, and the video playing interface 2b may be a video playing interface for playing the key video snippet at time T2. The key video snippets herein are a plurality of video snippets selected from the video snippets of the target video.

It may be appreciated that the double speed selection control 20c may be included in the video playing interface 2a, the application client may display a first control display interface 20b independent of the video playing interface 2a in response to the first trigger operation for the double speed selection control 20c (i.e., the first control display interface 20b is a sub-interface suspended from the video playing interface 2 a), where the first control display interface 20b may include a number of intelligent double speed controls associated with the double speed mode and a number of universal double speed controls associated with the universal mode, where the number of intelligent double speed controls may be N, the number of universal double speed controls may be M, where N may be equal to 2, where M may be equal to 4. The N intelligent speed control elements may specifically include: the speed-doubling controls corresponding to the intelligent speed-doubling 2.0x and the intelligent speed-doubling 1.5x can specifically include: the speed doubling controls corresponding to '0.5 x', '1.0 x', '1.5 x' and '2.0 x'.

As shown in fig. 2, the intelligent double-speed control corresponding to the "intelligent double-speed 2.0x" may be a first double-speed control 20a, and the user Y may perform a second trigger operation on the first double-speed control 20a in the application client, so that the application client may determine, in response to the second trigger operation performed on the first double-speed control 20a by the user Y, a playing progress (for example, a first playing progress) of the target video in the video playing interface 2a, and send a double-speed playing request to the server based on the first double-speed information (i.e., 2 times speed) and the first playing progress indicated by the "intelligent double-speed 2.0 x".

As shown in fig. 2, the server may receive a double-speed play request sent by a user Y of the application client, and further obtain K video segments of the target video associated with the double-speed play request, where the K video segments are obtained by dividing the target video, for example, the target video is divided into K video segments of 5 seconds, where K may be a positive integer, and the K video segments specifically may include: video clip 1, video clips 2, …, video clip k. Further, the server may perform correlation analysis on the user Y and the K video clips to obtain a double-speed evaluation result of each video clip in the K video clips, for example, the double-speed evaluation result of the video clip 1 may be a double-speed evaluation result 1, the double-speed evaluation result of the video clip 2 may be a double-speed evaluation result 2, …, and the double-speed evaluation result of the video clip K may be a double-speed evaluation result K.

As shown in fig. 2, the server may select L video clips from the K video clips based on the multiple speed evaluation result, the first multiple speed information and the first playing progress of each video clip in the K video clips, so that the selected L video clips are used as key video clips, where L may be a positive integer less than K. For example, when the user Y is watching the episode content in the video segment 2 of the target video, the intelligent multiple speed control corresponding to the desired multiple speed multiple of the user Y may be selected in the multiple speed mode, where the first playing progress of the target video belongs to the video segment 2, so that the server may obtain the video segment where the first playing progress is located and the video segment after the first playing progress from the K video segments, that is, obtain the video segments 2, … and the video segment K from the K video segments (where the video segments 2, … and the video segment K may be video segments to be played), rank the video segments 2, … and the video segment K based on the multiple speed evaluation results 2, … and the multiple speed evaluation result K, and further select 1/S video segments with higher multiple speed evaluation results from the video segments 2, … and the video segment K, so as key video segments. Wherein S is determined by the first double speed information, and S is equal to 2 when the first double speed information is 2 double speed. For example, the server may select video clip 2, video clip 4 (not shown in the figure), …, and video clip K from video clips 2, …, and video clip K as key video clips, where the number of key video clips is L, and l= (K-1)/2. Wherein when K is equal to 20, K-1 is equal to 19, then L is equal to 9 or 10. Alternatively, when K is equal to 21, K-1 is equal to 20, then L is equal to 10.

As shown in fig. 2, when the server determines a key video snippet among K video snippets of the target video, the key video snippet associated with the double-speed play request may be returned to the application client. Therefore, after receiving the key video snippet, the application client may play the key video snippet in the video play interface 2b of the application client. It can be appreciated that, when the key video snippet includes the video snippet 2, the video snippet 4 (not shown in the figure), …, and the video snippet k, the application client may play the key video snippet in the video playing interface 2b based on the playing order (i.e., the snippet position) of the key video snippet in the target video, that is, play the video snippet 2 first, play the video snippet 4 (not shown in the figure) after playing the video snippet 2, …, and play the video snippet k last. Optionally, the application client may also play the key video snippets in the video playing interface 2b based on the multiple speed evaluation result of the key video snippets, that is, first play the video snippet (for example, video snippet k) with the highest multiple speed evaluation result in the key video snippets, …, and finally play the video snippet (for example, video snippet 2) with the lowest multiple speed evaluation result in the key video snippets.

It can be understood that, in the double-speed mode, the application client can replace playing the target video by playing the key video snippets, for example, when the first double information is 2 times speed, the number of the key video snippets is half (i.e. 1/2) of the number of the video snippets to be played, so that the time for playing the key video snippets by the application client is half of the time for playing the video snippets to be played, that is, the application client can play the key video snippets more interested by the target user in the video snippets to be played by shortening the time for playing the target video (the shortened time is half of the original time).

Therefore, the embodiment of the application determines the expected degree of the target user (for example, the user Y) on the video segment of the target video based on the double-speed evaluation result of the video segment by performing depth understanding modeling on the video content and the user interest, and can keep the video segment (namely, the key video segment) which is more in line with the user expected of the target user in the target video to be presented to the target user when the target user plays the target video in the double-speed play mode. In this way, the embodiment of the application plays the key video clip in the application client to realize the playing of the target video, so that the target user can watch the wonderful content in the expected time, and further, the watching experience of the target user in the double-speed playing scene is improved, and the accuracy of double-speed playing is improved.

The specific implementation manner of the data interaction between the application client and the server can be referred to the embodiments corresponding to fig. 3 to 13 below.

Further, referring to fig. 3, fig. 3 is a flowchart of a video data playing method according to an embodiment of the present application. The method may be executed by an application client, or may be executed by a server, or may be executed by both the application client and the server, where the application client may be an application client in the embodiment corresponding to fig. 2, and the server may be a server in the embodiment corresponding to fig. 2. For ease of understanding, this embodiment will be described in terms of this method being performed by an application client as an example. The video data positioning method at least comprises the following steps S101-S103:

Step S101, displaying a video playing interface for playing a target video;

It can be appreciated that when the target user (e.g., user Y in the embodiment corresponding to fig. 2) needs to watch the target video in the application client, the video recommendation interface of the application client may be obtained, and further, a playing operation is performed for the target video in the plurality of recommended videos of the video recommendation interface. At this time, the application client may respond to the playing operation performed by the target user on the target video, and display a video playing interface (i.e., a non-full-screen playing interface) corresponding to the target video in the application client.

It can be appreciated that when the target user needs to watch the target video in the full-screen mode in the application client, the above-mentioned non-full-screen playing interface may be obtained, and then the first conversion operation is performed for the screen conversion control (for example, the first conversion control) in the non-full-screen playing interface, at this time, the application client may respond to the first conversion operation performed by the target user for the first conversion control, and display the video playing interface corresponding to the target video (i.e., the full-screen playing interface) in the application client.

It may be understood that the full-screen playing interface may also include a screen conversion control (e.g., a second conversion control), and when the target user needs to watch the target video in the application client in a non-full-screen mode, the full-screen playing interface may be acquired, and further, the second conversion operation is performed for the second conversion control in the full-screen playing interface. At this time, the application client may respond to the second conversion operation performed by the target user for the second conversion control, and display, in the application client, that the target video corresponds to the non-full-screen playing interface.

The target video may be a video such as a program, a movie, a television show, or a short video taken from a long video, which is not limited herein.

The playing operation may include clicking, long pressing, sliding, or other contact operations, or may include non-contact operations such as voice and gesture, which are not limited herein.

The first conversion operation and the second conversion operation may include touch operations such as clicking, long pressing, sliding, and the like, and may also include non-touch operations such as voice and gesture, and the application is not limited herein. Wherein the first conversion operation and the second conversion operation may be collectively referred to as a conversion operation.

For easy understanding, please refer to fig. 4a, fig. 4a is a schematic view of a scene showing a video playing interface according to an embodiment of the present application. The video recommendation interface 4a shown in fig. 4a may be a video recommendation interface of an application client, where the video recommendation interface 4a may include a plurality of recommended videos, where the plurality of recommended videos may specifically include: video 40a, video 40b, video 40c, and video 40d.

As shown in fig. 4a, when the user Y needs to watch a recommended video (for example, the video 40b is the target video), a playing operation may be performed on the video 40b, so that the application client may respond to the playing operation performed on the video 40b by the user Y, and send a video playing request to a server corresponding to the application client, so that a video playing interface 4b corresponding to the video 40b (i.e., a non-full-screen playing interface) is displayed in the application client, that is, the video recommending interface 4a is switched to the video playing interface 4b.

For easy understanding, please refer to fig. 4b, fig. 4b is a schematic view of a scene showing a video playing interface according to an embodiment of the present application. The video playing interface 4c shown in fig. 4b may be the video playing interface 4b in the embodiment corresponding to fig. 4a, and the video playing interface 4c may include a screen conversion control. Wherein user Y views video 40b (i.e., the target video) in a non-full screen mode in video playback interface 4 c.

As shown in fig. 4b, when the user Y needs to view the video 40b in the full-screen mode in the application client, the conversion operation may be performed for the screen conversion control (i.e., the first conversion operation is performed for the first conversion control), so that the application client may display the video playing interface 4d (i.e., the full-screen playing interface) corresponding to the video 40b in the application client in response to the conversion operation performed for the screen conversion control by the user Y, that is, switch the video playing interface 4c to the video playing interface 4d.

Step S102, responding to a first triggering operation for a video playing interface, and displaying N double-speed controls associated with a double-speed mode of a target video;

It may be understood that the video playing interface may include a double-speed selection control, and the application client responds to the first trigger operation for the video playing interface and may be understood as responding to the first trigger operation for the double-speed selection control. The first triggering operation may include a touch operation such as clicking, long pressing, sliding, or a non-touch operation such as voice or gesture, which is not limited herein.

It will be appreciated that the application client may directly display a video playback interface containing the speed selection control. Optionally, the application client may further display a video playing interface that does not include the double-speed selection control, and when the target user performs a control display operation on the video playing interface that does not include the double-speed selection control, the application client may display the double-speed selection control in the video playing interface in response to a control display operation performed by the target user on the video playing interface that does not include the double-speed selection control. The video playing interface may be a full-screen playing interface or a non-full-screen playing interface.

The control display operation may include touch operations such as clicking, long pressing, sliding, and the like, and may also include non-touch operations such as voice and gesture, which are not limited herein.

For ease of understanding, please refer to fig. 5, fig. 5 is a schematic diagram of a scenario in which a double-speed selection control is indirectly displayed according to an embodiment of the present application. The video playing interface 5a shown in fig. 5 may be a video playing interface that does not include a double-speed selection control, and when the user Y performs a control display operation (for example, a clicking operation) on the video playing interface 5a, the application client may display a video playing interface 5b that includes a double-speed selection control 50a, and the video playing interface 5b may be the video playing interface 4d in the embodiment corresponding to fig. 4 b.

It may be appreciated that the application client, in response to the first trigger operation for the video playback interface, may display intelligent speed-doubling controls associated with the speed-doubling mode, which may be N in number. Where N may be a positive integer. Optionally, when responding to the first trigger operation for the video playing interface, the application client may further display a universal multiple speed control associated with the universal mode, where the number of universal multiple speed controls may be M. Where M may be a positive integer. The double-speed mode and the universal mode are two different modes for double-speed playing of the target video, and the number of N and M is not limited in the embodiment of the application.

It should be appreciated that where the video playback interface is a non-full screen playback interface for playing the target video, the application client may play the target video in the video playback area of the non-full screen playback interface. Further, the application client may display a control display area of the target video in the non-full-screen playing interface in response to the first trigger operation for the non-full-screen playing interface. The control display area is an area suspended above the video playing area, or the control display area is an area not overlapped with the video playing area. Further, the application client may trigger a double speed mode of the target video in response to a double speed selection operation for the control display area, and display N double speed controls associated with the double speed mode in the second control display interface based on the double speed mode. The second control display interface is an interface suspended above the video playing area, or the second control display interface is an interface not overlapped with the video playing area.

The control display area can be suspended in the video playing area and overlapped with the video playing area. Optionally, the control display area may also be suspended in the video playing area and not overlap the video playing area. Optionally, the control display area may also be a non-floating area in the non-full screen playing interface, which is not overlapped with the video playing area.

Similarly, the second control display interface may be suspended in the video playing area and overlap the video playing area. Optionally, the second control display interface may also be suspended in the video playing area and not overlap the video playing area. Optionally, the second control display interface may also be a non-floating interface that does not overlap the video playing area in the non-full screen playing interface.

The double-speed selection operation may include a touch operation such as clicking, long pressing, sliding, or a non-touch operation such as voice or gesture, and the present application is not limited thereto.

Optionally, when the video playing interface is a non-full-screen playing interface for playing the target video, the application client side responds to a first triggering operation aiming at the non-full-screen playing interface to trigger a double-speed mode of the target video, and displays a second control display interface independent of the non-full-screen playing interface based on the double-speed mode. The interface size of the second control display interface is smaller than that of the non-full-screen playing interface. Further, the application client displays the N speed-doubling controls associated with the speed-doubling mode in the second control display interface.

For ease of understanding, fig. 6a is a schematic diagram of a scenario in which a speed control is indirectly displayed according to an embodiment of the present application. The video playing interface 6a shown in fig. 6a may be the video playing interface 4b in the embodiment corresponding to fig. 4a, and the video playing interface 6a may include a video playing area 60a for playing the target video. Wherein video playback area 60a may include video control controls therein.

As shown in fig. 6a, when the user Y needs to watch the target video in the double-speed mode, a first trigger operation may be performed on the video control (i.e., a first trigger operation is performed on the non-full-screen playing interface), so that the application client may display the control display area 60b in the video playing interface 6a in response to the first trigger operation performed on the video control by the user Y, to obtain the video playing interface 6b. The control display area 60b includes a plurality of controls, where the plurality of controls may include a control 61a, a control 61b, a control 61c, and a control 61d, the control 61a may be a double-speed selection control a, and the control 61b, the control 61c, and the control 61d may be other control controls.

For ease of understanding, fig. 6b is a schematic view of a scenario in which a speed control is indirectly displayed in fig. 6b according to an embodiment of the present application. The video playing interface 6c shown in fig. 6b may be the video playing interface 6b in the embodiment corresponding to fig. 6a, and the control display area 60c in the video playing interface 6c may be the control display area 60b in the video playing interface 6 b. Among them, a double-speed selection control b (i.e., control 61 a) may be included in the control display area 60 c.

As shown in fig. 6b, when the user Y needs to watch the target video in the double speed mode, a double speed selection operation may be performed on the double speed selection control b (i.e. a double speed selection operation is performed on the control display area), so that the application client may display a second control display interface 60d in the video playing interface 6c in response to the double speed selection operation performed on the double speed selection control b by the user Y, and further display an intelligent double speed control associated with the double speed mode and a universal double speed control associated with the universal mode in the second control display interface 60d, so as to obtain a video playing interface 6d, where the intelligent double speed controls may be N (e.g. 2), and the universal double speed controls may be M (e.g. 4).

Optionally, it should be understood that, when the video playing interface is a full-screen playing interface for playing the target video, the application client triggers a double-speed mode of the target video in response to a first trigger operation for the full-screen playing interface, and displays a first control display interface independent of the full-screen playing interface based on the double-speed mode. The interface size of the first control display interface is smaller than that of the full-screen playing interface. Further, the application client displays N speed-doubling controls associated with the speed-doubling mode in the first control display interface.

For ease of understanding, please refer to fig. 7, fig. 7 is a schematic diagram of a scenario in which a speed control is directly displayed according to an embodiment of the present application. The video playing interface 7a shown in fig. 7 may be the video playing interface 4d in the embodiment corresponding to fig. 4 b. The video playing interface 7a may include a double-speed selection control c.

As shown in fig. 7, when the user Y needs to watch the target video in the double-speed mode, a first trigger operation may be performed on the double-speed selection control c (i.e., a first trigger operation is performed on the full-screen playing interface), so that the application client may display the first control display interface 7c in the video playing interface 7a in response to the first trigger operation performed on the double-speed selection control c by the user Y, and further display the intelligent double-speed control associated with the double-speed mode and the universal double-speed control associated with the universal mode in the first control display interface 7c, so as to obtain a video playing interface 7b, where the intelligent double-speed controls may be N (e.g., 2), and the universal double-speed controls may be M (e.g., 4).

Step S103, in response to second trigger operation for the N speed-doubling controls, determining first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operation, and playing the key video clip of the target video in the video playing interface.

The key video snippet is a video snippet which is selected from video snippets of the target video and is associated with the first speed information. It can be understood that the value of the multiple speed of the first multiple speed information is greater than or equal to 1, and when the multiple speed is equal to 1, the application client can play the target video at a normal speed; when the multiple speed is greater than 1, the application client can play the key video clip in the multiple speed mode.

When the target user uses the intelligent double-speed function (i.e., when the target user executes the second trigger operation for the N double-speed controls in the double-speed mode), the application client may send a double-speed play request to the server corresponding to the application client based on the first double-speed information indicated by the double-speed control corresponding to the second trigger operation. It may be understood that after receiving the double-speed playing request, the server may obtain J video segments from K video segments of the target video based on the first playing progress carried by the double-speed playing request (i.e. the playing progress of the target video in the application client), where J may be a positive integer, and the J video segments are video segments after the first playing progress in the K video segments. Further, the server may screen the video segments with the integral multiple speed score of the first 1/S from the J video segments based on the multiple speed evaluation results (i.e., the integral multiple speed score) of the J video segments and the first multiple speed information carried by the multiple speed play request, where S may be equal to the multiple speed multiple of the first multiple speed information. For example, when the multiple speed of the first multiple speed information is 2, S may be equal to 2, and the server may use the video segment with the overall multiple speed score of the first 1/2 that is selected from the J video segments as the key video segment, and then return the key video segment to the application client.

The second triggering operation may include a touch operation such as clicking, long pressing, sliding, or a non-touch operation such as voice or gesture, which is not limited herein.

It can be understood that the application client can take the double-speed control corresponding to the second trigger operation as the first double-speed control, and the application client responds to the second trigger operation for the N double-speed controls and can be understood as responding to the second trigger operation for the first double-speed control. Optionally, the N multiple speed controls may further include a second multiple speed control, where the second multiple speed control may also be used to perform multiple speed playing on the target video in the multiple speed mode.

Optionally, the application client may also directly display the first speed control in the video playing interface, so that the application client may directly respond to the second triggering operation for the first speed control without responding to the first triggering operation for the video playing interface, and further play the key video clip of the target video in the video playing interface.

Optionally, the application client may further integrate N speed-doubling controls associated with the speed-doubling mode on one integrated speed-doubling control, and display the integrated speed-doubling control directly in the video playing interface. In this way, the application client may use the speed doubling control corresponding to the second trigger operation as the first speed doubling control when responding to the second trigger operation for the integrated speed doubling control, and further use the speed doubling control corresponding to the fourth trigger operation as the second speed doubling control when responding to the further second trigger operation (for example, the fourth trigger operation) for the integrated speed doubling control, and so on. The M speed doubling controls associated with the universal mode and the N speed doubling controls associated with the speed doubling mode may be integrated on the same integrated speed doubling control or may be integrated on different integrated speed doubling controls, which is not limited in the present application.

It should be appreciated that the application client may indirectly acquire the key video snippets and play the indirectly acquired key video snippets in the video play interface. The specific process of capturing the key video snippets between application clients can be described as: the application client can respond to second trigger operations for the N speed-doubling controls, determine first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operations, and determine the playing progress of the target video as the first playing progress in the video playing interface. Further, the application client may acquire, from the server, the double-speed play section identifier associated with the first double-speed information and the first play progress based on the first network state when it is checked that the network state of the application client for playing the target video belongs to the first network state. Wherein one of the double-speed play clips identifies a clip position for characterizing one of the key video clips in the target video. Further, the application client can obtain the key video snippets matched with the identifier of the double-speed playing snippet from the server. Further, the application client may play the key video snippets of the target video in the video play interface based on the snippet positions of the key video snippets in the target video.

When the application client is in the weak network state, the application client can send a segment acquisition request to the server based on the double-speed playing segment identifier when receiving the double-speed playing segment identifier returned by the server, so as to acquire the key video segment matched with the double-speed playing segment identifier. For example, the two-time playing segment identifier acquired by the application client may be identifier B1, identifier B2, …, and identifier BL, and then the application client may send a segment acquisition request to the server based on identifier B1 to acquire the video segment P1 matching with identifier B1, send a segment acquisition request to the server based on identifier B2 to acquire the video segment P2 matching with identifier B2, …, and send a segment acquisition request to the server based on identifier BL to acquire the video segment PL matching with identifier BL. The video segments P1, P2, …, and PL may be collectively referred to as key video segments. Optionally, the application client may also send a segment acquisition request to the server based on the multiple-speed play segment identifiers at the same time, so as to acquire a video segment that matches the multiple-speed play segment identifiers.

When the key video segments are the video segment P1, the video segments P2, …, and the video segment PL, the application client may play the obtained key video segment according to the segment position of the key video segment in the target video, that is, play the video segment P1 first, play the video segments P2, … second, and play the video segment PL last.

Alternatively, it should be understood that the application client may also directly acquire the key video snippets, and play the directly acquired key video snippets in the video play interface. The specific process of directly acquiring the key video snippets by the application client may be described as: the application client can respond to second trigger operations for the N speed-doubling controls, determine first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operations, and determine the playing progress of the target video as the first playing progress in the video playing interface. Further, the application client may acquire, when it is checked that the network state of the application client for playing the target video belongs to the second network state, a key video clip associated with the first double-speed information indicated by the first double-speed control and the first playing progress from the server based on the second network state. The key video clips are determined by the server from a key clip set of the target video based on the first double-speed information and the first playing progress; the key segment set is determined by the server based on the L video segments; the L video clips are determined based on clip interest attributes of K video clips of the target video; where L may be a positive integer less than K; where K may be a positive integer. Further, the application client can play the key video snippets of the target video in the video play interface.

The second network state represents a strong network state, namely a state with a better network, and when the application client is in the strong network state, the application client can directly acquire the key video clip returned by the server.

When the key video clip includes the video clip P1, the video clips P2, …, and the video clip PL, the key clip set may include the video clip P1, the video clips P2, …, and the video clip PL, and the server may determine the sequence of the video clip P1, the video clip P2, …, and the video clip PL in the key clip set according to the clip positions of the video clip P1, the video clip P2, …, and the video clip PL in the target video, and further return the video clip P1, the video clip P2, …, and the video clip PL (i.e., the key video clip) to the application client according to the sequence of the video clip P1, the video clip P2, …, and the video clip PL in the key clip set.

For easy understanding, please refer to fig. 8, fig. 8 is a schematic diagram of a scene of playing a key video snippet according to an embodiment of the present application. The video playing interface 8a shown in fig. 8 may be the video playing interface 7b in the embodiment corresponding to fig. 7, the playing progress of the target video in the video playing interface 8a may be a first playing progress, where the first playing progress is "00:10:15" (i.e., "00:10:15/01:10:20"), the application client may include a first control display interface 80a, the first control display interface 80a may include a speed-doubling control corresponding to "intelligent speed-doubling 2.0x", and the speed-doubling control corresponding to "intelligent speed-doubling 2.0x" may be a first speed-doubling control 80b.

As shown in fig. 8, when the user Y performs the second trigger operation on the first speed-doubling control 80b (i.e., performs the second trigger operation on the N speed-doubling controls), the application client may obtain the key video snippet based on the first playing progress and the first speed-doubling information indicated by the first speed-doubling control 80b, and play the key video snippet in the video playing interface 8 b. The key video snippet may be a video snippet with a higher multiple speed evaluation result after the first playing progress, where the key video snippet specifically may include: video clip 1 (e.g., video clip P1 described above), video clip 2 (e.g., video clip P2 described above), …, video clip L (e.g., video clip PL described above).

As shown in fig. 8, the application client may first play the video clip 1 in the video playing interface 8b, where the video clip 1 is the first video clip in the key video clips matched with the user Y, and the initial playing progress of the video clip 1 is "00:22:45" (i.e., "00:22:45/01:10:20"), and the application client may play the key video clip directly based on the initial playing progress of the video clip 1, i.e., the application client may switch the playing progress from the first playing progress to the initial playing progress.

It should be appreciated that when playing a key video snippet in a video playback interface, the application client may display N speed-doubling controls associated with the speed-doubling mode in response to a third trigger operation for the video playback interface. The N speed-doubling controls comprise second speed-doubling controls. Further, the application client may switch the double-speed information for double-speed playing of the target video from the first double-speed information indicated by the first double-speed control to the second double-speed information indicated by the second double-speed control in response to a fourth trigger operation for the second double-speed control. Further, the application client may determine a playing progress of the key video snippet in the target video as a second playing progress, and determine a switching video snippet for playing in the video playing interface based on the second double-speed information and the second playing progress. The video clips are selected from the video clips of the target video and are associated with the second double-speed information and the second playing progress. Further, the application client may play the switched video clip in the video play interface.

The third triggering operation and the fourth triggering operation may include touch operations such as clicking, long pressing, sliding, and the like, and may also include non-touch operations such as voice and gesture, which are not limited herein.

It should be understood that, the specific process of the application client responding to the third trigger operation for the video playing interface may be referred to the description of responding to the first trigger operation for the video playing interface, the specific process of the application client responding to the fourth trigger operation for the second speed doubling control may be referred to the description of responding to the second trigger operation for the first speed doubling control, and will not be described herein. It should be appreciated that, for a specific process that the application client obtains the switching video segment based on the second playing progress and the second double-speed information, reference may be made to the above description of obtaining the key video segment based on the first playing progress and the first double-speed information, which will not be described herein.

It will be appreciated that the application client may implement a mutual switching of the speed-doubling mode and the universal mode. Optionally, when playing the key video snippet in the video playing interface, the application client may display M speed-doubling controls associated with the universal mode in response to a fifth trigger operation for the video playing interface. The M speed-doubling controls comprise third speed-doubling controls. Further, the application client may switch the first double-speed information indicated by the first double-speed control to the third double-speed information indicated by the third double-speed control for double-speed playing the target video in response to a sixth trigger operation for the third double-speed control. Further, the application client may determine a playing progress of the key video snippet in the target video as a second playing progress, and determine a third playing progress for the target video played in the video playing interface based on the third speed information and the second playing progress. The application client may determine the second playing progress of the key video snippet as the third playing progress of the target video. Further, the application client may play the target video from the third play progress in the video play interface.

Therefore, when the target user plays the video in the speed doubling function, the embodiment of the application can keep the video fragments (i.e. the key video fragments) which the target user is more expected to watch to be presented to the target user based on the speed doubling selected by the target user. It should be appreciated that the number of key video snippets acquired by the target user may vary according to the multiple selected by the target user, and at different multiples, the target user may view the highlight video snippets in a predictable time. In addition, the key video snippets acquired by different users are different, and each user acquires the key video snippets which are more fit to the interests of the user. Based on the above, the embodiment of the application can provide a personalized double-speed playing mode for the user, and can improve the accuracy of double-speed playing while improving the double-speed playing experience of the user.

Further, referring to fig. 9, fig. 9 is a flowchart of a video data playing method according to an embodiment of the present application. The method may be executed by an application client, or may be executed by a server, or may be executed by both the application client and the server, where the application client may be an application client in the embodiment corresponding to fig. 2, and the server may be a server in the embodiment corresponding to fig. 2. For ease of understanding, this embodiment will be described in terms of this method being performed by a server as an example. The video data positioning method at least comprises the following steps S201 to S203:

Step S201, receiving a double-speed playing request which is sent by an application client based on first double-speed information and is associated with a target video;

the first double-speed information is used for indicating the application client to double-speed play the target video in the double-speed mode. It will be appreciated that the first speed information is determined by the application client in response to a second trigger operation for the N speed controls; the N speed-doubling controls are determined by the application client-side responding to a first triggering operation aiming at the video playing interface and triggering a speed-doubling mode associated with the target video; the video playing interface is used for playing the target video. The first double-speed information may be double-speed information indicated by a first double-speed control in the N double-speed controls.

Optionally, the double-speed playing request may further include a playing progress (for example, a first playing progress) of the target video in the video playing interface, so as to obtain L video segments (i.e., key video segments) from the video segments of the target video based on the first playing progress and the first double-speed information in step S202 described below.

Step S202, screening L video clips matched with a target user of an application client from K video clips of a target video based on a double-speed playing request, and taking the L video clips as key video clips for double-speed playing of the target video in a double-speed mode;

specifically, the server may obtain a video identifier of the target video from the double-speed play request, determine the target video in the application client based on the video identifier, and divide the target video into K video segments based on the video slicing parameter. Further, the server may obtain a target network model associated with the target video, predict a first segment attribute of each of the K video segments by the target network model, and determine a segment precision of each video segment based on the first segment attribute of each video segment. Further, the server may predict a second segment attribute of each of the K video segments through the target network model, and determine a segment popularity of each video segment based on the second segment attribute of each video segment. Further, the server may predict a third segment attribute of each of the K video segments via the target network model, and determine a segment interest level for each video segment based on the third segment attribute of each video segment. Further, the server may determine the segment chroma of each video segment, the segment heat of each video segment, and the segment interest of each video segment as the segment interest attribute of each video segment, and screen L video segments matching with the target user of the application client from K video segments based on the segment interest attribute of each video segment and the double-speed play request, and use the L video segments as key video segments for double-speed playing the target video in the double-speed mode. Where L may be a positive integer less than K, where K may be a positive integer.

It can be appreciated that when the video slicing parameter is a time length, the server may slice the target video into K video segments with the same time length according to the time length, for example, divide the target video into K video segments according to a time length of 5s (i.e. 5 seconds). Wherein the length of the first video segment or the last video segment may not meet the time length requirement. It should be understood that embodiments of the present application are not limited to specific values of time length. Optionally, the server may segment the target video into K video segments by means of video frame clustering (i.e., the video segmentation parameter may be a similarity between video frames), and may segment the target video into K video segments by means of uniform partitioning (i.e., the video segmentation parameter may be the number of video segments obtained by segmentation). It should be understood that the embodiment of the present application is not limited to a specific manner of acquiring K video segments of the target video.

It should be appreciated that the K video clips include video clip S _i, where i may be a positive integer less than or equal to K. The specific process of determining the segment precision of the video segment S _i by the server can be described as: the server may obtain a target network model associated with the target video. The target network model comprises a first target pre-estimated network for predicting the first segment attribute of the video segment S _i. Further, the server may determine a first image feature vector, a first audio feature vector, and a first text feature vector of the video clip S _i through the first target prediction network. Further, the server may perform feature fusion on the first image feature vector, the first audio feature vector and the first text feature vector to obtain a first fused feature vector of the video segment S _i, input the first fused feature vector to a first fully-connected network in the first target pre-estimated network, and perform feature extraction on the first fused feature vector by using the first fully-connected network to obtain a first target feature vector corresponding to the video segment S _i. Further, the server may input the first target feature vector into a first classifier in the first target prediction network for classifying the first segment attribute, the first classifier outputs a first matching degree between the first target feature vector and a first sample feature vector corresponding to a plurality of first sample attributes in the first classifier, the first segment attribute of the video segment S _i is determined based on the first matching degree, the segment chroma of the video segment S _i is determined based on the first segment attribute (i.e.）。

It can be understood that the manner in which the server performs feature fusion on the first image feature vector, the first audio feature vector and the first text feature vector may be a weighted average manner or a vector splicing manner. It should be understood that embodiments of the application are not limited to a particular manner of feature fusion.

It will be appreciated that the first fully-connected network may be a multi-layer fully-connected network that may perform a non-linear transformation on the input features (i.e., the first fused feature vector) to obtain the output features (i.e., the first target feature vector), and may further perform a dimensional compression on the input features, for example, compress the 4000-dimensional first fused feature vector to the 1000-dimensional first target feature vector.

It will be appreciated that when the number of first sample attributes is 2 (i.e., two classifications), the number of first sample feature vectors in the first classifier is 2, so that the server may determine first matching degrees between the first target feature vector and the 2 first sample feature vectors, for example, the first matching degree U1 corresponding to the first sample attribute O1 and the first matching degree U2 corresponding to the first sample attribute O2, respectively. The server may use the first sample attribute corresponding to the first matching degree having a higher value as the first segment attribute of the video segment S _i based on the first matching degree U1 and the first matching degree U2, for example, when the first matching degree U1 is greater than the first matching degree U2, the server may use the first sample attribute O1 as the first segment attribute of the video segment S _i, and determine the segment precision chroma of the video segment S _i based on the first segment attribute. When determining the segment precision of the video segment S _i based on the first segment attribute, the server may use the first matching degree U1 as the segment precision of the video segment S _i or use the first matching degree U2 as the segment precision of the video segment S _i.

It may be appreciated that the plurality of first sample attributes may include a target first sample attribute, and the first sample feature vector corresponding to the target first sample attribute may be a target first sample feature vector. In other words, the server may input the first target feature vector into the first classifier, and the first classifier outputs the target matching degree between the first target feature vector and the target first sample feature vector in the first classifier, so as to directly determine the target matching degree as the segment precision chroma of the video segment S _i.

It is understood that the first target prediction network may include a first image processing network, a first audio processing network, and a first text processing network. The specific process of determining the first image feature vector, the first audio feature vector, and the first text feature vector of the video segment S _i by the server through the first target prediction network may be described as follows: the server may use the image frame in the video segment S _i as a first image frame, input the first image frame to the first image processing network, and perform image feature extraction on the first image frame by the first image processing network to obtain a first image feature vector of the video segment S _i. Further, the server may use the audio frame in the video segment S _i as a first audio frame, input the first audio frame to the first audio processing network, and perform audio feature extraction on the first audio frame by the first audio processing network to obtain a first audio feature vector of the video segment S _i. Further, the server may input the text information associated with the video segment S _i as the first text information, and the first text information is input to the first text processing network, and the first text processing network performs text feature extraction on the first text information to obtain a first text feature vector of the video segment S _i.

It will be appreciated that the first image processing network may comprise a first image sub-network and a second image sub-network. The server may input each image frame in the first image frame of the video clip S _i to the first image sub-network, and perform image feature extraction on each image frame by the first image sub-network to obtain an image feature vector of each image frame (i.e., the first image frame), and further may input the image feature vector of the first image frame to the second image sub-network, and perform weighted fusion on the image feature vector of the first image frame by the second image sub-network to obtain the first image feature vector of the video clip S _i. It should be appreciated that the first image sub-network may be a EFFICIENTNET (RETHINKING MODEL SCALING FOR CONVOLUTIONAL NEURAL NETWORKS, a multi-dimensional hybrid model scaling method) model and the second image sub-network may be a Self-Attention (i.e., self-Attention) model, and embodiments of the present application are not limited to a particular type of first and second image sub-networks.

It is understood that the first audio processing network may comprise a first audio sub-network and a second audio sub-network. The server may input each audio frame in the first audio frame of the video clip S _i to the first audio sub-network, and perform audio feature extraction on each audio frame by the first audio sub-network to obtain an audio feature vector of each audio frame (i.e., the first audio frame), and further may input the audio feature vector of the first audio frame to the second audio sub-network, and perform weighted fusion on the audio feature vector of the first audio frame by the second audio sub-network to obtain the first audio feature vector of the video clip S _i. It should be appreciated that the first audio sub-network may be a VGGish model (i.e., model pre-trained on the AudioSet data of YouTube) and the second audio sub-network may be a Self-Attention (i.e., self-Attention) model, and embodiments of the present application are not limited to a particular type of first audio sub-network and second audio sub-network.

It is understood that the first text processing network may be a lightweight BERT model (A Lite BERT for Self-supervised Learning of Language Representations, ALBERT for short) for language-characterization self-supervised learning. Alternatively, the first text processing network may also be a transformer-based bi-directional encoder characterization (Bidirectional Encoder Representations from Transformers, abbreviated BERT) model. It should be understood that embodiments of the present application are not limited to a particular type of first text processing network.

The server may acquire X1 image frames in the video clip S _i, so that the X1 image frames are taken as the first image frame, where X1 may be a positive integer (e.g., 20), and the embodiment of the present application does not limit the specific value of X1. Similarly, the server may obtain X2 audio frames in the video segment S _i, where X2 may be a positive integer (e.g., 20) to use the X2 audio frames as the second image frame, and the embodiment of the present application does not limit the specific value of X2.

The server may obtain the voice text information and the subtitle text information of the video clip S _i, and take the voice text information (i.e., the text in white) and the subtitle text information (i.e., the subtitle text) as the first text information. It will be appreciated that the server may recognize the phonetic text information of each video clip through ASR (Automatic Speech Recognition, i.e., automatic speech recognition) and the subtitle text information of each video clip through OCR (Optical Character Recognition, i.e., optical character recognition). Wherein in a portion of the video clip, the subtitle text information may include voice text information.

Optionally, the server may further obtain bullet screen text information and object text information of the video clip S _i, and use the voice text information, the subtitle text information, the bullet screen text information and the object text information as the first text information. It can be appreciated that the server can determine the bullet screen text information of the video clip S _i through the bullet screen time stamp of the bullet screen information, and identify the object text information of the video clip S _i through the face detection model and the face recognition model. It should be understood that embodiments of the present application are not limited to a face detection model and a specific type of face recognition model.

It should be appreciated that the specific process of training the first initial predicted network by the server to obtain the first target predicted network may be described as: the server may use the video segment for training the first initial pre-estimated network as a training segment, and determine a barrage interaction amount of the training segment. Wherein, the barrage mutual quantity of one training segment is used for describing the true segment precision of one training segment. Further, the server may use a training segment with a barrage interaction amount greater than the interaction threshold as a positive sample segment, and use a real segment essence chroma of the positive sample segment as a first sample label, and use a training segment with a barrage interaction amount less than or equal to the interaction threshold as a negative sample segment, and use a real segment essence chroma of the negative sample segment as a second sample label. Further, the server may determine a first sample segment (i.e., a highlight-forecast training data set) for training the first initial forecast network based on the positive sample segment and the negative sample segment, and determine a plurality of first sample attributes based on the first sample label and the second sample label. Further, the server may determine a first sample image vector, a first sample audio vector, and a first sample vector of the first sample segment through the first initial pre-estimation network, perform feature fusion on the first sample image vector, the first sample audio vector, and the first sample vector to obtain a first sample fusion vector of the first sample segment, and determine a first prediction attribute of the first sample segment based on the first sample fusion vector. Further, the server may perform iterative training on the first initial estimated network based on the predicted sample precision and chroma corresponding to the first predicted attribute and the true sample precision and chroma corresponding to the first sample attribute, to obtain a first target estimated network.

Wherein, the first sample label may be "yes" (be a highlight, i.e. 1), the second sample label may be "no" (not be a highlight, i.e. 0), the first sample label and the second sample label may be collectively referred to as a first sample attribute, the number of the first sample attributes may be 2, and the correspondence between the first sample fragment and the first sample attribute may be as shown in the following table 1:

TABLE 1

As shown in table 1, the number of first sample segments used for training the first initial pre-estimated network may be p, where p may be a positive integer, and the p first sample segments may specifically include: video clip 1, video clips 2, …, video clip (p-1) and video clip p, video clips 1, …, video clip p can be positive sample clips, and video clips 2, …, video clip (p-1) can be negative sample clips.

It should be appreciated that, for the specific process of determining the first sample image vector, the first sample audio vector, and the first text sample vector of the first sample segment by the server through the first initial pre-estimation network, reference may be made to the above description of determining the first image feature vector, the first audio feature vector, and the first text feature vector of the video segment S _i by the first target pre-estimation network, which will not be repeated herein.

It should be appreciated that, for the specific process of feature fusion of the first sample image vector, the first sample audio vector and the first text feature vector by the server, reference may be made to the above description of feature fusion of the first image feature vector, the first audio feature vector and the first text feature vector, which will not be repeated here.

It should be appreciated that, for a specific process of determining the first prediction attribute of the first sample segment based on the first sample fusion vector, the server may refer to the above description of determining the first segment attribute of the video segment S _i based on the first fusion feature vector, which will not be described in detail herein.

For ease of understanding, fig. 10 is a schematic structural diagram of a first target prediction network according to an embodiment of the present application. The video clip shown in fig. 10 may be the video clip S _i, the plurality of image frames of the video clip may be the first image frame, the plurality of audio frames of the video clip may be the first audio frame, and the subtitle/caption text may be the first text information.

As shown in fig. 10, the first image processing network may include a first image sub-network and a second image sub-network, the first audio processing network may include a first audio sub-network and a second audio sub-network, a first image feature vector of the video clip may be extracted through the first image processing network, a first text feature vector of the video clip may be extracted through the first text processing network, and a first audio feature vector of the video clip may be extracted through the first audio processing network.

As shown in fig. 10, a first fusion feature vector may be obtained by performing vector stitching on a first image feature vector, a first text feature vector, and a first audio feature vector, and the first fusion feature vector is input to a multi-layer fully-connected network (i.e., the first fully-connected network described above), so that a video multi-dimensional depth representation (i.e., the first target feature vector) corresponding to a video segment may be output, and then, based on the video multi-dimensional depth representation, highlight prediction may be performed on the video segment, so as to obtain a segment chroma of the video segment.

It should be appreciated that the first initial pre-estimated network and the first target pre-estimated network may be collectively referred to as a first generalized network, where the first initial pre-estimated network and the first target pre-estimated network belong to names of the first generalized network at different times. The first generalized network may be referred to as a first initial predicted network during the training phase and as a first target predicted network during the prediction phase.

It should be appreciated that the K video clips include video clip S _i, where i may be a positive integer less than or equal to K; the target network model includes a second target predicted network for predicting a second segment attribute of the video segment S _i. The specific process of the server determining the segment hotness of the video segment S _i can be described as: the server may determine a second image feature vector, a second audio feature vector, and a second text feature vector for the video segment S _i over the second target prediction network. Further, the server may perform feature fusion on the second image feature vector, the second audio feature vector, and the second text feature vector to obtain a second fused feature vector of the video segment S _i, determine a second segment attribute of the video segment S _i based on the second fused feature vector, and determine a first segment popularity of the video segment S _i based on the second segment attribute (i.e.). Further, the server may obtain an auxiliary video segment of the service video on the platform to which the target video belongs, and determine an average barrage amount corresponding to the video segment S _i based on the barrage interaction amount and the first double speed information of the auxiliary video segment. Further, the server may obtain a segment curtain amount of the video segment S _i, determine a second segment heat (i.e., a second segment heat of the video segment S _i based on the segment curtain amount and the average curtain amount). Further, the server may determine the segment popularity of the video segment S _i based on the first segment popularity and the second segment popularity of the video segment S _i.

It should be appreciated that, for the specific process of determining the second image feature vector, the second audio feature vector, and the second text feature vector of the video segment S _i by the server through the second target prediction network, reference may be made to the above description of determining the first image feature vector, the first audio feature vector, and the first text feature vector of the video segment S _i by the first target prediction network, which will not be described herein.

It should be appreciated that, for a specific process of feature fusion of the second image feature vector, the second audio feature vector and the second text feature vector by the server, reference may be made to the above description of feature fusion of the first image feature vector, the first audio feature vector and the first text feature vector, which will not be repeated herein.

It should be appreciated that, for a specific process of determining the second segment attribute of the video segment S _i based on the second fusion feature vector, reference may be made to the above description of determining the first segment attribute of the video segment S _i based on the first fusion feature vector, which will not be described in detail herein.

It will be appreciated that the second target prediction network may comprise a second image processing network, a second audio processing network and a second text processing network, the model structure of the second image processing network may be referred to the model structure of the first image processing network, the model structure of the second audio processing network may be referred to the model structure of the first audio processing network, and the model structure of the second text processing network may be referred to the model structure of the first text processing network.

It is understood that the business video may be a video on a platform to which the target video belongs within a specified time range, where the units of the time range may be year, month, day, etc. For example, the business video may be a long video on the platform within 2019.

The manner of determining the second segment heat of the video segment S _i may be referred to as the following formula (1):

（1）；

wherein, The second segment heat may be represented, the bullet screen amount of the current segment represents the segment bullet screen amount of the video segment S _i, DN represents the average bullet screen amount corresponding to the video segment S _i, dn=the average bullet screen amount/>, of the auxiliary videoMultiple speed multiple. The second clip heat of video clip S _i may be equal to the smaller of 1.0 and (bullet/DN of the current clip).

The manner of determining the segment heat of the video segment S _i can be referred to as the following formula (2):

（2）；

wherein, Can represent the overall heat value (i.e., fragment heat), P/>Can represent the first segment heat (i.e., prior heat),/>The second segment heat (i.e., a posterior heat) may be represented, x1 represents a weight coefficient corresponding to the first segment heat, x2 represents a weight coefficient corresponding to the second segment heat, and the sum of x1 and x2 is equal to 1.

It should be appreciated that embodiments of the present application are not limited to a particular manner of calculating the segment heats from the first segment heats and the second segment heats. Alternatively, the server may directly use the first segment heat as the segment heat, and the server may directly use the second segment heat as the segment heat.

It should be appreciated that the specific process of training the second initial predicted network by the server to obtain the second target predicted network may be described as: the server may determine the video playing amount and the playing completion amount of the second sample segment by using the sample video for training the second initial pre-estimated network as the second sample segment (i.e., the heat pre-estimated training data set). Wherein, the video play-out amount and play-out completion amount of one second sample section are used to describe the true sample hotness of one second sample section. Further, the server may determine a true sample heat of the second sample segment based on a product of the video play-out amount and the play-out completion amount, and use the determined true sample heat as a plurality of second sample attributes associated with the second sample segment. Further, the server may determine a second sample image vector, a second sample audio vector, and a second sample text vector of the second sample segment through the second initial pre-estimation network, perform feature fusion on the second sample image vector, the second sample audio vector, and the second sample text vector to obtain a second sample fusion vector of the second sample segment, and determine a second prediction attribute of the second sample segment based on the second sample fusion vector. Further, the server may perform iterative training on the second initial estimated network based on the predicted sample heat corresponding to the second predicted attribute and the real sample heat corresponding to the second sample attribute, to obtain a second target estimated network.

Wherein, the way to determine the true sample heat can be seen in the following formula (3):

（3）；

wherein, The playing amount of the video can be represented, the playing completion degree can be represented as the playing completion amount, the playing times are represented as the times of clicking the second sample segment, the playing completion amount is represented as a weighted average of the playing completion degrees of the second sample segment by a plurality of users, and N is represented as the average playing times of the short video on the platform to which the target video belongs. True sample heat is equal to 1.0 and/>Is a larger value of (a).

It should be appreciated that, for the specific process of determining, by the server, the second sample image vector, the second sample audio vector, and the second sample text vector of the second sample segment through the second initial pre-estimation network, reference may be made to the description of determining, by the first target pre-estimation network, the first image feature vector, the first audio feature vector, and the first text feature vector of the video segment S _i, which will not be described in detail herein.

It should be appreciated that, for a specific process of feature fusion of the second sample image vector, the second sample audio vector and the second sample text vector by the server, reference may be made to the above description of feature fusion of the first image feature vector, the first audio feature vector and the first text feature vector, which will not be repeated here.

It should be appreciated that, for a specific process of determining the second prediction attribute of the second sample segment based on the second sample fusion vector, reference may be made to the above description of determining the first segment attribute of the video segment S _i based on the first fusion feature vector, which will not be described in detail herein.

It should be appreciated that the specific process of determining the plurality of second sample attributes associated with the second sample segment by the server may be referred to as determining the plurality of first sample attributes associated with the first sample segment, and will not be described in detail herein.

For ease of understanding, please refer to fig. 11, fig. 11 is a schematic diagram of a second target estimation network according to an embodiment of the present application. The model structure of the second target predicted network shown in fig. 11 may be the same as the model structure of the first target predicted network shown in fig. 10.

As shown in fig. 11, each content dimension representation of a video segment may be a second image feature vector, a second audio feature vector, and a second text feature vector of the video segment, the multi-dimensional representation fusion may be a second fusion feature vector, the second fusion feature vector is input into a multi-layer fully-connected network (i.e., a second fully-connected network in a second target estimation network), a video multi-dimensional depth representation (i.e., a second target feature vector) corresponding to the video segment may be output, and then the heat of the video segment may be estimated based on the video multi-dimensional depth representation, so as to obtain the segment heat (i.e., the first segment heat) of the video segment.

It should be appreciated that the second initial pre-estimated network and the second target pre-estimated network may be collectively referred to as a second generalized network, where the second initial pre-estimated network and the second target pre-estimated network belong to names of the first generalized network at different times. The second generalized network may be referred to as a second initial predicted network during the training phase and may be referred to as a second target predicted network during the prediction phase.

It should be appreciated that the K video clips include video clip S _i, where i may be a positive integer less than or equal to K; the target network model includes a third target prediction network for predicting a third segment attribute of the video segment S _i. The specific process of determining the segment interest level of the video segment S _i by the server can be described as: the server can acquire the target associated video associated with the target user in the application client, acquire the target video tag of the target associated video, and take the target video tag as the target interest tag of the target user. Further, the server may determine a target segment feature vector of the video segment S _i through a third target pre-estimation network, determine a target associated feature vector of the target associated video through the third target pre-estimation network, and determine a target interest feature vector of the target interest tag through the third target pre-estimation network. Further, the server may determine a third fused feature vector for the video segment S _i based on the target segment feature vector, the target associated feature vector, and the target interest feature vector, determine a third segment attribute for the video segment S _i based on the third fused feature vector, and determine a segment interest level for the video segment S _i based on the third segment attribute (i.e.）。

It can be understood that the server can acquire the service video on the platform to which the target video belongs, determine the watching completion degree of the target user on the service video, and further use the service video with the watching completion degree greater than the completion threshold value in the training video as the target associated video associated with the target user.

It will be appreciated that the third target predictive network may include: a first sub-network, a second sub-network, and a third sub-network, the first sub-network may be used to extract a target segment feature vector of the video segment S _i, the second sub-network may be used to extract a target associated feature vector of the target associated video, and the third sub-network may be used to extract a target interest feature vector of the target interest tag.

The first sub-network may include a third image processing network, a third audio processing network, and a third text processing network, where the first sub-network and the first target prediction network and the second target prediction network have the same model structure. It should be appreciated that, for the specific process of determining the target feature vector of the video segment S _i by the server through the first sub-network, reference may be made to the description of determining the first target feature vector corresponding to the video segment S _i by the first target prediction network, which will not be described in detail herein.

Wherein the second sub-network may comprise a weighted sub-network and a feature extraction sub-network, the feature extraction sub-network having the same model structure as the first target pre-estimation network and the second target pre-estimation network, the weighted sub-network having the same model structure as the second image sub-network in the first image processing network (or the second audio sub-network in the first audio processing sub-network). The specific process of the server determining the target association feature vector of the target association video through the second sub-network can be described as: the server can determine the associated feature vector of the target associated video through the feature extraction sub-network in the second sub-network, and further, the associated feature vector is weighted and summed through the weighting sub-network in the second sub-network, so that the target associated feature vector of the target associated video is obtained. It should be understood that, for a specific process of determining, by the server, the associated feature vector of the target associated video through the feature extraction sub-network in the second sub-network, reference may be made to determining the description of the first target feature vector corresponding to the video segment S _i, which will not be described herein.

The third sub-network may have the same model structure as the first text processing network in the first target pre-estimation network. For a specific process of determining, by the server, the target interest feature vector of the target interest tag through the third sub-network, reference may be made to the description of determining, by the first text processing network, the first text feature vector of the video segment S _i, which will not be described in detail herein.

It will be appreciated that the specific process of determining the third fused feature vector of the video segment S _i by the server based on the target segment feature vector, the target associated feature vector, and the target interest feature vector may be described as: the server can conduct first feature fusion on the target associated feature vector and the target interest feature vector to obtain a first fusion vector. Further, the server may perform a second feature fusion on the first fusion vector and the target segment feature vector, and determine a third fusion feature vector (i.e., a second fusion vector) of the video segment S _i.

Alternatively, it may be appreciated that the specific process of determining the third fusion feature vector of the video segment S _i by the server based on the target segment feature vector, the target associated feature vector, and the target interest feature vector may be described as: the server can conduct first feature fusion on the target associated feature vector and the target interest feature vector to obtain a first fusion vector. Further, the server may input the first fusion vector to a third full-connection layer in the third target prediction network, and the third full-connection layer performs feature extraction on the first fusion vector to obtain the first target vector. Further, the server may perform a second feature fusion on the first target vector and the target segment feature vector, and determine a third fused feature vector (i.e., a second fused vector) of the video segment S _i.

It is understood that the manner of the first feature fusion and the second feature fusion may be a weighted average manner or may be a vector stitching manner. It should be understood that embodiments of the present application are not limited to a specific manner of first feature fusion and second feature fusion.

It should be appreciated that, for a specific process of determining the third segment attribute of the video segment S _i by the server based on the third fusion feature vector, reference may be made to a description of determining the first segment attribute of the video segment S _i based on the first fusion feature vector, which will not be described in detail herein.

It should be appreciated that the specific process of training the third initial predicted network by the server to obtain the third target predicted network may be described as: the server may use the sample video for training the third initial pre-estimated network as a training video to determine a degree of completion of the sample user's viewing of the training video. Wherein, the watching completion degree of one training video is used for describing the real sample interest degree of one training video for one sample user. Further, the server may use the training video with the watching completion degree greater than the completion threshold as a positive sample video and the real sample interest degree of the positive sample video as a first video tag, and use the training video with the watching completion degree less than or equal to the completion threshold as a negative sample video and the real sample interest degree of the negative sample video as a second video tag. Further, the server may determine a third sample segment (i.e., a correlation data set) for training a third initial pre-estimated network based on the positive sample video and the negative sample video, and determine a plurality of third sample attributes based on the first video tag and the second video tag. Further, the server may take the positive sample video as a sample associated video associated with the sample user, and obtain a sample video tag of the sample associated video, and take the sample video tag as a sample interest tag of the sample user. Further, the server may determine a sample segment feature vector of the third sample segment through the third initial pre-estimation network, determine a sample association feature vector of the sample association video through the third initial pre-estimation network, and determine a sample interest feature vector of the sample interest tag through the third target pre-estimation network. Further, the server may determine a third sample fusion vector for the third sample segment based on the sample segment feature vector, the sample association feature vector, and the sample interest feature vector, and determine a third prediction attribute for the third sample segment based on the third sample fusion vector. Further, the server may perform iterative training on the third initial estimated network based on the predicted sample interestingness corresponding to the third predicted attribute and the real sample interestingness corresponding to the third sample attribute, to obtain a third target estimated network.

It should be appreciated that, for the specific process of determining the sample fragment feature vector, the sample association feature vector, and the sample interest feature vector by the server through the third initial pre-estimation network, reference may be made to the description of determining the target fragment feature vector, the target association feature vector, and the target interest feature vector by the third target pre-estimation network, which will not be described in detail herein.

It should be appreciated that, for the specific process of determining the third sample fusion vector by the server based on the sample segment feature vector, the sample association feature vector and the sample interest feature vector, reference may be made to the description of determining the third fusion feature vector based on the target segment feature vector, the target association feature vector and the target interest feature vector, which will not be described in detail herein.

It should be appreciated that, for a specific process of determining the third prediction attribute of the third sample segment based on the third sample fusion vector, the server may refer to a description of determining the third segment attribute of the video segment S _i based on the third fusion feature vector, which will not be described in detail herein.

For ease of understanding, fig. 12 is a schematic structural diagram of a third target prediction network according to an embodiment of the present application. As shown in fig. 12, the text of the user interest tag sequence may be a target interest tag, the multidimensional vector corresponding to the video sequence watched by the user may be a target associated feature vector of the target associated video, and the video clip may be video clip S _i.

As shown in fig. 12, the user interest tag sequence text is input to a third sub-network, from which a user explicit interest tag depth representation (i.e., a target interest feature vector) of the user interest tag sequence text can be output; inputting the multidimensional vector corresponding to the video sequence watched by the user into a weighting sub-network in the first sub-network, wherein the weighting sub-network can output the implicit interest tag depth representation (namely the target associated feature vector) of the user of the multidimensional vector corresponding to the video sequence watched by the user; the video clip is input to a first sub-network from which a multi-dimensional depth representation (i.e., a target clip feature vector) on the video clip side can be output.

As shown in fig. 12, a first feature fusion is performed on a user explicit interest tag depth representation and a user implicit interest tag depth representation, so that a user side interest depth representation (i.e., a first fusion vector) can be obtained, a second feature fusion is performed on a user side interest depth representation and a video segment side multidimensional depth representation, so that a third fusion feature vector (i.e., a second fusion vector) can be obtained, and further, interest degree estimation can be performed on a video segment based on the third fusion feature vector, so as to obtain segment interest degrees of the user and the video segment.

It should be appreciated that the third initial pre-estimated network and the third target pre-estimated network may be collectively referred to as a third generalized network, where the third initial pre-estimated network and the third target pre-estimated network belong to names of the third generalized network at different times. In the training phase, the third generalization network may be referred to as a third initial predicted network, and in the predicting phase, the third generalization network may be referred to as a third target predicted network.

It should be understood that the target network model is obtained by performing iterative training on an initial network model, where the initial network model and the target network model may be collectively referred to as a generalization model, and the initial network model and the target network model belong to names of the generalization model at different times. In the training stage, the generalization model can be called an initial network model, and at this time, the generalization model can comprise a first initial pre-estimated network, a second initial pre-estimated network and a third initial pre-estimated network; in the prediction stage, the generalization model may be referred to as a target network model, where the generalization model may include a first target prediction network, a second target prediction network, and a third target prediction network.

It should be appreciated that the specific process of the server screening for key video snippets may be described as: the server can obtain the double-speed evaluation result of each video clip based on the clip interest attribute of each video clip. The segment interest attribute of each video segment may include a segment precision of each video segment, a segment heat of each video segment, and a segment interest of each video segment. Further, the server may obtain the first playing progress and the first double-speed information of the target video in the double-speed playing request, screen L video clips matching with the target user of the application client from the K video clips based on the first double-speed information, the first playing progress and the double-speed evaluation result, and use the L video clips as key video clips for double-speed playing of the target video in the double-speed mode.

The way to determine the overall speed score of a video clip can be seen in the following formula (4):

（4）；

Wherein the overall multiple speed score is the multiple speed evaluation result, Representing fragment precision,/>The degree of heat of the segment is indicated,And representing the segment interestingness, wherein w1 is a weight coefficient corresponding to the segment chroma, w2 is a weight coefficient corresponding to the segment heat, w3 is a weight coefficient corresponding to the segment interestingness, and the sum of w1, w2 and w3 is equal to 1. It should be appreciated that embodiments of the present application are not limited to a particular manner of calculating the overall speed score based on segment precision, segment heat, and segment interestingness.

Optionally, the server may also obtain a double-speed evaluation result of each video clip according to the clip precision chroma, that is, directly use the clip precision chroma as the double-speed evaluation result.

Optionally, the server may also obtain a double-speed evaluation result of each video clip according to the clip heat, that is, directly use the clip heat as the double-speed evaluation result.

Optionally, the server may also obtain a double-speed evaluation result of each video clip according to the clip interest, i.e. directly use the clip interest as a double-speed evaluation result.

Optionally, the server determines the manner in which the multiple speed evaluation result of each video clip includes, but is not limited to, the clip essence, clip heat, and clip interest level described above. Alternatively, the manner in which the server determines the multiple speed estimation result of each video clip may also be any combination of the above 2 or more parameters to determine the multiple speed estimation result.

It can be understood that the server may sort the K video clips based on the multiple speed evaluation results of the K video clips, so that when the target user uses the intelligent multiple speed function, J video clips after the first playing progress are obtained from the K video clips, so as to select a video clip with a multiple speed evaluation result of 1/S in the J video clips, where S may represent multiple speed multiple of the first multiple speed information. For example, when the first speed information is 2 speeds, the server may select 1/2 video clips from the J video clips, i.e., J/2 video clips. Optionally, the double-speed information carried by the double-speed playing request may also be second double-speed information (the second double-speed information may be double-speed information indicated by a second double-speed control in the N double-speed controls), for example, when the second double-speed information is 1.5 double-speed, the server may select 1/1.5 video clips, that is, 2J/3 video clips, from the J video clips.

Optionally, after receiving the double-speed playing request, the server may directly select a video segment with a total double-speed score of 1/S in the K video segments based on the first double-speed information carried by the double-speed playing request and the double-speed evaluation results of the K video segments of the target video, where S may be equal to the double-speed multiple of the first double-speed information. For example, when the first speed information is 2 speeds, the server may select 1/2 video clips from the K video clips, that is, K/2 video clips.

Step S203, the key video snippets are returned to the application client so that the application client plays the key video snippets of the target video.

It can be understood that when receiving a double-speed playing request which is sent by the application client based on the first double-speed information and is associated with the target video, the server can determine double-speed evaluation results of K video fragments of the target video in real time so as to screen key video fragments matched with a target user of the application client from the K video fragments.

Optionally, the server may further analyze the target user to determine a video (for example, a target video) that the target user may watch, and further determine multiple speed evaluation results of K video segments of the target video in advance, so when receiving a multiple speed play request associated with the target video and sent by the application client based on the first multiple speed information, the server may obtain multiple speed evaluation results of K video segments determined in advance, so as to screen key video segments matching the target user of the application client from the K video segments.

It will be appreciated that the server may return the key video snippets to the target user of the application client to enable the target user to view the key video snippets in the video playback interface. Wherein, the target user can be a user, and the target user can watch the key video snippets associated with the target user in the video playing interface.

Optionally, the target user may be a type of user, that is, the target user may belong to a type of user, and then the target user may view a key video clip associated with the type of user in the video playing interface. Specifically, the server may perform interest clustering on all users on the platform based on all users on the platform to obtain a plurality of user clusters (for example, 256 user clusters), and further determine a key video segment matched with each user cluster from K video segments of the target video. Therefore, when the server receives the double-speed playing request sent by the target user, the user cluster to which the target user belongs can be determined, and then the key video segments matched with the user cluster are directly returned to the application client.

For easy understanding, please refer to fig. 13, fig. 13 is a schematic diagram of a scenario of an intelligent double-speed playing method according to an embodiment of the present application. As shown in fig. 13, the server may segment the target video to obtain video segments (i.e., K video segments) of the target video, and further evaluate the point of view precision (i.e., segment precision) and the heat (i.e., segment heat) of each of the K video segments.

As shown in fig. 13, after determining the point of view precision and popularity of the video clip, the server may analyze the user interests of the target user to obtain the interests of the target user (i.e., clip interests) corresponding to the video clip based on the user interests. Further, the server may select, from among the video clips, video clips having superior point precision and heat, and being more compatible with the user's interests, based on the point precision, heat, and interest, that is, select a key video clip from among the video clips, where the key video clip is determined by a double-speed evaluation result of the video clip. Therefore, when the target user performs the second trigger operation on the N multiple speed controls (i.e., the target user performs the second trigger operation on the first multiple speed control of the N multiple speed controls), the application client may perform intelligent multiple speed playing on the key video segments associated with the intelligent multiple speed in the application client based on the intelligent multiple speed (i.e., the first multiple speed information) associated with the first multiple speed control.

Therefore, when the double-speed playing request associated with the target video is received, the embodiment of the application can select the key video segment matched with the target user of the application client from the video segments of the target video based on the first double-speed information in the double-speed playing request. The embodiment of the application can construct a video segment representing a target video based on a depth model, a segment point-of-view prediction model (namely a first target pre-estimated network) is constructed through bullet screen quantity, a segment heat prediction model (namely a second target pre-estimated network) is constructed through effective playing data of a large number of short videos, and a segment interest prediction model (namely a third target pre-estimated network) is constructed through depth interest representation of display interests and implicit interests of users. Based on this, the embodiment of the present application may respectively determine the segment viewpoint degree, the segment heat degree, and the segment interest degree of each video segment through the three prediction models obtained by the above construction, so as to obtain the multiple speed evaluation result of each video segment, so as to select, based on the first multiple speed information, a specified number of video segments with higher multiple speed evaluation results (i.e., determine, as a key video segment, a video segment more interested by the target user) from among the video segments of the target video, so as to play the key video segment for the target user in the application client, thereby realizing multiple speed play based on the first multiple speed information. Therefore, the embodiment of the application can select different key video clips for different users, and provide personalized double-speed playing modes for different users based on the selected different key video clips, thereby improving the accuracy of double-speed playing in the application client.

Further, referring to fig. 14, fig. 14 is a schematic structural diagram of a video data playing device according to an embodiment of the present application. The video data playback apparatus 1 may include: the interface display module 100, the first response module 200, the second response module 300; further, the video data playback apparatus 1 may further include: the third response module 400, the fourth response module 500, the progress determination module 600, the segment switching module 700;

An interface display module 100 for displaying a video playing interface for playing a target video;

The first response module 200 is configured to display N multiple speed controls associated with the multiple speed mode of the target video in response to a first trigger operation for the video playing interface; n is a positive integer;

The first response module 200 includes: a first display unit 201, a second display unit 202;

a first display unit 201, configured to trigger a double-speed mode of a target video in response to a first trigger operation for a full-screen playing interface, and display a first control display interface independent of the full-screen playing interface based on the double-speed mode; the interface size of the first control display interface is smaller than the interface size of the full-screen playing interface;

The second display unit 202 is configured to display N double-speed controls associated with the double-speed mode in the first control display interface.

The specific implementation manner of the first display unit 201 and the second display unit 202 may refer to the description of step S102 in the embodiment corresponding to fig. 3, and will not be described herein.

The first response module 200 further includes: a video playing unit 203, a third display unit 204, and a fourth display unit 205;

A video playing unit 203, configured to play the target video in a video playing area of the non-full screen playing interface;

A third display unit 204, configured to display a control display area of the target video in the non-full-screen playing interface in response to a first trigger operation for the non-full-screen playing interface; the control display area is an area suspended above the video playing area, or the control display area is an area which is not overlapped with the video playing area;

And a fourth display unit 205, configured to trigger a double-speed mode of the target video in response to a double-speed selection operation for the control display area, and display N double-speed controls associated with the double-speed mode in the second control display interface based on the double-speed mode.

The specific implementation manner of the video playing unit 203, the third display unit 204, and the fourth display unit 205 may be referred to the description of step S102 in the embodiment corresponding to fig. 3, and will not be described herein.

The second response module 300 is configured to determine, in response to a second trigger operation for the N multiple speed controls, first multiple speed information indicated by the multiple speed control corresponding to the second trigger operation, and play a key video segment of the target video in the video play interface; the key video snippet is a video snippet associated with the first speed information selected from video snippets of the target video.

Wherein the second response module 300 includes: a first determination unit 301, a first check unit 302, a clip acquisition unit 303, a first playback unit 304;

the first determining unit 301 is configured to determine, in response to a second trigger operation for the N multiple speed controls, first multiple speed information indicated by the multiple speed control corresponding to the second trigger operation, and determine a playing progress of the target video as a first playing progress in the video playing interface;

A first checking unit 302, configured to, when it is checked that the network state of the application client for playing the target video belongs to a first network state, acquire, from the server, a double-speed playing fragment identifier associated with the first double-speed information and the first playing progress based on the first network state; a double-speed playing segment identifier is used for representing the segment position of a key video segment in the target video;

A segment obtaining unit 303, configured to obtain, from a server, a key video segment that matches with the identifier of the double-speed play segment;

the first playing unit 304 is configured to play the key video snippet of the target video in the video playing interface based on the snippet position of the key video snippet in the target video.

The specific implementation manner of the first determining unit 301, the first checking unit 302, the segment obtaining unit 303, and the first playing unit 304 may refer to the description of step S103 in the embodiment corresponding to fig. 3, and will not be described herein.

Wherein the second response module 300 further comprises: a second determination unit 305, a second check unit 306, a second play unit 307;

A second determining unit 305, configured to determine, in response to a second trigger operation for the N multiple speed controls, first multiple speed information indicated by the multiple speed control corresponding to the second trigger operation, and determine, in the video playing interface, a playing progress of the target video as a first playing progress;

A second checking unit 306, configured to, when it is checked that the network state of the application client for playing the target video belongs to a second network state, acquire, from the server, a key video segment associated with the first double-speed information and the first playing progress based on the second network state; the key video snippets are determined by the server from a key snippet set of the target video based on the first double-speed information and the first playing progress; the key segment set is determined by the server based on the L video segments; the L video clips are determined based on clip interest attributes of K video clips of the target video; l is a positive integer less than K; k is a positive integer;

a second playing unit 307, configured to play the key video snippets of the target video in the video playing interface.

The specific implementation manner of the second determining unit 305, the second checking unit 306, and the second playing unit 307 may refer to the description of step S103 in the embodiment corresponding to fig. 3, and will not be repeated here.

Optionally, the third response module 400 is configured to use the multiple speed control corresponding to the second trigger operation as a first multiple speed control, and when the key video clip is played in the video playing interface, respond to the third trigger operation for the video playing interface, and display N multiple speed controls associated with the multiple speed mode; the N speed-doubling controls comprise second speed-doubling controls;

The fourth response module 500 is configured to switch, in response to a fourth trigger operation for the second double-speed control, the first double-speed information indicated by the first double-speed control to the second double-speed information indicated by the second double-speed control, where the double-speed information is used for double-speed playing the target video;

The progress determining module 600 is configured to determine a playing progress of the key video snippet in the target video as a second playing progress, and determine a switching video snippet for playing in the video playing interface based on the second double-speed information and the second playing progress; switching video clips to be video clips which are selected from the video clips of the target video and are associated with second double-speed information and second playing progress;

The clip switching module 700 is configured to play the switched video clip in the video playing interface.

The specific implementation manners of the interface display module 100, the first response module 200, the second response module 300, the third response module 400, the fourth response module 500, the progress determination module 600 and the segment switching module 700 may be referred to the description of step S101 to step S103 in the embodiment corresponding to fig. 3, and will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 15, fig. 15 is a schematic structural diagram of a video data playing device according to an embodiment of the present application. The video data playback apparatus 2 may include: a request receiving module 10, a fragment determining module 20, a fragment returning module 30;

a request receiving module 10, configured to receive a double-speed play request associated with a target video, which is sent by an application client based on first double-speed information; the first double-speed information is used for indicating the application client to play the target video at double speed in a double-speed mode;

the segment determining module 20 is configured to screen, based on the double-speed play request, L video segments matching with a target user of the application client from K video segments of the target video, and use the L video segments as key video segments for double-speed playing of the target video in the double-speed mode; l is a positive integer less than K; k is a positive integer;

wherein the segment determination module 20 comprises: a video dividing unit 21, a first prediction unit 22, a second prediction unit 23, a third prediction unit 24, and a segment screening unit 25;

a video dividing unit 21, configured to obtain a video identifier of a target video from the double-speed play request, determine the target video in the application client based on the video identifier, and divide the target video into K video segments based on the video slicing parameter;

A first prediction unit 22, configured to obtain a target network model associated with a target video, predict, by using the target network model, a first segment attribute of each of the K video segments, and determine a segment precision chroma of each video segment based on the first segment attribute of each video segment;

The first prediction unit 22 includes: a model acquisition subunit 221, a first determination subunit 222, a first fusion subunit 223, a highlight determination subunit 224; optionally, the first prediction unit 22 may further include: an interaction volume determination subunit 225, a segment partitioning subunit 226, a first association subunit 227, a second fusion subunit 228, and a first training subunit 229;

A model acquisition subunit 221, configured to acquire a target network model associated with the target video; the target network model comprises a first target pre-estimated network for predicting a first segment attribute of the video segment S _i;

A first determining subunit 222, configured to determine, through a first target prediction network, a first image feature vector, a first audio feature vector, and a first text feature vector of the video segment S _i;

the first determination subunit 222 includes: a first extraction sub-unit 2221, a second extraction sub-unit 2222, and a third extraction sub-unit 2223;

the first extraction subunit 2221 is configured to take an image frame in the video segment S _i as a first image frame, input the first image frame to the first image processing network, and perform image feature extraction on the first image frame by the first image processing network to obtain a first image feature vector of the video segment S _i;

The second extraction subunit 2222 is configured to use the audio frame in the video segment S _i as a first audio frame, input the first audio frame to the first audio processing network, and perform audio feature extraction on the first audio frame by the first audio processing network to obtain a first audio feature vector of the video segment S _i;

The third extraction subunit 2223 is configured to take the text information associated with the video segment S _i as first text information, input the first text information into the first text processing network, and perform text feature extraction on the first text information by the first text processing network to obtain a first text feature vector of the video segment S _i.

The specific implementation manner of the first extraction subunit 2221, the second extraction subunit 2222, and the third extraction subunit 2223 may be referred to the description of step S202 in the embodiment corresponding to fig. 9, which will not be described herein.

The first fusion subunit 223 is configured to perform feature fusion on the first image feature vector, the first audio feature vector, and the first text feature vector to obtain a first fusion feature vector of the video segment S _i, input the first fusion feature vector to a first fully-connected network in the first target prediction network, and perform feature extraction on the first fusion feature vector by the first fully-connected network to obtain a first target feature vector corresponding to the video segment S _i;

The highlight determining subunit 224 is configured to input the first target feature vector into a first classifier in the first target pre-estimation network for classifying the attribute of the first segment, output, by the first classifier, a first matching degree between the first target feature vector and a first sample feature vector corresponding to a plurality of first sample attributes in the first classifier, determine a first segment attribute of the video segment S _i based on the first matching degree, and determine a segment highlight of the video segment S _i based on the first segment attribute.

Optionally, the interaction amount determining subunit 225 is configured to determine a barrage interaction amount of the training segment by using a video segment for training the first initial pre-estimated network as the training segment; the bullet screen mutual quantity of a training fragment is used for describing the true fragment precision and chroma of the training fragment;

The segment dividing unit 226 is configured to take a training segment with a barrage interaction amount greater than an interaction threshold value as a positive sample segment, take a real segment essence chroma of the positive sample segment as a first sample label, take a training segment with a barrage interaction amount less than or equal to the interaction threshold value as a negative sample segment, and take a real segment essence chroma of the negative sample segment as a second sample label;

A first association subunit 227 for determining a first sample segment for training a first initial pre-estimated network based on the positive sample segment and the negative sample segment, and determining a plurality of first sample attributes based on the first sample tag and the second sample tag;

A second fusion subunit 228, configured to determine, through a first initial pre-estimation network, a first sample image vector, a first sample audio vector, and a first sample vector of the first sample segment, perform feature fusion on the first sample image vector, the first sample audio vector, and the first sample vector to obtain a first sample fusion vector of the first sample segment, and determine a first prediction attribute of the first sample segment based on the first sample fusion vector;

the first training subunit 229 is configured to iteratively train the first initial estimated network based on the predicted sample precision and chroma corresponding to the first predicted attribute and the true sample precision and chroma corresponding to the first sample attribute, to obtain a first target estimated network.

The specific implementation manner of the model obtaining subunit 221, the first determining subunit 222, the first merging subunit 223, the highlight determining subunit 224, the interaction amount determining subunit 225, the segment dividing subunit 226, the first associating subunit 227, the second merging subunit 228 and the first training subunit 229 may be referred to the above description of step S202 in the corresponding embodiment of fig. 9, and will not be repeated herein.

A second prediction unit 23, configured to predict a second segment attribute of each of the K video segments by using the target network model, and determine a segment heat of each video segment based on the second segment attribute of each video segment;

the second prediction unit 23 includes: a second determination subunit 231, a third fusion subunit 232, an average processing subunit 233, a bullet screen amount acquisition subunit 234, and a heat determination subunit 235; optionally, the second prediction unit 23 may further include: a play amount determination subunit 236, a second association subunit 237, a fourth fusion subunit 238, and a second training subunit 239;

A second determining subunit 231 configured to determine, through a second target pre-estimation network, a second image feature vector, a second audio feature vector, and a second text feature vector of the video segment S _i;

The third fusion subunit 232 is configured to perform feature fusion on the second image feature vector, the second audio feature vector, and the second text feature vector to obtain a second fusion feature vector of the video segment S _i, determine a second segment attribute of the video segment S _i based on the second fusion feature vector, and determine a first segment heat of the video segment S _i based on the second segment attribute;

The average processing subunit 233 is configured to obtain an auxiliary video segment of a service video on a platform to which the target video belongs, and determine an average barrage amount corresponding to the video segment S _i based on the barrage interaction amount and the first double speed information of the auxiliary video segment;

A bullet screen amount obtaining subunit 234, configured to obtain a segment bullet screen amount of the video segment S _i, and determine a second segment heat of the video segment S _i based on the segment bullet screen amount and the average bullet screen amount;

The heat determining subunit 235 is configured to determine the segment heat of the video segment S _i according to the first segment heat and the second segment heat of the video segment S _i.

Optionally, the play amount determining subunit 236 is configured to determine, as the second sample segment, a play amount and a play completion amount of the video of the second sample segment, where the sample video is used to train the second initial pre-estimated network; the video play quantity and the play completion quantity of one second sample fragment are used for describing the real sample heat of one second sample fragment;

a second association subunit 237, configured to determine, based on a product of the video play amount and the play completion amount, a real sample heat of the second sample segment, and use the determined real sample heat as a plurality of second sample attributes associated with the second sample segment;

A fourth fusion subunit 238, configured to determine, through a second initial pre-estimation network, a second sample image vector, a second sample audio vector, and a second sample text vector of the second sample segment, perform feature fusion on the second sample image vector, the second sample audio vector, and the second sample text vector, obtain a second sample fusion vector of the second sample segment, and determine a second prediction attribute of the second sample segment based on the second sample fusion vector;

And a second training subunit 239, configured to perform iterative training on the second initial estimated network based on the predicted sample heat corresponding to the second predicted attribute and the real sample heat corresponding to the second sample attribute, to obtain a second target estimated network.

The specific implementation manner of the second determining subunit 231, the third fusing subunit 232, the averaging processing subunit 233, the bullet screen volume obtaining subunit 234, the heat determining subunit 235, the play volume determining subunit 236, the second associating subunit 237, the fourth fusing subunit 238 and the second training subunit 239 may be referred to the above description of step S202 in the corresponding embodiment of fig. 9, and will not be repeated here.

A third prediction unit 24, configured to predict a third segment attribute of each of the K video segments by using the target network model, and determine a segment interest level of each video segment based on the third segment attribute of each video segment;

The third prediction unit 24 includes: a first video determination subunit 241, a third determination subunit 242, and an interestingness determination subunit 243; optionally, the third prediction unit 24 may further include: a completion determination subunit 244, a video scoring subunit 245, a third association subunit 246, a second video determination subunit 247, a fourth determination subunit 248, a fifth fusion subunit 249, a third training subunit 250;

A first video determining subunit 241, configured to obtain a target associated video associated with a target user in the application client, and obtain a target video tag of the target associated video, and take the target video tag as a target interest tag of the target user;

A third determining subunit 242, configured to determine a target segment feature vector of the video segment S _i through a third target pre-estimation network, determine a target associated feature vector of the target associated video through the third target pre-estimation network, and determine a target interest feature vector of the target interest tag through the third target pre-estimation network;

The interestingness determining subunit 243 is configured to determine a third fusion feature vector of the video segment S _i based on the target segment feature vector, the target association feature vector, and the target interest feature vector, determine a third segment attribute of the video segment S _i based on the third fusion feature vector, and determine a segment interestingness of the video segment S _i based on the third segment attribute.

Optionally, the completion degree determining subunit 244 is configured to determine, using the sample video for training the third initial pre-estimation network as a training video, a viewing completion degree of the sample user for the training video; the watching completion degree of one training video is used for describing the real sample interest degree of one sample user on one training video;

The video dividing sub-unit 245 is configured to take a training video with a watching completion degree greater than a completion threshold value as a positive sample video, and take a real sample interest degree of the positive sample video as a first video tag, take a training video with a watching completion degree less than or equal to the completion threshold value as a negative sample video, and take a real sample interest degree of the negative sample video as a second video tag;

A third association subunit 246 for determining a third sample segment for training a third initial pre-estimated network based on the positive sample video and the negative sample video, determining a plurality of third sample attributes based on the first video tag and the second video tag;

A second video determining subunit 247, configured to take the positive sample video as a sample related video associated with the sample user, obtain a sample video tag of the sample related video, and take the sample video tag as a sample interest tag of the sample user;

A fourth determining subunit 248, configured to determine, through a third initial pre-estimation network, a sample segment feature vector of a third sample segment, determine, through the third initial pre-estimation network, a sample associated feature vector of a sample associated video, and determine, through a third target pre-estimation network, a sample interest feature vector of a sample interest tag;

A fifth fusion subunit 249 configured to determine a third sample fusion vector for the third sample segment based on the sample segment feature vector, the sample association feature vector, and the sample interest feature vector, and determine a third prediction attribute for the third sample segment based on the third sample fusion vector;

The third training subunit 250 is configured to perform iterative training on the third initial estimated network based on the predicted sample interest level corresponding to the third predicted attribute and the real sample interest level corresponding to the third sample attribute, so as to obtain a third target estimated network.

The specific implementation manner of the first video determination subunit 241, the third determination subunit 242, the interestingness determination subunit 243, the completion determination subunit 244, the video segmentation subunit 245, the third association subunit 246, the second video determination subunit 247, the fourth determination subunit 248, the fifth fusion subunit 249 and the third training subunit 250 may be referred to the description of step S202 in the corresponding embodiment of fig. 9, and will not be repeated here.

The segment screening unit 25 is configured to determine a segment chroma of each video segment, a segment heat of each video segment, and a segment interest of each video segment as segment interest attributes of each video segment, and screen L video segments matching with a target user of the application client from K video segments based on the segment interest attributes of each video segment and the double-speed play request, and use the L video segments as key video segments for double-speed playing the target video in the double-speed mode.

Wherein the fragment screening unit 25 includes: a result determination subunit 251, a fragment screening subunit 252;

A result determining subunit 251, configured to obtain a multiple speed evaluation result of each video clip based on the clip interest attribute of each video clip;

The segment screening subunit 252 is configured to obtain a first playing progress and first double-speed information of the target video in the double-speed playing request, screen L video segments matching with the target user of the application client from K video segments based on the first double-speed information, the first playing progress and the double-speed evaluation result, and use the L video segments as key video segments for double-speed playing of the target video in the double-speed mode.

The specific implementation manner of the result determining subunit 251 and the fragment filtering subunit 252 may refer to the description of step S202 in the embodiment corresponding to fig. 9, which will not be described herein.

The specific implementation manner of the video dividing unit 21, the first predicting unit 22, the second predicting unit 23, the third predicting unit 24 and the segment screening unit 25 may be referred to the description of step S202 in the embodiment corresponding to fig. 9, and will not be repeated here.

The clip returning module 30 is configured to return the key video clip to the application client, so that the application client plays the key video clip of the target video in the video playing interface based on the first speed doubling information.

The specific implementation manner of the request receiving module 10, the fragment determining module 20 and the fragment returning module 30 may refer to the description of steps S201 to S203 in the embodiment corresponding to fig. 9, and will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 16, fig. 16 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 16, the computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. Alternatively, the network interface 1004 may include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 1005 may also be at least one memory device located remotely from the aforementioned processor 1001. As shown in fig. 16, an operating system, a network communication module, a user interface module, and a device control application program may be included in a memory 1005, which is one type of computer-readable storage medium.

In the computer device 1000 shown in FIG. 16, the network interface 1004 may provide network communication functions; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke device control applications stored in the memory 1005.

It should be understood that the computer device 1000 described in the embodiment of the present application may perform the description of the video data playing method in the embodiment corresponding to fig. 3 or fig. 9, and may also perform the description of the video data playing device 1 in the embodiment corresponding to fig. 14 or the video data playing device 2 in the embodiment corresponding to fig. 15, which are not described herein again. In addition, the description of the beneficial effects of the same method is omitted.

Furthermore, it should be noted here that: the embodiment of the present application further provides a computer readable storage medium, in which a computer program executed by the video data playing apparatus 1 or the video data playing apparatus 2 mentioned above is stored, and the computer program includes program instructions, when executed by a processor, can execute the description of the video data playing method in the embodiment corresponding to fig. 3 or fig. 9, and therefore, a description will not be repeated here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.

In addition, it should be noted that: embodiments of the present application also provide a computer program product or computer program that may include computer instructions that may be stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor may execute the computer instructions, so that the computer device performs the description of the video data playing method in the embodiment corresponding to fig. 3 or fig. 9, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the computer program product or the computer program embodiments according to the present application, reference is made to the description of the method embodiments according to the present application.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (Random Access Memory, RAM), or the like.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A video data playback method, comprising:

receiving a double-speed playing request which is sent by an application client based on first double-speed information and is associated with a target video; the first double-speed information is used for indicating the application client to double-speed play the target video in a double-speed mode;

Acquiring a video identifier of the target video from the double-speed playing request, determining the target video in the application client based on the video identifier, and dividing the target video into K video segments based on video segmentation parameters;

Obtaining a target network model associated with the target video, predicting a first segment attribute of each video segment in the K video segments through the target network model, and determining segment precision of each video segment based on the first segment attribute of each video segment;

Predicting a second segment attribute of each video segment in the K video segments through the target network model, and determining the segment heat of each video segment based on the second segment attribute of each video segment;

predicting a third segment attribute of each video segment in the K video segments through the target network model, and determining the segment interestingness of each video segment based on the third segment attribute of each video segment;

Determining the segment precision of each video segment, the segment heat of each video segment and the segment interest of each video segment as segment interest attributes of each video segment, and obtaining a double-speed evaluation result of each video segment based on the segment interest attributes of each video segment;

Sorting the K video clips based on the K multiple speed evaluation results to obtain sorted K video clips; the K is a positive integer;

Screening L video clips matched with a target user of the application client from the K sequenced video clips based on the double-speed playing request, and taking the L video clips as key video clips for double-speed playing of the target video in the double-speed mode; l is a positive integer less than K;

2. The method of claim 1, wherein the K video clips include video clip S _i, and wherein i is a positive integer less than or equal to K;

The obtaining a target network model associated with the target video, predicting, by the target network model, a first segment attribute of each of the K video segments, determining a segment precision chroma of each video segment based on the first segment attribute of each video segment, including:

acquiring a target network model associated with the target video; the target network model comprises a first target pre-estimated network for predicting a first segment attribute of the video segment S _i;

determining a first image feature vector, a first audio feature vector and a first text feature vector of the video segment S _i through the first target pre-estimation network;

Performing feature fusion on the first image feature vector, the first audio feature vector and the first text feature vector to obtain a first fusion feature vector of the video segment S _i, inputting the first fusion feature vector into a first full-connection network in the first target pre-estimation network, and performing feature extraction on the first fusion feature vector by the first full-connection network to obtain a first target feature vector corresponding to the video segment S _i;

Inputting the first target feature vector into a first classifier used for classifying the first segment attribute in the first target pre-estimated network, outputting a first matching degree between the first target feature vector and first sample feature vectors corresponding to a plurality of first sample attributes in the first classifier by the first classifier, determining the first segment attribute of the video segment S _i based on the first matching degree, and determining the segment precision of the video segment S _i based on the first segment attribute.

3. The method of claim 2, wherein the first target prediction network comprises a first image processing network, a first audio processing network, and a first text processing network;

The determining, by the first target prediction network, the first image feature vector, the first audio feature vector, and the first text feature vector of the video segment S _i includes:

Taking the image frame in the video segment S _i as a first image frame, inputting the first image frame into the first image processing network, and extracting image features of the first image frame by the first image processing network to obtain a first image feature vector of the video segment S _i;

The audio frames in the video clip S _i are used as first audio frames, the first audio frames are input into the first audio processing network, and the first audio processing network performs audio feature extraction on the first audio frames to obtain first audio feature vectors of the video clip S _i;

And taking the text information associated with the video segment S _i as first text information, inputting the first text information into the first text processing network, and extracting text characteristics of the first text information by the first text processing network to obtain a first text characteristic vector of the video segment S _i.

4. The method according to claim 2, wherein the method further comprises:

taking a video segment for training a first initial pre-estimated network as a training segment, and determining the barrage interaction quantity of the training segment; the bullet screen mutual quantity of a training fragment is used for describing the true fragment precision and chroma of the training fragment;

Taking a training segment with the barrage interaction quantity larger than an interaction threshold value as a positive sample segment, taking the true segment essence chroma of the positive sample segment as a first sample label, taking a training segment with the barrage interaction quantity smaller than or equal to the interaction threshold value as a negative sample segment, and taking the true segment essence chroma of the negative sample segment as a second sample label;

Determining a first sample segment for training the first initial pre-estimated network based on the positive sample segment and the negative sample segment, determining the plurality of first sample attributes based on the first sample tag and the second sample tag;

Determining a first sample image vector, a first sample audio vector and a first sample text vector of the first sample segment through the first initial pre-estimation network, performing feature fusion on the first sample image vector, the first sample audio vector and the first sample text vector to obtain a first sample fusion vector of the first sample segment, and determining a first prediction attribute of the first sample segment based on the first sample fusion vector;

And performing iterative training on the first initial estimated network based on the precision and chroma of the predicted sample corresponding to the first predicted attribute and the precision and chroma of the real sample corresponding to the first sample attribute to obtain the first target estimated network.

5. The method of claim 1, wherein the K video clips include video clip S _i, and wherein i is a positive integer less than or equal to K; the target network model comprises a second target pre-estimation network for predicting a second segment attribute of the video segment S _i;

The predicting, by the target network model, the second segment attribute of each of the K video segments, determining the segment popularity of each video segment based on the second segment attribute of each video segment, including:

Determining a second image feature vector, a second audio feature vector and a second text feature vector of the video segment S _i through the second target pre-estimation network;

Performing feature fusion on the second image feature vector, the second audio feature vector and the second text feature vector to obtain a second fusion feature vector of the video segment S _i, determining the second segment attribute of the video segment S _i based on the second fusion feature vector, and determining the first segment heat of the video segment S _i based on the second segment attribute;

Acquiring an auxiliary video segment of a business video on a platform to which the target video belongs, and determining an average barrage quantity corresponding to a video segment S _i based on the barrage mutual quantity of the auxiliary video segment and the first double-speed information;

Acquiring a segment barrage amount of the video segment S _i, and determining a second segment heat of the video segment S _i based on the segment barrage amount and the average barrage amount;

And determining the segment heat of the video segment S _i according to the first segment heat and the second segment heat of the video segment S _i.

6. The method of claim 5, wherein the method further comprises:

Taking a sample video for training a second initial pre-estimated network as a second sample fragment, and determining the video playing amount and the playing completion amount of the second sample fragment; the video play quantity and the play completion quantity of one second sample fragment are used for describing the real sample heat of one second sample fragment;

Determining a true sample heat of the second sample segment based on a product of the video play-out amount and the play-out completion amount, and taking the determined true sample heat as a plurality of second sample attributes associated with the second sample segment;

Determining a second sample image vector, a second sample audio vector and a second sample text vector of the second sample segment through the second initial pre-estimation network, performing feature fusion on the second sample image vector, the second sample audio vector and the second sample text vector to obtain a second sample fusion vector of the second sample segment, and determining a second prediction attribute of the second sample segment based on the second sample fusion vector;

And performing iterative training on the second initial estimated network based on the predicted sample heat corresponding to the second predicted attribute and the real sample heat corresponding to the second sample attribute to obtain the second target estimated network.

7. The method of claim 1, wherein the K video clips include video clip S _i, and wherein i is a positive integer less than or equal to K; the target network model comprises a third target pre-estimation network for predicting a third segment attribute of the video segment S _i;

the predicting, by the target network model, a third segment attribute of each of the K video segments, and determining, based on the third segment attribute of each video segment, a segment interest level of each video segment includes:

acquiring a target associated video associated with a target user in an application client, acquiring a target video tag of the target associated video, and taking the target video tag as a target interest tag of the target user;

Determining a target segment feature vector of the video segment S _i through the third target pre-estimation network, determining a target associated feature vector of the target associated video through the third target pre-estimation network, and determining a target interest feature vector of the target interest tag through the third target pre-estimation network;

A third fused feature vector of the video segment S _i is determined based on the target segment feature vector, the target associated feature vector, and the target interest feature vector, the third segment attribute of the video segment S _i is determined based on the third fused feature vector, and a segment interest level of the video segment S _i is determined based on the third segment attribute.

8. The method of claim 7, wherein the method further comprises:

taking a sample video for training a third initial pre-estimated network as a training video, and determining the watching completion degree of a sample user on the training video; the watching completion degree of one training video is used for describing the real sample interest degree of one sample user on one training video;

Taking the training video with the watching completion degree larger than a completion threshold value as a positive sample video, taking the real sample interestingness of the positive sample video as a first video tag, taking the training video with the watching completion degree smaller than or equal to the completion threshold value as a negative sample video, and taking the real sample interestingness of the negative sample video as a second video tag;

determining a third sample segment for training the third initial pre-estimated network based on the positive sample video and the negative sample video, determining a plurality of third sample attributes based on the first video tag and the second video tag;

Taking the positive sample video as a sample associated video associated with the sample user, acquiring a sample video tag of the sample associated video, and taking the sample video tag as a sample interest tag of the sample user;

Determining sample segment feature vectors of the third sample segments through the third initial pre-estimation network, determining sample associated feature vectors of the sample associated video through the third initial pre-estimation network, and determining sample interest feature vectors of the sample interest tags through the third target pre-estimation network;

Determining a third sample fusion vector for the third sample segment based on the sample segment feature vector, the sample association feature vector, and the sample interest feature vector, and determining a third prediction attribute for the third sample segment based on the third sample fusion vector;

And performing iterative training on the third initial estimated network based on the predicted sample interestingness corresponding to the third predicted attribute and the real sample interestingness corresponding to the third sample attribute to obtain the third target estimated network.

9. The method of claim 1, wherein the screening L video clips matching the target user of the application client from the K sequenced video clips based on the double-speed play request comprises:

Acquiring a first playing progress and the first double-speed information of the target video in the double-speed playing request;

And screening L video clips matched with a target user of the application client from the K sequenced video clips based on the first double-speed information and the first playing progress, and taking the L video clips as key video clips for double-speed playing of the target video in the double-speed mode.

10. A video data playback method, comprising:

displaying a video playing interface for playing the target video;

responding to a first triggering operation for the video playing interface, and displaying N double-speed controls associated with the double-speed mode of the target video; the N is a positive integer;

Responding to second trigger operation for the N double-speed controls, determining first double-speed information indicated by the double-speed control corresponding to the second trigger operation, and playing the key video fragment of the target video in the video playing interface; the key video segments are L video segments which are selected from the K video segments of the target video after sequencing and are associated with the first double-speed information; the K video clips after sequencing are obtained by sequencing the K video clips based on the multiple speed evaluation results respectively corresponding to the K video clips of the target video; the K is a positive integer; l is a positive integer less than K; the double-speed evaluation result of each video clip is obtained based on the clip interest attribute of each video clip; the segment interest attribute of each video segment is determined by the segment precision of each video segment, the segment heat of each video segment and the segment interest of each video segment; the segment precision of each video segment is determined based on a first segment attribute of each video segment predicted by a target network model associated with the target video; the segment popularity of each video segment is determined based on a second segment attribute of each video segment, the second segment attribute of each video segment being predicted by the target network model; the segment interestingness of each video segment is determined based on a third segment attribute of each video segment, which is predicted by the target network model.

11. The method of claim 10, wherein the video playback interface is a full screen playback interface for playing the target video;

The responding to the first triggering operation for the video playing interface displays N times speed controls associated with the times speed mode of the target video, and the responding comprises the following steps:

responding to a first triggering operation for the full-screen playing interface, triggering a double-speed mode of the target video, and displaying a first control display interface independent of the full-screen playing interface based on the double-speed mode; the interface size of the first control display interface is smaller than the interface size of the full-screen playing interface;

and displaying N speed-doubling controls associated with the speed-doubling mode in the first control display interface.

12. The method of claim 10, wherein the video playback interface is a non-full screen playback interface for playing the target video;

playing the target video in a video playing area of the non-full screen playing interface;

Responding to a first triggering operation for the non-full-screen playing interface, and displaying a control display area of the target video in the non-full-screen playing interface; the control display area is an area which is suspended above the video playing area, or the control display area is an area which is not overlapped with the video playing area;

And responding to double-speed selection operation aiming at the control display area, triggering a double-speed mode of the target video, and displaying N double-speed controls associated with the double-speed mode in a second control display interface based on the double-speed mode.

13. The method of claim 10, wherein the determining, in response to the second trigger operation for the N multiple speed controls, the first multiple speed information indicated by the multiple speed control corresponding to the second trigger operation, plays the key video clip of the target video in the video play interface, includes:

responding to second trigger operation for the N speed-doubling controls, determining first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operation, and determining the playing progress of the target video as a first playing progress in the video playing interface;

When the network state of the application client for playing the target video is detected to belong to a first network state, acquiring a double-speed playing fragment identifier associated with the first double-speed information and the first playing progress from a server based on the first network state; a double-speed playing segment identifier is used for representing the segment position of a key video segment in the target video;

acquiring a key video clip matched with the double-speed playing clip identifier from a server;

and playing the key video snippet of the target video in the video playing interface based on the snippet position of the key video snippet in the target video.

14. The method of claim 10, wherein the determining, in response to the second trigger operation for the N multiple speed controls, the first multiple speed information indicated by the multiple speed control corresponding to the second trigger operation, plays the key video clip of the target video in the video play interface, includes:

When the network state of the application client for playing the target video is checked to be in a second network state, acquiring a key video segment associated with the first double-speed information and the first playing progress from a server based on the second network state; the key video snippets are determined by the server from a key snippet set of the target video based on the first double-speed information and the first playing progress; the set of key segments is determined by the server based on L video segments; the L video clips are determined based on clip interest attributes of K video clips of the target video; l is a positive integer less than K; the K is a positive integer;

And playing the key video snippet of the target video in the video playing interface.

15. The method according to claim 10, wherein the method further comprises:

taking the speed-doubling control corresponding to the second triggering operation as a first speed-doubling control, and when the key video clip is played in the video playing interface, responding to a third triggering operation aiming at the video playing interface, and displaying N speed-doubling controls associated with the speed-doubling mode; the N speed-doubling controls comprise second speed-doubling controls;

Responding to a fourth triggering operation for the second double-speed control, and switching double-speed information for double-speed playing the target video from the first double-speed information indicated by the first double-speed control to second double-speed information indicated by the second double-speed control;

Determining the playing progress of the key video segment in the target video as a second playing progress, and determining a switching video segment for playing in the video playing interface based on the second double-speed information and the second playing progress; the switching video clips are video clips which are selected from the video clips of the target video and are associated with the second double-speed information and the second playing progress;

and playing the switching video clip in the video playing interface.

16. A video data playback apparatus, comprising:

The request receiving module is used for receiving a double-speed playing request which is sent by the application client based on the first double-speed information and is associated with the target video; the first double-speed information is used for indicating the application client to double-speed play the target video in a double-speed mode;

The segment determining module is used for acquiring the video identification of the target video from the double-speed playing request, determining the target video in the application client based on the video identification, and dividing the target video into K video segments based on video segmentation parameters;

A segment determining module, configured to obtain a target network model associated with the target video, predict, by using the target network model, a first segment attribute of each of the K video segments, and determine a segment precision chroma of each video segment based on the first segment attribute of each video segment;

A segment determining module, configured to predict, by using the target network model, a second segment attribute of each of the K video segments, and determine a segment heat of each of the video segments based on the second segment attribute of each of the video segments;

A segment determining module, configured to predict, by using the target network model, a third segment attribute of each video segment of the K video segments, and determine a segment interest level of each video segment based on the third segment attribute of each video segment;

The segment determining module is used for determining the segment precision of each video segment, the segment heat of each video segment and the segment interest of each video segment as the segment interest attribute of each video segment, and obtaining a double-speed evaluation result of each video segment based on the segment interest attribute of each video segment;

The segment determining module is used for sequencing the K video segments based on the K multiple speed evaluation results to obtain sequenced K video segments; the K is a positive integer;

The segment determining module is used for screening L video segments matched with a target user of the application client from the K sequenced video segments based on the double-speed playing request, and taking the L video segments as key video segments for double-speed playing of the target video in the double-speed mode; l is a positive integer less than K;

and the segment returning module is used for returning the key video segment to the application client so that the application client plays the key video segment of the target video.

17. A video data playback apparatus, comprising:

The first response module is used for responding to a first triggering operation aiming at the video playing interface and displaying N double-speed controls associated with the double-speed mode of the target video; the N is a positive integer;

The second response module is used for responding to second trigger operations for the N speed-doubling controls, determining first speed-doubling information indicated by the speed-doubling control corresponding to the second trigger operations, and playing the key video fragments of the target video in the video playing interface; the key video segments are L video segments which are selected from the K video segments of the target video after sequencing and are associated with the first double-speed information; the K video clips after sequencing are obtained by sequencing the K video clips based on the multiple speed evaluation results respectively corresponding to the K video clips of the target video; the K is a positive integer; l is a positive integer less than K; the double-speed evaluation result of each video clip is obtained based on the clip interest attribute of each video clip; the segment interest attribute of each video segment is determined by the segment precision of each video segment, the segment heat of each video segment and the segment interest of each video segment; the segment precision of each video segment is determined based on a first segment attribute of each video segment predicted by a target network model associated with the target video; the segment popularity of each video segment is determined based on a second segment attribute of each video segment, the second segment attribute of each video segment being predicted by the target network model; the segment interestingness of each video segment is determined based on a third segment attribute of each video segment, which is predicted by the target network model.

18. A computer device, comprising: a processor and a memory;

The processor is connected to a memory, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-15.

19. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-15.

20. A computer program product, characterized in that it comprises computer instructions stored in a computer-readable storage medium and adapted to be read and executed by a processor to cause a computer device with the processor to perform the method of any of claims 1-15.