CN112905839A

CN112905839A - Model training method, model using device, storage medium and equipment

Info

Publication number: CN112905839A
Application number: CN202110185069.6A
Authority: CN
Inventors: 宋治勋
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2021-06-04

Abstract

The embodiment of the disclosure discloses a model training method, a model using device, a storage medium and equipment. The model training method comprises the following steps: determining a target video sample and a user sample, obtaining a historical behavior data sample of a recommended video which is generated before the user sample is directed at the target video sample, determining a corresponding target historical video sample according to the historical behavior data sample, determining a training sample set according to the feature data of the target historical video sample, the feature data of the target video sample and the behavior data of the user sample directed at the target video sample, and training a preset neural network model by adopting the training sample set to obtain a user behavior prediction model directed at the video to be recommended. By adopting the technical scheme, the content understanding capacity of the user behavior prediction model to the target video sample can be enhanced, and the prediction accuracy of the user behavior prediction model for video recommendation can be improved.

Description

Model training method, model using device, storage medium and equipment

Technical Field

The disclosed embodiments relate to the field of computer technologies, and in particular, to a model training method, a model using device, a storage medium, and a storage apparatus.

Background

With the rapid development of information technology and the popularization of mobile internet, a video recommendation platform is favored by more and more users. However, as the number of videos in the video library rapidly increases, the difficulty of rapidly and accurately selecting the videos which need to be recommended to the user from a large number of videos is increasing.

When a user watches videos, behaviors such as clicking, praise, comment, collection, forwarding and the like are usually generated, the interest degree of the user in the videos can be reflected by the behaviors, and under the condition of obtaining the authorization of the user, the data related to the behaviors of the user can be obtained to provide reference information for video recommendation. For example, for a video in a video library, a user behavior prediction model may be used to predict the probability of a user performing a corresponding behavior on the video, and then decide whether to recommend the video to the user.

Disclosure of Invention

The embodiment of the disclosure provides a model training method, a model using device, a storage medium and a device, which can solve the problem of insufficient prediction accuracy of a user behavior prediction model for video recommendation.

In a first aspect, an embodiment of the present disclosure provides a model training method, including:

determining a target video sample and a user sample, wherein the user sample comprises a recommended user corresponding to the target video sample;

acquiring historical behavior data samples of recommended videos of the user samples before the target video samples, and determining corresponding target historical video samples according to the historical behavior data samples;

determining a training sample set according to the feature data of the target historical video sample, the feature data of the target video sample and the behavior data of the user sample aiming at the target video sample;

and training a preset neural network model by adopting the training sample set to obtain a user behavior prediction model aiming at the video to be recommended.

In a second aspect, an embodiment of the present disclosure provides a model using method, including:

acquiring characteristic data of a video to be recommended;

inputting the characteristic data of the video to be recommended into a user behavior prediction model for the video to be recommended, wherein the user behavior prediction model for the video to be recommended is obtained by adopting the model training method provided by the embodiment of the invention;

and predicting the user behavior of the current user for the video to be recommended according to the output result of the user behavior prediction model for the video to be recommended.

In a third aspect, an embodiment of the present disclosure provides a model using method, including:

acquiring characteristic data of a video to be recommended;

acquiring historical behavior data of a recommended video which appears before the video to be recommended by a current user, and determining a corresponding target historical video according to the historical behavior data;

inputting the feature data of the video to be recommended and the feature data of the target historical video into a user behavior prediction model for the video to be recommended, wherein the user behavior prediction model for the video to be recommended is obtained by adopting the model training method provided by the embodiment of the invention;

In a fourth aspect, an embodiment of the present disclosure provides a model training apparatus, including:

the device comprises a first sample determining module, a second sample determining module and a third sample determining module, wherein the first sample determining module is used for determining a target video sample and a user sample, and the user sample comprises recommended users corresponding to the target video sample;

the second sample determining module is used for acquiring historical behavior data samples of recommended videos of the user samples before the target video samples, and determining corresponding target historical video samples according to the historical behavior data samples;

a training sample set determining module, configured to determine a training sample set according to the feature data of the target historical video sample, the feature data of the target video sample, and behavior data of the user sample for the target video sample;

and the model training module is used for training a preset neural network model by adopting the training sample set to obtain a user behavior prediction model aiming at the video to be recommended.

In a fifth aspect, an embodiment of the present disclosure provides a model using apparatus, including:

the characteristic data acquisition module is used for acquiring the characteristic data of the video to be recommended;

the data input module is used for inputting the characteristic data of the video to be recommended into a user behavior prediction model aiming at the video to be recommended, wherein the user behavior prediction model aiming at the video to be recommended is obtained by adopting the model training method provided by the embodiment of the invention;

and the behavior prediction module is used for predicting the user behavior of the current user for the video to be recommended according to the output result of the user behavior prediction model for the video to be recommended.

In a sixth aspect, an embodiment of the present disclosure provides a model using apparatus, including:

the historical video determining module is used for acquiring historical behavior data of a recommended video which appears before the video to be recommended by a current user, and determining a corresponding target historical video according to the historical behavior data;

the data input module is used for inputting the characteristic data of the video to be recommended and the characteristic data of the target historical video into a user behavior prediction model aiming at the video to be recommended, wherein the user behavior prediction model aiming at the video to be recommended is obtained by adopting the model training method provided by the embodiment of the invention;

In a seventh aspect, the disclosed embodiments provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the model training method and/or the model using method provided by the disclosed embodiments.

In an eighth aspect, the present disclosure provides a computer device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the model training method and/or the model using method provided by the embodiments of the present disclosure when executing the computer program.

According to the model training scheme provided by the embodiment of the disclosure, a target video sample and a user sample are determined, wherein the user sample comprises a recommended user corresponding to the target video sample, a historical behavior data sample of the recommended video, which appears before the target video sample, of the user sample is obtained, the corresponding target historical video sample is determined according to the historical behavior data sample, a training sample set is determined according to feature data of the target historical video sample, feature data of the target video sample and behavior data of the user sample aiming at the target video sample, a preset neural network model is trained by adopting the training sample set, and a user behavior prediction model aiming at a video to be recommended is obtained. By adopting the technical scheme, when the user behavior prediction model is trained, the historical video sample is selected according to the historical behavior data of the user sample, the characteristic data of the historical video sample and the characteristic data of the target video sample are input into the neural network model as the training sample for training, the content attribute characteristics related to the user behavior can be added to the target video sample, the content understanding ability of the user behavior prediction model on the target video sample is enhanced, and the prediction accuracy of the user behavior prediction model for video recommendation can be improved.

Drawings

Fig. 1 is a schematic flow chart of a model training method according to an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram illustrating another model training method provided by the embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a neural network model according to an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart diagram illustrating a method for using a model according to an embodiment of the present disclosure;

FIG. 5 is a schematic flow chart diagram illustrating another method for using a model provided by an embodiment of the present disclosure;

fig. 6 is a block diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a model using apparatus according to an embodiment of the present disclosure;

FIG. 8 is a block diagram of another model using apparatus according to an embodiment of the present disclosure;

fig. 9 is a block diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

In the following embodiments, optional features and examples are provided in each embodiment, and various features described in the embodiments may be combined to form a plurality of alternatives, and each numbered embodiment should not be regarded as only one technical solution.

Fig. 1 is a flowchart of a model training method provided in an embodiment of the present disclosure, which is applicable to a scenario in which a user behavior prediction model for a video to be recommended is trained, and which may be executed by a model training apparatus, where the apparatus may be implemented by software and/or hardware, and may generally be integrated in a computer device. As shown in fig. 1, the method includes:

step 101, determining a target video sample and a user sample, wherein the user sample comprises a recommended user corresponding to the target video sample.

In the embodiment of the present disclosure, the specific type of the video is not limited. Illustratively, the video can be a long video or a short video; the video can be a video shot by a photographer or a video made by a video platform. For short video recommendation platforms, the video may also be an online promotional video (e.g., an advertising video).

For example, the target video sample and the user sample may be derived from a test scenario or from a real-life usage scenario. For a test scene, a testee can be used as a user sample, a certain number of videos are recommended to the testee, the behavior of the testee in the process of watching the videos is recorded under the condition of authorization of the testee, after a period of time, the videos currently recommended to all the testees are selected as a target video sample, and the recommendation can be understood as showing and playing the videos to the testee. For a real use scene, under the condition of obtaining user authorization, background data of a video recommendation platform can be obtained, assuming that a first video recommended to a user is selected as a target video sample, and a certain number of users are selected as user samples from users who are recommended to watch the first video, wherein the recommendation can be understood as pushing the video to a client device used by the user, so that the video can be played on the client device.

And 102, obtaining historical behavior data samples of recommended videos of the user samples before the target video samples, and determining corresponding target historical video samples according to the historical behavior data samples.

For example, the behaviors involved in the embodiment of the present disclosure may include behaviors that a user applies to a video at his or her own will, for example, behaviors that are embodied in various operation forms and represent favorite selected videos, specifically, click, like, comment, collect, forward, and the like, and of course, behaviors that are embodied in various operation forms and represent dislike neglected videos, such as closing, switching (for example, switching to play a next video through a sliding operation), stepping (an operation representing dislike), or reporting, and the like, may also be included.

For example, videos are generally recommended to a user sequentially according to a sequence, the user may have various behaviors for the videos, for a target video sample, before the target video sample is recommended, a video that has been recommended to the user sample may be called a history video sample (that is, a recommended video that appears before the target video sample), the number of history video samples may be set according to an actual situation, and for different user samples, the number of history video samples may be the same or different. The behaviors of the same user sample for different historical video samples may be different, the behaviors of different user samples for the same historical video sample may also be different, the data recorded with the behaviors of the user for the video may be called behavior data, and the behavior data of the user sample for the historical video sample is the historical behavior data sample. In the embodiment of the disclosure, the type of the concerned behavior can be predetermined, taking a selected video as an example, and a historical video sample with the selected video behavior in the corresponding historical behavior data samples can be determined as a target historical video sample; taking the neglect video as an example, the historical video sample with the neglect video behavior in the corresponding historical behavior data samples can be determined as the target historical video sample.

Step 103, determining a training sample set according to the feature data of the target historical video sample, the feature data of the target video sample and the behavior data of the user sample aiming at the target video sample.

For example, the feature data of the video sample (including the feature data of the target historical video sample and the feature data of the target video sample) may include one or more data types such as text sample data, image sample data, and sound sample data, and the specific obtaining manner is not limited. The text sample data may include, for example, a title of the video sample, a classification category of the video sample, or a voice recognition text of the video sample, etc.; the image sample data may include, for example, a cover map or a video frame drawing map of the video sample; the sound sample data may comprise audio data in a video sample, etc. For the target historical video sample and the target video sample, the types of data included in the feature data may be the same or different, and are not limited specifically. In general, since the target video sample serves as a prediction subject and the feature data of the target historical video sample is used to add the content attribute features related to the user behavior to the target video sample, the number of data types included in the feature data of the target video sample is greater than or equal to the number of data types included in the feature data of the target historical video sample. For example, the feature data of the target video sample includes text sample data and image sample data, and the feature data of the target history video sample includes image sample data.

For example, for a user sample, the feature data of the target historical video sample and the feature data of the target video sample corresponding to the user sample may be used as training sample data, the behavior data of the user sample for the target video sample may be used as a sample label corresponding to the training sample data, a complete training sample corresponding to the user sample is obtained, training samples corresponding to a plurality of user samples are collected, and a training sample set is obtained.

And 104, training a preset neural network model by using the training sample set to obtain a user behavior prediction model aiming at the video to be recommended.

The preset neural network model can be understood as a neural network model to be trained, and can be specifically understood as an initial model designed according to actual requirements. The network structure, parameters and the like in the preset neural network model are not specifically limited and can be set according to actual requirements.

Illustratively, values of loss functions corresponding to the preset neural network model are continuously optimized through training means such as network parameter back propagation, and the preset neural network model is further continuously optimized until a certain training cut-off condition is met, so that a trained user behavior prediction model aiming at the video to be recommended is obtained. The specific training cutoff condition may be set according to actual requirements, and the embodiment of the present disclosure is not limited. The video to be recommended can be understood as a video input into the user behavior prediction model, and the behavior of the user for the video to be recommended can be predicted according to the output result of the user behavior prediction model. By taking the user behavior as an example for selecting the video, whether the user selects the video to be recommended or not can be predicted according to the output result of the user behavior prediction model, or the probability of selecting the video to be recommended by the user can be predicted, and the like.

According to the model training method provided by the embodiment of the disclosure, when the user behavior prediction model is trained, the historical video sample is selected according to the historical behavior data of the user sample, and the feature data of the historical video sample and the feature data of the target video sample are input into the neural network model as the training sample for training, so that the content attribute features related to the user behavior can be added to the target video sample, the content understanding capability of the user behavior prediction model on the target video sample is enhanced, and the prediction accuracy of the user behavior prediction model for video recommendation can be improved.

In some embodiments, the action includes selecting a recommended video. The obtaining of the historical behavior data sample of the recommended video of the user sample before the target video sample, and determining the corresponding target historical video sample according to the historical behavior data sample, includes: acquiring historical selection condition data samples of recommended videos of the user samples before the target video samples; and determining a first preset number of historical video samples which are selected by the user sample before the target video sample appears according to the historical selection condition data samples to obtain the target historical video sample. The advantage of this arrangement is that the interest of the user generally changes with the time, and the closer the user is to the target video sample, the closer the preference of the user sample is to the preference of the user when watching the target video sample, and therefore, the history video sample selected recently is selected as the target history video sample. The first preset quantity can be set according to actual requirements.

In some embodiments, the determining a set of training samples from the feature data of the target historical video sample, the feature data of the target video sample, and the behavior data of the user sample for the target video sample comprises: acquiring a second preset number of historical video frame data in the current target historical video sample aiming at each target historical video sample; acquiring a third preset number of target video frame data in the target video sample; determining training sample data according to the historical video frame data and the target video frame data, and determining a sample label corresponding to the training sample data according to whether the user sample selects the target video sample, so as to obtain a training sample set. The advantage of setting up like this lies in, video frame data compare can embody video content characteristic better than text and pronunciation etc. and control the data bulk size of training sample when guaranteeing the training effect, is favorable to improving training efficiency.

The second preset number and the third preset number may be the same or different, and are not limited specifically. Acquiring video frame data in a video sample can be understood as performing frame extraction processing on the video sample, and taking an extracted video frame as the acquired video frame data. The frame extraction rule can be set according to actual requirements, such as frame extraction at equal intervals or frame extraction at unequal intervals. In addition, after the frame extraction processing, the extracted video frame may be subjected to preprocessing, such as length and width preprocessing, so that the size of the processed image is adapted to the preset neural network model, and the processed video frame is used as the acquired video frame data. In addition, when the second preset number or the third preset number is greater than 1, when the training sample data is determined according to the historical video frame data and the target video frame data, the training sample data can also contain time sequence characteristics among the video frames, so that the training sample data can more accurately express the content in the video.

In some embodiments, the second predetermined number is less than the third predetermined number. The advantage of such an arrangement is that, as described above, the target video sample is used as a prediction subject, and the feature data of the target historical video sample is used to add content attribute features related to user behavior to the target video sample, so that a smaller number of video frames are extracted for the target historical video sample, the data size of the training sample can be further controlled, and the training efficiency is improved. Further, the historical video frame data includes historical video cover frame data. When the second preset number is 1, the historical video frame data may be historical video cover frame data.

In some embodiments, the training a preset neural network model by using the training sample set to obtain a user behavior prediction model for a video to be recommended includes: inputting the historical video frame data into a first image feature extraction network in a preset neural network model to obtain corresponding historical video image features; inputting the target video frame data into a second image feature extraction network in the preset neural network model to obtain corresponding target video image features; outputting corresponding fusion characteristics according to the historical video image characteristics arranged according to the time sequence and the target video image characteristics arranged according to the time sequence through a characteristic fusion network in the preset neural network model, wherein the characteristic fusion network performs characteristic fusion on the internal correlation of the historical video image characteristics, the internal correlation of the target video image characteristics and the correlation between the historical video image characteristics and the target video image characteristics, for example, the characteristic fusion network is constructed based on a self-attention mechanism; the fusion feature outputs a predicted value of the user sample selecting the target video sample through a preset network in the preset neural network model; and calculating a loss function according to the predicted value and the corresponding sample label, and adjusting the preset neural network model based on the loss function to obtain a user behavior prediction model for the video to be recommended. The method has the advantages that the feature fusion network constructed based on the self-attention mechanism is utilized, so that the model can learn respective features of the target video sample and the user behavior, and can also learn the features between the target video sample and the user behavior at the same time, and indirectly complete information interaction of different input sources in the same mode, wherein the same mode refers to the mode that both the historical video image features and the target video image features are image features in the video, different input sources refer to the mode that the historical video image features and the target video image features are from different videos, and the feature information of the user behavior is hidden in the target historical video sample, so that the trained user behavior prediction model can better understand the content of the target video sample, and the prediction accuracy is further improved.

For example, the first image feature extraction Network and the second image feature extraction Network may be the same or different, and may be, for example, a residual error Network (ResNet) model or a Visual Geometry Group (VGG) model. The feature fusion network may be, for example, a BERT model, where a basic encoder of the feature fusion network uses a transform (Transformer) structure, and the transform structure mainly uses a self attention (self attention) mechanism to encode and parallelize input data, so as to improve the global characterization capability of each data. The Self attribute can learn the weight of each feature in the input sequence and realize the feature expression of the upper layer through weighted summation. Specifically, the feature fusion network can realize fusion of the historical video image features and the target video image features by means of a transform structure in BERT, and simple splicing is not performed, so that fusion extracted features obtained after fusion can better express a target video sample, and more accurate content understanding is realized.

In some embodiments, the videos to be recommended include advertisement videos to be recommended in a short video platform, and the user behavior prediction model includes a Click-Through-Rate (CTR) prediction model. The method has the advantages that for the advertisement video to be recommended in the short video platform, high-quality advertisement video content characteristics can be produced, the click rate estimation effect is improved, the recommendation of the advertisement video is more accurate, and the real requirements of users are met.

Fig. 2 is a schematic flow chart of another model training method provided by an embodiment of the present disclosure, which is optimized based on various alternatives in the foregoing embodiments, and is described by taking training of a CTR prediction model for an advertisement video to be recommended in a short video platform as an example, specifically, the method includes the following steps:

step 201, determining a target video sample and a user sample.

The user sample comprises recommended users corresponding to the target video sample. In the short video platform, short videos are generally recommended to a user one by one, an advertisement video may be recommended to the user between two short videos, when the user is interested in the advertisement video, the user can enter a commodity detail page or a live broadcast room corresponding to the advertisement video in a mode of clicking the advertisement video and the like, so that the user can further know the advertisement content or perform operations such as online ordering and ordering on commodities, and the click through rate prediction model in the embodiment of the disclosure can be used for predicting the click probability of the user for a certain advertisement video to be recommended or whether the advertisement video is clicked.

In an actual training process, a large amount of training data is often needed, and for convenience of description in the embodiment of the present disclosure, a simple introduction is performed by taking 2 target video samples and 3 user samples as examples. Assuming that the target video sample comprises an advertisement video a and an advertisement video B, the user samples are a user A, a user B and a user C, and the advertisement video a and the advertisement video B are recommended to each sample user respectively, 6 training samples can be constructed according to the user A and the advertisement video a, the user A and the advertisement video B, the user B and the advertisement video a, the user B and the advertisement video B, the user C and the advertisement video a, and the user C and the advertisement video B.

Step 202, obtaining historical click condition data samples of recommended videos which appear before the user samples aim at the target video samples.

The following description takes user a and advertisement video a as an example, and the construction process of other training samples is similar. User a also watches many advertisement videos before watching advertisement video a, and user a and the other two users may watch different advertisement videos before watching advertisement video a, which is not limited in particular. The video watched by the user A before watching the advertisement video a is called a historical video sample, and under the condition of obtaining the authorization of the user A, whether the user inputs a click operation for each historical video sample or not can be recorded, so that a corresponding historical click condition data sample is obtained and obtained in the step.

Step 203, determining a first preset number of history video samples clicked by the user sample before the target video sample appears according to the history click condition data samples, and obtaining the target history video sample.

Illustratively, still taking the user a and the advertisement video a as an example, the first preset number may be 10, that is, 10 history video samples that the user a has clicked before the advertisement video a appears may be determined according to the history click condition data samples, and these 10 history video samples are taken as the target history video samples.

And 204, acquiring a second preset number of historical video frame data in the current target historical video sample aiming at each target historical video sample.

For example, in consideration of the model size and the training efficiency, the second preset number may be set to 1, and the historical video frame data is the historical video cover frame data. Continuing with the above example, in this step, for each of the 10 target history video samples, a video cover frame is obtained for constructing a training sample, which is denoted as cover frame 1, cover frame 2, …, and cover frame 10, respectively.

And step 205, acquiring a third preset number of target video frame data in the target video sample.

Illustratively, continuing with the above example, in this step, a third preset number of target video frame data in the advertisement video a is obtained. Assuming that the third preset number is 10, frame extraction processing is performed on the advertisement video a, the frame extraction number is 10, the frame extraction rule is not limited, for example, equal-interval extraction may be performed, and the obtained video frames may be denoted as [ picture frame 1, picture frame 2, …, picture frame 10 ].

And step 206, determining training sample data according to the historical video frame data and the target video frame data, and determining a sample label corresponding to the training sample data according to whether the user sample clicks the target video sample, so as to obtain a training sample set.

Illustratively, continuing with the above example, training sample data is constructed in this step according to 10 cover frames and 10 picture frames. Optionally, the 10 pictures may be preprocessed, and assuming that the length of the picture received by the CTR prediction model is H, the width of the picture is W, and the number of RGB (red, green, and blue) channels (channels) of the color picture is 3, the training sample data may be processed into 20 pixel matrices of H × W × 3, and each pixel point is normalized. In the process of watching the advertisement video a, the user a may click on the advertisement video a, or may not click on the advertisement video a, and a corresponding sample label (cable) of the training sample data may be determined according to whether the user a clicks, for example, 0 indicates that the user does not click, and 1 indicates that the user clicks. Correspondingly, in the using process of the CTR prediction model, if 0 is output, it indicates that the user is predicted not to click, and if 1 is output, it indicates that the user is predicted to click.

Aiming at the six combinations, 6 training samples can be obtained respectively, and then a training sample set is obtained. It should be noted that the above is only an example, and the number of training samples in the training sample set is not limited in practical application, and may be much larger than 6.

And step 207, training the preset neural network model by adopting the training sample set to obtain a click through rate prediction model aiming at the advertisement video to be recommended.

In order to facilitate understanding of the present invention, a specific structure of the neural network model is schematically described below. Fig. 3 is a schematic structural diagram of a preset neural network model provided in the embodiment of the present disclosure, as shown in fig. 3, the preset neural network model includes an image feature extraction network (Resnet) and a feature fusion network (Video Bert), that is, in the example shown in fig. 3, the first image feature extraction network and the second image feature extraction network are the same, and the feature fusion network may be a Bert model fused with respect to a Video image. The image feature extraction network can perform convolution processing on each picture to identify features contained in different receptive fields. The feature fusion network can model the time sequence features of the image features output by the image feature extraction network, and output the fused features by utilizing operations such as pooling extraction and the like. In addition, the preset neural network model may further include other network structures (preset networks), such as a depth (depth) layer, and specifically, may be a fully connected layer. When a preset neural network model is trained, inputting historical video frame data and target video frame data in a training sample into an image feature extraction network to obtain corresponding video image features, outputting corresponding fusion features through a feature fusion network according to the historical video image features arranged in a time sequence and the target video image features arranged in the time sequence, outputting a predicted value of a user sample selection target video sample through a depth layer according to the fusion features, calculating a click through rate Loss function (CTR Loss) according to the predicted value and a corresponding sample label, and adjusting the preset neural network model based on the CTR Loss to obtain a CTR prediction model aiming at the advertisement video to be recommended. That is to say, network parameters are continuously adjusted in the training process, and by optimizing the CTR loss, the model learns the correlation between the current advertisement video and the historical click video of the user, so that the click rate can be predicted more accurately by the model.

By adopting the technical scheme, the model can simultaneously complete respective feature modeling and interactive feature coding of the historical video frame data and the target video frame data, and high-quality advertisement video content features are generated, namely the model can learn which advertisement video users like to watch and which features in the advertisement video attract the users, so that the accuracy of the model for predicting the click rate is improved, and the click rate estimation effect is improved.

Fig. 4 is a flowchart of a model using method provided by an embodiment of the present disclosure, which is applicable to a scenario in which a user predicts a behavior of a video to be recommended, and which can be executed by a model using apparatus, where the apparatus can be implemented by software and/or hardware, and can be generally integrated in a computer device. As shown in fig. 4, the method includes:

step 401, obtaining feature data of a video to be recommended.

In the embodiment of the present disclosure, the video to be recommended may be understood as a video that is desired to be recommended to a user, and may be specifically determined by a current application scenario, for example, the video may be determined by operations such as recall and rough ranking on the basis of a video library. In this step, the determination method of the feature data of the video to be recommended may be the same as the determination method of the feature data of the target video sample used for model training, for example, a third preset number of video frame data in the video to be recommended may be obtained, and the above related content may be specifically referred to, and is not described herein again.

Step 402, inputting the feature data of the video to be recommended into a user behavior prediction model aiming at the video to be recommended.

The user behavior prediction model for the video to be recommended is obtained by adopting the method provided by the embodiment of the disclosure.

And 403, predicting the user behavior of the current user for the video to be recommended according to the output result of the user behavior prediction model for the video to be recommended.

For example, the output result of the user behavior prediction model may be the user behavior that the user may generate for the video to be recommended, or the probability of generating a certain user behavior. The current user may be any user, and the embodiment of the present invention is not limited.

For example, taking a video to be recommended as an advertisement video to be recommended as an example, the probability of whether the user clicks the advertisement video to be recommended or clicks the advertisement video to be recommended can be predicted according to the output result.

According to the model using method provided by the embodiment of the disclosure, the characteristic data of the video to be recommended, which needs to be subjected to user behavior prediction, is input into the user behavior prediction model for the video to be recommended, the relevant data of the video observed by the user does not need to be input, so that the model can rapidly output a result, and the user behavior of the current user for the video to be recommended is predicted according to the output result of the model.

Fig. 5 is a flowchart of another model using method provided by the embodiment of the present disclosure, which is applicable to a scenario in which a user predicts a behavior of a video to be recommended, and which can be executed by a model using apparatus, where the apparatus can be implemented by software and/or hardware, and can be generally integrated in a computer device. As shown in fig. 5, the method includes:

and step 501, acquiring characteristic data of a video to be recommended.

Step 502, obtaining historical behavior data of a recommended video which appears before the video to be recommended by a current user, and determining a corresponding target historical video according to the historical behavior data.

For example, the determination method of the target history video may be the same as the determination method of the target history video sample used for model training, and reference may be specifically made to the above related contents, which is not described herein again. It should be noted that before acquiring the behavior data of the current user, the authorization of the user needs to be acquired.

Step 503, inputting the feature data of the video to be recommended and the feature data of the target historical video into a user behavior prediction model aiming at the video to be recommended.

For example, for each target history video, obtaining a second preset number of historical video frame data in the current target history video may specifically refer to the above related content, and details are not repeated here.

And step 504, predicting the user behavior of the current user for the video to be recommended according to the output result of the user behavior prediction model for the video to be recommended.

The model using method provided by the embodiment of the disclosure inputs the feature data of the video to be recommended, which needs to be subjected to user behavior prediction, and the feature data of the target historical video corresponding to the current user into the user behavior prediction model for the video to be recommended, which is provided by the embodiment of the disclosure, and predicts the user behavior of the current user for the video to be recommended according to the output result of the model.

Fig. 6 is a block diagram of a model training apparatus provided in an embodiment of the present disclosure, which may be implemented by software and/or hardware, and may be generally integrated in a computer device, and may perform model training by executing a model training method. As shown in fig. 6, the apparatus includes:

a first sample determining module 601, configured to determine a target video sample and a user sample, where the user sample includes a recommended user corresponding to the target video sample;

a second sample determining module 602, configured to obtain historical behavior data samples of recommended videos that have appeared before the target video sample for the user sample, and determine a corresponding target historical video sample according to the historical behavior data samples;

a training sample set determining module 603, configured to determine a training sample set according to the feature data of the target historical video sample, the feature data of the target video sample, and the behavior data of the user sample for the target video sample;

and the model training module 604 is configured to train a preset neural network model by using the training sample set to obtain a user behavior prediction model for the video to be recommended.

According to the model training device provided by the embodiment of the disclosure, when the user behavior prediction model is trained, the historical video sample is selected according to the historical behavior data of the user sample, and the feature data of the historical video sample and the feature data of the target video sample are input into the neural network model as the training sample for training, so that the content attribute features related to the user behavior can be added to the target video sample, the content understanding capability of the user behavior prediction model on the target video sample is enhanced, and the prediction accuracy of the user behavior prediction model for video recommendation can be improved.

Optionally, the action comprises selecting a recommended video; the obtaining of the historical behavior data sample of the recommended video of the user sample before the target video sample, and determining the corresponding target historical video sample according to the historical behavior data sample, includes:

acquiring historical selection condition data samples of recommended videos of the user samples before the target video samples;

and determining a first preset number of historical video samples which are selected by the user sample before the target video sample appears according to the historical selection condition data samples to obtain the target historical video sample.

Optionally, the determining a training sample set according to the feature data of the target historical video sample, the feature data of the target video sample, and the behavior data of the user sample for the target video sample includes:

acquiring a second preset number of historical video frame data in the current target historical video sample aiming at each target historical video sample;

acquiring a third preset number of target video frame data in the target video sample;

determining training sample data according to the historical video frame data and the target video frame data, and determining a sample label corresponding to the training sample data according to whether the user sample selects the target video sample, so as to obtain a training sample set.

Optionally, the second preset number is smaller than the third preset number, and the historical video frame data includes historical video cover frame data.

Optionally, the training of the preset neural network model by using the training sample set to obtain the user behavior prediction model for the video to be recommended includes:

inputting the historical video frame data into a first image feature extraction network in a preset neural network model to obtain corresponding historical video image features;

inputting the target video frame data into a second image feature extraction network in the preset neural network model to obtain corresponding target video image features;

outputting corresponding fusion characteristics according to the historical video image characteristics arranged according to the time sequence and the target video image characteristics arranged according to the time sequence through a characteristic fusion network in the preset neural network model, wherein the characteristic fusion network is constructed based on a self-attention mechanism;

the fusion feature outputs a predicted value of the user sample selecting the target video sample through a preset network in the preset neural network model;

and calculating a loss function according to the predicted value and the corresponding sample label, and adjusting the preset neural network model based on the loss function to obtain a user behavior prediction model for the video to be recommended.

Optionally, the video to be recommended includes an advertisement video to be recommended in a short video platform, and the user behavior prediction model includes a click through rate prediction model.

Fig. 7 is a block diagram of a model using apparatus provided in an embodiment of the present disclosure, which may be implemented by software and/or hardware, and may be generally integrated in a computer device, and may perform model training by executing a model training method. As shown in fig. 7, the apparatus includes:

a feature data obtaining module 701, configured to obtain feature data of a video to be recommended;

the data input module 702 is configured to input feature data of the video to be recommended into a user behavior prediction model for the video to be recommended, where the user behavior prediction model for the video to be recommended is obtained by using the model training method provided in the embodiment of the present invention;

a behavior prediction module 703, configured to predict, according to an output result of the user behavior prediction model for the video to be recommended, a user behavior of a current user for the video to be recommended.

The model using device provided by the embodiment of the disclosure inputs the feature data of the video to be recommended, which needs to be subjected to user behavior prediction, into the user behavior prediction model for the video to be recommended, which is provided by the embodiment of the disclosure, without inputting the related data of the video observed by the user, so that the model can quickly output a result, and predict the user behavior of the current user for the video to be recommended according to the output result of the model.

Fig. 8 is a block diagram of another model using apparatus provided in the embodiment of the present disclosure, which may be implemented by software and/or hardware, and may be generally integrated in a computer device, and may perform model training by executing a model training method. As shown in fig. 8, the apparatus includes:

a characteristic data obtaining module 801, configured to obtain characteristic data of a video to be recommended;

a historical video determining module 802, configured to obtain historical behavior data of a recommended video that appears before the video to be recommended by a current user, and determine a corresponding target historical video according to the historical behavior data;

the data input module 803 is configured to input the feature data of the video to be recommended and the feature data of the target historical video into a user behavior prediction model for the video to be recommended, where the user behavior prediction model for the video to be recommended is obtained by using the model training method provided in the embodiment of the present invention;

and the behavior prediction module 804 is used for predicting the user behavior of the current user for the video to be recommended according to the output result of the user behavior prediction model for the video to be recommended.

The model using device provided by the embodiment of the disclosure inputs the feature data of the video to be recommended, which needs to be subjected to user behavior prediction, and the feature data of the target historical video corresponding to the current user into the user behavior prediction model for the video to be recommended, which is provided by the embodiment of the disclosure, and predicts the user behavior of the current user for the video to be recommended according to the output result of the model.

Referring now to FIG. 9, shown is a schematic block diagram of a computer device 900 suitable for use in implementing embodiments of the present disclosure. The computer device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The computer device shown in fig. 9 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present disclosure.

As shown in fig. 9, the computer device 900 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 901 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage means 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the computer apparatus 900 are also stored. The processing apparatus 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication means 909 may allow the computer apparatus 900 to communicate with other apparatuses wirelessly or by wire to exchange data. While fig. 9 illustrates a computer device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing apparatus 901.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the computer device; or may exist separately and not be incorporated into the computer device.

The computer readable medium carries one or more programs which, when executed by the computer device, enable the computer device to perform any of the model training methods or model using method steps provided by embodiments of the present disclosure.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a module does not in some cases constitute a limitation of the module itself, for example, the first sample determination module may also be described as a "module that determines a target video sample and a user sample".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a model training method including:

Further, the behavior includes selecting a recommended video;

the obtaining of the historical behavior data sample of the recommended video of the user sample before the target video sample, and determining the corresponding target historical video sample according to the historical behavior data sample, includes:

Further, the determining a training sample set according to the feature data of the target historical video sample, the feature data of the target video sample, and the behavior data of the user sample for the target video sample includes:

Further, the second preset number is smaller than the third preset number, and the historical video frame data includes historical video cover frame data.

Further, the training a preset neural network model by using the training sample set to obtain a user behavior prediction model for a video to be recommended, includes:

Further, the videos to be recommended comprise advertisement videos to be recommended in a short video platform, and the user behavior prediction model comprises a click through rate prediction model.

According to one or more embodiments of the present disclosure, there is provided a model using method including:

acquiring characteristic data of a video to be recommended;

According to one or more embodiments of the present disclosure, there is provided a model training apparatus including:

According to one or more embodiments of the present disclosure, there is provided a model using apparatus including:

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein the behavior comprises selecting a recommended video;

3. The method of claim 2, wherein determining a set of training samples from the feature data of the target historical video sample, the feature data of the target video sample, and the behavior data of the user sample for the target video sample comprises:

4. The method of claim 3, wherein the second predetermined amount is less than the third predetermined amount, and wherein the historical video frame data comprises historical video cover frame data.

5. The method according to claim 3, wherein the training a preset neural network model by using the training sample set to obtain a user behavior prediction model for a video to be recommended comprises:

outputting corresponding fusion characteristics according to the historical video image characteristics arranged according to the time sequence and the target video image characteristics arranged according to the time sequence through a characteristic fusion network in the preset neural network model, wherein the characteristic fusion network performs characteristic fusion based on the internal correlation of the historical video image characteristics, the internal correlation of the target video image characteristics and the correlation between the historical video image characteristics and the target video image characteristics;

6. The method according to any one of claims 1 to 5, wherein the video to be recommended comprises an advertisement video to be recommended in a short video platform, and the user behavior prediction model comprises a click through rate prediction model.

7. A method of using a model, comprising:

acquiring characteristic data of a video to be recommended;

inputting the characteristic data of the video to be recommended into a user behavior prediction model for the video to be recommended, wherein the user behavior prediction model for the video to be recommended is obtained by adopting the method of any one of claims 1 to 6;

8. A method of using a model, comprising:

acquiring characteristic data of a video to be recommended;

inputting the feature data of the video to be recommended and the feature data of the target historical video into a user behavior prediction model for the video to be recommended, wherein the user behavior prediction model for the video to be recommended is obtained by adopting the method of any one of claims 1 to 6;

9. A model training apparatus, comprising:

10. A model using apparatus, comprising:

the data input module is used for inputting the characteristic data of the video to be recommended into a user behavior prediction model aiming at the video to be recommended, wherein the user behavior prediction model aiming at the video to be recommended is obtained by adopting the method of any one of claims 1 to 6;

11. A model using apparatus, comprising:

the data input module is used for inputting the feature data of the video to be recommended and the feature data of the target historical video into a user behavior prediction model aiming at the video to be recommended, wherein the user behavior prediction model aiming at the video to be recommended is obtained by adopting the method of any one of claims 1 to 6;

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.

13. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1-8 when executing the computer program.