CN112784165A - Training method of incidence relation estimation model and method for estimating file popularity - Google Patents
Training method of incidence relation estimation model and method for estimating file popularity Download PDFInfo
- Publication number
- CN112784165A CN112784165A CN202110132791.3A CN202110132791A CN112784165A CN 112784165 A CN112784165 A CN 112784165A CN 202110132791 A CN202110132791 A CN 202110132791A CN 112784165 A CN112784165 A CN 112784165A
- Authority
- CN
- China
- Prior art keywords
- file
- group
- user
- degree
- estimation model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 79
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000013528 artificial neural network Methods 0.000 claims abstract description 13
- 239000013598 vector Substances 0.000 claims description 41
- 230000006399 behavior Effects 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 18
- 238000012795 verification Methods 0.000 claims description 18
- 238000012360 testing method Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 11
- 230000015654 memory Effects 0.000 claims description 9
- 235000019633 pungent taste Nutrition 0.000 claims 2
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 21
- 238000004422 calculation algorithm Methods 0.000 description 12
- 238000012545 processing Methods 0.000 description 11
- 238000004891 communication Methods 0.000 description 10
- 238000007781 pre-processing Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000005457 optimization Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 230000003044 adaptive effect Effects 0.000 description 4
- 230000003542 behavioural effect Effects 0.000 description 3
- 238000004140 cleaning Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000003062 neural network model Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000036316 preload Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a training method of an incidence relation estimation model, relates to the field of artificial intelligence, and particularly relates to the field of machine learning and big data. The specific implementation scheme is as follows: acquiring sample data, wherein the sample data comprises the characteristics of a plurality of first user groups, the characteristics of a plurality of first file groups and the association degree between each first user group and each first file group; and training the incidence relation estimation model based on the neural network by using the sample data to obtain the trained incidence relation estimation model. The disclosure also discloses a training device of the incidence relation estimation model, a method and a device for estimating the file heat degree, electronic equipment and a storage medium.
Description
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and more particularly, to machine learning and big data technology. More specifically, the disclosure provides a training method and device for an incidence relation estimation model, a method and device for estimating file popularity, an electronic device and a storage medium.
Background
With the continuous growth of the internet scale, data is also growing explosively, and the pressure of data storage is increasing.
Generally, data is stored in a file form, the heat of the file can be determined according to the access frequency of the file, files with different heat are stored according to different strategies, the classified storage of cold and hot data is realized, and storage resources are reasonably distributed.
However, the access heat of different user groups for different types of files is different, and the file heat is determined only through the access frequency, so that the accuracy is low.
Disclosure of Invention
The disclosure provides a training method and device of an incidence relation estimation model, a method and device for estimating file popularity, equipment and a storage medium.
According to an aspect of the present disclosure, a method for training an incidence relation pre-estimation model is provided, including: acquiring sample data, wherein the sample data comprises the characteristics of a plurality of first user groups, the characteristics of a plurality of first file groups and the association degree between each first user group and each first file group; and training the incidence relation estimation model based on the neural network by using the sample data to obtain the trained incidence relation estimation model.
According to another aspect of the present disclosure, there is provided a method for estimating a file heat degree, including: acquiring input data, wherein the input data comprises the characteristics of a target file group and the characteristics of a target user group; estimating the association degree between the target file group and the target user group according to the characteristics of the target file group and the characteristics of the target user group by using an association relation estimation model; and determining the heat degree of the target file in the target file group according to the estimated association degree between the target file group and the target user group.
According to another aspect of the present disclosure, there is provided a training apparatus for an incidence relation pre-estimation model, including:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring sample data, and the sample data comprises the characteristics of a plurality of first user groups, the characteristics of a plurality of first file groups and the association degree between each first user group and each first file group;
and the training module is used for training the incidence relation estimation model based on the neural network by using the sample data to obtain the trained incidence relation estimation model.
According to another aspect of the present disclosure, there is provided an apparatus for estimating a heat degree of a document, including: the second acquisition module is used for acquiring input data, and the input data comprises the characteristics of the target file group and the characteristics of the target user group; the first estimation module is used for estimating the association degree between the target file group and the target user group according to the characteristics of the target file group and the characteristics of the target user group by using an association relation estimation model; and the first determining module is used for determining the heat of the target files in the target file group according to the estimated association degree between the target file group and the target user group.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided in accordance with the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided according to the present disclosure.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method provided according to the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of an exemplary system architecture for a training method and apparatus to which an incidence relation pre-estimation model may be applied, according to one embodiment of the present disclosure;
FIG. 2 is a flow diagram of a method of training an incidence relation pre-estimation model according to one embodiment of the present disclosure;
FIG. 3 is a flow diagram of a method of training an incidence relation pre-estimation model according to another embodiment of the present disclosure;
FIG. 4 is a flow diagram of a method of determining a target association prediction model according to one embodiment of the present disclosure;
FIG. 5 is a schematic flow chart diagram of a method of updating a target association prediction model according to one embodiment of the present disclosure;
FIG. 6 is a schematic flow diagram of a method of obtaining sample data according to one embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a storage system for user data, file data, and behavioral data according to one embodiment of the present disclosure;
FIG. 8 is a system architecture diagram of a method of training an incidence relation pre-estimation model according to an embodiment of the present disclosure;
FIG. 9 is a schematic flow chart diagram of a method of predicting file warmth according to one embodiment of the present disclosure;
FIG. 10 is a block diagram of a training apparatus for an incidence relation pre-estimation model according to one embodiment of the present disclosure;
FIG. 11 is a block diagram of an apparatus to predict file warmth according to one embodiment of the present disclosure;
FIG. 12 is a block diagram of an electronic device of a method of training an incidence relation pre-estimation model according to one embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
With the increasing scale of the internet, data is also growing explosively. The data can be acquired more quickly and completely, the value of the data can be fully excavated, and the method becomes a consensus of various industries in the big data era.
FIG. 1 is a schematic diagram of an exemplary system architecture of a training method and apparatus to which an incidence relation pre-estimation model may be applied, according to one embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 according to this embodiment may include a terminal device 101, a network 102, and a server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired and/or wireless communication links, and so forth.
The electronic device 101 may have a file log system running therein, configured to obtain a file log generated during a running process of a related application, where the related application may be, for example, a network disk application, and the file log may include, for example, an operation record generated by a user operating a file.
Specifically, the server 103 may obtain a file log generated by a file log system in the electronic device 101, perform preprocessing such as cleaning and filtering on a large number of file logs, and extract user data, file data, and behavior data of a user on a file from the file log, where the behavior data may include operation behaviors such as uploading, downloading, browsing, paying attention to, and forwarding, and the number of times of each operation behavior. The server 103 may perform feature extraction on the user data, the file data, and the behavior data, and cluster the extracted features to obtain features of a plurality of user groups, features of a plurality of file groups, and a degree of association between each user group and each file group. The characteristic of each user group may represent the characteristic of a certain class of users, the characteristic of each file group may represent the characteristic of a certain class of files, the association degree between a user group and a file group may be determined according to the number of operations on a certain class of files by a certain class of users, for example, the association degree between a user group and a file group may represent the interest degree of the user group in the file group (or the consumption rate of the user group on the file group), or may represent the heat degree of the files (also referred to as activity degree).
According to the embodiment of the disclosure, the server 103 may perform training of the neural network model by combining the features of the user group and the features of the file group, so as to obtain a trained neural network model, and the trained neural network model can estimate the degree of association between the user group and the file group, so as to determine the heat of the files in the file group according to the degree of association.
According to the embodiment of the present disclosure, the operation of obtaining the file log and processing the file log to obtain the features of the plurality of user groups, the features of the plurality of file groups, the association degrees between each user group and each file group, and the process of training the neural network by using the feature data may be executed in the same electronic device (e.g., the server 103) or may be executed in different electronic devices, for example, the process of training the neural network by using the feature data may be executed in the server 103, and the process of processing the log file to obtain the feature data may be executed in other electronic devices, which is not limited in the embodiment of the present disclosure.
According to the embodiment of the present disclosure, after acquiring the characteristics of the plurality of user groups, the characteristics of the plurality of file groups, and the association degrees between each user group and each file group, the server 103 may store these Data, for example, in a UDW (u managed Data consumer, Data Warehouse). When the characteristic data is used for training the neural network, the characteristic data can be directly obtained from a data warehouse.
FIG. 2 is a flow diagram of a method of training an incidence relation pre-estimation model according to one embodiment of the present disclosure.
As shown in fig. 2, the method 200 for training the association prediction model may include operations S210 to S220.
In operation S210, sample data is acquired.
According to an embodiment of the present disclosure, the sample data includes characteristics of the plurality of first user groups, characteristics of the plurality of first file groups, and a degree of association between each first user group and each first file group, and the characteristics of the first user groups may include, for example, regions, genders, ages, professions, and the like. The characteristics of the first group of files may include type, size, extension, and the like. The characteristic of each first user group may represent a characteristic of a certain first user group, the characteristic of each first file group may represent a characteristic of a certain first file group, the association degree between the first user group and the first file group may be determined according to the number of operations of the first user group on the first file group, for example, the association degree between the first user group and the first file group may represent the interest degree of the first user group on the first file group (or the consumption rate of the first user group on the first file group), or may represent the heat degree (also referred to as activity degree) of files in the first file group.
In operation S220, the incidence relation pre-estimation model based on the neural network is trained using the sample data, resulting in a trained incidence relation pre-estimation model.
According to the embodiment of the disclosure, the incidence relation estimation model based on the Neural network may include CNN (Convolutional Neural network), LSTM (long short Term Memory network), and the like.
According to an embodiment of the present disclosure, the training process may include: estimating the association degree between any first user group and any first file group according to the characteristics of any first user group and the characteristics of any first file group by using an association relation estimation model; adjusting parameters of the incidence relation pre-estimation model based on the difference between the pre-estimated incidence degree and the incidence degree of any first user group and any first file group in the sample data.
According to the embodiment of the disclosure, the relevance estimation model can be used for vectorizing and expressing the characteristics of the first user groups and the characteristics of the first file groups to obtain the vectors of the characteristics of the first user groups and the vectors of the characteristics of the first file groups. The vector distance between the feature of any first user group and the feature of any first document group can be calculated, and the vector distance can be calculated through an algorithm such as Euclidean distance or cosine similarity. The vector distance between the features of any first group of users and the features of any first group of files may characterize the degree of association between the first group of users and the first group of files.
According to the embodiment of the disclosure, the parameters of the incidence relation pre-estimation model can be adjusted according to the difference between the incidence degree between any first user group and any first file group pre-estimated by the incidence relation pre-estimation model and the incidence degree between any first user group and any first file group in sample data, so as to obtain the updated incidence relation pre-estimation model. And returning to the step of calculating the vector distance between the first user group feature and the first file group feature aiming at the next first user group feature and the first file group feature until a preset stop condition is reached.
According to the embodiment of the disclosure, a preset loss function can be used to calculate the difference between the estimated association degree between any first user group and any first file group and the association degree between any first user group and any first file group in the sample data, and the difference represents the loss of the current association relation estimation model. The parameters of the incidence relation pre-estimation model when the loss function is optimal can be calculated by using an optimization algorithm, and the parameters of the current incidence relation pre-estimation model are adjusted to the parameters of the incidence relation pre-estimation model when the loss function is optimal, so that the updated incidence relation pre-estimation model is obtained. The loss function may be, for example, a mean square error loss, a mean absolute error loss, a cross entropy loss, and the like, and the optimization algorithm may be, for example, various adaptive learning rate algorithms such as Adam (adaptive moment estimation) algorithm and the like. Embodiments of the present disclosure do not limit the type of penalty function and the type of optimization algorithm.
According to an embodiment of the present disclosure, the preset stop condition may be a preset number of returns, i.e., a number of model trainings, for example, the training is stopped after the preset number of trainings of 100. The preset stop condition may also be that the loss function satisfies the preset condition, for example, the training is stopped when the loss function converges.
According to the embodiment of the disclosure, the trained incidence relation estimation model is obtained after the training is stopped, the trained incidence relation estimation model can be used for estimating the incidence degree between the user group and the file group according to the input characteristics of the user group and the characteristics of the file group, the heat degree of the files in the file group can be determined according to the incidence degree, and the files can be stored in a corresponding strategy according to the heat degree of the files. For example, if the heat degree of the file is greater than a certain preset threshold, the file is determined to be a hot file, and the file may be stored in a computer room with a relatively good outlet bandwidth. If the heat degree of the file is not larger than the preset threshold value, the file is determined to be a cold file, the file can be stored in cheap equipment, or the file is directly compressed and then stored, so that storage resources are reasonably distributed.
According to the embodiment of the disclosure, the relevance estimation model based on the neural network is trained by using the characteristics of the first user group, the characteristics of the first file group and the relevance between the characteristics of the first user group and the characteristics of the first file group to obtain the trained relevance estimation model, the relevance between the user group and the file group can be estimated by using the trained relevance estimation model according to the input characteristics of the user group and the characteristics of the file group, the heat of the files in the file group can be determined according to the estimated relevance, and the estimation accuracy of the heat of the files can be improved compared with a mode of estimating the heat of the files according to the access frequency.
Furthermore, the accurate estimation of the heat degree of the file can realize the fine classification of the file, and realize the classified storage of the cold and hot data, so that the storage resources are reasonably distributed.
According to An embodiment of the present disclosure, the features of the plurality of first user groups may be represented as a1, a2.. ann, for example, and the features of the plurality of first file groups may include B1, B2.. ann. The degree of association between the characteristic of the first group of users and the characteristic of the first group of files may characterize the degree of association between the first group of users and the first group of files, e.g., the degree of association between a1 and B1 may characterize the degree of association between the first group of users of a1 and the first group of files of B1, the degree of association between a2 and B2 may characterize the degree of association between the first group of users of a2 and the first group of files of B2, and so on. For another example, if there is no correlation between the characteristic A8 of the first user group and the characteristic of each first file group, there is no correlation between the first user group A8 and each first file group, and there is no correlation between the characteristic B11 of the first file group and the characteristic of each first user group, there is no correlation between the first file group B11 and each first user group.
FIG. 3 is a flow chart of a method of training an incidence relation pre-estimation model according to another embodiment of the present disclosure.
According to the embodiment of the disclosure, the features of any first user group and the features of any first file group may be selected from sample data, and the features of the first user group and the features of the first file group with a certain degree of association (such as a1 and B1, a2 and B2, etc.) may be trained as positive samples, or the features of the first user group and the features of the first file group without degree of association (such as A8 and B2, a1 and B11, etc.) selected from the sample data may be trained as negative samples.
As shown in fig. 3, the training method of the incidence relation estimation model may include operations S321 to S326.
In operation S321, a vector of the features of each first user group is determined, and a vector of the features of each first document group is determined.
According to the embodiment of the disclosure, the characteristics of each first user group can be vectorized and expressed to obtain the vector of each first user group. And vectorizing and expressing the characteristics of each first file group to obtain the vector of the characteristics of each first file group.
In operation S322, a vector distance between the features of any first group of users and the features of any first group of documents is calculated.
According to the embodiment of the disclosure, taking the feature of any first user group as a1 and the feature of any first file group as B1 as an example, a vector distance between the vector of a1 and the vector of B1 is calculated, and the vector distance may be, for example, a euclidean distance or a cosine similarity.
In operation S323, a degree of association between any first user group and any first document group is estimated according to the vector distance.
According to the embodiment of the disclosure, the association degree between the first user group of A1 and the first file group of B1 can be estimated according to the vector distance between the vector of A1 and the vector of B1. The corresponding relationship between the vector distance and the association degree can be preset, and the association degree can be determined according to the vector distance and the corresponding relationship. For example, when the vector distance between the vector of a1 and the vector of B1 is in the first interval, the degree of association between the first user group of a1 and the first file group of B1 is determined to be the first degree of association, when the vector distance between the vector of a1 and the vector of B1 is in the second interval, the degree of association between the first user group of a1 and the first file group of B1 is determined to be the second degree of association, and so on.
In operation S324, a loss of the association prediction model is calculated based on the predicted association degree and the association degree between any first user group and any first file group in the sample data by using a preset loss function.
According to the embodiment of the disclosure, the degree of association between the first user group a1 and the first file group B1 predicted by the association relation prediction model may be D1, for example, and the degree of association between the first user group a1 and the first file group B1 in sample data is C1, for example, the loss of the association relation prediction model may be calculated based on D1 and C1 by using a preset loss function. The loss function may be, for example, a mean square error loss, a mean absolute error loss, a cross entropy loss, and the like, and the embodiments of the present disclosure do not limit the type of the loss function.
In operation S325, parameters of the association prediction model are adjusted according to the loss.
According to the embodiment of the disclosure, an optimization algorithm can be used for calculating the parameters of the incidence relation pre-estimation model when the loss function is optimal, and the parameters of the current incidence relation pre-estimation model are adjusted to the parameters when the loss function is optimal, so that the updated incidence relation pre-estimation model is obtained. The optimization algorithm may be, for example, various adaptive learning rate algorithms, such as Adam (adaptive moment estimation) algorithm, and the like. Embodiments of the present disclosure do not limit the type of optimization algorithm.
In operation S326, it is determined whether the number of times of returning reaches a first preset threshold, if so, the training is stopped, otherwise, the operation S322 is returned for the features of any next first user group and the features of any next first file group until the preset number of times of returning is reached.
According to the embodiment of the disclosure, in the case that the number of times of returning reaches a first preset threshold (for example, 100 times), the training is stopped, and a trained incidence relation estimation model is obtained.
According to the embodiment of the disclosure, in the case that the number of times of returning does not reach the first preset threshold (for example, 100 times), the vector distance between the feature of any next first user group and the feature of any next first file group is calculated by using the updated association relation pre-estimation model, for example, the feature of any next first user group is a2, and the feature of any next first file group is B2, and the vector distance between a2 and B2 is calculated by using the updated association relation pre-estimation model.
FIG. 4 is a flow diagram of a method of determining a target association prediction model according to one embodiment of the present disclosure.
As shown in fig. 4, the method for determining the target association relation pre-estimation model may include operations S410 to S440.
In operation S410, the sample data is divided into a plurality of batches, each of which includes at least one user group feature and at least one file group feature.
According to the embodiment of the disclosure, the data volume of the file log is large, the data volume of the extracted sample data is also large, the sample data can be divided into a plurality of batches, and the sample data of each batch is used for training a plurality of incidence relation pre-estimation models respectively.
For example, sample data may be divided into 10 batches, a first batch of sample data may include a group feature A1.. a100 of a first user and a group feature B1.. a B100 of a first file, a second batch of sample data may include a feature a101.. a200 of a first user group and a feature B101.. a B200 of a second file group, and so on.
In operation S420, a plurality of correlation estimation models are trained using a plurality of batches of sample data, respectively, to obtain a plurality of trained correlation estimation models.
According to the embodiment of the disclosure, the first user group characteristic A1.... a100 and the first file group characteristic B1.... B100 are used for training to obtain the first batch of incidence relation estimation models, the first user group characteristic a101.... a200 and the second file group characteristic B101.. B200 are used for training to obtain the second batch of incidence relation estimation models, and by so on, 10 batches of incidence relation estimation models can be trained to obtain 10 incidence relation estimation models.
In operation S430, the accuracy of each trained incidence relation prediction model is calculated.
According to an embodiment of the present disclosure, for the obtained plurality of trained incidence relation pre-estimation models, the precision of each trained incidence relation pre-estimation model may be calculated using the verification data. The verification data may include characteristics of a plurality of second user groups, characteristics of a plurality of second file groups, and a degree of association between each second user group and each second file group. The verification data may be data which is acquired from the data warehouse separately and is not the same as the sample data, or the verification data may be a part extracted from the sample data, and the features of the plurality of second user groups and the features of the plurality of second file groups are the same as the features of the first user group and the features of the first file group in the part of the sample data.
According to the embodiment of the disclosure, for each trained incidence relation estimation model, estimating the incidence degree between each second user group and each second file group according to the characteristics of each second user group and the characteristics of each second file group by using the trained incidence relation estimation model, and calculating the precision of the trained incidence relation estimation model based on the estimated incidence degree between each second user group and each second file group and the incidence degree between each second user group and each second file group in the verification data.
It can be understood that the accuracy of the estimated association degree of the association relation estimation model can be calculated based on the estimated association degree between each second user group and each second file group and the estimated association degree between each second user group and each second file group in the verification data, and the accuracy of the association relation estimation model can be determined according to the accuracy of the estimated association degree of the association relation estimation model.
In operation S440, the estimated association relationship model with the highest precision is determined as the estimated target association relationship model.
According to the embodiment of the disclosure, the incidence relation estimation model with the highest precision in the 10 trained incidence relation estimation models can be selected as the final target incidence relation estimation model, and the target incidence relation estimation model can be used for online estimation of the incidence degree between the characteristics of the specific user group and the characteristics of the specific file group.
FIG. 5 is a flowchart illustrating a method for updating a target association prediction model according to an embodiment of the disclosure.
As shown in FIG. 5, the method for updating the pre-estimation model of the target association relationship includes operations S510-S540.
In operation S510, test data is acquired.
According to an embodiment of the present disclosure, the test data may include features of a third group of users and features of a third group of files, the third group of files may be a group of files in the actual application scenario, and the third group of users may be a group of users targeted in the actual application scenario. An online test system can be arranged to verify the correctness of the relevance between the third user group and the third file group estimated by the target relevance estimation model according to the relevance between the third user group and the third file group in the actual application scene.
In operation S520, a degree of association between the third user group and the third document group is estimated according to the features of the third user group and the features of the third document group using the target association relationship estimation model.
According to the embodiment of the disclosure, the relevance between the characteristics of the third user group and the characteristics of the third file group is estimated by using the target relevance estimation model. The third file group may be stored according to the estimated degree of association, for example, it is estimated that the degree of association between the feature of the third user group and the feature of the third file group is small, the degree of heat of the third file group is determined to be low, and the third file group may be stored after being compressed.
In operation S530, an actual degree of association between the third group of users and the third group of files is tested.
According to the embodiment of the disclosure, the actual association degree between the third user group and the third file group can be determined according to the number of times that the third user group operates the third file group in the actual application scene.
In operation S540, in a case that a difference between the estimated degree of association between the third user group and the third file group and the actual degree of association between the third user group and the third file group exceeds a second preset threshold, sample data is updated based on the features of the third user group, the features of the third file group, and the actual degree of association between the third user group and the third file group.
According to the embodiment of the disclosure, if the difference between the estimated association degree between the third user group and the third file group and the actual association degree between the third user group and the third file group exceeds a second preset threshold (for example, 50%), it is indicated that the estimated correctness of the target association relation estimation model is low, that is, the third user group operates the third file group more frequently, and it is not appropriate to compress and store the third file group as cold data. Therefore, the characteristics of the third user group and the characteristics of the third file group are used as new sample data to train the incidence relation pre-estimation model so as to update the incidence relation pre-estimation model.
Fig. 6 is a flowchart illustration of a method of obtaining sample data according to one embodiment of the present disclosure.
As shown in fig. 6, the method for acquiring sample data includes operations S610 to S650.
In operation S610, a plurality of file operation records are obtained from the file log, and each file operation record includes user data, file data, and user behavior data for the file.
According to the embodiment of the disclosure, the file log may be an operation record generated by a user operating a file, and the operation record includes user data, file data, and behavior data of the user on the file. The behavior data may include operation behaviors such as upload, download, browse, focus, and forward, and the number of times each operation behavior is performed. The user data, file data, and behavior data can be extracted from the operation log.
In operation S620, feature extraction is performed on the user data and the file data in the file operation records, respectively, to obtain features of the first users and features of the first files.
According to the embodiment of the disclosure, preprocessing operations such as cleaning, filtering and normalization can be performed on the user data, the file data and the behavior data, so that normalized user data, file data and behavior data are obtained. The normalized user data may be subjected to feature extraction to obtain features of a plurality of first users, where the features of each first user may include region, age, occupation, and the like. The normalized file data is subjected to feature extraction, so that the features of a plurality of first files can be obtained, and the features of each first file can comprise type, size, extension name and the like. An association relationship between the first user and the first file may be determined based on the behavioral data. The behavior data can be extracted to obtain the operation times of the first user on the first file and the like.
In operation S630, the features of the plurality of first users are clustered to obtain features of a plurality of first user groups.
According to the embodiment of the disclosure, the features of the plurality of first users may be clustered according to feature dimensions, for example, clustering according to regions or clustering according to ages, so as to obtain the features of the plurality of first user groups, and each first user group may represent a certain class of user group.
In operation S640, the features of the first files are clustered to obtain features of a first file group.
According to the embodiment of the disclosure, the features of the plurality of first files may be clustered according to feature dimensions, for example, clustering according to types or clustering according to sizes, so as to obtain the features of the plurality of first file groups, and each first file group may represent a certain class of file groups.
In operation S650, according to the behavior data of the user on the file, a degree of association between each first user group and each first file group is determined.
According to the embodiment of the disclosure, the number of operations of each first user group on each first file group can be determined according to the behavior data, and the number may be all the operations of all the users in a certain type of user group on all the files in a certain type of file group, or an average number of operations. The association degree between each first user group and each first file group can be determined according to the number of operations of each first user group on each first file group.
FIG. 7 is a schematic diagram of a storage system for user data, file data, and behavioral data according to one embodiment of the present disclosure.
As shown in FIG. 7, the System may include an AFS (android File System) cluster 710 and a data repository 720. The data warehouse 720 includes a first storage 721, a second storage 722, and a third storage 723.
According to the embodiment of the disclosure, a collection task may be set, and when the collection task is executed, the collection task may obtain a file log from a file log system according to different collection periods, and store the file log to the AFS cluster 710. The different acquisition periods may include days, months and years. As shown in fig. 7, the file logs acquired according to different acquisition cycles may be, for example, a file log 1, a file log 2, a file log 3, and the like, where the file log 1, the file log 2, and the file log 3 all include user data, file data, and behavior data.
According to the embodiment of the disclosure, for the file log in the AFS cluster 710, preprocessing such as cleaning, filtering, extracting, mapping and the like of data can be performed according to different ETL (Extract-Transform-Load) rules, so as to obtain user data, file data and behavior data, and store the obtained user data, file data and behavior data into the data warehouse 720. User data, file data and behavior data obtained by preprocessing file logs of different acquisition cycles can be stored in different storage spaces in the data warehouse 720 according to different acquisition cycles, illustratively, the user data, the file data and the behavior data obtained by preprocessing the file logs acquired by days can be stored in the first storage space 721, the user data, the file data and the behavior data obtained by preprocessing the file logs acquired by months can be stored in the second storage space 722, and the user data, the file data and the behavior data obtained by preprocessing the file logs acquired by years can be stored in the third storage space 723.
Fig. 8 is a system architecture diagram of a training method of an incidence relation pre-estimation model according to an embodiment of the disclosure.
As shown in FIG. 8, the system architecture may include a data processing system 810, an incidence relation prediction model training system 820, and an incidence relation prediction model testing system 830, where the incidence relation prediction model training system 820 includes a model training subsystem 821 and a model validation subsystem 822.
According to an embodiment of the present disclosure, the data processing system 810 is configured to perform feature extraction and clustering on the user data, the file data, and the behavior data to obtain features of a plurality of user groups, features of a plurality of file groups, and a degree of association between each user group and each file group. The features of the N first user groups and the features of the N first file groups may be selected from the features of the plurality of user groups and the features of the plurality of file groups as training data.
According to the embodiment of the present disclosure, the model training subsystem 821 in the association relation pre-estimation model training system 820 is configured to perform model training using training data, and specifically, may calculate a vector distance between a feature of any first user group and a feature of any first file group, pre-estimate a degree of association between a feature of any first user group and a feature of any first file group according to the vector distance, calculate a loss of the association relation pre-estimation model according to the degree of association between the pre-estimated feature of any first user group and the feature of any first file group and an actual degree of association between the feature of any first user group and the feature of any first file group, calculate an updated association relation pre-estimation model according to a parameter of the association relation pre-estimation model, obtain an updated association relation pre-estimation model for a feature of any next first user group and a feature of any first file group, and repeating the training process until the training times reach a preset value to obtain a trained incidence relation estimation model.
According to the embodiment of the disclosure, the features of the N first user groups and the features of the N first file groups as training data may be divided into a plurality of batches for training, for example, the batches are divided into M batches for training, and the training data of each batch may be trained to obtain a trained incidence relation estimation model, so that M trained incidence relation estimation models may be obtained.
According to an embodiment of the present disclosure, the features of the k second user groups and the features of the k second file groups may be selected from the features of the plurality of user groups and the features of the plurality of file groups in the data processing system 810 as the verification data. The model verification subsystem 822 in the incidence relation prediction model training system 820 is configured to select a model with the highest precision from the M trained incidence relation prediction models as an optimal incidence relation prediction model by using the verification data. Specifically, for each trained incidence relation estimation model, the trained incidence relation estimation model is used to estimate the incidence degree between each second user group and each second file group according to the characteristics of each second user group and the characteristics of each second file group, and the precision of the trained incidence relation estimation model is calculated based on the estimated incidence degree between each second user group and each second file group and the incidence degree between each second user group and each second file group in the verification data. And taking the incidence relation estimation model with the highest precision in the M trained incidence relation estimation models as the optimal incidence relation estimation model.
According to the embodiment of the disclosure, the incidence relation pre-estimation model test system 830 may be used to test the accuracy of the pre-estimated incidence degree of the incidence relation pre-estimation model. Specifically, the characteristics of S third user groups and the characteristics of S third file groups may be obtained, where the third file groups may be file groups in an actual application scenario (e.g., a webdisk application scenario), and the third user groups may be user groups targeted in the actual application scenario. The incidence relation pre-estimation model test system 830 may pre-estimate the incidence between the third user group and the third file group using the optimal incidence relation pre-estimation model, and verify the accuracy of the pre-estimated incidence of the optimal incidence relation pre-estimation model according to the incidence between the third user group and the third file group in the actual application scenario. And aiming at the characteristics of the third user group with lower estimated relevance and the characteristics of the third file group, updating the incidence relation estimation model by using the characteristics as feedback data. Specifically, the features of the third user group and the features of the third document group may be subjected to model training again as updated training data.
FIG. 9 is a flowchart illustrating a method for predicting file popularity according to one embodiment of the present disclosure.
As shown in FIG. 9, the method 900 for estimating the popularity of a file includes operations S910-S930.
In operation S910, input data is acquired.
According to an embodiment of the present disclosure, the input data may include characteristics of a target file group and characteristics of a target user group, the target file group may be a file group in a real service scene, and the target user group may be a user group targeted in the real service scene.
In operation S920, the association degree between the target file group and the target user group is estimated according to the characteristics of the target file group and the characteristics of the target user group using the association relationship estimation model.
According to the embodiment of the disclosure, the vector distance between the features of the target file group and the features of the target user group can be estimated by using the association relation, and the association degree between the target file group and the target user group can be estimated according to the vector distance.
In operation S930, a popularity of the target document in the target document group is determined according to the estimated association degree between the target document group and the target user group.
According to the embodiment of the present disclosure, the correspondence between the relevance and the heat may be preset, for example, if the relevance between the target file group and the target user group is in a first interval (e.g., 0-10%), the heat of the target file in the target file group is a first heat (e.g., 2%), the relevance between the target file group and the target user group is in a second interval (e.g., 11% -20%), the heat of the target file in the target file group is a second heat (e.g., 4%), and so on.
According to the embodiment of the disclosure, if the determined heat degree of the target file in the target file group is greater than the third threshold (for example, 4), the target file can be determined to be hot data, the target file can be preprocessed and then stored, the access speed of the file can be increased, and the user experience can be improved. If the determined heat degree of the target files in the target file group is not greater than a third threshold (for example, 4), the target files can be determined to be cold data, the target files can be compressed and then stored, or the target files can be stored in an unusual storage device, so that storage resources are reasonably utilized.
FIG. 10 is a block diagram of a training apparatus for an incidence relation pre-estimation model according to one embodiment of the present disclosure.
As shown in fig. 10, the training apparatus 1000 for the incidence relation pre-estimation model may include a first obtaining module 1001 and a training module 1002.
The first obtaining module 1001 is configured to obtain sample data, where the sample data includes features of a plurality of first user groups, features of a plurality of first file groups, and association degrees between each first user group and each first file group.
The training module 1002 is configured to train the incidence relation prediction model based on the neural network using the sample data, so as to obtain a trained incidence relation prediction model.
According to an embodiment of the present disclosure, training module 1002 includes an estimation unit and an adjustment unit.
The estimation unit is used for estimating the association degree between any first user group and any first file group according to the characteristics of any first user group and the characteristics of any first file group by using the association relation estimation model.
The adjusting unit is used for adjusting parameters of the incidence relation estimation model based on the estimated incidence degree and the difference between the incidence degrees of any first user group and any first file group in the sample data.
According to an embodiment of the present disclosure, the pre-estimation unit includes a first determination subunit, a second determination subunit, a first calculation subunit, and a pre-estimation subunit.
The first determining subunit is configured to determine a vector of features of any of the first user groups.
The second determining subunit is configured to determine a vector of features of any of the first group of files.
The first calculating subunit is configured to calculate a vector distance between the feature of any first user group and the feature of any first document group.
The estimation subunit is used for estimating the association degree between any first user group and any first file group according to the vector distance.
According to an embodiment of the present disclosure, the adjusting unit includes a second calculating subunit and an adjusting subunit.
The second calculating subunit is configured to calculate, by using a preset loss function, a loss of the association relationship prediction model based on the predicted association degree and the association degree between any first user group and any first file group in the sample data.
The adjusting subunit is configured to adjust parameters of the association relation estimation model according to the loss, and return to the estimation unit for the features of any next first user group and the features of any next first file group until the number of returns reaches a first preset threshold.
According to the embodiment of the disclosure, the number of the incidence relation estimation models is multiple. The training module 1002 includes a partitioning unit and a training unit.
The dividing unit is used for dividing the sample data into a plurality of batches, and each batch comprises the characteristics of at least one first user group and the characteristics of at least one first file group.
The training unit is used for training the incidence relation estimation models by using a plurality of batches of sample data respectively to obtain a plurality of trained incidence relation estimation models.
According to the embodiment of the disclosure, the training apparatus 1000 of the incidence relation pre-estimation model further includes a calculation module and a second determination module.
And the calculation module is used for calculating the precision of each trained incidence relation estimation model.
The second determination module is used for determining the incidence relation estimation model with the highest precision as the target incidence relation estimation model.
According to an embodiment of the present disclosure, a calculation module includes an acquisition unit and a first verification unit.
The obtaining unit is used for obtaining verification data, and the verification data comprises the characteristics of a plurality of second user groups, the characteristics of a plurality of second file groups and the association degree between each second user group and each second file group.
The first verification unit is used for predicting the association degree between each second user group and each second file group according to the characteristics of each second user group and the characteristics of each second file group by using the trained association relation prediction model aiming at each trained association relation prediction model, and calculating the precision of the trained association relation prediction model based on the predicted association degree between each second user group and each second file group and the association degree between each second user group and each second file group in verification data.
According to the embodiment of the present disclosure, the training apparatus 1000 of the incidence relation pre-estimation model further includes a third obtaining module, a second pre-estimation module, a testing module and an updating module.
The third obtaining module is used for obtaining test data, and the test data comprises characteristics of a third user group and characteristics of a third file group.
The second estimation module is used for estimating the association degree between the third user group and the third file group according to the characteristics of the third user group and the characteristics of the third file group by using the target association relation estimation model.
The testing module is used for testing the actual association degree between the third user group and the third file group.
The updating module is used for updating the sample data based on the characteristics of the third user group, the characteristics of the third file group and the actual association degree between the third user group and the third file group under the condition that the difference between the estimated association degree between the third user group and the third file group and the actual association degree between the third user group and the third file group exceeds a second preset threshold value.
According to an embodiment of the present disclosure, the first obtaining module 1001 is specifically configured to obtain a plurality of file operation records from a file log, where each file operation record includes user data, file data, and behavior data of a user on a file; respectively extracting the characteristics of the user data and the file data in the file operation records to obtain the characteristics of a plurality of first users and the characteristics of a plurality of first files; clustering the characteristics of the first users to obtain the characteristics of a plurality of first user groups; clustering the characteristics of the first files to obtain the characteristics of a plurality of first file groups; and determining the association degree between each first user group and each first file group according to the behavior data of the user to the file.
FIG. 11 is a block diagram of an apparatus to predict file warmth according to one embodiment of the present disclosure.
As shown in fig. 11, the apparatus 1100 for estimating the file heat degree may include a second obtaining module 1101, a first estimating module 1102 and a first determining module 1103.
The second obtaining module 1101 is configured to obtain input data, where the input data includes features of a target file group and features of a target user group.
The first estimation module 1102 is configured to estimate the association degree between the target file group and the target user group according to the characteristics of the target file group and the characteristics of the target user group by using an association relationship estimation model.
The first determining module 1103 is configured to determine the heat of the target file in the target file group according to the estimated association between the target file group and the target user group.
According to an embodiment of the present disclosure, the apparatus 1100 for predicting the popularity of a file may further include a storage module.
The storage module is used for storing the target file according to the heat degree of the target file.
According to the embodiment of the disclosure, the storage module is specifically configured to, under the condition that the heat degree of the target file is determined to be greater than a third preset threshold, pre-load the target file and store the pre-loaded target file; and compressing and storing the target file under the condition that the heat degree of the target file is determined to be less than or equal to a third preset threshold value.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 12 shows a schematic block diagram of an example electronic device 1200, which can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM1202, and the RAM1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.
Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 performs the above-described methods and processes, such as a training method of the incidence relation estimation model. For example, in some embodiments, the method of training the association prediction model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM1202 and/or the communication unit 1209. When the computer program is loaded into the RAM1203 and executed by the computing unit 1201, one or more steps of the above-described method for training the incidence relation pre-estimation model may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the training method of the association prediction model in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.
Claims (17)
1. A training method of an incidence relation pre-estimation model comprises the following steps:
acquiring sample data, wherein the sample data comprises the characteristics of a plurality of first user groups, the characteristics of a plurality of first file groups and the association degree between each first user group and each first file group;
and training an incidence relation pre-estimation model based on a neural network by using the sample data to obtain the trained incidence relation pre-estimation model.
2. The method of claim 1, the training comprising:
estimating the association degree between any first user group and any first file group according to the characteristics of any first user group and the characteristics of any first file group by using the incidence relation estimation model;
adjusting parameters of the incidence relation pre-estimation model based on the difference between the pre-estimated incidence degree and the incidence degree of any first user group and any first file group in the sample data.
3. The method of claim 2, wherein said estimating the association degree between any first user group and any first file group according to the characteristics of any first user group and the characteristics of any first file group by using the association estimation model comprises:
determining a vector of features of any first group of users;
determining a vector of features of any first group of documents;
calculating the vector distance between the features of any first user group and the features of any first file group;
and according to the vector distance, predicting the association degree between any first user group and any first file group.
4. The method of claim 2, wherein said adjusting parameters of the association prediction model based on the predicted difference between the degree of association and the degree of association between said any first group of users and said any first group of files in the sample data comprises:
calculating the loss of the incidence relation estimation model based on the estimated incidence degree and the incidence degree of any first user group and any first file group in the sample data by using a preset loss function;
and adjusting parameters of the incidence relation estimation model according to the loss, and returning to the step of estimating the incidence degree between any first user group and any first file group aiming at the characteristics of any next first user group and any next first file group until the return times reach a first preset threshold value.
5. The method of claim 1, wherein the number of the pre-estimation of incidence relations model is plural, the training of the pre-estimation of incidence relations based on neural networks using the sample data comprises:
dividing the sample data into a plurality of batches, wherein each batch comprises the characteristics of at least one first user group and the characteristics of at least one first file group;
and training a plurality of incidence relation pre-estimation models by using a plurality of batches of sample data respectively to obtain a plurality of trained incidence relation pre-estimation models.
6. The method of claim 5, further comprising:
calculating the precision of each trained incidence relation estimation model;
and determining the incidence relation estimation model with the highest precision as a target incidence relation estimation model.
7. The method of claim 6, wherein the calculating the accuracy of each trained correlation prediction model comprises:
acquiring verification data, wherein the verification data comprises the characteristics of a plurality of second user groups, the characteristics of a plurality of second file groups and the association degree between each second user group and each second file group;
and for each trained incidence relation estimation model, estimating the incidence degree between each second user group and each second file group by using the trained incidence relation estimation model according to the characteristics of each second user group and the characteristics of each second file group, and calculating the precision of the trained incidence relation estimation model based on the estimated incidence degree between each second user group and each second file group and the incidence degree between each second user group and each second file group in verification data.
8. The method of claim 6, further comprising:
acquiring test data, wherein the test data comprises characteristics of a third user group and characteristics of a third file group;
estimating the association degree between the third user group and the third file group according to the characteristics of the third user group and the characteristics of the third file group by using the target association relation estimation model;
testing the actual association degree between the third user group and the third file group;
updating the sample data based on the characteristics of the third user group, the characteristics of the third file group and the actual association degree between the third user group and the third file group when the difference between the estimated association degree between the third user group and the third file group and the actual association degree between the third user group and the third file group exceeds a second preset threshold.
9. The method of claim 1, wherein said obtaining sample data comprises:
acquiring a plurality of file operation records from a file log, wherein each file operation record comprises user data, file data and behavior data of a user to a file;
respectively extracting the characteristics of the user data and the file data in the file operation records to obtain the characteristics of a plurality of first users and the characteristics of a plurality of first files;
clustering the characteristics of the first users to obtain the characteristics of the first user groups;
clustering the characteristics of the first files to obtain the characteristics of the first file groups;
and determining the association degree between each first user group and each first file group according to the behavior data of the user to the file.
10. A method for predicting the popularity of a file comprises the following steps:
acquiring input data, wherein the input data comprises characteristics of a target file group and characteristics of a target user group;
estimating the association degree between the target file group and the target user group according to the characteristics of the target file group and the characteristics of the target user group by using an association relation estimation model;
determining the heat degree of a target file in the target file group according to the estimated association degree between the target file group and the target user group;
wherein the incidence relation pre-estimation model is trained by the method according to any one of claims 1-9.
11. The method of claim 10, further comprising:
and storing the target file according to the heat degree of the target file.
12. The method of claim 11, wherein the storing the object file according to the hotness of the object file comprises:
under the condition that the heat degree of the target file is determined to be larger than a third preset threshold value, preloading the target file and then storing the target file;
and compressing and storing the target file under the condition that the heat degree of the target file is determined to be less than or equal to the third preset threshold value.
13. A training device for an incidence relation pre-estimation model comprises:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring sample data, and the sample data comprises the characteristics of a plurality of first user groups, the characteristics of a plurality of first file groups and the association degree between each first user group and each first file group;
and the training module is used for training the incidence relation estimation model based on the neural network by using the sample data to obtain the trained incidence relation estimation model.
14. An apparatus for predicting the hotness of a document, comprising:
the second acquisition module is used for acquiring input data, and the input data comprises the characteristics of a target file group and the characteristics of a target user group;
the first estimation module is used for estimating the association degree between the target file group and the target user group according to the characteristics of the target file group and the characteristics of the target user group by using an association relation estimation model;
the first determining module is used for determining the heat degree of the target file in the target file group according to the estimated association degree between the target file group and the target user group;
wherein the incidence relation pre-estimation model is trained by the apparatus according to claim 13.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 12.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 12.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110132791.3A CN112784165B (en) | 2021-01-29 | 2021-01-29 | Training method of association relation prediction model and method for predicting file heat |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110132791.3A CN112784165B (en) | 2021-01-29 | 2021-01-29 | Training method of association relation prediction model and method for predicting file heat |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112784165A true CN112784165A (en) | 2021-05-11 |
CN112784165B CN112784165B (en) | 2024-07-19 |
Family
ID=75760154
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110132791.3A Active CN112784165B (en) | 2021-01-29 | 2021-01-29 | Training method of association relation prediction model and method for predicting file heat |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112784165B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113259141A (en) * | 2021-06-11 | 2021-08-13 | 腾讯科技(深圳)有限公司 | Test method and device of group prediction model, storage medium and electronic equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20110062896A (en) * | 2009-12-04 | 2011-06-10 | 한국과학기술원 | Apparatus and method for searching local information |
CN105426038A (en) * | 2015-11-02 | 2016-03-23 | 北京科东电力控制系统有限责任公司 | Picture popularity algorithm based picture pre-loading method for power grid scheduling control system |
CN106528608A (en) * | 2016-09-27 | 2017-03-22 | 中国电力科学研究院 | Cold and hot storage method and system for power grid GIS (Geographic Information System) data in cloud architecture |
CN108804351A (en) * | 2018-05-30 | 2018-11-13 | 郑州云海信息技术有限公司 | A kind of caching replacement method and device |
CN109815309A (en) * | 2018-12-21 | 2019-05-28 | 航天信息股份有限公司 | A kind of user information recommended method and system based on personalization |
CN110321944A (en) * | 2019-06-26 | 2019-10-11 | 华中科技大学 | A kind of construction method of the deep neural network model based on contact net image quality evaluation |
US20200272676A1 (en) * | 2019-02-21 | 2020-08-27 | Microsoft Technology Licensing, Llc | Characterizing a place by features of a user visit |
CN111898031A (en) * | 2020-08-14 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Method and device for obtaining user portrait |
CN111898904A (en) * | 2020-07-28 | 2020-11-06 | 拉扎斯网络科技(上海)有限公司 | Data processing method and device |
-
2021
- 2021-01-29 CN CN202110132791.3A patent/CN112784165B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20110062896A (en) * | 2009-12-04 | 2011-06-10 | 한국과학기술원 | Apparatus and method for searching local information |
CN105426038A (en) * | 2015-11-02 | 2016-03-23 | 北京科东电力控制系统有限责任公司 | Picture popularity algorithm based picture pre-loading method for power grid scheduling control system |
CN106528608A (en) * | 2016-09-27 | 2017-03-22 | 中国电力科学研究院 | Cold and hot storage method and system for power grid GIS (Geographic Information System) data in cloud architecture |
CN108804351A (en) * | 2018-05-30 | 2018-11-13 | 郑州云海信息技术有限公司 | A kind of caching replacement method and device |
CN109815309A (en) * | 2018-12-21 | 2019-05-28 | 航天信息股份有限公司 | A kind of user information recommended method and system based on personalization |
US20200272676A1 (en) * | 2019-02-21 | 2020-08-27 | Microsoft Technology Licensing, Llc | Characterizing a place by features of a user visit |
CN110321944A (en) * | 2019-06-26 | 2019-10-11 | 华中科技大学 | A kind of construction method of the deep neural network model based on contact net image quality evaluation |
CN111898904A (en) * | 2020-07-28 | 2020-11-06 | 拉扎斯网络科技(上海)有限公司 | Data processing method and device |
CN111898031A (en) * | 2020-08-14 | 2020-11-06 | 腾讯科技(深圳)有限公司 | Method and device for obtaining user portrait |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113259141A (en) * | 2021-06-11 | 2021-08-13 | 腾讯科技(深圳)有限公司 | Test method and device of group prediction model, storage medium and electronic equipment |
CN113259141B (en) * | 2021-06-11 | 2021-09-24 | 腾讯科技(深圳)有限公司 | Test method and device of group prediction model, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN112784165B (en) | 2024-07-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112232495B (en) | Prediction model training method, device, medium and computing equipment | |
CN108733508B (en) | Method and system for controlling data backup | |
US20190311114A1 (en) | Man-machine identification method and device for captcha | |
CN110569427A (en) | Multi-target sequencing model training and user behavior prediction method and device | |
WO2013121181A1 (en) | Method of machine learning classes of search queries | |
CN111368887B (en) | Training method of thunderstorm weather prediction model and thunderstorm weather prediction method | |
KR20210017342A (en) | Time series prediction method and apparatus based on past prediction data | |
CN111125658B (en) | Method, apparatus, server and storage medium for identifying fraudulent user | |
CN112907128A (en) | Data analysis method, device, equipment and medium based on AB test result | |
CN116684330A (en) | Traffic prediction method, device, equipment and storage medium based on artificial intelligence | |
JP2019105871A (en) | Abnormality candidate extraction program, abnormality candidate extraction method and abnormality candidate extraction apparatus | |
CN114037059A (en) | Pre-training model, model generation method, data processing method and data processing device | |
CN112784165A (en) | Training method of incidence relation estimation model and method for estimating file popularity | |
CN110929849B (en) | Video detection method and device based on neural network model compression | |
CN111783883A (en) | Abnormal data detection method and device | |
CN111309706A (en) | Model training method and device, readable storage medium and electronic equipment | |
CN113590447B (en) | Buried point processing method and device | |
CN111368864A (en) | Identification method, availability evaluation method and device, electronic equipment and storage medium | |
CN116385059A (en) | Method, device, equipment and storage medium for updating behavior data prediction model | |
CN104572820A (en) | Method and device for generating model and method and device for acquiring importance degree | |
CN103810157A (en) | Method and device for achieving input method | |
CN115935208A (en) | Online segmentation method, equipment and medium for multi-element time sequence running data of data center equipment | |
CN115393100A (en) | Resource recommendation method and device | |
CN114090535A (en) | Model training method, data storage method and device and electronic equipment | |
CN115169692A (en) | Time series prediction method and related device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |