CN114564594A - Knowledge graph user preference entity recall method based on double-tower model - Google Patents
Knowledge graph user preference entity recall method based on double-tower model Download PDFInfo
- Publication number
- CN114564594A CN114564594A CN202210169936.1A CN202210169936A CN114564594A CN 114564594 A CN114564594 A CN 114564594A CN 202210169936 A CN202210169936 A CN 202210169936A CN 114564594 A CN114564594 A CN 114564594A
- Authority
- CN
- China
- Prior art keywords
- user
- embedding
- item
- representing
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 81
- 230000003993 interaction Effects 0.000 claims abstract description 35
- 239000013598 vector Substances 0.000 claims description 122
- 230000006870 function Effects 0.000 claims description 44
- 230000008569 process Effects 0.000 claims description 37
- 238000005070 sampling Methods 0.000 claims description 29
- 101150060512 SPATA6 gene Proteins 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 19
- 238000013528 artificial neural network Methods 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 15
- 238000003062 neural network model Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 8
- 230000011218 segmentation Effects 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 6
- 238000009826 distribution Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000003491 array Methods 0.000 claims description 3
- 230000006399 behavior Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 238000009795 derivation Methods 0.000 claims description 3
- 238000009499 grossing Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 2
- 230000002452 interceptive effect Effects 0.000 claims 4
- 230000005540 biological transmission Effects 0.000 claims 2
- 238000006243 chemical reaction Methods 0.000 claims 1
- 238000000151 deposition Methods 0.000 claims 1
- 230000007935 neutral effect Effects 0.000 claims 1
- 238000005457 optimization Methods 0.000 abstract description 3
- 238000010801 machine learning Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Biology (AREA)
- Animal Behavior & Ethology (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种基于双塔模型的知识图谱用户偏好实体召回方法,在传统的双塔模型中添加了优化方法,用于更好的学习用户与物品之间的交互,已训练的双塔模型能用于召回在知识图谱上与用户偏好相关的实体。首先将用户历史记录的物品在知识图谱对应的实体作为起点,沿着边检索到所有的邻居实体。然后通过已经训练好的优化双塔模型对召回到的实体进行筛选。最后以召回到的实体作为新的起点,重复上述操作。最终构成了能够表示用户偏好和潜在偏好的知识图谱。The invention discloses a knowledge map user preference entity recall method based on a double-tower model, and an optimization method is added to the traditional double-tower model for better learning the interaction between users and items. The model can be used to recall entities related to user preferences on the knowledge graph. First, the entities corresponding to the items in the user history in the knowledge graph are used as the starting point, and all neighbor entities are retrieved along the edges. The recalled entities are then screened through the trained and optimized twin-tower model. Finally, the above operations are repeated with the recalled entity as a new starting point. Finally, a knowledge graph that can represent user preferences and potential preferences is formed.
Description
技术领域technical field
本发明涉及知识图谱和深度学习技术领域,具体涉及一种基于双塔模型的知识图谱用户偏好实体召回方法。The invention relates to the technical field of knowledge graphs and deep learning, in particular to a method for recalling user preference entities of knowledge graphs based on a twin-tower model.
背景技术Background technique
知识图谱是谷歌在2012年提出的概念,是谷歌用于增强其搜索引擎功能的知识库。本质上,知识图谱旨在描述真实世界中存在的各种实体或概念及其关系,其构成一张巨大的语意网络图,节点表示实体或概念,边则由属性或关系构成。每条知识的表示为三元组(h,r,t)的形式,其中h表示头实体,t表示尾实体,r表示头、尾实体之间的关系。知识图谱以强大的语义处理能力和开放组织能力,在推荐系统、智能问答和信息检索等领域发挥着重要的作用,为互联网时代的知识组织和智能应用奠定了基础。A knowledge graph, a concept proposed by Google in 2012, is a knowledge base that Google uses to enhance the functionality of its search engine. Essentially, the knowledge graph aims to describe various entities or concepts and their relationships in the real world, which constitute a huge semantic network graph, where nodes represent entities or concepts, and edges are composed of attributes or relationships. Each piece of knowledge is represented in the form of a triple (h, r, t), where h represents the head entity, t represents the tail entity, and r represents the relationship between the head and tail entities. With powerful semantic processing capabilities and open organization capabilities, knowledge graphs play an important role in recommender systems, intelligent question answering, and information retrieval, laying the foundation for knowledge organization and intelligent applications in the Internet era.
传统的推荐系统使用显式或隐式信息作为输入来进行预测,存在两个主要问题。一是稀疏性问题,实际场景中,用户和物品的交互信息往往是非常稀疏的,使用如此少的观测数据来预测大量的未知信息,会极大增加过拟合的风险。二是冷启动问题,对于新加入的用户或者物品,其没有对应的历史信息,因此难以进行准确地建模和推荐。Traditional recommender systems use explicit or implicit information as input to make predictions, and there are two main problems. One is the sparsity problem. In actual scenarios, the interaction information between users and items is often very sparse. Using such a small amount of observation data to predict a large amount of unknown information will greatly increase the risk of overfitting. The second is the cold start problem. For newly added users or items, there is no corresponding historical information, so it is difficult to accurately model and recommend.
知识图谱包含了实体之间丰富的语义关联,为推荐系统提供了潜在的辅助信息来源。知识图谱为物品引入更多的语义关系,可以深层地发现用户兴趣。通过知识图谱中不同的关系链接种类,有利于推荐结果的发散。知识图谱可以连接用户的历史记录和推荐结果,从而提高用户对推荐结果的满意度和接受度,增强用户对系统的信任。Knowledge graphs contain rich semantic associations between entities, providing a potential source of auxiliary information for recommender systems. Knowledge graphs introduce more semantic relationships for items, which can deeply discover user interests. Through the different types of relational links in the knowledge graph, it is beneficial to the divergence of the recommendation results. The knowledge graph can connect the user's historical records and recommendation results, thereby improving the user's satisfaction and acceptance of the recommendation results, and enhancing the user's trust in the system.
现有的知识图谱推荐应用的方法主要有两类。一类是基于向量嵌入方法(embedding-based methods),通过知识图谱向量嵌入算法,去学习知识图谱中的实体和关系,并得到实体和关系的向量表示,再把实体和关系的向量引入到推荐系统框架中。比如,基于卷积神经网络的DKN框架(Deep Knowledge-aware Network),基于协同知识库嵌入的CKE框架(Collaborative Knowledge base Embedding)。虽然基于向量嵌入的知识图谱推荐方法具有很强的灵活性,但这类方法通常适用于图形内链路预测应用,而推荐场景更需要的是去挖掘用户的潜在兴趣。另一类是基于路径方法(path-based methods),探索知识图谱中各个实体之间的各种联系,为推荐系统提供额外的指导。比如,基于个性化实体的推荐方法(Personalized Entity Recommendation),基于元图的推荐方法(Meta-GraphBased Recommendation)。虽然基于路径的知识图谱推荐方法能够以更自然和更直观的方式使用知识图谱,但是它们严重依赖于手动设计的元路径,这在实践中很难优化。There are mainly two types of existing methods for knowledge graph recommendation applications. One is based on the vector embedding method (embedding-based methods), through the knowledge map vector embedding algorithm, to learn the entities and relationships in the knowledge map, and obtain the vector representation of the entities and relationships, and then introduce the vectors of the entities and relationships into the recommendation. in the system framework. For example, the DKN framework (Deep Knowledge-aware Network) based on convolutional neural networks, and the CKE framework (Collaborative Knowledge base Embedding) based on collaborative knowledge base embedding. Although the knowledge graph recommendation method based on vector embedding has strong flexibility, this kind of method is usually suitable for the application of link prediction in the graph, and the recommendation scenario needs to explore the potential interests of users. The other category is path-based methods, which explore various connections between entities in the knowledge graph and provide additional guidance for recommender systems. For example, Personalized Entity Recommendation and Meta-GraphBased Recommendation. Although path-based knowledge graph recommendation methods are able to use knowledge graphs in a more natural and intuitive way, they heavily rely on manually designed meta-paths, which are difficult to optimize in practice.
发明内容SUMMARY OF THE INVENTION
针对现有技术中存在的问题,本发明提供了一种基于双塔模型的知识图谱用户偏好实体召回方法。在传统的双塔模型中添加了优化方法,用于更好的学习用户与物品之间的交互;已训练的双塔模型能用于召回在知识图谱上与用户偏好相关的实体;首先将用户历史记录的物品在知识图谱对应的实体作为起点,沿着边检索到所有的邻居实体;然后通过已经训练好的优化双塔模型对召回到的实体进行筛选。最后以召回到的实体作为新的起点,重复上述操作;最终构成了能够表示用户偏好和潜在偏好的知识图谱。In view of the problems existing in the prior art, the present invention provides a method for recalling user preference entities of knowledge graphs based on the twin tower model. An optimization method is added to the traditional two-tower model to better learn the interaction between users and items; the trained two-tower model can be used to recall entities related to user preferences on the knowledge graph; The items in the history record take the entity corresponding to the knowledge graph as the starting point, and retrieve all neighbor entities along the edge; then the recalled entities are screened through the trained optimized twin-tower model. Finally, the recalled entities are used as a new starting point, and the above operations are repeated; finally, a knowledge graph that can represent user preferences and potential preferences is formed.
本发明所采用的技术方案如下:The technical scheme adopted in the present invention is as follows:
一种基于双塔模型的知识图谱用户偏好实体召回方法,包括如下步骤:A method for recalling user preference entities in knowledge graph based on two-tower model, comprising the following steps:
1、定义用户特征向量和物品特征向量,作为双塔模型的输入;1. Define the user feature vector and the item feature vector as the input of the twin tower model;
2、训练双塔模型,结合in-batch softmax损失函数与基于哈希序列的频率估计方法对双塔模型进行优化;2. Train the twin-tower model, and optimize the twin-tower model by combining the in-batch softmax loss function and the frequency estimation method based on the hash sequence;
3、定义用户历史交互矩阵与知识图谱的实体映射关系;3. Define the entity mapping relationship between the user history interaction matrix and the knowledge graph;
4、通过偏好实体传播的方式,将每次传播召回到的实体与用户特征输入到优化的双塔模型户偏得出预测概率,筛选概率高的实体,最终得到表示用户偏好和潜在偏好的知识图谱。4. Through the method of preference entity propagation, input the entities and user characteristics recalled by each propagation into the optimized twin-tower model to obtain the predicted probability, screen the entities with high probability, and finally obtain the knowledge representing user preferences and potential preferences Atlas.
所述步骤1的过程如下:The process of step 1 is as follows:
1.1、用户特征指的是用户对物品的交互行为,包括点击记录,搜索记录,社交数据,个人数据和样本年龄,用户特征向量是将上述交互数据转化为向量并做拼接(concatenate)。其中将原始数据转化为向量的方式称为向量嵌入(embedding),向量嵌入(embedding)是机器学习中常用的表示数据特征的方法,目的是将原始数据提取出特征,也就是通过神经网络映射之后的低维向量;1.1. User feature refers to the user's interaction behavior with items, including click records, search records, social data, personal data and sample age. User feature vector is to convert the above interaction data into vectors and concatenate them. The method of converting raw data into vectors is called vector embedding. Vector embedding is a commonly used method to represent data features in machine learning. The purpose is to extract features from raw data, that is, after mapping through neural networks. The low-dimensional vector of ;
更进一步,所述1.1的流程如下:Further, the process of 1.1 is as follows:
1.1.1、用户点击记录的embedding,是所有点击物品的id类embedding的加权平均,其中id类embedding是将物品唯一标识符映射到同一维度的向量,其权重与浏览物品时间成正比。其用户点击记录的embedding计算公式如下:1.1.1. The embedding of the user click record is the weighted average of the id-type embeddings of all clicked items. The id-type embedding is a vector that maps the unique identifier of the item to the same dimension, and its weight is proportional to the browsing time of the item. The embedding calculation formula of its user click record is as follows:
其中vclick表示用户点击记录的embedding,表示第i个权重,vclick,i表示点击记录中第i个物品的id类embedding,n表示embedding的个数;其中,可通过如下公式计算:Where v click represents the embedding of the user click record, represents the ith weight, v click, i represents the id class embedding of the ith item in the click record, and n represents the number of embeddings; among them, It can be calculated by the following formula:
其中表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;
1.1.2、用户搜索记录的embedding是历史搜索的关键词进行分词得到词条。分词的过程是通过Word2vec模型得到对应词条的embedding,然后将用户搜索记录的embedding进行加权平均。1.1.2. The embedding of the user's search record is the keyword of the historical search for word segmentation to obtain the entry. The process of word segmentation is to obtain the embedding of the corresponding entry through the Word2vec model, and then perform a weighted average of the embeddings of the user search records.
其中分词是搜索引擎针对用户提交搜索的关键词串进行切分成不同词条token的技术。The word segmentation is a technology in which the search engine divides the keyword string submitted by the user into different entry tokens.
Word2vec模型将文本中的内容词汇通过转换处理,化简为空间向量,词向量的数值受上下文的影响,蕴含了词与词之间相互的关联性。The Word2vec model converts the content vocabulary in the text and reduces it to a space vector. The value of the word vector is affected by the context, which implies the correlation between words.
其用户搜索记录的embedding的计算公式如下:The calculation formula of the embedding of its user search records is as follows:
其中vsearch表示用户搜索记录的embedding,表示第i个权重,vsearch,i表示搜索记录中第i个词条的embedding,n表示embedding的个数;where v search represents the embedding of the user search record, Represents the ith weight, v search, i represents the embedding of the ith entry in the search record, and n represents the number of embeddings;
搜索记录的embedding的权重计算:The weight calculation of the embedding of the search record:
其中搜索的有效性判断为用户是否在搜索后点击物品;The validity of the search is judged as whether the user clicks the item after the search;
1.1.3、用户的社交数据包括收藏、点赞和订阅数据对应的embedding加权平均。其中收藏和点赞数据对应的embedding指的是用户收藏和点赞的物品id类的embedding;订阅数据对应的embedding指的是用户订阅物品对应的负责人的id类的embedding。1.1.3. The user's social data includes the embedding weighted average of favorites, likes, and subscription data. The embedding corresponding to the collection and like data refers to the embedding of the id category of the items that the user has favorited and liked; the embedding corresponding to the subscription data refers to the embedding of the id category of the person in charge corresponding to the user subscribed items.
其用户社交数据的embedding的计算公式如下:The calculation formula of the embedding of its user social data is as follows:
其中vsocial表示用户搜索记录的embedding,表示第i个权重,vsocial,i表示搜索记录中第i个社交数据的embedding;where v social represents the embedding of user search records, represents the ith weight, v social,i represents the embedding of the ith social data in the search record;
对于收藏和点赞的embedding的权重计算:For the weight calculation of the favorite and liked embeddings:
其中表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;
对于订阅的embedding权重计算:For subscription embedding weight calculation:
其中示被订阅者第i个物品的浏览时间,N表示样本总数,k表示正例总数;in Indicates the browsing time of the ith item of the subscriber, N represents the total number of samples, and k represents the total number of positive examples;
1.1.4、用户的个人数据包括用户的性别、年龄和地域;其中性别是简单的二元特征,年龄和地域属于连续型特征,可以将其归一化为[0,1]区间上的实数值。用户个人数据的embedding,是将处理过的性别、年龄和地域的值做拼接操作后得到的向量;1.1.4. The user's personal data includes the user's gender, age, and region; gender is a simple binary feature, and age and region are continuous features, which can be normalized to a real value in the [0,1] interval. numerical value. The embedding of the user's personal data is a vector obtained by splicing the processed values of gender, age and region;
更进一步,1.1.4的流程如下:Going a step further, the process for 1.1.4 is as follows:
1.1.4.1、计算用户性别的二元表示,其公式如下:1.1.4.1. Calculate the binary representation of user gender, the formula is as follows:
1.1.4.2、计算用户的年龄和地域的归一化实数值,其归一化公式如下:1.1.4.2. Calculate the normalized real value of the user's age and region. The normalization formula is as follows:
其中X表示样本数值,μ为所有样本数据的均值,σ为所有样本数据的标准差;where X represents the sample value, μ is the mean of all sample data, and σ is the standard deviation of all sample data;
1.1.4.3、将1.1.4所述的性别二元值,年龄和地域归一化实数值做拼接操作得到一个向量,这个向量拼接操作公式如下:1.1.4.3. Perform a splicing operation on the gender binary value, age and region normalized real value described in 1.1.4 to obtain a vector. The vector splicing operation formula is as follows:
vpersonal=[gender,zage,zregion]v personal = [gender, z age , z region ]
其中vpersonal表示用户特征向量,gender表示用户性别,zage和zregion分别表示用户的年龄和地域的归一化值;where v personal represents the user feature vector, gender represents the user’s gender, and z age and z region represent the normalized value of the user’s age and region, respectively;
1.1.5、将1.1流程所述的用户点击记录的embedding,用户搜索记录的embedding,用户交互数据的embedding,用户个人数据的embedding做concatenate连接操作得到用户特征向量,其公式如下:1.1.5. The user feature vector is obtained by performing the concatenate connection operation on the embedding of the user click record, the embedding of the user search record, the embedding of the user interaction data, and the embedding of the user's personal data described in the process of 1.1. The formula is as follows:
vuser=concatenate(vclick,vsearch,vsocial,vpersonal)v user =concatenate(v click ,v search ,v social ,v personal )
=[vclick[1],vclick[2],…,vsearch[1],vsearch[2],…,vsocial[1],vsocial[2],…,vpersonal[1],vpersonal[2],…]=[v click [1],v click [2],…,v search [1],v search [2],…,v social [1],v social [2],…,v personal [1], v personal [2],…]
其中vuser表示用户特征向量,vclick[i]表示用户点击embedding的第i个分量,vsearch[i]表示用户搜索记录embedding的第i个分量,vsocial[i]表示用户社交数据embedding的第i个分量,vpersonal[i]表示用户个人数据embedding的第i个分量;where v user represents the user feature vector, v click [i] represents the ith component of the user click embedding, v search [i] represents the ith component of the user search record embedding, and v social [i] represents the user's social data embedding The i-th component, v personal [i] represents the i-th component of the user's personal data embedding;
1.2、物品特征包括物品的id及其上下文信息,物品特征向量由物品的id类embedding于其上下文信息的embedding拼接而成;1.2. Item features include the id of the item and its context information, and the item feature vector is formed by splicing the id class embedding of the item and the embedding of its context information;
更进一步,1.2的流程如下:Going a step further, the process for 1.2 is as follows:
1.2.1、给出物品的id类embedding,是将物品唯一标识符映射到同一维度的向量;1.2.1. The id class embedding of the item is given, which is a vector that maps the unique identifier of the item to the same dimension;
1.2.2、给出物品的上下文信息embedding,是将上下文信息通过Word2vec得到的向量;1.2.2. The context information embedding of the item is given, which is a vector obtained by passing the context information through Word2vec;
1.2.3、将1.2所述的id类embedding和上下文信息embedding做concatenate连接操作得到物品特征向量,其公式如下:1.2.3. Perform the concatenate connection operation of the id class embedding and the context information embedding described in 1.2 to obtain the item feature vector. The formula is as follows:
vitem=concatenate(vid,vcontext)=[vid[1],vid[2],…,vcontext[1],vcontext[2],…]v item =concatenate(v id ,v context )=[v id [1],v id [2],…,v context [1],v context [2],…]
其中vitem表示物品特征向量,vid表示物品的id类embedding,vcontext表示物品上下文信息的embedding,vid[i]表示物品的id类embedding的第i个分量,vcontext[i]表示物品上下文信息的embedding的第i个分量;Where v item represents the feature vector of the item, v id represents the id class embedding of the item, v context represents the embedding of the item context information, v id [i] represents the i-th component of the id class embedding of the item, and v context [i] represents the item The ith component of the embedding of context information;
所述步骤2中,双塔模型衍生于DSSM(Deep Structured Semantic Models)。DSSM是深度结构化语义模型,在自然语言处理领域中,DSSM常用于解决语义相似度的问题。双塔模型从上往下可以分为输入层、表示层、匹配层,从水平方向看分为两个输入层,两个表示层和一个匹配层。其中两个输入层的输出分别是两个表示层的输入,两个表示层输出汇总到匹配层,整体呈现“双塔”的形式。本发明中,双塔模型的输入层的两个输入分别是用户特征向量和物品特征向量,两个表示层是相同的神经网络结构,特征向量通过神经网络计算后会得到相同维度的向量。最后将这两个向量经过L2归一化后进行内积。进一步的,本发明优化的双塔模型,是采用了基于哈希序列的频率估计方法对双塔模型进行优化。该方法减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数。In the step 2, the twin tower model is derived from DSSM (Deep Structured Semantic Models). DSSM is a deep structured semantic model. In the field of natural language processing, DSSM is often used to solve the problem of semantic similarity. The dual-tower model can be divided into input layer, presentation layer and matching layer from top to bottom, and from the horizontal direction, it is divided into two input layers, two presentation layers and one matching layer. The outputs of the two input layers are the inputs of the two presentation layers, respectively, and the outputs of the two presentation layers are aggregated to the matching layer, and the whole is in the form of a "twin tower". In the present invention, the two inputs of the input layer of the twin-tower model are the user feature vector and the item feature vector respectively, the two representation layers have the same neural network structure, and the feature vector will obtain a vector of the same dimension after being calculated by the neural network. Finally, the two vectors are L2 normalized and then inner product is performed. Further, the optimized twin-tower model of the present invention adopts a frequency estimation method based on a hash sequence to optimize the twin-tower model. This method reduces the possible sampling bias problem of negative sampling in each batch, thereby optimizing the loss function.
所述步骤2的流程如下:The process of step 2 is as follows:
2.1、给出两个带参数的embedding函数:其中表示d维实数向量。2.1. Give two embedding functions with parameters: in Represents a d-dimensional real vector.
是通过深度神经网络提取后的用户特征向量和物品特征向量,其中和分别是双塔模型需要的输入值。 are the user feature vector and item feature vector extracted by the deep neural network, where and are the input values required by the twin-tower model, respectively.
更进一步,2.1的流程如下:Going a step further, the process for 2.1 is as follows:
2.1.1、双塔模型含有两个深度神经网络模型。神经网络模型的组成单元是感知机,感知机有一个若干个输入和一个输出,其中输出和输入是训练到的一个线性关系,输出的值会通过一个非线性激活函数得到最终结果;2.1.1. The twin tower model contains two deep neural network models. The component unit of the neural network model is the perceptron. The perceptron has several inputs and one output. The output and the input are a linear relationship trained, and the output value will get the final result through a nonlinear activation function;
2.1.2、深度神经网络在神经网络模型上做了扩展,深度神经网络有一个输入层,一个输出层和若干个隐藏层,本发明采用了3层隐藏层,并且层与层之间采用全连接,全连接是指的每一个结点都与上一层的所有结点相连;2.1.2. The deep neural network is extended on the neural network model. The deep neural network has one input layer, one output layer and several hidden layers. Connection, full connection means that each node is connected to all nodes in the previous layer;
2.1.3、将用户特征向量和物品特征向量输入到对应的神经网络,由于模型需要训练,所以输出结果用含有待训练参数θ的函数来表示,输出结果分别为:其中 2.1.3. Input the user feature vector and item feature vector into the corresponding neural network. Since the model needs to be trained, the output result is represented by a function containing the parameter θ to be trained. The output results are: in
2.1.4、将输出结果做L2归一化处理,L2归一化公式如下:2.1.4, will output the result To do L2 normalization, the L2 normalization formula is as follows:
其中γi表示[Y]向量上第i个分量; in γ i represents the i-th component on the [Y] vector;
2.2、通过基于哈希序列的频率估计方法,即利用哈希序列来记录采样的序列号,减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数,循环以下步骤;2.2. Through the frequency estimation method based on the hash sequence, that is, using the hash sequence to record the serial number of the sampling, the sampling deviation problem that may occur in each batch of negative sampling is reduced, thereby optimizing the loss function, and looping the following steps ;
更进一步,2.2的流程如下:Going a step further, the process for 2.2 is as follows:
2.2.1、从样本集中随机采取T个样本,表示如下:2.2.1. Take T samples randomly from the sample set, which are expressed as follows:
其中xi表示样本T中的第i个用户特征向量, where x i represents the ith user feature vector in sample T,
yi表示样本T中的第i个物品特征向量,y i represents the i-th item feature vector in sample T,
ri表示样本T中第i个用户反馈程度,且ri ∈[0,1]。ri represents the feedback degree of the ith user in the sample T, and ri ∈ [0,1].
2.2.2、利用基于哈希序列的频率估计方法计算每个样本中yi的概率pi;2.2.2. Calculate the probability p i of y i in each sample by using the frequency estimation method based on the hash sequence;
所述2.2.2流程如下:The process described in 2.2.2 is as follows:
2.2.2.1、设学习率α,大小为H的数组A和D,哈希函数h;2.2.2.1. Let the learning rate α, the arrays A and D of size H, and the hash function h;
其中哈希函数h可以把每个yi映射到范围在[0,H]的整数。where the hash function h can map each yi to an integer in the range [0, H].
2.2.2.2、对于每一步t=1,2,…,在训练样本数量的一个批次(batch)中的所有物品集合对于每一物品有:2.2.2.2. For each step t = 1, 2, ..., the set of all items in a batch of the number of training samples for each item Have:
2.2.2.2.1、 2.2.2.2.1,
其中←表示赋值,表示yi上次被采样到的时刻,表示yi被采样一次经过的步数;where ← means assignment, represents the last time yi was sampled, Indicates the number of steps that yi is sampled once;
2.2.2.2.2、 2.2.2.2.2,
2.2.2.3、对于每一个采样概率 2.2.2.3, for each sampling probability
2.2.3、计算优化后的损失函数并给出推导过程;2.2.3. Calculate the optimized loss function And give the derivation process;
所述2.2.3流程如下:The process described in 2.2.3 is as follows:
2.2.3.1、给出向量内积公式:2.2.3.1. Give the vector inner product formula:
2.2.3.2、给定一个向量x,从M个物品中得到其中一个的概率可以利用softmax函数计算,其公式如下:2.2.3.2. Given a vector x, get one of the M items The probability of can be calculated using the softmax function, the formula is as follows:
其中e为自然常数;where e is a natural constant;
2.2.3.3、加权对数似然损失函数为:2.2.3.3. The weighted log-likelihood loss function is:
其中T表示2.2.1所述随机采取的T个样本;where T represents T samples randomly taken as described in 2.2.1;
2.2.3.4、本发明利用负采样算法进行计算,负采样算法是采用了一种平滑策略,能够提高低频样本的采样几率。我们在同一batch中采样负样本。给出一个最小batch,为其中batch-softmax函数为:2.2.3.4. The present invention uses a negative sampling algorithm for calculation, and the negative sampling algorithm adopts a smoothing strategy, which can improve the sampling probability of low-frequency samples. We sample negative samples in the same batch. gives a minimum batch, which is The batch-softmax function is:
2.2.3.5、在每一batch中,由于存在幂分布现象,即在每个batch中随机采样负样本,会使热门物品容易被采样到,在损失函数中就过度惩罚了这些热门商品,因此考虑用频率对采样进行修正,其公式如下:2.2.3.5. In each batch, due to the power distribution phenomenon, that is, random sampling of negative samples in each batch will make popular items easy to be sampled, and these popular items will be excessively punished in the loss function, so consider The sampling is corrected by frequency, and the formula is as follows:
sc(xi,yi)=s(xi,yi)-log(pi)s c (x i , y i )=s(x i , y i )-log( pi )
其中sc(xi,yi)表示s(xi,yi)的修正值,pi来自2.2.2所述的基于哈希序列的频率估计算法;where s c ( xi , y i ) represents the correction value of s ( xi , y i ), and pi comes from the frequency estimation algorithm based on hash sequence described in 2.2.2;
2.2.3.6、因此修正后的条件概率函数为:2.2.3.6. Therefore, the modified conditional probability function is:
2.2.3.7、最后损失函数为:2.2.3.7. The final loss function is:
2.2.4、采用梯度下降算法更新参数θ,使其接近最优值。其中梯度下降算法机器学习常用的算法,用于求解模型参数;2.2.4. Use the gradient descent algorithm to update the parameter θ to make it close to the optimal value. Among them, the gradient descent algorithm is a commonly used algorithm in machine learning, which is used to solve the model parameters;
所述步骤3流程如下:The step 3 process is as follows:
3.1、给定一个用户交互矩阵 表示第个用户与第个物品的交互情况,U和V分别表示用户集合和物品集合;3.1. Given a user interaction matrix means the first users and The interaction of each item, U and V represent the user set and the item set, respectively;
用户交互矩阵Y的表达公式如下:The expression formula of the user interaction matrix Y is as follows:
3.2、定义Oi,j表示用户i对物品j的交互情况,将用户i存在交互的物品映射到知识图谱的实体;3.2. Define O i,j to represent the interaction of user i with item j, and map the items that user i interacts with to the knowledge graph entity;
更进一步,所述3.2流程如下:Further, the 3.2 process is as follows:
3.2.1、Oi,j是一个交互矩阵,由用户的行向量组成。Oi,j表示用户i对物品j的交互情况,其中j为物品的索引;3.2.1. O i,j is an interaction matrix consisting of row vectors of users. O i,j represents the interaction of user i with item j, where j is the index of the item;
3.2.2、定义一个HashMap用于存放全部物品,HashMap以key-value键值对的方式存放,这里的key存放物品的索引,value存放物体对应的实体;3.2.2. Define a HashMap to store all items. HashMap is stored in the form of key-value key-value pairs. The key here stores the index of the item, and the value stores the entity corresponding to the object;
3.2.3,从用户交互矩阵O的行向量表示一个用户的交互情况,把这个行向量存放于一个数组,这个数组定义为Ε;3.2.3, from the row vector of the user interaction matrix O to represent the interaction situation of a user, this row vector is stored in an array, and this array is defined as E;
3.2.3、定义一个临时集合temp_set用户存放实体,遍历数组Ε,如果数组里元素的值为1,则根据这个元素的索引访问物品HashMap,获取对应的实体,存放于临时集合temp_set中;3.2.3. Define a temporary set temp_set user to store the entity, traverse the array E, if the value of the element in the array is 1, then access the item HashMap according to the index of this element, obtain the corresponding entity, and store it in the temporary set temp_set;
所述步骤4流程如下:The step 4 process is as follows:
4.1、为一个用户给出第一步所述的用户特征向量vuser;4.1, give the user feature vector v user described in the first step for a user ;
4.2、按第三步初始化该用户的临时集合temp_set;4.2. Initialize the user's temporary set temp_set according to the third step;
4.3、定义一个用户的偏好的HashMap,命名为user_map,key值存放循环次数,value存放三元组集合;4.3. Define a HashMap of user preferences, named user_map, the key value stores the number of cycles, and the value stores the triplet set;
4.4、定义一个集合used_set,用于存放已经召回的三元组;4.4. Define a set used_set to store the recalled triples;
4.5、令循环次数k=1,2,3…,给定一个最大值K,当user_map中的三元组数量大于K,退出循环,循环执行以下步骤:4.5. Let the number of loops k = 1, 2, 3..., given a maximum value K, when the number of triples in user_map is greater than K, exit the loop and execute the following steps:
4.5.1、遍历temp_set中的实体,去掉used_set中的实体4.5.1. Traverse the entities in temp_set and remove the entities in used_set
temp_set←temp_set-used_settemp_set←temp_set-used_set
其中←表示赋值,-表示差集运算;← means assignment, - means difference operation;
4.5.2、从知识图谱找到三元组集合并存放到user_map,定义如下:4.5.2. Find the triplet set from the knowledge graph And stored in user_map, defined as follows:
其中(h',r',t')表示三元组,h'表示头实体,r'表示关系,t'表示尾实体;Where (h', r', t') represents a triple, h' represents a head entity, r' represents a relationship, and t' represents a tail entity;
4.5.3、由于知识图谱存在回路,为了防止收集到已经召回的实体,需要将temp_set中的实体添加到used_set;4.5.3. Since there is a loop in the knowledge graph, in order to prevent the collection of recalled entities, it is necessary to add entities in temp_set to used_set;
4.5.4、取出中三元组的尾实体,存放到集合 4.5.4, take out The tail entity of the triple, stored in the collection
4.5.5、遍历中的实体,取出对应的物品特征向量vitem;4.5.5. Traversal The entity in , take out the corresponding item feature vector v item ;
4.5.6、通过第二步所述的输入参数vuser和vitem到双塔模型,得出用户与集合中对应的物品的概率,并根据概率大小将实体排序;4.5.6. Through the input parameters v user and v item described in the second step to the twin tower model, the user and set are obtained The probability of the corresponding item in , and the entities are sorted according to the probability;
4.5.7、给定一个值τ,0<τ≤1,用于决定筛选实体的个数:4.5.7. Given a value τ, 0<τ≤1, it is used to determine the number of screening entities:
其中表示集合中的元素按概率排序,并返回一个数组,getSubVec(i,j)表示取出子数组,即原数组的第i到第j的元素,newSet()表示将数组转化为集合,表示向上取整;in Indicates that the elements in the set are sorted by probability and returns an array, getSubVec(i,j) means to take out the sub-array, that is, the i-th to j-th elements of the original array, newSet() means to convert the array into a set, means rounded up;
4.5.8、从集合中筛选出尾实体属于的三元组:4.5.8. From Filters out the tail entity from the collection that belongs to triples of :
4.5.9、存放三元组集合Λ到user_map的方法:user_map(k,Λ);4.5.9. The method of storing triplet set Λ to user_map: user_map(k, Λ);
4.5.10、将集合覆盖到temp_set;4.5.10, will Set overwrites to temp_set;
4.6、执行完4.5后,最终得到用户偏好的知识图谱。4.6. After executing 4.5, the knowledge graph of user preference is finally obtained.
本发明的有益效果如下:通过优化的双塔模型对知识图谱中用户偏好的实体进行筛选。双塔模型的优化方法是采用了基于哈希序列的频率估计方法,使得物品可以更好得适应多种数据分布。对知识图谱实体的筛选,不仅可以得到更优质的数据,召回的实体能够真实得接近用户偏好;而且筛选实体有利于知识图谱的深度召回,因为每次召回的实体可能会因为前一次实体的基数爆炸式得增多,如果不进行筛选,将会影响计算效率和对用户潜在偏好的探索。The beneficial effects of the present invention are as follows: the entities preferred by the user in the knowledge graph are screened through the optimized twin-tower model. The optimization method of the twin-tower model adopts the frequency estimation method based on the hash sequence, so that the items can better adapt to a variety of data distributions. The screening of knowledge graph entities can not only obtain better data, the recalled entities can be truly close to user preferences; and screening entities is conducive to the deep recall of knowledge graphs, because each recalled entity may be due to the cardinality of the previous entity. Explosive growth, if not screened, will affect computational efficiency and exploration of users’ potential preferences.
具体实施方式Detailed ways
以下结合实施例对本发明作进一步描述。The present invention will be further described below in conjunction with the embodiments.
实施例:Example:
第一步:定义用户特征向量和物品特征向量,作为双塔模型的输入;The first step: define the user feature vector and item feature vector as the input of the twin tower model;
所述第一步的过程如下:The process of the first step is as follows:
1.1、用户特征指的是用户对物品的交互行为,包括点击记录,搜索记录,社交数据,个人数据和样本年龄,用户特征向量是将上述交互数据转化为向量并做拼接(concatenate)。其中将原始数据转化为向量的方式称为向量嵌入(embedding),向量嵌入(embedding)是机器学习中常用的表示数据特征的方法,目的是将原始数据提取出特征,也就是通过神经网络映射之后的低维向量;1.1. User feature refers to the user's interaction behavior with items, including click records, search records, social data, personal data and sample age. User feature vector is to convert the above interaction data into vectors and concatenate them. The method of converting raw data into vectors is called vector embedding. Vector embedding is a method commonly used in machine learning to represent data features. The purpose is to extract features from raw data, that is, after mapping through neural networks. The low-dimensional vector of ;
更进一步,所述1.1的流程如下Further, the process of 1.1 is as follows
1.1.1、用户点击记录的embedding,是所有点击物品的id类embedding的加权平均,其中id类embedding是将物品唯一标识符映射到同一维度的向量,其权重与浏览物品时间成正比。其用户点击记录的embedding计算公式如下:1.1.1. The embedding of the user click record is the weighted average of the id-type embeddings of all clicked items. The id-type embedding is a vector that maps the unique identifier of the item to the same dimension, and its weight is proportional to the browsing time of the item. The embedding calculation formula of its user click record is as follows:
其中vclick表示用户点击记录的embedding,表示第i个权重,vclick,i表示点击记录中第i个物品的id类embedding,n表示embedding的个数;其中,可通过如下公式计算:Where v click represents the embedding of the user click record, represents the ith weight, v click, i represents the id class embedding of the ith item in the click record, and n represents the number of embeddings; among them, It can be calculated by the following formula:
其中表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;
1.1.2、用户搜索记录的embedding是历史搜索的关键词进行分词得到词条。分词的过程是通过Word2vec模型得到对应词条的embedding,然后将用户搜索记录的embedding进行加权平均。1.1.2. The embedding of the user's search record is the keyword of the historical search for word segmentation to obtain the entry. The process of word segmentation is to obtain the embedding of the corresponding entry through the Word2vec model, and then perform a weighted average of the embeddings of the user search records.
其中分词是搜索引擎针对用户提交搜索的关键词串进行切分成不同词条token的技术。The word segmentation is a technology in which the search engine divides the keyword string submitted by the user into different entry tokens.
Word2vec模型由Mikolov等人于2013年提出,该模型将文本中的内容词汇通过转换处理,化简为空间向量,词向量的数值受上下文的影响,蕴含了词与词之间相互的关联性。The Word2vec model was proposed by Mikolov et al. in 2013. This model converts the content words in the text and simplifies them into space vectors. The value of the word vector is affected by the context, which implies the correlation between words.
其用户搜索记录的embedding的计算公式如下:The calculation formula of the embedding of its user search records is as follows:
其中vsearch表示用户搜索记录的embedding,表示第i个权重,vsearch,i表示搜索记录中第i个词条的embedding,n表示embedding的个数;where v search represents the embedding of the user search record, Represents the ith weight, v search, i represents the embedding of the ith entry in the search record, and n represents the number of embeddings;
搜索记录的embedding的权重计算:The weight calculation of the embedding of the search record:
其中搜索的有效性判断为用户是否在搜索后点击物品;The validity of the search is judged as whether the user clicks the item after the search;
1.1.3、用户的社交数据包括收藏、点赞和订阅数据对应的embedding加权平均。其中收藏和点赞数据对应的embedding指的是用户收藏和点赞的物品id类的embedding;订阅数据对应的embedding指的是用户订阅物品对应的负责人的id类的embedding。1.1.3. The user's social data includes the embedding weighted average of favorites, likes, and subscription data. The embedding corresponding to the collection and like data refers to the embedding of the id category of the items that the user has favorited and liked; the embedding corresponding to the subscription data refers to the embedding of the id category of the person in charge corresponding to the user subscribed items.
其用户社交数据的embedding的计算公式如下:The calculation formula of the embedding of its user social data is as follows:
其中vsocial表示用户搜索记录的embedding,表示第i个权重,vsocial,i表示搜索记录中第i个社交数据的embedding;where v social represents the embedding of user search records, represents the ith weight, v social,i represents the embedding of the ith social data in the search record;
对于收藏和点赞的embedding的权重计算:For the weight calculation of the favorite and liked embeddings:
其中表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;
对于订阅的embedding权重计算:For subscription embedding weight calculation:
其中示被订阅者第i个物品的浏览时间,N表示样本总数,k表示正例总数;in Indicates the browsing time of the ith item of the subscriber, N represents the total number of samples, and k represents the total number of positive examples;
1.1.4、用户的个人数据包括用户的性别、年龄和地域。其中性别是简单的二元特征,年龄和地域属于连续型特征,可以将其归一化为[0,1]区间上的实数值。用户个人数据的embedding,是将处理过的性别、年龄和地域的值做拼接操作后得到的向量;1.1.4. The user's personal data includes the user's gender, age and region. Among them, gender is a simple binary feature, and age and region are continuous features, which can be normalized to real values in the [0,1] interval. The embedding of the user's personal data is a vector obtained by splicing the processed values of gender, age and region;
更进一步,1.1.4的流程如下:Going a step further, the process for 1.1.4 is as follows:
1.1.4.1、计算用户性别的二元表示,其公式如下:1.1.4.1. Calculate the binary representation of user gender, the formula is as follows:
1.1.4.2、计算用户的年龄和地域的归一化实数值,其归一化公式如下:1.1.4.2. Calculate the normalized real value of the user's age and region. The normalization formula is as follows:
其中X表示样本数值,μ为所有样本数据的均值,σ为所有样本数据的标准差;where X represents the sample value, μ is the mean of all sample data, and σ is the standard deviation of all sample data;
1.1.4.3、将1.1.4所述的性别二元值,年龄和地域归一化实数值做拼接操作得到一个向量,这个向量拼接操作公式如下:1.1.4.3. Perform a splicing operation on the gender binary value, age and region normalized real value described in 1.1.4 to obtain a vector. The vector splicing operation formula is as follows:
vpersonal=[gender,zage,zregion]v personal = [gender, z age , z region ]
其中vpersonal表示用户特征向量,gender表示用户性别,zage和zregion分别表示用户的年龄和地域的归一化值;where v personal represents the user feature vector, gender represents the user’s gender, and z age and z region represent the normalized value of the user’s age and region, respectively;
1.1.5、将1.1流程所述的用户点击记录的embedding,用户搜索记录的embedding,用户交互数据的embedding,用户个人数据的embedding做concatenate连接操作得到用户特征向量,其公式如下:1.1.5. The user feature vector is obtained by performing the concatenate connection operation on the embedding of the user click record, the embedding of the user search record, the embedding of the user interaction data, and the embedding of the user's personal data described in the process of 1.1. The formula is as follows:
vuser=concatenate(vclick,vsearch,vsocial,vpersonal)v user =concatenate(v click ,v search ,v social ,v personal )
=[vclick[1],vclick[2],…,vsearch[1],vsearch[2],…,vsocial[1],vsocial[2],…,vpersonal[1],vpersonal[2],…]=[v click [1],v click [2],…,v search [1],v search [2],…,v social [1],v social [2],…,v personal [1], v personal [2],…]
其中vuser表示用户特征向量,vclick[i]表示用户点击embedding的第i个分量,vsearch[i]表示用户搜索记录embedding的第i个分量,vsocial[i]表示用户社交数据embedding的第i个分量,vpersonal[i]表示用户个人数据embedding的第i个分量;where v user represents the user feature vector, v click [i] represents the ith component of the user click embedding, v search [i] represents the ith component of the user search record embedding, and v social [i] represents the user's social data embedding The i-th component, v personal [i] represents the i-th component of the user's personal data embedding;
1.2、物品特征包括物品的id及其上下文信息,物品特征向量由物品的id类embedding于其上下文信息的embedding拼接而成;1.2. Item features include the id of the item and its context information, and the item feature vector is formed by splicing the id class embedding of the item and the embedding of its context information;
更进一步,1.2的流程如下:Going a step further, the process for 1.2 is as follows:
1.2.1、给出物品的id类embedding,是将物品唯一标识符映射到同一维度的向量;1.2.1. The id class embedding of the item is given, which is a vector that maps the unique identifier of the item to the same dimension;
1.2.2、给出物品的上下文信息embedding,是将上下文信息通过Word2vec得到的向量;1.2.2. The context information embedding of the item is given, which is a vector obtained by passing the context information through Word2vec;
1.2.3、将1.2所述的id类embedding和上下文信息embedding做concatenate连接操作得到物品特征向量,其公式如下:1.2.3. Perform the concatenate connection operation of the id class embedding and the context information embedding described in 1.2 to obtain the item feature vector. The formula is as follows:
vitem=concatenate(vid,vcontext)=[vid[1],vid[2],…,vcontext[1],vcontext[2],…]v item =concatenate(v id ,v context )=[v id [1],v id [2],…,v context [1],v context [2],…]
其中vitem表示物品特征向量,vid表示物品的id类embedding,vcontext表示物品上下文信息的embedding,vid[i]表示物品的id类embedding的第i个分量,vcontext[i]表示物品上下文信息的embedding的第i个分量;Where v item represents the feature vector of the item, v id represents the id class embedding of the item, v context represents the embedding of the item context information, v id [i] represents the i-th component of the id class embedding of the item, and v context [i] represents the item The ith component of the embedding of context information;
第二步:训练双塔模型,结合in-batch softmax损失函数与基于哈希序列的频率估计方法对双塔模型进行优化;The second step: train the double-tower model, and optimize the double-tower model by combining the in-batch softmax loss function and the frequency estimation method based on the hash sequence;
所述第二步中,双塔模型衍生于DSSM(Deep Structured Semantic Models)。DSSM是深度结构化语义模型,在自然语言处理领域中,DSSM常用于解决语义相似度的问题。双塔模型从上往下可以分为输入层、表示层、匹配层,从水平方向看分为两个输入层,两个表示层和一个匹配层。其中两个输入层的输出分别是两个表示层的输入,两个表示层输出汇总到匹配层,整体呈现“双塔”的形式。本发明中,双塔模型的输入层的两个输入分别是用户特征向量和物品特征向量,两个表示层是相同的神经网络结构,特征向量通过神经网络计算后会得到相同维度的向量。最后将这两个向量经过L2归一化后进行内积。进一步的,本发明优化的双塔模型,是采用了基于哈希序列的频率估计方法对双塔模型进行优化。该方法减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数。In the second step, the twin tower model is derived from DSSM (Deep Structured Semantic Models). DSSM is a deep structured semantic model. In the field of natural language processing, DSSM is often used to solve the problem of semantic similarity. The dual-tower model can be divided into input layer, presentation layer and matching layer from top to bottom, and from the horizontal direction, it is divided into two input layers, two presentation layers and one matching layer. The outputs of the two input layers are the inputs of the two presentation layers, respectively, and the outputs of the two presentation layers are aggregated to the matching layer, and the whole is in the form of a "twin tower". In the present invention, the two inputs of the input layer of the twin-tower model are the user feature vector and the item feature vector respectively, the two representation layers have the same neural network structure, and the feature vector will obtain a vector of the same dimension after being calculated by the neural network. Finally, the two vectors are L2 normalized and then inner product is performed. Further, the optimized twin-tower model of the present invention adopts a frequency estimation method based on a hash sequence to optimize the twin-tower model. This method reduces the possible sampling bias problem of negative sampling in each batch, thereby optimizing the loss function.
所述第二步的流程如下:The process of the second step is as follows:
2.1、给出两个带参数的embedding函数:其中表示d维实数向量。2.1. Give two embedding functions with parameters: in Represents a d-dimensional real vector.
是通过深度神经网络提取后的用户特征向量和物品特征向量,其中和分别是双塔模型需要的输入值。 are the user feature vector and item feature vector extracted by the deep neural network, where and are the input values required by the twin-tower model, respectively.
更进一步,2.1的流程如下Further, the process of 2.1 is as follows
2.1.1、双塔模型含有两个深度神经网络模型。神经网络模型的组成单元是感知机,感知机有一个若干个输入和一个输出,其中输出和输入是训练到的一个线性关系,输出的值会通过一个非线性激活函数得到最终结果;2.1.1. The twin tower model contains two deep neural network models. The component unit of the neural network model is the perceptron. The perceptron has several inputs and one output. The output and the input are a linear relationship trained, and the output value will get the final result through a nonlinear activation function;
2.1.2、深度神经网络在神经网络模型上做了扩展,深度神经网络有一个输入层,一个输出层和若干个隐藏层,本发明采用了3层隐藏层,并且层与层之间采用全连接,全连接是指的每一个结点都与上一层的所有结点相连;2.1.2. The deep neural network is extended on the neural network model. The deep neural network has one input layer, one output layer and several hidden layers. Connection, full connection means that each node is connected to all nodes in the previous layer;
2.1.3、将用户特征向量和物品特征向量输入到对应的神经网络,由于模型需要训练,所以输出结果用含有待训练参数θ的函数来表示,输出结果分别为:其中 2.1.3. Input the user feature vector and item feature vector into the corresponding neural network. Since the model needs to be trained, the output result is represented by a function containing the parameter θ to be trained. The output results are: in
2.1.4、将输出结果做L2归一化处理,L2归一化公式如下:2.1.4, will output the result To do L2 normalization, the L2 normalization formula is as follows:
其中γi表示[Y]向量上第i个分量; in γ i represents the i-th component on the [Y] vector;
2.2、通过基于哈希序列的频率估计方法,即利用哈希序列来记录采样的序列号,减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数,循环以下步骤;2.2. Through the frequency estimation method based on the hash sequence, that is, using the hash sequence to record the serial number of the sampling, the sampling deviation problem that may occur in each batch of negative sampling is reduced, thereby optimizing the loss function, and looping the following steps ;
更进一步,2.2的流程如下:Going a step further, the process for 2.2 is as follows:
2.2.1、从样本集中随机采取T个样本,表示如下:2.2.1. Take T samples randomly from the sample set, which are expressed as follows:
其中xi表示样本T中的第i个用户特征向量, where x i represents the ith user feature vector in sample T,
yi表示样本T中的第i个物品特征向量,y i represents the i-th item feature vector in the sample T,
ri表示样本T中第i个用户反馈程度,且ri ∈[0,1]。ri represents the feedback degree of the ith user in the sample T, and ri ∈ [0,1].
2.2.2、利用基于哈希序列的频率估计方法计算每个样本中yi的概率pi;2.2.2. Calculate the probability p i of y i in each sample by using the frequency estimation method based on the hash sequence;
所述2.2.2流程如下:The process described in 2.2.2 is as follows:
2.2.2.1、设学习率α,大小为H的数组A和D,哈希函数h;2.2.2.1. Let the learning rate α, the arrays A and D of size H, and the hash function h;
其中哈希函数h可以把每个yi映射到范围在[0,H]的整数。where the hash function h can map each yi to an integer in the range [0, H].
2.2.2.2、对于每一步t=1,2,…,在训练样本数量的一个批次(batch)中的所有物品集合对于每一物品有:2.2.2.2. For each step t = 1, 2, ..., the set of all items in a batch of the number of training samples for each item Have:
2.2.2.2.1、 2.2.2.2.1,
其中←表示赋值,表示yi上次被采样到的时刻,表示yi被采样一次经过的步数;where ← means assignment, represents the last time yi was sampled, Indicates the number of steps that yi is sampled once;
2.2.2.2.2、 2.2.2.2.2,
2.2.2.3、对于每一个采样概率 2.2.2.3, for each sampling probability
2.2.3、计算优化后的损失函数并给出推导过程;2.2.3. Calculate the optimized loss function And give the derivation process;
所述2.2.3流程如下:The process described in 2.2.3 is as follows:
2.2.3.1、给出向量内积公式:2.2.3.1. Give the vector inner product formula:
2.2.3.2、给定一个向量x,从M个物品中得到其中一个的概率可以利用softmax函数计算,其公式如下:2.2.3.2. Given a vector x, get one of the M items The probability of can be calculated using the softmax function, the formula is as follows:
其中;符号后面的θ表示所带的参数,e为自然常数;Among them; θ after the symbol means With the parameters, e is a natural constant;
2.2.3.3、加权对数似然损失函数为:2.2.3.3. The weighted log-likelihood loss function is:
其中T表示2.2.1所述随机采取的T个样本;where T represents T samples randomly taken as described in 2.2.1;
2.2.3.4、本发明利用负采样算法进行计算,负采样算法是采用了一种平滑策略,能够提高低频样本的采样几率。我们在同一batch中采样负样本。给出一个最小batch,为其中batch-softmax函数为:2.2.3.4. The present invention uses a negative sampling algorithm for calculation, and the negative sampling algorithm adopts a smoothing strategy, which can improve the sampling probability of low-frequency samples. We sample negative samples in the same batch. gives a minimum batch, which is The batch-softmax function is:
2.2.3.5、在每一batch中,由于存在幂分布现象,即在每个batch中随机采样负样本,会使热门物品容易被采样到,在损失函数中就过度惩罚了这些热门商品,因此考虑用频率对采样进行修正,其公式如下:2.2.3.5. In each batch, due to the power distribution phenomenon, that is, random sampling of negative samples in each batch will make popular items easy to be sampled, and these popular items will be excessively punished in the loss function, so consider The sampling is corrected by frequency, and the formula is as follows:
sc(xi,yi)=s(xi,yi)-log(pi)s c (x i , y i )=s(x i , y i )-log( pi )
其中sc(xi,yi)表示s(xi,yi)的修正值,pi来自2.2.2所述的基于哈希序列的频率估计算法;where s c ( xi , y i ) represents the correction value of s ( xi , y i ), and pi comes from the frequency estimation algorithm based on hash sequence described in 2.2.2;
2.2.3.6、因此修正后的条件概率函数为:2.2.3.6. Therefore, the modified conditional probability function is:
2.2.3.7、最后损失函数为:2.2.3.7. The final loss function is:
2.2.4、采用梯度下降算法更新参数θ,使其接近最优值。其中梯度下降算法2.2.4. Use the gradient descent algorithm to update the parameter θ to make it close to the optimal value. The gradient descent algorithm
机器学习常用的算法,用于求解模型参数;Algorithms commonly used in machine learning to solve model parameters;
第三步:定义用户历史交互矩阵与知识图谱的实体映射关系;Step 3: Define the entity mapping relationship between the user history interaction matrix and the knowledge graph;
所述第三步流程如下:The third step process is as follows:
3.1、给定一个用户交互矩阵 表示第个用户与第个物品的交互情况,U和V分别表示用户集合和物品集合;3.1. Given a user interaction matrix means the first users and The interaction of each item, U and V represent the user set and the item set, respectively;
用户交互矩阵Y的表达公式如下:The expression formula of the user interaction matrix Y is as follows:
3.2、定义Oi,j表示用户i对物品j的交互情况,将用户i存在交互的物品映射到知识图谱的实体;3.2. Define O i,j to represent the interaction of user i with item j, and map the items that user i interacts with to the knowledge graph entity;
更进一步,所述3.2流程如下:Further, the 3.2 process is as follows:
3.2.1、Oi,j是一个交互矩阵,由用户的行向量组成。Oi,j表示用户i对物品j的交互情况,其中j为物品的索引;3.2.1. O i,j is an interaction matrix consisting of row vectors of users. O i,j represents the interaction of user i with item j, where j is the index of the item;
3.2.2、定义一个HashMap用于存放全部物品,HashMap以key-value键值对的方式存放,这里的key存放物品的索引,value存放物体对应的实体;3.2.2. Define a HashMap to store all items. HashMap is stored in the form of key-value key-value pairs. The key here stores the index of the item, and the value stores the entity corresponding to the object;
3.2.3,从用户交互矩阵O的行向量表示一个用户的交互情况,把这个行向量存放于一个数组,这个数组定义为Ε;3.2.3, from the row vector of the user interaction matrix O to represent the interaction situation of a user, this row vector is stored in an array, and this array is defined as E;
3.2.3、定义一个临时集合temp_set用户存放实体,遍历数组Ε,如果数组里元素的值为1,则根据这个元素的索引访问物品HashMap,获取对应的实体,存放于临时集合temp_set中;3.2.3. Define a temporary set temp_set user to store entities, traverse the array E, if the value of the element in the array is 1, access the item HashMap according to the index of this element, obtain the corresponding entity, and store it in the temporary set temp_set;
第四步:通过偏好实体传播的方式,将每次传播召回到的实体与用户特征输入到优化的双塔模型户偏得出预测概率,筛选概率高的实体,最终得到表示用户偏好和潜在偏好的知识图谱;Step 4: Through the method of preference entity propagation, input the entities and user characteristics recalled by each propagation into the optimized twin-tower model to obtain the predicted probability, screen the entities with high probability, and finally obtain the representation of user preference and potential preference. knowledge graph;
所述第四步流程如下:The fourth step process is as follows:
4.1、为一个用户给出第一步所述的用户特征向量vuser;4.1, give the user feature vector v user described in the first step for a user ;
4.2、按第三步初始化该用户的临时集合temp_set;4.2. Initialize the user's temporary set temp_set according to the third step;
4.3、定义一个用户的偏好的HashMap,命名为user_map,key值存放循环次数,value存放三元组集合;4.3. Define a HashMap of user preferences, named user_map, the key value stores the number of cycles, and the value stores the triplet set;
4.4、定义一个集合used_set,用于存放已经召回的三元组;4.4. Define a set used_set to store the recalled triples;
4.5、令循环次数k=1,2,3…,给定一个最大值K,当user_map中的三元组数量大于K,退出循环,循环执行以下步骤:4.5. Let the number of loops k = 1, 2, 3..., given a maximum value K, when the number of triples in user_map is greater than K, exit the loop and execute the following steps:
4.5.1、遍历temp_set中的实体,去掉used_set中的实体4.5.1. Traverse the entities in temp_set and remove the entities in used_set
temp_set←temp_set-used_settemp_set←temp_set-used_set
其中←表示赋值,-表示差集运算;← means assignment, - means difference operation;
4.5.2、从知识图谱找到三元组集合并存放到user_map,定义如下:4.5.2. Find the triplet set from the knowledge graph And stored in user_map, defined as follows:
其中(h',r',t')表示三元组,h'表示头实体,r'表示关系,t'表示尾实体;Where (h', r', t') represents a triple, h' represents a head entity, r' represents a relationship, and t' represents a tail entity;
4.5.3、由于知识图谱存在回路,为了防止收集到已经召回的实体,需要将temp_set中的实体添加到used_set;4.5.3. Since there is a loop in the knowledge graph, in order to prevent the collection of recalled entities, it is necessary to add entities in temp_set to used_set;
4.5.4、取出中三元组的尾实体,存放到集合 4.5.4, take out The tail entity of the triple, stored in the collection
4.5.5、遍历中的实体,取出对应的物品特征向量vitem;4.5.5. Traversal The entity in , take out the corresponding item feature vector v item ;
4.5.6、通过第二步所述的输入参数vuser和vitem到双塔模型,得出用户与集合中对应的物品的概率,并根据概率大小将实体排序;4.5.6. Through the input parameters v user and v item described in the second step to the twin tower model, the user and set are obtained The probability of the corresponding item in , and the entities are sorted according to the probability;
4.5.7、给定一个值τ,0<τ≤1,用于决定筛选实体的个数:4.5.7. Given a value τ, 0<τ≤1, it is used to determine the number of screening entities:
其中表示集合中的元素按概率排序,并返回一个数组,getSubVec(i,j)表示取出子数组,即原数组的第i到第j的元素,newSet()表示将数组转化为集合,表示向上取整;in Indicates that the elements in the set are sorted by probability and returns an array, getSubVec(i,j) means to take out the sub-array, that is, the i-th to j-th elements of the original array, newSet() means to convert the array into a set, means rounded up;
4.5.8、从集合中筛选出尾实体属于的三元组:4.5.8. From Filter out the tail entity from the collection that belongs to triples of :
4.5.9、存放三元组集合Λ到user_map的方法:user_map(k,Λ);4.5.10、将集合覆盖到temp_set;4.5.9. The method of storing the triplet set Λ to user_map: user_map(k, Λ); 4.5.10. Putting the Set overwrites to temp_set;
4.6、执行完4.5后,最终得到用户偏好的知识图谱。4.6. After executing 4.5, the knowledge graph of user preference is finally obtained.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210169936.1A CN114564594B (en) | 2022-02-23 | A knowledge graph user preference entity recall method based on double tower model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210169936.1A CN114564594B (en) | 2022-02-23 | A knowledge graph user preference entity recall method based on double tower model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114564594A true CN114564594A (en) | 2022-05-31 |
CN114564594B CN114564594B (en) | 2025-02-21 |
Family
ID=
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116150504A (en) * | 2023-04-17 | 2023-05-23 | 特斯联科技集团有限公司 | Recommendation method and device for processing long tail distribution, computer storage medium and terminal |
CN118312657A (en) * | 2024-06-07 | 2024-07-09 | 江苏瑞问科技有限公司 | Knowledge base-based intelligent large model analysis recommendation system and method |
CN119025718A (en) * | 2024-10-28 | 2024-11-26 | 上海大智慧财汇数据科技有限公司 | Method and system for determining and analyzing group households based on graphic database |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111428055A (en) * | 2020-04-20 | 2020-07-17 | 神思电子技术股份有限公司 | Industry-oriented context omission question-answering method |
CN112232925A (en) * | 2020-11-02 | 2021-01-15 | 哈尔滨工程大学 | Method for carrying out personalized recommendation on commodities by fusing knowledge maps |
CN112732936A (en) * | 2021-01-11 | 2021-04-30 | 电子科技大学 | Radio and television program recommendation method based on knowledge graph and user microscopic behaviors |
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111428055A (en) * | 2020-04-20 | 2020-07-17 | 神思电子技术股份有限公司 | Industry-oriented context omission question-answering method |
CN112232925A (en) * | 2020-11-02 | 2021-01-15 | 哈尔滨工程大学 | Method for carrying out personalized recommendation on commodities by fusing knowledge maps |
CN112732936A (en) * | 2021-01-11 | 2021-04-30 | 电子科技大学 | Radio and television program recommendation method based on knowledge graph and user microscopic behaviors |
Non-Patent Citations (3)
Title |
---|
YUANYI CHEN等: "Learning User Preference from Heterogeneous Information for Store-Type Recommendation", 《IEEE TRANSACTIONS ON SERVICES COMPUTING》, vol. 13, no. 6, 21 September 2017 (2017-09-21), XP011823707, DOI: 10.1109/TSC.2017.2755009 * |
张有杰: "基于社交网络的推荐算法研究与实现", 《中国优秀硕士学位论文全文数据库》, 15 January 2022 (2022-01-15) * |
陆佳炜等: "一种融合实体图上下文的三维旋转知识图谱表示学习", 《小型微型计算机系统》, vol. 44, no. 1, 15 October 2021 (2021-10-15) * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116150504A (en) * | 2023-04-17 | 2023-05-23 | 特斯联科技集团有限公司 | Recommendation method and device for processing long tail distribution, computer storage medium and terminal |
CN118312657A (en) * | 2024-06-07 | 2024-07-09 | 江苏瑞问科技有限公司 | Knowledge base-based intelligent large model analysis recommendation system and method |
CN118312657B (en) * | 2024-06-07 | 2024-08-09 | 江苏瑞问科技有限公司 | Knowledge base-based intelligent large model analysis recommendation system and method |
CN119025718A (en) * | 2024-10-28 | 2024-11-26 | 上海大智慧财汇数据科技有限公司 | Method and system for determining and analyzing group households based on graphic database |
CN119025718B (en) * | 2024-10-28 | 2025-01-17 | 上海大智慧财汇数据科技有限公司 | Group user judging and analyzing method and system based on graphic database |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112214685B (en) | Knowledge graph-based personalized recommendation method | |
CN108920641B (en) | Information fusion personalized recommendation method | |
CN110910218B (en) | Multi-behavior migration recommendation method based on deep learning | |
CN111310063B (en) | Neural network-based article recommendation method for memory perception gated factorization machine | |
CN110532471B (en) | Active learning collaborative filtering method based on gated cyclic unit neural network | |
CN114693397B (en) | Attention neural network-based multi-view multi-mode commodity recommendation method | |
CN115062237A (en) | Cultural resource recommendation method based on the combination of graph neural network and knowledge graph | |
CN112488791A (en) | Individualized recommendation method based on knowledge graph convolution algorithm | |
CN113918834B (en) | Graph convolution collaborative filtering recommendation method fusing social relations | |
CN113918833A (en) | Product recommendation method realized through graph convolution collaborative filtering of social network relationship | |
CN113704438B (en) | Conversation recommendation method of abnormal picture based on layered attention mechanism | |
CN111985623A (en) | Attribute graph group discovery method based on maximized mutual information and graph neural network | |
CN117112920A (en) | A collaborative filtering method based on interactive graph convolutional network | |
CN117634600A (en) | A graph self-supervised learning recommendation method based on knowledge awareness | |
CN112818256A (en) | Recommendation method based on neural collaborative filtering | |
CN104063555B (en) | The user model modeling method intelligently distributed towards remote sensing information | |
CN116662564A (en) | A service recommendation method based on deep matrix factorization and knowledge graph | |
CN116069921A (en) | News Recommendation Method Combining Activation Diffusion Theory and Ebbinghaus Forgetting Theory | |
CN111814048A (en) | Method and device for recommending information | |
CN114817581A (en) | Cross-modal Hash retrieval method based on fusion attention mechanism and DenseNet network | |
CN114564594B (en) | A knowledge graph user preference entity recall method based on double tower model | |
CN114564594A (en) | Knowledge graph user preference entity recall method based on double-tower model | |
CN115114528A (en) | A Knowledge Graph Recommendation Method Integrating GNN and ResNet | |
CN116821519A (en) | Intelligent recommendation method for system filtering and noise reduction based on graph structure | |
CN116308618A (en) | A method for training a recommendation model, a recommendation model, and a product recommendation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |