[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114564594A - Knowledge graph user preference entity recall method based on double-tower model - Google Patents

Knowledge graph user preference entity recall method based on double-tower model Download PDF

Info

Publication number
CN114564594A
CN114564594A CN202210169936.1A CN202210169936A CN114564594A CN 114564594 A CN114564594 A CN 114564594A CN 202210169936 A CN202210169936 A CN 202210169936A CN 114564594 A CN114564594 A CN 114564594A
Authority
CN
China
Prior art keywords
user
embedding
item
representing
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210169936.1A
Other languages
Chinese (zh)
Other versions
CN114564594B (en
Inventor
陆佳炜
吴俚达
程振波
韦航俊
朱昊天
方静雯
徐俊
肖刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202210169936.1A priority Critical patent/CN114564594B/en
Priority claimed from CN202210169936.1A external-priority patent/CN114564594B/en
Publication of CN114564594A publication Critical patent/CN114564594A/en
Application granted granted Critical
Publication of CN114564594B publication Critical patent/CN114564594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于双塔模型的知识图谱用户偏好实体召回方法,在传统的双塔模型中添加了优化方法,用于更好的学习用户与物品之间的交互,已训练的双塔模型能用于召回在知识图谱上与用户偏好相关的实体。首先将用户历史记录的物品在知识图谱对应的实体作为起点,沿着边检索到所有的邻居实体。然后通过已经训练好的优化双塔模型对召回到的实体进行筛选。最后以召回到的实体作为新的起点,重复上述操作。最终构成了能够表示用户偏好和潜在偏好的知识图谱。The invention discloses a knowledge map user preference entity recall method based on a double-tower model, and an optimization method is added to the traditional double-tower model for better learning the interaction between users and items. The model can be used to recall entities related to user preferences on the knowledge graph. First, the entities corresponding to the items in the user history in the knowledge graph are used as the starting point, and all neighbor entities are retrieved along the edges. The recalled entities are then screened through the trained and optimized twin-tower model. Finally, the above operations are repeated with the recalled entity as a new starting point. Finally, a knowledge graph that can represent user preferences and potential preferences is formed.

Description

一种基于双塔模型的知识图谱用户偏好实体召回方法A Recall Method for User Preference Entity in Knowledge Graph Based on Twin Tower Model

技术领域technical field

本发明涉及知识图谱和深度学习技术领域,具体涉及一种基于双塔模型的知识图谱用户偏好实体召回方法。The invention relates to the technical field of knowledge graphs and deep learning, in particular to a method for recalling user preference entities of knowledge graphs based on a twin-tower model.

背景技术Background technique

知识图谱是谷歌在2012年提出的概念,是谷歌用于增强其搜索引擎功能的知识库。本质上,知识图谱旨在描述真实世界中存在的各种实体或概念及其关系,其构成一张巨大的语意网络图,节点表示实体或概念,边则由属性或关系构成。每条知识的表示为三元组(h,r,t)的形式,其中h表示头实体,t表示尾实体,r表示头、尾实体之间的关系。知识图谱以强大的语义处理能力和开放组织能力,在推荐系统、智能问答和信息检索等领域发挥着重要的作用,为互联网时代的知识组织和智能应用奠定了基础。A knowledge graph, a concept proposed by Google in 2012, is a knowledge base that Google uses to enhance the functionality of its search engine. Essentially, the knowledge graph aims to describe various entities or concepts and their relationships in the real world, which constitute a huge semantic network graph, where nodes represent entities or concepts, and edges are composed of attributes or relationships. Each piece of knowledge is represented in the form of a triple (h, r, t), where h represents the head entity, t represents the tail entity, and r represents the relationship between the head and tail entities. With powerful semantic processing capabilities and open organization capabilities, knowledge graphs play an important role in recommender systems, intelligent question answering, and information retrieval, laying the foundation for knowledge organization and intelligent applications in the Internet era.

传统的推荐系统使用显式或隐式信息作为输入来进行预测,存在两个主要问题。一是稀疏性问题,实际场景中,用户和物品的交互信息往往是非常稀疏的,使用如此少的观测数据来预测大量的未知信息,会极大增加过拟合的风险。二是冷启动问题,对于新加入的用户或者物品,其没有对应的历史信息,因此难以进行准确地建模和推荐。Traditional recommender systems use explicit or implicit information as input to make predictions, and there are two main problems. One is the sparsity problem. In actual scenarios, the interaction information between users and items is often very sparse. Using such a small amount of observation data to predict a large amount of unknown information will greatly increase the risk of overfitting. The second is the cold start problem. For newly added users or items, there is no corresponding historical information, so it is difficult to accurately model and recommend.

知识图谱包含了实体之间丰富的语义关联,为推荐系统提供了潜在的辅助信息来源。知识图谱为物品引入更多的语义关系,可以深层地发现用户兴趣。通过知识图谱中不同的关系链接种类,有利于推荐结果的发散。知识图谱可以连接用户的历史记录和推荐结果,从而提高用户对推荐结果的满意度和接受度,增强用户对系统的信任。Knowledge graphs contain rich semantic associations between entities, providing a potential source of auxiliary information for recommender systems. Knowledge graphs introduce more semantic relationships for items, which can deeply discover user interests. Through the different types of relational links in the knowledge graph, it is beneficial to the divergence of the recommendation results. The knowledge graph can connect the user's historical records and recommendation results, thereby improving the user's satisfaction and acceptance of the recommendation results, and enhancing the user's trust in the system.

现有的知识图谱推荐应用的方法主要有两类。一类是基于向量嵌入方法(embedding-based methods),通过知识图谱向量嵌入算法,去学习知识图谱中的实体和关系,并得到实体和关系的向量表示,再把实体和关系的向量引入到推荐系统框架中。比如,基于卷积神经网络的DKN框架(Deep Knowledge-aware Network),基于协同知识库嵌入的CKE框架(Collaborative Knowledge base Embedding)。虽然基于向量嵌入的知识图谱推荐方法具有很强的灵活性,但这类方法通常适用于图形内链路预测应用,而推荐场景更需要的是去挖掘用户的潜在兴趣。另一类是基于路径方法(path-based methods),探索知识图谱中各个实体之间的各种联系,为推荐系统提供额外的指导。比如,基于个性化实体的推荐方法(Personalized Entity Recommendation),基于元图的推荐方法(Meta-GraphBased Recommendation)。虽然基于路径的知识图谱推荐方法能够以更自然和更直观的方式使用知识图谱,但是它们严重依赖于手动设计的元路径,这在实践中很难优化。There are mainly two types of existing methods for knowledge graph recommendation applications. One is based on the vector embedding method (embedding-based methods), through the knowledge map vector embedding algorithm, to learn the entities and relationships in the knowledge map, and obtain the vector representation of the entities and relationships, and then introduce the vectors of the entities and relationships into the recommendation. in the system framework. For example, the DKN framework (Deep Knowledge-aware Network) based on convolutional neural networks, and the CKE framework (Collaborative Knowledge base Embedding) based on collaborative knowledge base embedding. Although the knowledge graph recommendation method based on vector embedding has strong flexibility, this kind of method is usually suitable for the application of link prediction in the graph, and the recommendation scenario needs to explore the potential interests of users. The other category is path-based methods, which explore various connections between entities in the knowledge graph and provide additional guidance for recommender systems. For example, Personalized Entity Recommendation and Meta-GraphBased Recommendation. Although path-based knowledge graph recommendation methods are able to use knowledge graphs in a more natural and intuitive way, they heavily rely on manually designed meta-paths, which are difficult to optimize in practice.

发明内容SUMMARY OF THE INVENTION

针对现有技术中存在的问题,本发明提供了一种基于双塔模型的知识图谱用户偏好实体召回方法。在传统的双塔模型中添加了优化方法,用于更好的学习用户与物品之间的交互;已训练的双塔模型能用于召回在知识图谱上与用户偏好相关的实体;首先将用户历史记录的物品在知识图谱对应的实体作为起点,沿着边检索到所有的邻居实体;然后通过已经训练好的优化双塔模型对召回到的实体进行筛选。最后以召回到的实体作为新的起点,重复上述操作;最终构成了能够表示用户偏好和潜在偏好的知识图谱。In view of the problems existing in the prior art, the present invention provides a method for recalling user preference entities of knowledge graphs based on the twin tower model. An optimization method is added to the traditional two-tower model to better learn the interaction between users and items; the trained two-tower model can be used to recall entities related to user preferences on the knowledge graph; The items in the history record take the entity corresponding to the knowledge graph as the starting point, and retrieve all neighbor entities along the edge; then the recalled entities are screened through the trained optimized twin-tower model. Finally, the recalled entities are used as a new starting point, and the above operations are repeated; finally, a knowledge graph that can represent user preferences and potential preferences is formed.

本发明所采用的技术方案如下:The technical scheme adopted in the present invention is as follows:

一种基于双塔模型的知识图谱用户偏好实体召回方法,包括如下步骤:A method for recalling user preference entities in knowledge graph based on two-tower model, comprising the following steps:

1、定义用户特征向量和物品特征向量,作为双塔模型的输入;1. Define the user feature vector and the item feature vector as the input of the twin tower model;

2、训练双塔模型,结合in-batch softmax损失函数与基于哈希序列的频率估计方法对双塔模型进行优化;2. Train the twin-tower model, and optimize the twin-tower model by combining the in-batch softmax loss function and the frequency estimation method based on the hash sequence;

3、定义用户历史交互矩阵与知识图谱的实体映射关系;3. Define the entity mapping relationship between the user history interaction matrix and the knowledge graph;

4、通过偏好实体传播的方式,将每次传播召回到的实体与用户特征输入到优化的双塔模型户偏得出预测概率,筛选概率高的实体,最终得到表示用户偏好和潜在偏好的知识图谱。4. Through the method of preference entity propagation, input the entities and user characteristics recalled by each propagation into the optimized twin-tower model to obtain the predicted probability, screen the entities with high probability, and finally obtain the knowledge representing user preferences and potential preferences Atlas.

所述步骤1的过程如下:The process of step 1 is as follows:

1.1、用户特征指的是用户对物品的交互行为,包括点击记录,搜索记录,社交数据,个人数据和样本年龄,用户特征向量是将上述交互数据转化为向量并做拼接(concatenate)。其中将原始数据转化为向量的方式称为向量嵌入(embedding),向量嵌入(embedding)是机器学习中常用的表示数据特征的方法,目的是将原始数据提取出特征,也就是通过神经网络映射之后的低维向量;1.1. User feature refers to the user's interaction behavior with items, including click records, search records, social data, personal data and sample age. User feature vector is to convert the above interaction data into vectors and concatenate them. The method of converting raw data into vectors is called vector embedding. Vector embedding is a commonly used method to represent data features in machine learning. The purpose is to extract features from raw data, that is, after mapping through neural networks. The low-dimensional vector of ;

更进一步,所述1.1的流程如下:Further, the process of 1.1 is as follows:

1.1.1、用户点击记录的embedding,是所有点击物品的id类embedding的加权平均,其中id类embedding是将物品唯一标识符映射到同一维度的向量,其权重与浏览物品时间成正比。其用户点击记录的embedding计算公式如下:1.1.1. The embedding of the user click record is the weighted average of the id-type embeddings of all clicked items. The id-type embedding is a vector that maps the unique identifier of the item to the same dimension, and its weight is proportional to the browsing time of the item. The embedding calculation formula of its user click record is as follows:

Figure BDA0003517220730000031
Figure BDA0003517220730000031

其中vclick表示用户点击记录的embedding,

Figure BDA0003517220730000032
表示第i个权重,vclick,i表示点击记录中第i个物品的id类embedding,n表示embedding的个数;其中,
Figure BDA0003517220730000033
可通过如下公式计算:Where v click represents the embedding of the user click record,
Figure BDA0003517220730000032
represents the ith weight, v click, i represents the id class embedding of the ith item in the click record, and n represents the number of embeddings; among them,
Figure BDA0003517220730000033
It can be calculated by the following formula:

Figure BDA0003517220730000034
Figure BDA0003517220730000034

其中

Figure BDA0003517220730000035
表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in
Figure BDA0003517220730000035
Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;

1.1.2、用户搜索记录的embedding是历史搜索的关键词进行分词得到词条。分词的过程是通过Word2vec模型得到对应词条的embedding,然后将用户搜索记录的embedding进行加权平均。1.1.2. The embedding of the user's search record is the keyword of the historical search for word segmentation to obtain the entry. The process of word segmentation is to obtain the embedding of the corresponding entry through the Word2vec model, and then perform a weighted average of the embeddings of the user search records.

其中分词是搜索引擎针对用户提交搜索的关键词串进行切分成不同词条token的技术。The word segmentation is a technology in which the search engine divides the keyword string submitted by the user into different entry tokens.

Word2vec模型将文本中的内容词汇通过转换处理,化简为空间向量,词向量的数值受上下文的影响,蕴含了词与词之间相互的关联性。The Word2vec model converts the content vocabulary in the text and reduces it to a space vector. The value of the word vector is affected by the context, which implies the correlation between words.

其用户搜索记录的embedding的计算公式如下:The calculation formula of the embedding of its user search records is as follows:

Figure BDA0003517220730000036
Figure BDA0003517220730000036

其中vsearch表示用户搜索记录的embedding,

Figure BDA0003517220730000037
表示第i个权重,vsearch,i表示搜索记录中第i个词条的embedding,n表示embedding的个数;where v search represents the embedding of the user search record,
Figure BDA0003517220730000037
Represents the ith weight, v search, i represents the embedding of the ith entry in the search record, and n represents the number of embeddings;

搜索记录的embedding的权重计算:The weight calculation of the embedding of the search record:

Figure BDA0003517220730000038
Figure BDA0003517220730000038

其中搜索的有效性判断为用户是否在搜索后点击物品;The validity of the search is judged as whether the user clicks the item after the search;

1.1.3、用户的社交数据包括收藏、点赞和订阅数据对应的embedding加权平均。其中收藏和点赞数据对应的embedding指的是用户收藏和点赞的物品id类的embedding;订阅数据对应的embedding指的是用户订阅物品对应的负责人的id类的embedding。1.1.3. The user's social data includes the embedding weighted average of favorites, likes, and subscription data. The embedding corresponding to the collection and like data refers to the embedding of the id category of the items that the user has favorited and liked; the embedding corresponding to the subscription data refers to the embedding of the id category of the person in charge corresponding to the user subscribed items.

其用户社交数据的embedding的计算公式如下:The calculation formula of the embedding of its user social data is as follows:

Figure BDA0003517220730000041
Figure BDA0003517220730000041

其中vsocial表示用户搜索记录的embedding,

Figure BDA0003517220730000042
表示第i个权重,vsocial,i表示搜索记录中第i个社交数据的embedding;where v social represents the embedding of user search records,
Figure BDA0003517220730000042
represents the ith weight, v social,i represents the embedding of the ith social data in the search record;

对于收藏和点赞的embedding的权重计算:For the weight calculation of the favorite and liked embeddings:

Figure BDA0003517220730000043
Figure BDA0003517220730000043

其中

Figure BDA0003517220730000044
表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in
Figure BDA0003517220730000044
Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;

对于订阅的embedding权重计算:For subscription embedding weight calculation:

Figure BDA0003517220730000045
Figure BDA0003517220730000045

其中

Figure BDA0003517220730000046
示被订阅者第i个物品的浏览时间,N表示样本总数,k表示正例总数;in
Figure BDA0003517220730000046
Indicates the browsing time of the ith item of the subscriber, N represents the total number of samples, and k represents the total number of positive examples;

1.1.4、用户的个人数据包括用户的性别、年龄和地域;其中性别是简单的二元特征,年龄和地域属于连续型特征,可以将其归一化为[0,1]区间上的实数值。用户个人数据的embedding,是将处理过的性别、年龄和地域的值做拼接操作后得到的向量;1.1.4. The user's personal data includes the user's gender, age, and region; gender is a simple binary feature, and age and region are continuous features, which can be normalized to a real value in the [0,1] interval. numerical value. The embedding of the user's personal data is a vector obtained by splicing the processed values of gender, age and region;

更进一步,1.1.4的流程如下:Going a step further, the process for 1.1.4 is as follows:

1.1.4.1、计算用户性别的二元表示,其公式如下:1.1.4.1. Calculate the binary representation of user gender, the formula is as follows:

Figure BDA0003517220730000047
Figure BDA0003517220730000047

1.1.4.2、计算用户的年龄和地域的归一化实数值,其归一化公式如下:1.1.4.2. Calculate the normalized real value of the user's age and region. The normalization formula is as follows:

Figure BDA0003517220730000048
Figure BDA0003517220730000048

其中X表示样本数值,μ为所有样本数据的均值,σ为所有样本数据的标准差;where X represents the sample value, μ is the mean of all sample data, and σ is the standard deviation of all sample data;

1.1.4.3、将1.1.4所述的性别二元值,年龄和地域归一化实数值做拼接操作得到一个向量,这个向量拼接操作公式如下:1.1.4.3. Perform a splicing operation on the gender binary value, age and region normalized real value described in 1.1.4 to obtain a vector. The vector splicing operation formula is as follows:

vpersonal=[gender,zage,zregion]v personal = [gender, z age , z region ]

其中vpersonal表示用户特征向量,gender表示用户性别,zage和zregion分别表示用户的年龄和地域的归一化值;where v personal represents the user feature vector, gender represents the user’s gender, and z age and z region represent the normalized value of the user’s age and region, respectively;

1.1.5、将1.1流程所述的用户点击记录的embedding,用户搜索记录的embedding,用户交互数据的embedding,用户个人数据的embedding做concatenate连接操作得到用户特征向量,其公式如下:1.1.5. The user feature vector is obtained by performing the concatenate connection operation on the embedding of the user click record, the embedding of the user search record, the embedding of the user interaction data, and the embedding of the user's personal data described in the process of 1.1. The formula is as follows:

vuser=concatenate(vclick,vsearch,vsocial,vpersonal)v user =concatenate(v click ,v search ,v social ,v personal )

=[vclick[1],vclick[2],…,vsearch[1],vsearch[2],…,vsocial[1],vsocial[2],…,vpersonal[1],vpersonal[2],…]=[v click [1],v click [2],…,v search [1],v search [2],…,v social [1],v social [2],…,v personal [1], v personal [2],…]

其中vuser表示用户特征向量,vclick[i]表示用户点击embedding的第i个分量,vsearch[i]表示用户搜索记录embedding的第i个分量,vsocial[i]表示用户社交数据embedding的第i个分量,vpersonal[i]表示用户个人数据embedding的第i个分量;where v user represents the user feature vector, v click [i] represents the ith component of the user click embedding, v search [i] represents the ith component of the user search record embedding, and v social [i] represents the user's social data embedding The i-th component, v personal [i] represents the i-th component of the user's personal data embedding;

1.2、物品特征包括物品的id及其上下文信息,物品特征向量由物品的id类embedding于其上下文信息的embedding拼接而成;1.2. Item features include the id of the item and its context information, and the item feature vector is formed by splicing the id class embedding of the item and the embedding of its context information;

更进一步,1.2的流程如下:Going a step further, the process for 1.2 is as follows:

1.2.1、给出物品的id类embedding,是将物品唯一标识符映射到同一维度的向量;1.2.1. The id class embedding of the item is given, which is a vector that maps the unique identifier of the item to the same dimension;

1.2.2、给出物品的上下文信息embedding,是将上下文信息通过Word2vec得到的向量;1.2.2. The context information embedding of the item is given, which is a vector obtained by passing the context information through Word2vec;

1.2.3、将1.2所述的id类embedding和上下文信息embedding做concatenate连接操作得到物品特征向量,其公式如下:1.2.3. Perform the concatenate connection operation of the id class embedding and the context information embedding described in 1.2 to obtain the item feature vector. The formula is as follows:

vitem=concatenate(vid,vcontext)=[vid[1],vid[2],…,vcontext[1],vcontext[2],…]v item =concatenate(v id ,v context )=[v id [1],v id [2],…,v context [1],v context [2],…]

其中vitem表示物品特征向量,vid表示物品的id类embedding,vcontext表示物品上下文信息的embedding,vid[i]表示物品的id类embedding的第i个分量,vcontext[i]表示物品上下文信息的embedding的第i个分量;Where v item represents the feature vector of the item, v id represents the id class embedding of the item, v context represents the embedding of the item context information, v id [i] represents the i-th component of the id class embedding of the item, and v context [i] represents the item The ith component of the embedding of context information;

所述步骤2中,双塔模型衍生于DSSM(Deep Structured Semantic Models)。DSSM是深度结构化语义模型,在自然语言处理领域中,DSSM常用于解决语义相似度的问题。双塔模型从上往下可以分为输入层、表示层、匹配层,从水平方向看分为两个输入层,两个表示层和一个匹配层。其中两个输入层的输出分别是两个表示层的输入,两个表示层输出汇总到匹配层,整体呈现“双塔”的形式。本发明中,双塔模型的输入层的两个输入分别是用户特征向量和物品特征向量,两个表示层是相同的神经网络结构,特征向量通过神经网络计算后会得到相同维度的向量。最后将这两个向量经过L2归一化后进行内积。进一步的,本发明优化的双塔模型,是采用了基于哈希序列的频率估计方法对双塔模型进行优化。该方法减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数。In the step 2, the twin tower model is derived from DSSM (Deep Structured Semantic Models). DSSM is a deep structured semantic model. In the field of natural language processing, DSSM is often used to solve the problem of semantic similarity. The dual-tower model can be divided into input layer, presentation layer and matching layer from top to bottom, and from the horizontal direction, it is divided into two input layers, two presentation layers and one matching layer. The outputs of the two input layers are the inputs of the two presentation layers, respectively, and the outputs of the two presentation layers are aggregated to the matching layer, and the whole is in the form of a "twin tower". In the present invention, the two inputs of the input layer of the twin-tower model are the user feature vector and the item feature vector respectively, the two representation layers have the same neural network structure, and the feature vector will obtain a vector of the same dimension after being calculated by the neural network. Finally, the two vectors are L2 normalized and then inner product is performed. Further, the optimized twin-tower model of the present invention adopts a frequency estimation method based on a hash sequence to optimize the twin-tower model. This method reduces the possible sampling bias problem of negative sampling in each batch, thereby optimizing the loss function.

所述步骤2的流程如下:The process of step 2 is as follows:

2.1、给出两个带参数的embedding函数:

Figure BDA0003517220730000061
其中
Figure BDA0003517220730000062
表示d维实数向量。2.1. Give two embedding functions with parameters:
Figure BDA0003517220730000061
in
Figure BDA0003517220730000062
Represents a d-dimensional real vector.

Figure BDA0003517220730000063
是通过深度神经网络提取后的用户特征向量和物品特征向量,其中
Figure BDA0003517220730000064
Figure BDA0003517220730000065
分别是双塔模型需要的输入值。
Figure BDA0003517220730000063
are the user feature vector and item feature vector extracted by the deep neural network, where
Figure BDA0003517220730000064
and
Figure BDA0003517220730000065
are the input values required by the twin-tower model, respectively.

更进一步,2.1的流程如下:Going a step further, the process for 2.1 is as follows:

2.1.1、双塔模型含有两个深度神经网络模型。神经网络模型的组成单元是感知机,感知机有一个若干个输入和一个输出,其中输出和输入是训练到的一个线性关系,输出的值会通过一个非线性激活函数得到最终结果;2.1.1. The twin tower model contains two deep neural network models. The component unit of the neural network model is the perceptron. The perceptron has several inputs and one output. The output and the input are a linear relationship trained, and the output value will get the final result through a nonlinear activation function;

2.1.2、深度神经网络在神经网络模型上做了扩展,深度神经网络有一个输入层,一个输出层和若干个隐藏层,本发明采用了3层隐藏层,并且层与层之间采用全连接,全连接是指的每一个结点都与上一层的所有结点相连;2.1.2. The deep neural network is extended on the neural network model. The deep neural network has one input layer, one output layer and several hidden layers. Connection, full connection means that each node is connected to all nodes in the previous layer;

2.1.3、将用户特征向量和物品特征向量输入到对应的神经网络,由于模型需要训练,所以输出结果用含有待训练参数θ的函数来表示,输出结果分别为:

Figure BDA0003517220730000066
其中
Figure BDA0003517220730000067
2.1.3. Input the user feature vector and item feature vector into the corresponding neural network. Since the model needs to be trained, the output result is represented by a function containing the parameter θ to be trained. The output results are:
Figure BDA0003517220730000066
in
Figure BDA0003517220730000067

2.1.4、将输出结果

Figure BDA0003517220730000068
做L2归一化处理,L2归一化公式如下:2.1.4, will output the result
Figure BDA0003517220730000068
To do L2 normalization, the L2 normalization formula is as follows:

Figure BDA0003517220730000071
其中
Figure BDA0003517220730000072
γi表示[Y]向量上第i个分量;
Figure BDA0003517220730000071
in
Figure BDA0003517220730000072
γ i represents the i-th component on the [Y] vector;

2.2、通过基于哈希序列的频率估计方法,即利用哈希序列来记录采样的序列号,减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数,循环以下步骤;2.2. Through the frequency estimation method based on the hash sequence, that is, using the hash sequence to record the serial number of the sampling, the sampling deviation problem that may occur in each batch of negative sampling is reduced, thereby optimizing the loss function, and looping the following steps ;

更进一步,2.2的流程如下:Going a step further, the process for 2.2 is as follows:

2.2.1、从样本集中随机采取T个样本,表示如下:2.2.1. Take T samples randomly from the sample set, which are expressed as follows:

Figure BDA0003517220730000073
其中xi表示样本T中的第i个用户特征向量,
Figure BDA0003517220730000073
where x i represents the ith user feature vector in sample T,

yi表示样本T中的第i个物品特征向量,y i represents the i-th item feature vector in sample T,

ri表示样本T中第i个用户反馈程度,且ri ∈[0,1]。ri represents the feedback degree of the ith user in the sample T, and ri ∈ [0,1].

2.2.2、利用基于哈希序列的频率估计方法计算每个样本中yi的概率pi2.2.2. Calculate the probability p i of y i in each sample by using the frequency estimation method based on the hash sequence;

所述2.2.2流程如下:The process described in 2.2.2 is as follows:

2.2.2.1、设学习率α,大小为H的数组A和D,哈希函数h;2.2.2.1. Let the learning rate α, the arrays A and D of size H, and the hash function h;

其中哈希函数h可以把每个yi映射到范围在[0,H]的整数。where the hash function h can map each yi to an integer in the range [0, H].

2.2.2.2、对于每一步t=1,2,…,在训练样本数量的一个批次(batch)中的所有物品集合

Figure BDA0003517220730000074
对于每一物品
Figure BDA0003517220730000075
有:2.2.2.2. For each step t = 1, 2, ..., the set of all items in a batch of the number of training samples
Figure BDA0003517220730000074
for each item
Figure BDA0003517220730000075
Have:

2.2.2.2.1、

Figure BDA0003517220730000076
2.2.2.2.1,
Figure BDA0003517220730000076

其中←表示赋值,

Figure BDA0003517220730000077
表示yi上次被采样到的时刻,
Figure BDA0003517220730000078
表示yi被采样一次经过的步数;where ← means assignment,
Figure BDA0003517220730000077
represents the last time yi was sampled,
Figure BDA0003517220730000078
Indicates the number of steps that yi is sampled once;

2.2.2.2.2、

Figure BDA0003517220730000079
2.2.2.2.2,
Figure BDA0003517220730000079

2.2.2.3、对于每一个

Figure BDA00035172207300000710
采样概率
Figure BDA00035172207300000711
2.2.2.3, for each
Figure BDA00035172207300000710
sampling probability
Figure BDA00035172207300000711

2.2.3、计算优化后的损失函数

Figure BDA00035172207300000712
并给出推导过程;2.2.3. Calculate the optimized loss function
Figure BDA00035172207300000712
And give the derivation process;

所述2.2.3流程如下:The process described in 2.2.3 is as follows:

2.2.3.1、给出向量内积公式:2.2.3.1. Give the vector inner product formula:

Figure BDA00035172207300000713
Figure BDA00035172207300000713

2.2.3.2、给定一个向量x,从M个物品中得到其中一个

Figure BDA00035172207300000714
的概率可以利用softmax函数计算,其公式如下:2.2.3.2. Given a vector x, get one of the M items
Figure BDA00035172207300000714
The probability of can be calculated using the softmax function, the formula is as follows:

Figure BDA0003517220730000081
Figure BDA0003517220730000081

其中e为自然常数;where e is a natural constant;

2.2.3.3、加权对数似然损失函数为:2.2.3.3. The weighted log-likelihood loss function is:

Figure BDA0003517220730000082
Figure BDA0003517220730000082

其中T表示2.2.1所述随机采取的T个样本;where T represents T samples randomly taken as described in 2.2.1;

2.2.3.4、本发明利用负采样算法进行计算,负采样算法是采用了一种平滑策略,能够提高低频样本的采样几率。我们在同一batch中采样负样本。给出一个最小batch,为

Figure BDA0003517220730000083
其中batch-softmax函数为:2.2.3.4. The present invention uses a negative sampling algorithm for calculation, and the negative sampling algorithm adopts a smoothing strategy, which can improve the sampling probability of low-frequency samples. We sample negative samples in the same batch. gives a minimum batch, which is
Figure BDA0003517220730000083
The batch-softmax function is:

Figure BDA0003517220730000084
Figure BDA0003517220730000084

2.2.3.5、在每一batch中,由于存在幂分布现象,即在每个batch中随机采样负样本,会使热门物品容易被采样到,在损失函数中就过度惩罚了这些热门商品,因此考虑用频率对采样进行修正,其公式如下:2.2.3.5. In each batch, due to the power distribution phenomenon, that is, random sampling of negative samples in each batch will make popular items easy to be sampled, and these popular items will be excessively punished in the loss function, so consider The sampling is corrected by frequency, and the formula is as follows:

sc(xi,yi)=s(xi,yi)-log(pi)s c (x i , y i )=s(x i , y i )-log( pi )

其中sc(xi,yi)表示s(xi,yi)的修正值,pi来自2.2.2所述的基于哈希序列的频率估计算法;where s c ( xi , y i ) represents the correction value of s ( xi , y i ), and pi comes from the frequency estimation algorithm based on hash sequence described in 2.2.2;

2.2.3.6、因此修正后的条件概率函数为:2.2.3.6. Therefore, the modified conditional probability function is:

Figure BDA0003517220730000085
Figure BDA0003517220730000085

2.2.3.7、最后损失函数为:2.2.3.7. The final loss function is:

Figure BDA0003517220730000086
Figure BDA0003517220730000086

2.2.4、采用梯度下降算法更新参数θ,使其接近最优值。其中梯度下降算法机器学习常用的算法,用于求解模型参数;2.2.4. Use the gradient descent algorithm to update the parameter θ to make it close to the optimal value. Among them, the gradient descent algorithm is a commonly used algorithm in machine learning, which is used to solve the model parameters;

所述步骤3流程如下:The step 3 process is as follows:

3.1、给定一个用户交互矩阵

Figure BDA0003517220730000087
Figure BDA0003517220730000088
表示第
Figure BDA0003517220730000089
个用户与第
Figure BDA00035172207300000810
个物品的交互情况,U和V分别表示用户集合和物品集合;3.1. Given a user interaction matrix
Figure BDA0003517220730000087
Figure BDA0003517220730000088
means the first
Figure BDA0003517220730000089
users and
Figure BDA00035172207300000810
The interaction of each item, U and V represent the user set and the item set, respectively;

用户交互矩阵Y的表达公式如下:The expression formula of the user interaction matrix Y is as follows:

Figure BDA0003517220730000091
Figure BDA0003517220730000091

3.2、定义Oi,j表示用户i对物品j的交互情况,将用户i存在交互的物品映射到知识图谱

Figure BDA0003517220730000092
的实体;3.2. Define O i,j to represent the interaction of user i with item j, and map the items that user i interacts with to the knowledge graph
Figure BDA0003517220730000092
entity;

更进一步,所述3.2流程如下:Further, the 3.2 process is as follows:

3.2.1、Oi,j是一个交互矩阵,由用户的行向量组成。Oi,j表示用户i对物品j的交互情况,其中j为物品的索引;3.2.1. O i,j is an interaction matrix consisting of row vectors of users. O i,j represents the interaction of user i with item j, where j is the index of the item;

3.2.2、定义一个HashMap用于存放全部物品,HashMap以key-value键值对的方式存放,这里的key存放物品的索引,value存放物体对应的实体;3.2.2. Define a HashMap to store all items. HashMap is stored in the form of key-value key-value pairs. The key here stores the index of the item, and the value stores the entity corresponding to the object;

3.2.3,从用户交互矩阵O的行向量表示一个用户的交互情况,把这个行向量存放于一个数组,这个数组定义为Ε;3.2.3, from the row vector of the user interaction matrix O to represent the interaction situation of a user, this row vector is stored in an array, and this array is defined as E;

3.2.3、定义一个临时集合temp_set用户存放实体,遍历数组Ε,如果数组里元素的值为1,则根据这个元素的索引访问物品HashMap,获取对应的实体,存放于临时集合temp_set中;3.2.3. Define a temporary set temp_set user to store the entity, traverse the array E, if the value of the element in the array is 1, then access the item HashMap according to the index of this element, obtain the corresponding entity, and store it in the temporary set temp_set;

所述步骤4流程如下:The step 4 process is as follows:

4.1、为一个用户给出第一步所述的用户特征向量vuser4.1, give the user feature vector v user described in the first step for a user ;

4.2、按第三步初始化该用户的临时集合temp_set;4.2. Initialize the user's temporary set temp_set according to the third step;

4.3、定义一个用户的偏好的HashMap,命名为user_map,key值存放循环次数,value存放三元组集合;4.3. Define a HashMap of user preferences, named user_map, the key value stores the number of cycles, and the value stores the triplet set;

4.4、定义一个集合used_set,用于存放已经召回的三元组;4.4. Define a set used_set to store the recalled triples;

4.5、令循环次数k=1,2,3…,给定一个最大值K,当user_map中的三元组数量大于K,退出循环,循环执行以下步骤:4.5. Let the number of loops k = 1, 2, 3..., given a maximum value K, when the number of triples in user_map is greater than K, exit the loop and execute the following steps:

4.5.1、遍历temp_set中的实体,去掉used_set中的实体4.5.1. Traverse the entities in temp_set and remove the entities in used_set

temp_set←temp_set-used_settemp_set←temp_set-used_set

其中←表示赋值,-表示差集运算;← means assignment, - means difference operation;

4.5.2、从知识图谱找到三元组集合

Figure BDA0003517220730000093
并存放到user_map,定义如下:4.5.2. Find the triplet set from the knowledge graph
Figure BDA0003517220730000093
And stored in user_map, defined as follows:

Figure BDA0003517220730000101
Figure BDA0003517220730000101

其中(h',r',t')表示三元组,h'表示头实体,r'表示关系,t'表示尾实体;Where (h', r', t') represents a triple, h' represents a head entity, r' represents a relationship, and t' represents a tail entity;

4.5.3、由于知识图谱存在回路,为了防止收集到已经召回的实体,需要将temp_set中的实体添加到used_set;4.5.3. Since there is a loop in the knowledge graph, in order to prevent the collection of recalled entities, it is necessary to add entities in temp_set to used_set;

4.5.4、取出

Figure BDA0003517220730000102
中三元组的尾实体,存放到集合
Figure BDA0003517220730000103
4.5.4, take out
Figure BDA0003517220730000102
The tail entity of the triple, stored in the collection
Figure BDA0003517220730000103

4.5.5、遍历

Figure BDA0003517220730000104
中的实体,取出对应的物品特征向量vitem;4.5.5. Traversal
Figure BDA0003517220730000104
The entity in , take out the corresponding item feature vector v item ;

4.5.6、通过第二步所述的输入参数vuser和vitem到双塔模型,得出用户与集合

Figure BDA0003517220730000105
中对应的物品的概率,并根据概率大小将实体排序;4.5.6. Through the input parameters v user and v item described in the second step to the twin tower model, the user and set are obtained
Figure BDA0003517220730000105
The probability of the corresponding item in , and the entities are sorted according to the probability;

4.5.7、给定一个值τ,0<τ≤1,用于决定筛选实体的个数:4.5.7. Given a value τ, 0<τ≤1, it is used to determine the number of screening entities:

Figure BDA0003517220730000106
Figure BDA0003517220730000106

其中

Figure BDA0003517220730000107
表示集合中的元素按概率排序,并返回一个数组,getSubVec(i,j)表示取出子数组,即原数组的第i到第j的元素,newSet()表示将数组转化为集合,
Figure BDA0003517220730000108
表示向上取整;in
Figure BDA0003517220730000107
Indicates that the elements in the set are sorted by probability and returns an array, getSubVec(i,j) means to take out the sub-array, that is, the i-th to j-th elements of the original array, newSet() means to convert the array into a set,
Figure BDA0003517220730000108
means rounded up;

4.5.8、从

Figure BDA0003517220730000109
集合中筛选出尾实体属于
Figure BDA00035172207300001010
的三元组:4.5.8. From
Figure BDA0003517220730000109
Filters out the tail entity from the collection that belongs to
Figure BDA00035172207300001010
triples of :

Figure BDA00035172207300001011
Figure BDA00035172207300001011

4.5.9、存放三元组集合Λ到user_map的方法:user_map(k,Λ);4.5.9. The method of storing triplet set Λ to user_map: user_map(k, Λ);

4.5.10、将

Figure BDA00035172207300001012
集合覆盖到temp_set;4.5.10, will
Figure BDA00035172207300001012
Set overwrites to temp_set;

4.6、执行完4.5后,最终得到用户偏好的知识图谱。4.6. After executing 4.5, the knowledge graph of user preference is finally obtained.

本发明的有益效果如下:通过优化的双塔模型对知识图谱中用户偏好的实体进行筛选。双塔模型的优化方法是采用了基于哈希序列的频率估计方法,使得物品可以更好得适应多种数据分布。对知识图谱实体的筛选,不仅可以得到更优质的数据,召回的实体能够真实得接近用户偏好;而且筛选实体有利于知识图谱的深度召回,因为每次召回的实体可能会因为前一次实体的基数爆炸式得增多,如果不进行筛选,将会影响计算效率和对用户潜在偏好的探索。The beneficial effects of the present invention are as follows: the entities preferred by the user in the knowledge graph are screened through the optimized twin-tower model. The optimization method of the twin-tower model adopts the frequency estimation method based on the hash sequence, so that the items can better adapt to a variety of data distributions. The screening of knowledge graph entities can not only obtain better data, the recalled entities can be truly close to user preferences; and screening entities is conducive to the deep recall of knowledge graphs, because each recalled entity may be due to the cardinality of the previous entity. Explosive growth, if not screened, will affect computational efficiency and exploration of users’ potential preferences.

具体实施方式Detailed ways

以下结合实施例对本发明作进一步描述。The present invention will be further described below in conjunction with the embodiments.

实施例:Example:

第一步:定义用户特征向量和物品特征向量,作为双塔模型的输入;The first step: define the user feature vector and item feature vector as the input of the twin tower model;

所述第一步的过程如下:The process of the first step is as follows:

1.1、用户特征指的是用户对物品的交互行为,包括点击记录,搜索记录,社交数据,个人数据和样本年龄,用户特征向量是将上述交互数据转化为向量并做拼接(concatenate)。其中将原始数据转化为向量的方式称为向量嵌入(embedding),向量嵌入(embedding)是机器学习中常用的表示数据特征的方法,目的是将原始数据提取出特征,也就是通过神经网络映射之后的低维向量;1.1. User feature refers to the user's interaction behavior with items, including click records, search records, social data, personal data and sample age. User feature vector is to convert the above interaction data into vectors and concatenate them. The method of converting raw data into vectors is called vector embedding. Vector embedding is a method commonly used in machine learning to represent data features. The purpose is to extract features from raw data, that is, after mapping through neural networks. The low-dimensional vector of ;

更进一步,所述1.1的流程如下Further, the process of 1.1 is as follows

1.1.1、用户点击记录的embedding,是所有点击物品的id类embedding的加权平均,其中id类embedding是将物品唯一标识符映射到同一维度的向量,其权重与浏览物品时间成正比。其用户点击记录的embedding计算公式如下:1.1.1. The embedding of the user click record is the weighted average of the id-type embeddings of all clicked items. The id-type embedding is a vector that maps the unique identifier of the item to the same dimension, and its weight is proportional to the browsing time of the item. The embedding calculation formula of its user click record is as follows:

Figure BDA0003517220730000111
Figure BDA0003517220730000111

其中vclick表示用户点击记录的embedding,

Figure BDA0003517220730000112
表示第i个权重,vclick,i表示点击记录中第i个物品的id类embedding,n表示embedding的个数;其中,
Figure BDA0003517220730000113
可通过如下公式计算:Where v click represents the embedding of the user click record,
Figure BDA0003517220730000112
represents the ith weight, v click, i represents the id class embedding of the ith item in the click record, and n represents the number of embeddings; among them,
Figure BDA0003517220730000113
It can be calculated by the following formula:

Figure BDA0003517220730000114
Figure BDA0003517220730000114

其中

Figure BDA0003517220730000115
表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in
Figure BDA0003517220730000115
Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;

1.1.2、用户搜索记录的embedding是历史搜索的关键词进行分词得到词条。分词的过程是通过Word2vec模型得到对应词条的embedding,然后将用户搜索记录的embedding进行加权平均。1.1.2. The embedding of the user's search record is the keyword of the historical search for word segmentation to obtain the entry. The process of word segmentation is to obtain the embedding of the corresponding entry through the Word2vec model, and then perform a weighted average of the embeddings of the user search records.

其中分词是搜索引擎针对用户提交搜索的关键词串进行切分成不同词条token的技术。The word segmentation is a technology in which the search engine divides the keyword string submitted by the user into different entry tokens.

Word2vec模型由Mikolov等人于2013年提出,该模型将文本中的内容词汇通过转换处理,化简为空间向量,词向量的数值受上下文的影响,蕴含了词与词之间相互的关联性。The Word2vec model was proposed by Mikolov et al. in 2013. This model converts the content words in the text and simplifies them into space vectors. The value of the word vector is affected by the context, which implies the correlation between words.

其用户搜索记录的embedding的计算公式如下:The calculation formula of the embedding of its user search records is as follows:

Figure BDA0003517220730000116
Figure BDA0003517220730000116

其中vsearch表示用户搜索记录的embedding,

Figure BDA0003517220730000117
表示第i个权重,vsearch,i表示搜索记录中第i个词条的embedding,n表示embedding的个数;where v search represents the embedding of the user search record,
Figure BDA0003517220730000117
Represents the ith weight, v search, i represents the embedding of the ith entry in the search record, and n represents the number of embeddings;

搜索记录的embedding的权重计算:The weight calculation of the embedding of the search record:

Figure BDA0003517220730000121
Figure BDA0003517220730000121

其中搜索的有效性判断为用户是否在搜索后点击物品;The validity of the search is judged as whether the user clicks the item after the search;

1.1.3、用户的社交数据包括收藏、点赞和订阅数据对应的embedding加权平均。其中收藏和点赞数据对应的embedding指的是用户收藏和点赞的物品id类的embedding;订阅数据对应的embedding指的是用户订阅物品对应的负责人的id类的embedding。1.1.3. The user's social data includes the embedding weighted average of favorites, likes, and subscription data. The embedding corresponding to the collection and like data refers to the embedding of the id category of the items that the user has favorited and liked; the embedding corresponding to the subscription data refers to the embedding of the id category of the person in charge corresponding to the user subscribed items.

其用户社交数据的embedding的计算公式如下:The calculation formula of the embedding of its user social data is as follows:

Figure BDA0003517220730000122
Figure BDA0003517220730000122

其中vsocial表示用户搜索记录的embedding,

Figure BDA0003517220730000123
表示第i个权重,vsocial,i表示搜索记录中第i个社交数据的embedding;where v social represents the embedding of user search records,
Figure BDA0003517220730000123
represents the ith weight, v social,i represents the embedding of the ith social data in the search record;

对于收藏和点赞的embedding的权重计算:For the weight calculation of the favorite and liked embeddings:

Figure BDA0003517220730000124
Figure BDA0003517220730000124

其中

Figure BDA0003517220730000125
表示用户对物品i浏览的时间,N表示样本总数,k表示正例总数;in
Figure BDA0003517220730000125
Represents the time that the user browses the item i, N represents the total number of samples, and k represents the total number of positive examples;

对于订阅的embedding权重计算:For subscription embedding weight calculation:

Figure BDA0003517220730000126
Figure BDA0003517220730000126

其中

Figure BDA0003517220730000127
示被订阅者第i个物品的浏览时间,N表示样本总数,k表示正例总数;in
Figure BDA0003517220730000127
Indicates the browsing time of the ith item of the subscriber, N represents the total number of samples, and k represents the total number of positive examples;

1.1.4、用户的个人数据包括用户的性别、年龄和地域。其中性别是简单的二元特征,年龄和地域属于连续型特征,可以将其归一化为[0,1]区间上的实数值。用户个人数据的embedding,是将处理过的性别、年龄和地域的值做拼接操作后得到的向量;1.1.4. The user's personal data includes the user's gender, age and region. Among them, gender is a simple binary feature, and age and region are continuous features, which can be normalized to real values in the [0,1] interval. The embedding of the user's personal data is a vector obtained by splicing the processed values of gender, age and region;

更进一步,1.1.4的流程如下:Going a step further, the process for 1.1.4 is as follows:

1.1.4.1、计算用户性别的二元表示,其公式如下:1.1.4.1. Calculate the binary representation of user gender, the formula is as follows:

Figure BDA0003517220730000131
Figure BDA0003517220730000131

1.1.4.2、计算用户的年龄和地域的归一化实数值,其归一化公式如下:1.1.4.2. Calculate the normalized real value of the user's age and region. The normalization formula is as follows:

Figure BDA0003517220730000132
Figure BDA0003517220730000132

其中X表示样本数值,μ为所有样本数据的均值,σ为所有样本数据的标准差;where X represents the sample value, μ is the mean of all sample data, and σ is the standard deviation of all sample data;

1.1.4.3、将1.1.4所述的性别二元值,年龄和地域归一化实数值做拼接操作得到一个向量,这个向量拼接操作公式如下:1.1.4.3. Perform a splicing operation on the gender binary value, age and region normalized real value described in 1.1.4 to obtain a vector. The vector splicing operation formula is as follows:

vpersonal=[gender,zage,zregion]v personal = [gender, z age , z region ]

其中vpersonal表示用户特征向量,gender表示用户性别,zage和zregion分别表示用户的年龄和地域的归一化值;where v personal represents the user feature vector, gender represents the user’s gender, and z age and z region represent the normalized value of the user’s age and region, respectively;

1.1.5、将1.1流程所述的用户点击记录的embedding,用户搜索记录的embedding,用户交互数据的embedding,用户个人数据的embedding做concatenate连接操作得到用户特征向量,其公式如下:1.1.5. The user feature vector is obtained by performing the concatenate connection operation on the embedding of the user click record, the embedding of the user search record, the embedding of the user interaction data, and the embedding of the user's personal data described in the process of 1.1. The formula is as follows:

vuser=concatenate(vclick,vsearch,vsocial,vpersonal)v user =concatenate(v click ,v search ,v social ,v personal )

=[vclick[1],vclick[2],…,vsearch[1],vsearch[2],…,vsocial[1],vsocial[2],…,vpersonal[1],vpersonal[2],…]=[v click [1],v click [2],…,v search [1],v search [2],…,v social [1],v social [2],…,v personal [1], v personal [2],…]

其中vuser表示用户特征向量,vclick[i]表示用户点击embedding的第i个分量,vsearch[i]表示用户搜索记录embedding的第i个分量,vsocial[i]表示用户社交数据embedding的第i个分量,vpersonal[i]表示用户个人数据embedding的第i个分量;where v user represents the user feature vector, v click [i] represents the ith component of the user click embedding, v search [i] represents the ith component of the user search record embedding, and v social [i] represents the user's social data embedding The i-th component, v personal [i] represents the i-th component of the user's personal data embedding;

1.2、物品特征包括物品的id及其上下文信息,物品特征向量由物品的id类embedding于其上下文信息的embedding拼接而成;1.2. Item features include the id of the item and its context information, and the item feature vector is formed by splicing the id class embedding of the item and the embedding of its context information;

更进一步,1.2的流程如下:Going a step further, the process for 1.2 is as follows:

1.2.1、给出物品的id类embedding,是将物品唯一标识符映射到同一维度的向量;1.2.1. The id class embedding of the item is given, which is a vector that maps the unique identifier of the item to the same dimension;

1.2.2、给出物品的上下文信息embedding,是将上下文信息通过Word2vec得到的向量;1.2.2. The context information embedding of the item is given, which is a vector obtained by passing the context information through Word2vec;

1.2.3、将1.2所述的id类embedding和上下文信息embedding做concatenate连接操作得到物品特征向量,其公式如下:1.2.3. Perform the concatenate connection operation of the id class embedding and the context information embedding described in 1.2 to obtain the item feature vector. The formula is as follows:

vitem=concatenate(vid,vcontext)=[vid[1],vid[2],…,vcontext[1],vcontext[2],…]v item =concatenate(v id ,v context )=[v id [1],v id [2],…,v context [1],v context [2],…]

其中vitem表示物品特征向量,vid表示物品的id类embedding,vcontext表示物品上下文信息的embedding,vid[i]表示物品的id类embedding的第i个分量,vcontext[i]表示物品上下文信息的embedding的第i个分量;Where v item represents the feature vector of the item, v id represents the id class embedding of the item, v context represents the embedding of the item context information, v id [i] represents the i-th component of the id class embedding of the item, and v context [i] represents the item The ith component of the embedding of context information;

第二步:训练双塔模型,结合in-batch softmax损失函数与基于哈希序列的频率估计方法对双塔模型进行优化;The second step: train the double-tower model, and optimize the double-tower model by combining the in-batch softmax loss function and the frequency estimation method based on the hash sequence;

所述第二步中,双塔模型衍生于DSSM(Deep Structured Semantic Models)。DSSM是深度结构化语义模型,在自然语言处理领域中,DSSM常用于解决语义相似度的问题。双塔模型从上往下可以分为输入层、表示层、匹配层,从水平方向看分为两个输入层,两个表示层和一个匹配层。其中两个输入层的输出分别是两个表示层的输入,两个表示层输出汇总到匹配层,整体呈现“双塔”的形式。本发明中,双塔模型的输入层的两个输入分别是用户特征向量和物品特征向量,两个表示层是相同的神经网络结构,特征向量通过神经网络计算后会得到相同维度的向量。最后将这两个向量经过L2归一化后进行内积。进一步的,本发明优化的双塔模型,是采用了基于哈希序列的频率估计方法对双塔模型进行优化。该方法减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数。In the second step, the twin tower model is derived from DSSM (Deep Structured Semantic Models). DSSM is a deep structured semantic model. In the field of natural language processing, DSSM is often used to solve the problem of semantic similarity. The dual-tower model can be divided into input layer, presentation layer and matching layer from top to bottom, and from the horizontal direction, it is divided into two input layers, two presentation layers and one matching layer. The outputs of the two input layers are the inputs of the two presentation layers, respectively, and the outputs of the two presentation layers are aggregated to the matching layer, and the whole is in the form of a "twin tower". In the present invention, the two inputs of the input layer of the twin-tower model are the user feature vector and the item feature vector respectively, the two representation layers have the same neural network structure, and the feature vector will obtain a vector of the same dimension after being calculated by the neural network. Finally, the two vectors are L2 normalized and then inner product is performed. Further, the optimized twin-tower model of the present invention adopts a frequency estimation method based on a hash sequence to optimize the twin-tower model. This method reduces the possible sampling bias problem of negative sampling in each batch, thereby optimizing the loss function.

所述第二步的流程如下:The process of the second step is as follows:

2.1、给出两个带参数的embedding函数:

Figure BDA0003517220730000141
其中
Figure BDA0003517220730000142
表示d维实数向量。2.1. Give two embedding functions with parameters:
Figure BDA0003517220730000141
in
Figure BDA0003517220730000142
Represents a d-dimensional real vector.

Figure BDA0003517220730000143
是通过深度神经网络提取后的用户特征向量和物品特征向量,其中
Figure BDA0003517220730000144
Figure BDA0003517220730000145
分别是双塔模型需要的输入值。
Figure BDA0003517220730000143
are the user feature vector and item feature vector extracted by the deep neural network, where
Figure BDA0003517220730000144
and
Figure BDA0003517220730000145
are the input values required by the twin-tower model, respectively.

更进一步,2.1的流程如下Further, the process of 2.1 is as follows

2.1.1、双塔模型含有两个深度神经网络模型。神经网络模型的组成单元是感知机,感知机有一个若干个输入和一个输出,其中输出和输入是训练到的一个线性关系,输出的值会通过一个非线性激活函数得到最终结果;2.1.1. The twin tower model contains two deep neural network models. The component unit of the neural network model is the perceptron. The perceptron has several inputs and one output. The output and the input are a linear relationship trained, and the output value will get the final result through a nonlinear activation function;

2.1.2、深度神经网络在神经网络模型上做了扩展,深度神经网络有一个输入层,一个输出层和若干个隐藏层,本发明采用了3层隐藏层,并且层与层之间采用全连接,全连接是指的每一个结点都与上一层的所有结点相连;2.1.2. The deep neural network is extended on the neural network model. The deep neural network has one input layer, one output layer and several hidden layers. Connection, full connection means that each node is connected to all nodes in the previous layer;

2.1.3、将用户特征向量和物品特征向量输入到对应的神经网络,由于模型需要训练,所以输出结果用含有待训练参数θ的函数来表示,输出结果分别为:

Figure BDA0003517220730000151
其中
Figure BDA0003517220730000152
2.1.3. Input the user feature vector and item feature vector into the corresponding neural network. Since the model needs to be trained, the output result is represented by a function containing the parameter θ to be trained. The output results are:
Figure BDA0003517220730000151
in
Figure BDA0003517220730000152

2.1.4、将输出结果

Figure BDA0003517220730000153
做L2归一化处理,L2归一化公式如下:2.1.4, will output the result
Figure BDA0003517220730000153
To do L2 normalization, the L2 normalization formula is as follows:

Figure BDA0003517220730000154
其中
Figure BDA0003517220730000155
γi表示[Y]向量上第i个分量;
Figure BDA0003517220730000154
in
Figure BDA0003517220730000155
γ i represents the i-th component on the [Y] vector;

2.2、通过基于哈希序列的频率估计方法,即利用哈希序列来记录采样的序列号,减少了负采样在每个batch中可能会出现的采样偏差问题,从而优化了损失函数,循环以下步骤;2.2. Through the frequency estimation method based on the hash sequence, that is, using the hash sequence to record the serial number of the sampling, the sampling deviation problem that may occur in each batch of negative sampling is reduced, thereby optimizing the loss function, and looping the following steps ;

更进一步,2.2的流程如下:Going a step further, the process for 2.2 is as follows:

2.2.1、从样本集中随机采取T个样本,表示如下:2.2.1. Take T samples randomly from the sample set, which are expressed as follows:

Figure BDA0003517220730000156
其中xi表示样本T中的第i个用户特征向量,
Figure BDA0003517220730000156
where x i represents the ith user feature vector in sample T,

yi表示样本T中的第i个物品特征向量,y i represents the i-th item feature vector in the sample T,

ri表示样本T中第i个用户反馈程度,且ri ∈[0,1]。ri represents the feedback degree of the ith user in the sample T, and ri ∈ [0,1].

2.2.2、利用基于哈希序列的频率估计方法计算每个样本中yi的概率pi2.2.2. Calculate the probability p i of y i in each sample by using the frequency estimation method based on the hash sequence;

所述2.2.2流程如下:The process described in 2.2.2 is as follows:

2.2.2.1、设学习率α,大小为H的数组A和D,哈希函数h;2.2.2.1. Let the learning rate α, the arrays A and D of size H, and the hash function h;

其中哈希函数h可以把每个yi映射到范围在[0,H]的整数。where the hash function h can map each yi to an integer in the range [0, H].

2.2.2.2、对于每一步t=1,2,…,在训练样本数量的一个批次(batch)中的所有物品集合

Figure BDA0003517220730000157
对于每一物品
Figure BDA0003517220730000158
有:2.2.2.2. For each step t = 1, 2, ..., the set of all items in a batch of the number of training samples
Figure BDA0003517220730000157
for each item
Figure BDA0003517220730000158
Have:

2.2.2.2.1、

Figure BDA0003517220730000159
2.2.2.2.1,
Figure BDA0003517220730000159

其中←表示赋值,

Figure BDA00035172207300001510
表示yi上次被采样到的时刻,
Figure BDA00035172207300001511
表示yi被采样一次经过的步数;where ← means assignment,
Figure BDA00035172207300001510
represents the last time yi was sampled,
Figure BDA00035172207300001511
Indicates the number of steps that yi is sampled once;

2.2.2.2.2、

Figure BDA00035172207300001512
2.2.2.2.2,
Figure BDA00035172207300001512

2.2.2.3、对于每一个

Figure BDA0003517220730000161
采样概率
Figure BDA0003517220730000162
2.2.2.3, for each
Figure BDA0003517220730000161
sampling probability
Figure BDA0003517220730000162

2.2.3、计算优化后的损失函数

Figure BDA0003517220730000163
并给出推导过程;2.2.3. Calculate the optimized loss function
Figure BDA0003517220730000163
And give the derivation process;

所述2.2.3流程如下:The process described in 2.2.3 is as follows:

2.2.3.1、给出向量内积公式:2.2.3.1. Give the vector inner product formula:

Figure BDA0003517220730000164
Figure BDA0003517220730000164

2.2.3.2、给定一个向量x,从M个物品中得到其中一个

Figure BDA0003517220730000165
的概率可以利用softmax函数计算,其公式如下:2.2.3.2. Given a vector x, get one of the M items
Figure BDA0003517220730000165
The probability of can be calculated using the softmax function, the formula is as follows:

Figure BDA0003517220730000166
Figure BDA0003517220730000166

其中;符号后面的θ表示

Figure BDA0003517220730000167
所带的参数,e为自然常数;Among them; θ after the symbol means
Figure BDA0003517220730000167
With the parameters, e is a natural constant;

2.2.3.3、加权对数似然损失函数为:2.2.3.3. The weighted log-likelihood loss function is:

Figure BDA0003517220730000168
Figure BDA0003517220730000168

其中T表示2.2.1所述随机采取的T个样本;where T represents T samples randomly taken as described in 2.2.1;

2.2.3.4、本发明利用负采样算法进行计算,负采样算法是采用了一种平滑策略,能够提高低频样本的采样几率。我们在同一batch中采样负样本。给出一个最小batch,为

Figure BDA0003517220730000169
其中batch-softmax函数为:2.2.3.4. The present invention uses a negative sampling algorithm for calculation, and the negative sampling algorithm adopts a smoothing strategy, which can improve the sampling probability of low-frequency samples. We sample negative samples in the same batch. gives a minimum batch, which is
Figure BDA0003517220730000169
The batch-softmax function is:

Figure BDA00035172207300001610
Figure BDA00035172207300001610

2.2.3.5、在每一batch中,由于存在幂分布现象,即在每个batch中随机采样负样本,会使热门物品容易被采样到,在损失函数中就过度惩罚了这些热门商品,因此考虑用频率对采样进行修正,其公式如下:2.2.3.5. In each batch, due to the power distribution phenomenon, that is, random sampling of negative samples in each batch will make popular items easy to be sampled, and these popular items will be excessively punished in the loss function, so consider The sampling is corrected by frequency, and the formula is as follows:

sc(xi,yi)=s(xi,yi)-log(pi)s c (x i , y i )=s(x i , y i )-log( pi )

其中sc(xi,yi)表示s(xi,yi)的修正值,pi来自2.2.2所述的基于哈希序列的频率估计算法;where s c ( xi , y i ) represents the correction value of s ( xi , y i ), and pi comes from the frequency estimation algorithm based on hash sequence described in 2.2.2;

2.2.3.6、因此修正后的条件概率函数为:2.2.3.6. Therefore, the modified conditional probability function is:

Figure BDA0003517220730000171
Figure BDA0003517220730000171

2.2.3.7、最后损失函数为:2.2.3.7. The final loss function is:

Figure BDA0003517220730000172
Figure BDA0003517220730000172

2.2.4、采用梯度下降算法更新参数θ,使其接近最优值。其中梯度下降算法2.2.4. Use the gradient descent algorithm to update the parameter θ to make it close to the optimal value. The gradient descent algorithm

机器学习常用的算法,用于求解模型参数;Algorithms commonly used in machine learning to solve model parameters;

第三步:定义用户历史交互矩阵与知识图谱的实体映射关系;Step 3: Define the entity mapping relationship between the user history interaction matrix and the knowledge graph;

所述第三步流程如下:The third step process is as follows:

3.1、给定一个用户交互矩阵

Figure BDA0003517220730000173
Figure BDA0003517220730000174
表示第
Figure BDA0003517220730000175
个用户与第
Figure BDA0003517220730000176
个物品的交互情况,U和V分别表示用户集合和物品集合;3.1. Given a user interaction matrix
Figure BDA0003517220730000173
Figure BDA0003517220730000174
means the first
Figure BDA0003517220730000175
users and
Figure BDA0003517220730000176
The interaction of each item, U and V represent the user set and the item set, respectively;

用户交互矩阵Y的表达公式如下:The expression formula of the user interaction matrix Y is as follows:

Figure BDA0003517220730000177
Figure BDA0003517220730000177

3.2、定义Oi,j表示用户i对物品j的交互情况,将用户i存在交互的物品映射到知识图谱

Figure BDA0003517220730000178
的实体;3.2. Define O i,j to represent the interaction of user i with item j, and map the items that user i interacts with to the knowledge graph
Figure BDA0003517220730000178
entity;

更进一步,所述3.2流程如下:Further, the 3.2 process is as follows:

3.2.1、Oi,j是一个交互矩阵,由用户的行向量组成。Oi,j表示用户i对物品j的交互情况,其中j为物品的索引;3.2.1. O i,j is an interaction matrix consisting of row vectors of users. O i,j represents the interaction of user i with item j, where j is the index of the item;

3.2.2、定义一个HashMap用于存放全部物品,HashMap以key-value键值对的方式存放,这里的key存放物品的索引,value存放物体对应的实体;3.2.2. Define a HashMap to store all items. HashMap is stored in the form of key-value key-value pairs. The key here stores the index of the item, and the value stores the entity corresponding to the object;

3.2.3,从用户交互矩阵O的行向量表示一个用户的交互情况,把这个行向量存放于一个数组,这个数组定义为Ε;3.2.3, from the row vector of the user interaction matrix O to represent the interaction situation of a user, this row vector is stored in an array, and this array is defined as E;

3.2.3、定义一个临时集合temp_set用户存放实体,遍历数组Ε,如果数组里元素的值为1,则根据这个元素的索引访问物品HashMap,获取对应的实体,存放于临时集合temp_set中;3.2.3. Define a temporary set temp_set user to store entities, traverse the array E, if the value of the element in the array is 1, access the item HashMap according to the index of this element, obtain the corresponding entity, and store it in the temporary set temp_set;

第四步:通过偏好实体传播的方式,将每次传播召回到的实体与用户特征输入到优化的双塔模型户偏得出预测概率,筛选概率高的实体,最终得到表示用户偏好和潜在偏好的知识图谱;Step 4: Through the method of preference entity propagation, input the entities and user characteristics recalled by each propagation into the optimized twin-tower model to obtain the predicted probability, screen the entities with high probability, and finally obtain the representation of user preference and potential preference. knowledge graph;

所述第四步流程如下:The fourth step process is as follows:

4.1、为一个用户给出第一步所述的用户特征向量vuser4.1, give the user feature vector v user described in the first step for a user ;

4.2、按第三步初始化该用户的临时集合temp_set;4.2. Initialize the user's temporary set temp_set according to the third step;

4.3、定义一个用户的偏好的HashMap,命名为user_map,key值存放循环次数,value存放三元组集合;4.3. Define a HashMap of user preferences, named user_map, the key value stores the number of cycles, and the value stores the triplet set;

4.4、定义一个集合used_set,用于存放已经召回的三元组;4.4. Define a set used_set to store the recalled triples;

4.5、令循环次数k=1,2,3…,给定一个最大值K,当user_map中的三元组数量大于K,退出循环,循环执行以下步骤:4.5. Let the number of loops k = 1, 2, 3..., given a maximum value K, when the number of triples in user_map is greater than K, exit the loop and execute the following steps:

4.5.1、遍历temp_set中的实体,去掉used_set中的实体4.5.1. Traverse the entities in temp_set and remove the entities in used_set

temp_set←temp_set-used_settemp_set←temp_set-used_set

其中←表示赋值,-表示差集运算;← means assignment, - means difference operation;

4.5.2、从知识图谱找到三元组集合

Figure BDA0003517220730000181
并存放到user_map,定义如下:4.5.2. Find the triplet set from the knowledge graph
Figure BDA0003517220730000181
And stored in user_map, defined as follows:

Figure BDA0003517220730000182
Figure BDA0003517220730000182

其中(h',r',t')表示三元组,h'表示头实体,r'表示关系,t'表示尾实体;Where (h', r', t') represents a triple, h' represents a head entity, r' represents a relationship, and t' represents a tail entity;

4.5.3、由于知识图谱存在回路,为了防止收集到已经召回的实体,需要将temp_set中的实体添加到used_set;4.5.3. Since there is a loop in the knowledge graph, in order to prevent the collection of recalled entities, it is necessary to add entities in temp_set to used_set;

4.5.4、取出

Figure BDA0003517220730000183
中三元组的尾实体,存放到集合
Figure BDA0003517220730000184
4.5.4, take out
Figure BDA0003517220730000183
The tail entity of the triple, stored in the collection
Figure BDA0003517220730000184

4.5.5、遍历

Figure BDA0003517220730000185
中的实体,取出对应的物品特征向量vitem;4.5.5. Traversal
Figure BDA0003517220730000185
The entity in , take out the corresponding item feature vector v item ;

4.5.6、通过第二步所述的输入参数vuser和vitem到双塔模型,得出用户与集合

Figure BDA00035172207300001812
中对应的物品的概率,并根据概率大小将实体排序;4.5.6. Through the input parameters v user and v item described in the second step to the twin tower model, the user and set are obtained
Figure BDA00035172207300001812
The probability of the corresponding item in , and the entities are sorted according to the probability;

4.5.7、给定一个值τ,0<τ≤1,用于决定筛选实体的个数:4.5.7. Given a value τ, 0<τ≤1, it is used to determine the number of screening entities:

Figure BDA0003517220730000186
Figure BDA0003517220730000186

其中

Figure BDA0003517220730000187
表示集合中的元素按概率排序,并返回一个数组,getSubVec(i,j)表示取出子数组,即原数组的第i到第j的元素,newSet()表示将数组转化为集合,
Figure BDA0003517220730000188
表示向上取整;in
Figure BDA0003517220730000187
Indicates that the elements in the set are sorted by probability and returns an array, getSubVec(i,j) means to take out the sub-array, that is, the i-th to j-th elements of the original array, newSet() means to convert the array into a set,
Figure BDA0003517220730000188
means rounded up;

4.5.8、从

Figure BDA0003517220730000189
集合中筛选出尾实体属于
Figure BDA00035172207300001810
的三元组:4.5.8. From
Figure BDA0003517220730000189
Filter out the tail entity from the collection that belongs to
Figure BDA00035172207300001810
triples of :

Figure BDA00035172207300001811
Figure BDA00035172207300001811

4.5.9、存放三元组集合Λ到user_map的方法:user_map(k,Λ);4.5.10、将

Figure BDA0003517220730000191
集合覆盖到temp_set;4.5.9. The method of storing the triplet set Λ to user_map: user_map(k, Λ); 4.5.10. Putting the
Figure BDA0003517220730000191
Set overwrites to temp_set;

4.6、执行完4.5后,最终得到用户偏好的知识图谱。4.6. After executing 4.5, the knowledge graph of user preference is finally obtained.

Claims (5)

1. A knowledge graph user preference entity recalling method based on a double-tower model is characterized by comprising the following steps:
1) defining a user characteristic vector and an article characteristic vector as the input of a double-tower model;
2) training a double-tower model, and optimizing the double-tower model by combining an in-batch softmax loss function and a frequency estimation method based on a Hash sequence;
3) defining an entity mapping relation between a user historical interaction matrix and a knowledge graph;
4) and inputting the entities recalled by each transmission and the user characteristics into the optimized double-tower model user bias to obtain a prediction probability in a preference entity transmission mode, screening the entities according to the prediction probability, and finally obtaining a knowledge graph representing the user preference and the potential preference.
2. The method for recalling knowledge-graph user preference entity based on double-tower model according to claim 1, wherein the specific process of step 1) is as follows:
1.1) user characteristics refer to interactive behaviors of a user on an article, including click records, search records, social data, personal data and sample age, and user characteristic vectors are obtained by converting the interactive data into vectors and splicing the vectors; the mode of converting the original data into the vector is called vector embedding;
1.1.1) the embedding of the user click record is the weighted average of the id types of the clicked items, wherein the id types of the embedding are vectors which map the unique identifiers of the items to the same dimension, and the weight of the id types of the embedding is in direct proportion to the item browsing time; the imbedding calculation formula of the user click record is as follows:
Figure FDA0003517220720000011
wherein v isclickAn embedding representing the user clicking on a record,
Figure FDA0003517220720000012
denotes the ith weight, vclick,iRepresenting id type embedding of the ith item in the click record, wherein n represents the number of the embedding; wherein,
Figure FDA0003517220720000013
can be calculated by the following formula:
Figure FDA0003517220720000014
wherein
Figure FDA0003517220720000015
Representing the time when the user browses the item i, N representing the total number of samples, and k representing the total number of positive examples;
1.1.2) the embedding of the user search record is to perform word segmentation on the keywords of the historical search to obtain entries; the Word segmentation process is to obtain embedding of a corresponding entry through a Word2vec model, and then carry out weighted average on the embedding of the user search record;
the formula for embedding the user search records is as follows:
Figure FDA0003517220720000021
wherein v issearchEmbedding representing a user's search records,
Figure FDA0003517220720000022
denotes the ith weight, vsearch,iIndicating the imbedding of the ith entry in the search record, wherein n indicates the number of the imbedding;
weight calculation of embedding of search records:
Figure FDA0003517220720000023
judging whether the user clicks the article after searching the article according to the searching validity;
1.1.3) the social data of the user comprises the embedding weighted average corresponding to the collection, praise and subscription data; the imbedding corresponding to the collection and approval data refers to the imbedding of the item id class collected and approved by the user; subscribing the embedding corresponding to the data refers to subscribing the embedding of the id class of a person in charge corresponding to the item by the user;
the formula for embedding the social data of the user is as follows:
Figure FDA0003517220720000024
wherein v issocialEmbedding representing a user's search records,
Figure FDA0003517220720000025
denotes the ith weight, vsocial,iEmbedding representing the ith social data in the search record;
weight calculation for favorites and praise embedding:
Figure FDA0003517220720000026
wherein
Figure FDA0003517220720000027
Representing user to itemi, browsing time, N represents the total number of samples, and k represents the total number of positive cases;
embedding weight calculation for a subscription:
Figure FDA0003517220720000028
wherein
Figure FDA0003517220720000029
Showing the browsing time of the ith item of the subscriber, wherein N represents the total number of samples, and k represents the total number of positive examples;
1.1.4) personal data of a user includes the user's gender, age, and location; the neutral characteristic is a simple binary characteristic, the age and the region belong to a continuous characteristic, and the continuous characteristic is normalized into a real numerical value in a [0,1] interval; embedding of user personal data is a vector obtained by splicing the processed values of gender, age and region;
1.1.4.1) calculates a binary representation of the user's gender, which is formulated as follows:
Figure FDA0003517220720000031
1.1.4.2) calculating the normalized real value of the age and the region of the user, wherein the normalized formula is as follows:
Figure FDA0003517220720000032
wherein X represents the sample value, μ is the mean of all sample data, and σ is the standard deviation of all sample data;
1.1.4.3) splicing the sex binary value, age and the region normalized real value in the step 1.1.4) to obtain a vector, wherein the vector splicing operation formula is as follows:
vpersonal=[gender,zage,zregion]
wherein v ispersonalRepresenting the user feature vector, gender, zageAnd zregionNormalized values respectively representing the age and region of the user;
1.1.5) clicking recorded embedding by the user according to the process in the step 1.1), searching recorded embedding by the user, interacting data embedding by the user, and carrying out concatemate connection operation on the embedding of personal data of the user to obtain a user characteristic vector, wherein the formula is as follows:
vuser=concatenate(vclick,vsearch,vsocial,vpersonal)=[vclick[1],vclick[2],…,vsearch[1],vsearch[2],…,vsocial[1],vsocial[2],…,vpersonal[1],vpersonal[2],…]
wherein v isuserRepresenting a user feature vector, vclick[i]The ith component, v, representing the user clicking on embeddingsearch[i]The i-th component, v, representing the user's search record embeddingsocial[i]The i component, v, representing the user's social data embeddingpersonal[i]An ith component representing user personal data embedding;
1.2) the article characteristics comprise the id of the article and the context information thereof, and the article characteristic vector is formed by splicing the id class embedding of the article and the embedding of the context information thereof;
1.2.1) giving an id class embedding of the article, wherein the id class embedding is a vector for mapping the unique identifier of the article to the same dimension;
1.2.2) giving context information embedding of the article, wherein the context information is a vector obtained by Word2 vec;
1.2.3) carrying out a containing connection operation on the id type embedding and the context information embedding in the step 1.2) to obtain an article feature vector, wherein the formula is as follows:
vitem=concatenate(vid,vcontext)=[vid[1],vid[2],…,vcontext[1],vcontext[2],…]
wherein v isitemIndicating the nature of the articleEigenvectors, vidId class embedding, v representing an itemcontextEmbedding, v representing item context informationid[i]The i component, v, of the id class embedding representing the itemcontext[i]The ith component of embedding representing item context information.
3. The method for recalling knowledge-graph user preference entity based on double-tower model according to claim 1, wherein the specific process of the step 2) is as follows:
2.1) the embedding function gives two band parameters:
Figure FDA0003517220720000041
wherein
Figure FDA0003517220720000042
Figure FDA0003517220720000043
Representing a d-dimensional real number vector;
Figure FDA0003517220720000044
is the user feature vector and the article feature vector extracted by the deep neural network, wherein
Figure FDA0003517220720000045
And
Figure FDA0003517220720000046
respectively are input values required by the double-tower model;
2.1.1) the double tower model contains two deep neural network models; the component unit of the neural network model is a perceptron, the perceptron has a plurality of inputs and an output, wherein the output and the input are a linear relation trained, the output value can obtain the final result through a nonlinear activation function;
2.1.2) the deep neural network is expanded on the neural network model, and the deep neural network is provided with an input layer, an output layer and a plurality of hidden layers;
2.1.3) inputting the user characteristic vector and the article characteristic vector into a corresponding neural network, wherein the output result is represented by a function containing a parameter theta to be trained because the model needs to be trained, and the output results are respectively as follows:
Figure FDA0003517220720000047
wherein
Figure FDA0003517220720000048
2.1.4) will output the result
Figure FDA0003517220720000049
The L2 normalization processing is carried out, and the L2 normalization formula is as follows:
Figure FDA00035172207200000410
wherein
Figure FDA00035172207200000411
γiIs represented by [ Y]The ith component on the vector;
2.2) through a frequency estimation method based on a hash sequence, namely, the hash sequence is used for recording the serial number of the sample, the problem of sampling deviation possibly occurring in each batch of negative samples is reduced, so that a loss function is optimized, and the following steps are circulated;
2.2.1) randomly taking T samples from the sample set, expressed as follows:
Figure FDA00035172207200000412
wherein xiRepresenting the ith user feature vector in sample T,
yirepresenting the ith item feature vector in sample T,
rirepresents the degree of feedback of the ith user in the sample T, and ri∈[0,1];
2.2.2) computing y in each sample by using a frequency estimation method based on a Hash sequenceiProbability p ofi
2.2.2.1) setting arrays A and D with learning rate alpha and size H and a hash function H;
wherein the hash function h may hash each yiMapping to a range of [0, H]An integer of (d);
2.2.2.2) for each step t ═ 1,2, …, all collections of items in one batch of the training sample number batch
Figure FDA0003517220720000051
For each article
Figure FDA0003517220720000052
Comprises the following steps:
2.2.2.2.1)
Figure FDA0003517220720000053
where ← represents the assignment,
Figure FDA0003517220720000054
denotes yiThe time of the last time that it was sampled,
Figure FDA0003517220720000055
denotes yiThe number of steps of one pass of the sampled data;
2.2.2.2.2)
Figure FDA0003517220720000056
2.2.2.3) for each
Figure FDA0003517220720000057
Probability of sampling
Figure FDA0003517220720000058
2.2.3) after calculation optimizationLoss function of
Figure FDA0003517220720000059
And giving a derivation process;
2.2.3.1) gives the vector inner product formula:
Figure FDA00035172207200000510
2.2.3.2) given a vector
Figure FDA00035172207200000511
Obtaining one of the M articles
Figure FDA00035172207200000512
The probability of (c) can be calculated using the softmax function, which is formulated as follows:
Figure FDA00035172207200000513
wherein e is a natural constant;
2.2.3.3) weighted log-likelihood loss function as:
Figure FDA00035172207200000514
wherein T represents 2.2.1) the randomly taken T samples;
2.2.3.4) calculating by using a negative sampling algorithm, wherein the negative sampling algorithm adopts a smoothing strategy and can improve the sampling probability of low-frequency samples; sampling negative samples in the same batch; giving a minimum batch of
Figure FDA00035172207200000515
Wherein the batch-softmax function is:
Figure FDA0003517220720000061
2.2.3.5) in each batch, the sampling is corrected by frequency, which is considered to be the following equation, because of the power distribution phenomenon, that is, the negative samples are randomly sampled in each batch, which makes the hot goods easy to sample, and the hot goods are excessively punished in the loss function:
sc(xi,yi)=s(xi,yi)-log(pi)
wherein s isc(xi,yi) Denotes s (x)i,yi) Correction value of piIs yiThe probability of (d);
2.2.3.6) the conditional probability function thus modified is:
Figure FDA0003517220720000062
2.2.3.7) the final loss function is:
Figure FDA0003517220720000063
2.2.4) updating the parameter theta by adopting a gradient descent algorithm to enable the parameter theta to be close to an optimal value; the gradient descent algorithm machine learns the commonly used algorithm and is used for solving the model parameters.
4. The method for recalling knowledge-graph user preference entity based on double-tower model according to claim 1, wherein the specific process of the step 3) is as follows:
3.1) given a user interaction matrix
Figure FDA0003517220720000064
Figure FDA0003517220720000065
Is shown as
Figure FDA0003517220720000066
A user and the second
Figure FDA0003517220720000067
The interactive condition of each item, U and V respectively represent a user set and an item set;
the expression of the user interaction matrix Y is as follows:
Figure FDA0003517220720000068
3.2) definition of Oi,jRepresenting the interaction condition of the user i to the item j, and mapping the item with the interaction of the user i to the knowledge graph
Figure FDA0003517220720000069
The entity of (1);
3.2.1)Oi,jis an interactive matrix, which is composed of row vectors of users; o isi,jRepresenting the interaction condition of a user i to an item j, wherein j is the index of the item;
3.2.2) defining a HashMap for storing all articles, wherein the HashMap is stored in a key-value key value pair mode, the key stores the index of the article, and the value stores the entity corresponding to the object;
3.2.3) expressing the interaction condition of a user from the row vector of the user interaction matrix O, and storing the row vector in an array, wherein the array is defined as Ee;
3.2.3) defining a temporary set temp _ set user storage entity, traversing the array E, and if the value of an element in the array is 1, accessing the article HashMap according to the index of the element, acquiring the corresponding entity, and storing the entity in the temporary set temp _ set.
5. The method for recalling knowledge-graph user preference entity based on double-tower model according to claim 1, wherein the specific process of the step 4) is as follows:
4.1) providing a user with the user feature vector v of the first stepuser
4.2) initializing the temporary set temp _ set of the user according to the third step;
4.3) defining a HashMap preferred by a user, namely a user _ map, storing the circulation times of a key value, and storing a triple set by a value;
4.4) defining a set used _ set for storing the recalled triples;
4.5) making the number K of loops equal to 1,2,3 …, giving a maximum value K, and when the number of triples in the user _ map is greater than K, exiting the loop, and executing the following steps:
4.5.1) traverse the entities in temp _ set, remove the entities in used _ set:
temp_set←temp_set-used_set
wherein ← represents valuation, -represents difference set operation;
4.5.2) finding triple sets from the knowledge-graph
Figure FDA0003517220720000071
And stored in the user _ map, defined as follows:
Figure FDA0003517220720000072
wherein (h ', r', t ') represents a triplet, h' represents a head entity, r 'represents a relationship, and t' represents a tail entity;
4.5.3) due to the existence of a loop of the knowledge graph, in order to prevent the collection of the recalled entities, the entities in temp _ set need to be added to used _ set;
4.5.4) taking out
Figure FDA0003517220720000073
Tail entities of middle triplets, deposited into collections
Figure FDA0003517220720000074
4.5.5) traversal
Figure FDA0003517220720000075
The corresponding article feature vector v is taken out from the entity in (1)item
4.5.6) input parameter v as described by the second stepuserAnd vitemTo the double tower model, the users and the sets are obtained
Figure FDA0003517220720000076
The probability of the corresponding article in (1), and ordering the entities according to the probability;
4.5.7) given a value τ,0< τ ≦ 1, for determining the number of screening entities:
Figure FDA0003517220720000077
wherein
Figure FDA0003517220720000078
Representing the elements in the set sorted by probability and returning an array, getSubVec (i, j) representing the fetching of the child array, i.e. the i to j elements of the original array, newSet () representing the conversion of the array into the set,
Figure FDA0003517220720000081
represents rounding up;
4.5.8) from
Figure FDA0003517220720000082
The screened tail entities in the set belong to
Figure FDA0003517220720000083
The triplet of (2):
Figure FDA0003517220720000084
4.5.9) method of depositing a triplet set Λ to user _ map: user _ map (k, Λ);
4.5.10) will be
Figure FDA0003517220720000085
Set overlay to temp _ set;
4.6) and 4.5) finally obtaining the knowledge graph preferred by the user.
CN202210169936.1A 2022-02-23 A knowledge graph user preference entity recall method based on double tower model Active CN114564594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210169936.1A CN114564594B (en) 2022-02-23 A knowledge graph user preference entity recall method based on double tower model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210169936.1A CN114564594B (en) 2022-02-23 A knowledge graph user preference entity recall method based on double tower model

Publications (2)

Publication Number Publication Date
CN114564594A true CN114564594A (en) 2022-05-31
CN114564594B CN114564594B (en) 2025-02-21

Family

ID=

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150504A (en) * 2023-04-17 2023-05-23 特斯联科技集团有限公司 Recommendation method and device for processing long tail distribution, computer storage medium and terminal
CN118312657A (en) * 2024-06-07 2024-07-09 江苏瑞问科技有限公司 Knowledge base-based intelligent large model analysis recommendation system and method
CN119025718A (en) * 2024-10-28 2024-11-26 上海大智慧财汇数据科技有限公司 Method and system for determining and analyzing group households based on graphic database

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428055A (en) * 2020-04-20 2020-07-17 神思电子技术股份有限公司 Industry-oriented context omission question-answering method
CN112232925A (en) * 2020-11-02 2021-01-15 哈尔滨工程大学 Method for carrying out personalized recommendation on commodities by fusing knowledge maps
CN112732936A (en) * 2021-01-11 2021-04-30 电子科技大学 Radio and television program recommendation method based on knowledge graph and user microscopic behaviors

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428055A (en) * 2020-04-20 2020-07-17 神思电子技术股份有限公司 Industry-oriented context omission question-answering method
CN112232925A (en) * 2020-11-02 2021-01-15 哈尔滨工程大学 Method for carrying out personalized recommendation on commodities by fusing knowledge maps
CN112732936A (en) * 2021-01-11 2021-04-30 电子科技大学 Radio and television program recommendation method based on knowledge graph and user microscopic behaviors

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YUANYI CHEN等: "Learning User Preference from Heterogeneous Information for Store-Type Recommendation", 《IEEE TRANSACTIONS ON SERVICES COMPUTING》, vol. 13, no. 6, 21 September 2017 (2017-09-21), XP011823707, DOI: 10.1109/TSC.2017.2755009 *
张有杰: "基于社交网络的推荐算法研究与实现", 《中国优秀硕士学位论文全文数据库》, 15 January 2022 (2022-01-15) *
陆佳炜等: "一种融合实体图上下文的三维旋转知识图谱表示学习", 《小型微型计算机系统》, vol. 44, no. 1, 15 October 2021 (2021-10-15) *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116150504A (en) * 2023-04-17 2023-05-23 特斯联科技集团有限公司 Recommendation method and device for processing long tail distribution, computer storage medium and terminal
CN118312657A (en) * 2024-06-07 2024-07-09 江苏瑞问科技有限公司 Knowledge base-based intelligent large model analysis recommendation system and method
CN118312657B (en) * 2024-06-07 2024-08-09 江苏瑞问科技有限公司 Knowledge base-based intelligent large model analysis recommendation system and method
CN119025718A (en) * 2024-10-28 2024-11-26 上海大智慧财汇数据科技有限公司 Method and system for determining and analyzing group households based on graphic database
CN119025718B (en) * 2024-10-28 2025-01-17 上海大智慧财汇数据科技有限公司 Group user judging and analyzing method and system based on graphic database

Similar Documents

Publication Publication Date Title
CN112214685B (en) Knowledge graph-based personalized recommendation method
CN108920641B (en) Information fusion personalized recommendation method
CN110910218B (en) Multi-behavior migration recommendation method based on deep learning
CN111310063B (en) Neural network-based article recommendation method for memory perception gated factorization machine
CN110532471B (en) Active learning collaborative filtering method based on gated cyclic unit neural network
CN114693397B (en) Attention neural network-based multi-view multi-mode commodity recommendation method
CN115062237A (en) Cultural resource recommendation method based on the combination of graph neural network and knowledge graph
CN112488791A (en) Individualized recommendation method based on knowledge graph convolution algorithm
CN113918834B (en) Graph convolution collaborative filtering recommendation method fusing social relations
CN113918833A (en) Product recommendation method realized through graph convolution collaborative filtering of social network relationship
CN113704438B (en) Conversation recommendation method of abnormal picture based on layered attention mechanism
CN111985623A (en) Attribute graph group discovery method based on maximized mutual information and graph neural network
CN117112920A (en) A collaborative filtering method based on interactive graph convolutional network
CN117634600A (en) A graph self-supervised learning recommendation method based on knowledge awareness
CN112818256A (en) Recommendation method based on neural collaborative filtering
CN104063555B (en) The user model modeling method intelligently distributed towards remote sensing information
CN116662564A (en) A service recommendation method based on deep matrix factorization and knowledge graph
CN116069921A (en) News Recommendation Method Combining Activation Diffusion Theory and Ebbinghaus Forgetting Theory
CN111814048A (en) Method and device for recommending information
CN114817581A (en) Cross-modal Hash retrieval method based on fusion attention mechanism and DenseNet network
CN114564594B (en) A knowledge graph user preference entity recall method based on double tower model
CN114564594A (en) Knowledge graph user preference entity recall method based on double-tower model
CN115114528A (en) A Knowledge Graph Recommendation Method Integrating GNN and ResNet
CN116821519A (en) Intelligent recommendation method for system filtering and noise reduction based on graph structure
CN116308618A (en) A method for training a recommendation model, a recommendation model, and a product recommendation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant