CN106649818A

CN106649818A - Recognition method and device for application search intentions and application search method and server

Info

Publication number: CN106649818A
Application number: CN201611246921.1A
Authority: CN
Inventors: 庞伟
Original assignee: Beijing Qihoo Technology Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-05-10
Anticipated expiration: 2036-12-29
Also published as: CN106649818B

Abstract

The invention discloses an application search intent identification method, device, application search method and server. The method includes: obtaining search words in each query session from the query session log of an application search engine; According to the label system of each search term, the application search intent corresponding to the search term is identified. In this solution, the user intent recognition method tag method matching the app application tag system is proposed, which flexibly expresses the user's fine-grained query intent. Based on unsupervised machine learning technology, a user intent labeling system is built, which abandons the traditional user intent classification method, and implements a set of automated user intent mining processes, which can generate a user intent tag list with high accuracy and recall rate, and combine user intent and Apps are mapped to the same tag system, so that when users search for apps, they can quickly and accurately obtain apps that meet their intentions.

Description

Application search intent recognition method, device, application search method, and server

技术领域technical field

本发明涉及数据挖掘领域，具体涉及一种应用搜索意图的识别方法、装置、应用搜索方法和服务器。The present invention relates to the field of data mining, in particular to an application search intent identification method, device, application search method and server.

背景技术Background technique

应用搜索引擎是一款移动端软件应用搜索引擎服务，提供手机上的app搜索和下载，如360手机助手、腾讯应用宝、GooglePlay、Appstore等。应用搜索引擎是安装在手机上的移动搜索服务，如360手机助手app应用，由于搜索结果的展现平面小等客观条件限制，只有提供精准的搜索结果才能获得最佳的用户体验，也是移动搜索与PC端网页搜索的重要区别之一。移动端app应用数量巨大，有数百万的app应用，应用搜索引擎需要在理解用户查询意图的前提下，才能精准的展现给用户那一款心中所想的app应用。App search engine is a mobile software application search engine service that provides app search and download on mobile phones, such as 360 Mobile Assistant, Tencent App Store, Google Play, Appstore, etc. Application search engines are mobile search services installed on mobile phones, such as 360 mobile assistant app applications. Due to the limitations of objective conditions such as the small display plane of search results, the best user experience can only be obtained by providing accurate search results. One of the important differences in PC-side web search. The number of mobile app applications is huge, with millions of app applications. The application search engine needs to understand the user's query intention in order to accurately display the app application in the user's mind.

应用搜索引擎提供精准搜索服务的前提是精准理解用户的查询意图。用户的每个查询请求背后都隐含着潜在的搜索意图，如果应用搜索引擎能感知用户需求，将搜索词文本映射到对应的app应用功能或app应用类别上，将更符合用户意图的app应用结果排在前列，这显然会增强用户的搜索体验。因此用户意图识别是应用搜索引擎的核心技术，也是实现功能搜索技术的关键。The premise of using a search engine to provide accurate search services is to accurately understand the user's query intention. There is a potential search intent hidden behind each query request of the user. If the application search engine can perceive the user's needs and map the search word text to the corresponding app application function or app application category, the app application that is more in line with the user's intention The results are ranked first, which obviously enhances the user's search experience. Therefore, user intent recognition is the core technology of the application search engine, and it is also the key to realize the functional search technology.

在现有传统的web搜索引擎技术中，均是人工整理分类用户搜索意图，将用户搜索意图分为导航类、信息类和资源类三大类型，但这种针对网页的用户意图分类方法并不适用于app应用场景。因为每一款app应用都有固定的应用领域，为人们提供某一种具体化的功能，使用标签挖掘用户细粒度的功能需求是恰当的，基于分类的方法粒度广、宽泛因而不适用。所以，至今尚无一种非常灵活和有效的方法能满足用户日益增长的对app应用快速、精确搜索的需求。In the existing traditional web search engine technology, user search intentions are manually sorted and classified, and user search intentions are divided into three types: navigation, information, and resource. Applicable to app application scenarios. Because each app application has a fixed application field and provides people with a certain specific function, it is appropriate to use tags to mine users' fine-grained functional requirements, and classification-based methods are not applicable due to their wide granularity and breadth. Therefore, there is still no very flexible and effective method to meet the growing demand of users for fast and precise search of apps.

发明内容Contents of the invention

鉴于上述问题，提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种应用搜索意图的识别方法、装置、应用搜索方法和服务器。In view of the above problems, the present invention is proposed to provide an application search intent identification method, device, application search method and server that overcome the above problems or at least partially solve the above problems.

依据本发明的一个方面，提供了一种应用搜索意图的识别方法，该方法包括：According to one aspect of the present invention, a method for identifying application search intent is provided, the method comprising:

从应用搜索引擎的查询会话日志中获取各查询会话中的搜索词；Obtain the search terms in each query session from the query session log of the application search engine;

根据各查询会话中的搜索词以及预设策略，挖掘出各搜索词的标签体系；According to the search terms and preset strategies in each query session, the label system of each search term is excavated;

根据每个搜索词的标签体系识别出该搜索词对应的应用搜索意图。According to the tag system of each search term, the application search intent corresponding to the search term is identified.

可选地，根据各查询会话中的搜索词以及预设策略，挖掘出各搜索词的标签体系包括：Optionally, according to the search terms in each query session and the preset strategy, the tag system for mining each search term includes:

根据各查询会话中的搜索词，获得训练语料集合；Obtain a training corpus set according to the search terms in each query session;

将训练语料集合输入至LDA模型中进行训练，得到LDA模型输出的搜索词-主题概率分布结果以及主题-关键词概率分布结果；Input the training corpus set into the LDA model for training, and obtain the search word-topic probability distribution result and the topic-keyword probability distribution result output by the LDA model;

根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系。According to the search word-topic probability distribution result and the topic-keyword probability distribution result, the label system of each search word is calculated.

可选地，所述根据各查询会话中的搜索词，获得训练语料集合包括：Optionally, said obtaining the training corpus set according to the search terms in each query session includes:

根据各查询会话中的搜索词，获得各搜索词的原始语料；According to the search terms in each query session, the original corpus of each search term is obtained;

各搜索词的原始语料构成原始语料集合；对所述原始语料集合进行预处理，获得训练语料集合。The original corpus of each search word constitutes an original corpus set; the original corpus set is preprocessed to obtain a training corpus set.

可选地，所述根据各查询会话中的搜索词，获得各搜索词的原始语料包括：Optionally, said obtaining the original corpus of each search word according to the search words in each query session includes:

根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；以及，获得多个查询会话对应的搜索词集合；Obtain a set of search word sequences corresponding to multiple query sessions according to the search words in each query session; and obtain a set of search words corresponding to multiple query sessions;

对所述搜索词序列集合进行训练得到N维的搜索词向量文件；The search term sequence set is trained to obtain an N-dimensional search term vector file;

对于所搜索词集合中的每个搜索词，根据所述N维的搜索词向量文件计算该搜索词与其他各搜索词之间的关联程度；将与该搜索词的关联程度符合符合预设条件的其他各搜索词作为该搜索词的原始语料。For each search word in the search word set, calculate the degree of association between the search word and other search words according to the N-dimensional search word vector file; the degree of association with the search word meets the preset conditions The other search terms of the search term are used as the original corpus of the search term.

可选地，所述获得多个查询会话对应的搜索词序列集合包括：Optionally, the obtaining the set of search word sequences corresponding to multiple query sessions includes:

对于每个查询会话，将该查询会话中的搜索词按照顺序排成一个序列；如果该序列中的一个搜索词对应于应用下载操作，将所下载的应用的名称插入到该序列中的相应搜索词的后面相邻位置；得到该查询会话对应的搜索词序列；For each query session, arrange the search terms in the query session in order into a sequence; if a search term in the sequence corresponds to an application download operation, insert the name of the downloaded application into the corresponding search term in the sequence The back adjacent position of word; Obtain the search word sequence corresponding to this query session;

所述获得多个查询会话对应的搜索词集合包括：将多个查询会话中的搜索词的集合作为所述多个查询会话对应的搜索词集合。The obtaining the search term sets corresponding to the multiple query sessions includes: using the search term sets in the multiple query sessions as the search term sets corresponding to the multiple query sessions.

可选地，对所述搜索词序列集合进行训练得到N维的搜索词向量文件包括：Optionally, training the set of search word sequences to obtain an N-dimensional search word vector file includes:

将所述搜索词序列集合中的每个搜索词作为一个单词，利用深度学习工具包word2vec对所述搜索词序列集合进行训练，生成N维的搜索词向量文件。Each search word in the search word sequence set is taken as a word, and the deep learning toolkit word2vec is used to train the search word sequence set to generate an N-dimensional search word vector file.

可选地，所述对于所搜索词集合中的每个搜索词，根据所述N维的搜索词向量文件计算该搜索词与其他各搜索词之间的关联程度；将与该搜索词的关联程度符合符合预设条件的其他各搜索词作为该搜索词的原始语料包括：Optionally, for each search term in the set of search terms, the degree of association between the search term and other search terms is calculated according to the N-dimensional search term vector file; the association with the search term The other search terms that meet the preset conditions are used as the original corpus of the search term, including:

利用KNN算法对所述搜索词集合以及所述N维的搜索词向量文件进行运算，根据所述N维的搜索词向量文件计算所述搜索词集合中的每两个搜索词之间的距离；Using the KNN algorithm to perform operations on the search word set and the N-dimensional search word vector file, and calculate the distance between every two search words in the search word set according to the N-dimensional search word vector file;

对于所述搜索词集合中的每个搜索词，按照与该搜索词的距离从大到小排序，选取前第一预设阈值个搜索词作为该搜索词的原始语料。For each search word in the set of search words, the search words are sorted according to the distance from the search word in descending order, and the first preset threshold search words are selected as the original corpus of the search word.

可选地，所述对所述原始语料集合进行预处理包括：Optionally, the preprocessing of the original corpus includes:

在所述原始语料集合中，In the original corpus,

对于每个原始语料，对所述原始语料进行分词处理，得到包含多个词项的分词结果；查找由所述分词结果中的相邻词项构成的短语；保留所述短语、所述分词结果中属于名词的词项和属于动词的词项，作为该原始语料对应保留的关键词。For each original corpus, word segmentation is carried out to the original corpus to obtain a word segmentation result containing a plurality of words; search for a phrase formed by adjacent words in the word segmentation result; retain the phrase, the word segmentation result The lexical items belonging to nouns and the lexical items belonging to verbs are the corresponding reserved keywords of the original corpus.

可选地，所述查找由所述分词结果中的相邻词项构成的短语包括：Optionally, the search for a phrase formed of adjacent terms in the word segmentation result includes:

计算分词结果中的每两个相邻词项的cPMId值，当两个相邻词项的cPMId值大于第二预设阈值时，确定这两个相邻词项构成短语。The cPMId value of every two adjacent terms in the word segmentation result is calculated, and when the cPMId value of the two adjacent terms is greater than a second preset threshold, it is determined that the two adjacent terms form a phrase.

可选地，所述对所述原始语料集合进行预处理还包括：Optionally, the preprocessing of the original corpus further includes:

将每个搜索词的原始物料对应保留的关键词作为该搜索词的第一阶段训练语料；The original material of each search word corresponds to the reserved keywords as the first stage training corpus of the search word;

各搜索词的第一阶段训练语料构成第一阶段训练语料集合；对所述第一阶段训练语料集合中的关键词进行数据清洗。The first-stage training corpus of each search term constitutes a first-stage training corpus set; data cleaning is performed on keywords in the first-stage training corpus set.

可选地，所述对所述第一阶段训练语料集合中的关键词进行数据清洗包括：Optionally, the data cleaning of keywords in the first-stage training corpus includes:

在所述第一阶段训练语料集合中，In the first stage training corpus collection,

对于每个搜索词的第一阶段训练语料，计算所述第一阶段训练语料中的每个关键词的TF-IDF值；将TF-IDF值高于第三预设阈值和/或低于第四预设阈值的关键词删除，得到该搜索词的训练语料；For the first-stage training corpus of each search word, calculate the TF-IDF value of each keyword in the first-stage training corpus; Make the TF-IDF value higher than the third preset threshold and/or lower than the first The keywords of four preset thresholds are deleted to obtain the training corpus of the search words;

各搜索词的训练语料构成训练语料集合。The training corpus of each search term constitutes a training corpus set.

可选地，所述根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系包括：Optionally, according to the search word-topic probability distribution result and the topic-keyword probability distribution result, calculating the label system of each search word includes:

根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到搜索词-关键词概率分布结果；According to the search word-topic probability distribution result and the topic-keyword probability distribution result, calculate the search word-keyword probability distribution result;

根据所述搜索词-关键词概率分布结果，对于每个搜索词，将关键词按照关于该搜索词的概率从大到小排序，选取前第五预设阈值数目的关键词。According to the search word-keyword probability distribution result, for each search word, the keywords are sorted according to the probability of the search word in descending order, and the keywords with the fifth preset threshold number are selected.

可选地，所述根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到搜索词-关键词概率分布结果包括：Optionally, the calculation of the search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result includes:

对于每个搜索词，根据所述搜索词-主题概率分布结果得到各主题关于该搜索词的概率；For each search word, obtain the probability of each topic about the search word according to the search word-topic probability distribution result;

对于每个主题，根据所述主题-关键词概率分布结果得到各关键词关于该主题的概率；For each topic, obtain the probability of each keyword about the topic according to the topic-keyword probability distribution result;

则对于每个关键词，将该关键词关于一个主题的概率与该主题关于一个搜索词的概率的乘积作为该关键词基于该主题的关于所述搜索词的概率；将该关键词基于各主题关于所述搜索词的概率之和作为该关键词关于所述搜索词的概率。Then for each keyword, the product of the probability of the keyword about a topic and the probability of the topic about a search word is used as the probability of the keyword based on the topic about the search word; the keyword is based on each topic The sum of the probabilities with respect to the search term is used as the probability of the keyword with respect to the search term.

可选地，所述根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系还包括：Optionally, according to the search word-topic probability distribution result and the topic-keyword probability distribution result, calculating the label system of each search word further includes:

将每个搜索词对应选取的前第五预设阈值数目的关键词作为该搜索词的第一阶段标签体系；Use the keywords corresponding to the selected fifth preset threshold number for each search term as the first-stage tag system for the search term;

对于每个搜索词的第一阶段标签体系，计算该搜索词的第一阶段标签体系中的每个关键词与该搜索词之间的语义关系值；对于每个关键词，将该关键词对应的语义关系值与该关键词关于该搜索词的概率的乘积作为该关键词关于该搜索词的修正概率；将该搜索词的第一阶段标签体系中的各关键词按照关于该搜索词的修正概率从大到小排序，选取前第六预设阈值个关键词构成该搜索词的标签体系。For the first-stage tag system of each search word, calculate the semantic relationship value between each keyword in the first-stage tag system of the search word and the search word; for each keyword, the corresponding keyword The product of the semantic relationship value of the keyword and the probability of the keyword with respect to the search term is used as the modified probability of the keyword with respect to the search term; each keyword in the first-stage label system of the search term is adjusted according to the modified probability of the search term The probabilities are sorted from large to small, and the first sixth preset threshold keywords are selected to form the label system of the search term.

可选地，计算该搜索词的第一阶段标签体系中的每个关键词与该搜索词之间的语义关系值包括：Optionally, calculating the semantic relationship value between each keyword in the tag system of the first stage of the search term and the search term includes:

根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；对所述搜索词序列集合进行训练得到N维的关键词向量文件；According to the search words in each query session, obtain the search word sequence set corresponding to a plurality of query sessions; train the search word sequence set to obtain an N-dimensional keyword vector file;

根据所述N维的关键词向量文件，计算该关键词的词向量，计算该搜索词中的每个词项的词向量；Calculate the word vector of the keyword according to the keyword vector file of the N dimension, and calculate the word vector of each term in the search term;

计算该关键词的词向量与每个词项的词向量之间的余弦相似度，作为该关键词与相应词项的语义关系值；Calculate the cosine similarity between the word vector of the keyword and the word vector of each term as the semantic relationship value between the keyword and the corresponding term;

将该关键词与各词项的语义关系值之和作为该关键词与该搜索词之间的语义关系值。The sum of the semantic relationship values between the keyword and each term is used as the semantic relationship value between the keyword and the search term.

可选地，所述对所述搜索词序列集合进行训练得到N维的关键词向量文件包括：Optionally, said training the set of search word sequences to obtain an N-dimensional keyword vector file includes:

对所述搜索词序列集合进行分词处理，利用深度学习工具包word2vec对分词处理后的搜索词序列集合进行训练，生成N维的关键词向量文件。Carry out word segmentation processing on the set of search word sequences, use the deep learning toolkit word2vec to train the set of search word sequences after the word segmentation processing, and generate N-dimensional keyword vector files.

将每个搜索词对应选取的前第六预设阈值个关键词作为该搜索词的第二阶段标签体系；The first six preset threshold keywords corresponding to each search term are used as the second stage label system of the search term;

对于每个搜索词的第二阶段标签体系，统计该搜索词的第二阶段标签体系中的每个关键词在该搜索词的训练语料中的TF-IDF值；对于每个关键词，将该关键词关于该搜索词的概率与所述TF-IDF值的乘积作为该关键词关于该搜索词的二次修正概率；将该搜索词的第二阶段标签体系中的各关键词按照关于该搜索词的二次修正概率从大到小排序，选取前K个关键词构成该搜索词的标签体系。For the second-stage tag system of each search word, count the TF-IDF value of each keyword in the second-stage tag system of the search word in the training corpus of the search word; for each keyword, the The product of the probability of the keyword with respect to the search term and the TF-IDF value is used as the secondary revised probability of the keyword with respect to the search term; The secondary correction probability of words is sorted from large to small, and the first K keywords are selected to form the label system of the search word.

可选地，所述选取前K个关键词构成该搜索词的标签体系包括：Optionally, the tag system for selecting the first K keywords to form the search term includes:

从应用搜索引擎的查询会话日志中获取关于该搜索词在预设时间段内的查询次数；Obtain the number of queries about the search term within a preset time period from the query session log of the application search engine;

根据所述查询次数选取前K个关键词构成该搜索词的标签体系；其中K值作为该搜索词对应的查询次数的折线函数。According to the number of queries, the first K keywords are selected to form the label system of the search word; where the value of K is a broken line function of the number of queries corresponding to the search word.

依据本发明的另一个方面，提供了一种应用搜索方法，该方法包括：According to another aspect of the present invention, an application search method is provided, the method comprising:

构建搜索词标签数据库，该搜索词标签数据库中包括多个搜索词的标签体系；Build a search term label database, which includes a label system of multiple search terms;

接收客户端上传的当前搜索词，根据所述搜索词标签数据库获取当前搜索词的标签体系；Receive the current search term uploaded by the client, and obtain the tag system of the current search term according to the search term tag database;

计算当前搜索词的标签体系与各应用的标签体系之间的关联程度；Calculate the degree of association between the label system of the current search term and the label system of each application;

当当前搜索词的标签体系与一个应用的标签体系之间的关联程度符合预设条件时，将该应用的相关信息返回至客户端进行展示；When the degree of association between the label system of the current search word and the label system of an application meets the preset conditions, the relevant information of the application is returned to the client for display;

通过本发明第一个方面中任一项所述的方法构建所述搜索词标签数据库。The search word label database is constructed by the method described in any one of the first aspects of the present invention.

可选地，所述根据所述搜索词标签数据库获取当前搜索词的标签体系包括：Optionally, said obtaining the label system of the current search term according to the search term label database includes:

计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的语义相似度，按照语义相似度从大到小排序，选取前第一预设阈值个搜索词；Calculate the semantic similarity between the current search term and each search term in the search term label database, sort according to the semantic similarity from large to small, and select the first preset threshold search term;

根据所选取的各搜索词的标签体系，获得当前搜索词的标签体系。According to the selected tag system of each search word, the tag system of the current search word is obtained.

可选地，所述计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的语义相似度包括：计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的欧式距离，将每个搜索词与当前搜索词之间的欧式距离作为该搜索词对应的语义相似度；Optionally, the calculating the semantic similarity between the current search term and each search term in the search term label database includes: calculating the Euclidean similarity between the current search term and each search term in the search term label database Distance, using the Euclidean distance between each search term and the current search term as the semantic similarity corresponding to the search term;

所述根据所选取的各搜索词的标签体系，获得当前搜索词的标签体系包括：每个搜索词对应的语义相似度作为该搜索词的标签体系中的各标签的权重；对于各搜索词的标签体系对应的各标签，将相同的标签的权重相加，得到各标签的最终权重；按照最终权重从大到小排序，选取前第二预设阈值个标签构成当前搜索词的标签体系。According to the selected tag system of each search term, obtaining the tag system of the current search term includes: the semantic similarity corresponding to each search term is used as the weight of each tag in the tag system of the search term; For each tag corresponding to the tag system, add the weights of the same tags to get the final weight of each tag; sort the final weights from large to small, and select the first second preset threshold tags to form the tag system of the current search word.

依据本发明的另一个方面，提供了一种应用搜索意图的识别装置，该装置包括：According to another aspect of the present invention, an application search intent recognition device is provided, the device comprising:

获取单元，适于从应用搜索引擎的查询会话日志中获取各查询会话中的搜索词；An acquisition unit adapted to acquire the search words in each query session from the query session log of the application search engine;

挖掘单元，适于根据各查询会话中的搜索词以及预设策略，挖掘出各搜索词的标签体系；The mining unit is adapted to mine the label system of each search term according to the search term and the preset strategy in each query session;

识别单元，适于根据每个搜索词的标签体系识别出该搜索词对应的应用搜索意图。The identification unit is adapted to identify the application search intent corresponding to the search term according to the tag system of each search term.

可选的，所述挖掘单元，适于根据各查询会话中的搜索词，获得训练语料集合；将训练语料集合输入至LDA模型中进行训练，得到LDA模型输出的搜索词-主题概率分布结果以及主题-关键词概率分布结果；根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系。Optionally, the mining unit is adapted to obtain a training corpus set according to the search terms in each query session; input the training corpus set into the LDA model for training, and obtain the search term-topic probability distribution results output by the LDA model and Topic-keyword probability distribution results; according to the search word-topic probability distribution results and the topic-keyword probability distribution results, the label system of each search word is calculated.

可选地，所述挖掘单元，适于根据各查询会话中的搜索词，获得各搜索词的原始语料；各搜索词的原始语料构成原始语料集合；对所述原始语料集合进行预处理，获得训练语料集合。Optionally, the mining unit is adapted to obtain the original corpus of each search word according to the search words in each query session; the original corpus of each search word constitutes an original corpus set; preprocess the original corpus set to obtain A collection of training corpora.

可选地，所述挖掘单元，适于根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；以及，获得多个查询会话对应的搜索词集合；对所述搜索词序列集合进行训练得到N维的搜索词向量文件；对于所搜索词集合中的每个搜索词，根据所述N维的搜索词向量文件计算该搜索词与其他各搜索词之间的关联程度；将与该搜索词的关联程度符合符合预设条件的其他各搜索词作为该搜索词的原始语料。Optionally, the mining unit is adapted to obtain a set of search word sequences corresponding to multiple query sessions according to the search words in each query session; and obtain a set of search words corresponding to multiple query sessions; The sequence set is trained to obtain an N-dimensional search term vector file; for each search term in the search term collection, calculate the degree of association between the search term and other search terms according to the N-dimensional search term vector file; Other search words whose degree of association with the search word meets the preset conditions are used as the original corpus of the search word.

可选地，所述挖掘单元，适于对于每个查询会话，将该查询会话中的搜索词按照顺序排成一个序列；如果该序列中的一个搜索词对应于应用下载操作，将所下载的应用的名称插入到该序列中的相应搜索词的后面相邻位置；得到该查询会话对应的搜索词序列；将多个查询会话中的搜索词的集合作为所述多个查询会话对应的搜索词集合。Optionally, the mining unit is adapted to, for each query session, arrange the search words in the query session into a sequence; if a search word in the sequence corresponds to an application download operation, the downloaded inserting the name of the application into the adjacent position behind the corresponding search term in the sequence; obtaining the search term sequence corresponding to the query session; using a set of search terms in multiple query sessions as the search term corresponding to the multiple query sessions gather.

可选地，所述挖掘单元，适于将所述搜索词序列集合中的每个搜索词作为一个单词，利用深度学习工具包word2vec对所述搜索词序列集合进行训练，生成N维的搜索词向量文件。Optionally, the mining unit is adapted to use each search word in the set of search word sequences as a word, use the deep learning toolkit word2vec to train the set of search word sequences, and generate N-dimensional search words vector file.

可选地，所述挖掘单元，适于利用KNN算法对所述搜索词集合以及所述N维的搜索词向量文件进行运算，根据所述N维的搜索词向量文件计算所述搜索词集合中的每两个搜索词之间的距离；对于所述搜索词集合中的每个搜索词，按照与该搜索词的距离从大到小排序，选取前第一预设阈值个搜索词作为该搜索词的原始语料。Optionally, the mining unit is adapted to use the KNN algorithm to operate on the search word set and the N-dimensional search word vector file, and calculate the search word set in the search word set according to the N-dimensional search word vector file. The distance between every two search words; for each search word in the search word set, sort according to the distance from the search word from large to small, and select the first preset threshold search words as the search word The original corpus of words.

可选地，所述挖掘单元，适于在所述原始语料集合中，对于每个原始语料，对所述原始语料进行分词处理，得到包含多个词项的分词结果；查找由所述分词结果中的相邻词项构成的短语；保留所述短语、所述分词结果中属于名词的词项和属于动词的词项，作为该原始语料对应保留的关键词。Optionally, the mining unit is adapted to perform word segmentation processing on each original corpus in the original corpus set to obtain a word segmentation result containing a plurality of lexical items; Phrases formed by adjacent terms in ; keep the phrases, the terms belonging to nouns and the terms belonging to verbs in the word segmentation results, as the corresponding reserved keywords of the original corpus.

可选地，所述挖掘单元，适于计算分词结果中的每两个相邻词项的cPMId值，当两个相邻词项的cPMId值大于第二预设阈值时，确定这两个相邻词项构成短语。Optionally, the mining unit is adapted to calculate the cPMId value of each two adjacent term items in the word segmentation result, and when the cPMId value of the two adjacent term items is greater than a second preset threshold, it is determined that the two adjacent term items Adjacent terms form phrases.

可选地，所述挖掘单元，还适于将每个搜索词的原始物料对应保留的关键词作为该搜索词的第一阶段训练语料；各搜索词的第一阶段训练语料构成第一阶段训练语料集合；对所述第一阶段训练语料集合中的关键词进行数据清洗。Optionally, the mining unit is also adapted to use the keywords retained corresponding to the original material of each search word as the first-stage training corpus of the search word; the first-stage training corpus of each search word constitutes the first-stage training A corpus collection; performing data cleaning on keywords in the first-stage training corpus collection.

可选地，所述挖掘单元，适于在所述第一阶段训练语料集合中，对于每个搜索词的第一阶段训练语料，计算所述第一阶段训练语料中的每个关键词的TF-IDF值；将TF-IDF值高于第三预设阈值和/或低于第四预设阈值的关键词删除，得到该搜索词的训练语料；各搜索词的训练语料构成训练语料集合。Optionally, the mining unit is adapted to calculate the TF of each keyword in the first-stage training corpus for the first-stage training corpus of each search word in the first-stage training corpus set -IDF value; delete keywords whose TF-IDF value is higher than the third preset threshold and/or lower than the fourth preset threshold to obtain the training corpus of the search term; the training corpus of each search term constitutes a training corpus set.

可选地，所述挖掘单元，适于根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到搜索词-关键词概率分布结果；根据所述搜索词-关键词概率分布结果，对于每个搜索词，将关键词按照关于该搜索词的概率从大到小排序，选取前第五预设阈值数目的关键词。Optionally, the mining unit is adapted to calculate a search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result; As a result of the word probability distribution, for each search word, the keywords are sorted according to the probability of the search word in descending order, and the keywords with the fifth preset threshold number are selected.

可选地，所述挖掘单元，适于对于每个搜索词，根据所述搜索词-主题概率分布结果得到各主题关于该搜索词的概率；对于每个主题，根据所述主题-关键词概率分布结果得到各关键词关于该主题的概率；则对于每个关键词，将该关键词关于一个主题的概率与该主题关于一个搜索词的概率的乘积作为该关键词基于该主题的关于所述搜索词的概率；将该关键词基于各主题关于所述搜索词的概率之和作为该关键词关于所述搜索词的概率。Optionally, the mining unit is adapted to, for each search word, obtain the probability of each topic with respect to the search word according to the search word-topic probability distribution result; for each topic, according to the topic-keyword probability The distribution results get the probability of each keyword on the topic; then for each keyword, the product of the probability of the keyword on a topic and the probability of the topic on a search word is used as the keyword's probability on the topic based on the topic The probability of the search term: the sum of the probabilities of the keyword with respect to the search term based on each topic is used as the probability of the keyword with respect to the search term.

可选地，所述挖掘单元，还适于将每个搜索词对应选取的前第五预设阈值数目的关键词作为该搜索词的第一阶段标签体系；对于每个搜索词的第一阶段标签体系，计算该搜索词的第一阶段标签体系中的每个关键词与该搜索词之间的语义关系值；对于每个关键词，将该关键词对应的语义关系值与该关键词关于该搜索词的概率的乘积作为该关键词关于该搜索词的修正概率；将该搜索词的第一阶段标签体系中的各关键词按照关于该搜索词的修正概率从大到小排序，选取前第六预设阈值个关键词构成该搜索词的标签体系。Optionally, the mining unit is also adapted to use the keywords corresponding to the first fifth preset threshold number selected for each search term as the first stage tag system of the search term; for the first stage of each search term Tag system, calculate the semantic relationship value between each keyword in the first stage tag system of the search word and the search word; for each keyword, the semantic relationship value corresponding to the keyword is related to the keyword The product of the probability of the search term is used as the corrected probability of the keyword with respect to the search term; the keywords in the first-stage label system of the search term are sorted from large to small according to the corrected probability of the search term. The sixth preset threshold keywords constitute the tag system of the search term.

可选地，所述挖掘单元，适于根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；对所述搜索词序列集合进行训练得到N维的关键词向量文件；根据所述N维的关键词向量文件，计算该关键词的词向量，计算该搜索词中的每个词项的词向量；计算该关键词的词向量与每个词项的词向量之间的余弦相似度，作为该关键词与相应词项的语义关系值；将该关键词与各词项的语义关系值之和作为该关键词与该搜索词之间的语义关系值。Optionally, the mining unit is adapted to obtain a set of search word sequences corresponding to multiple query sessions according to the search words in each query session; train the set of search word sequences to obtain an N-dimensional keyword vector file; Calculate the word vector of the keyword according to the keyword vector file of the N dimension, calculate the word vector of each term in the search term; calculate the difference between the word vector of the keyword and the word vector of each term The cosine similarity of the keyword is used as the semantic relationship value between the keyword and the corresponding term; the sum of the semantic relationship values between the keyword and each term is used as the semantic relationship value between the keyword and the search term.

可选地，所述挖掘单元，适于对所述搜索词序列集合进行分词处理，利用深度学习工具包word2vec对分词处理后的搜索词序列集合进行训练，生成N维的关键词向量文件。Optionally, the mining unit is adapted to perform word segmentation processing on the set of search word sequences, use the deep learning toolkit word2vec to train the set of search word sequences after word segmentation, and generate an N-dimensional keyword vector file.

可选地，所述挖掘单元，还适于将每个搜索词对应选取的前第六预设阈值个关键词作为该搜索词的第二阶段标签体系；对于每个搜索词的第二阶段标签体系，统计该搜索词的第二阶段标签体系中的每个关键词在该搜索词的训练语料中的TF-IDF值；对于每个关键词，将该关键词关于该搜索词的概率与所述TF-IDF值的乘积作为该关键词关于该搜索词的二次修正概率；将该搜索词的第二阶段标签体系中的各关键词按照关于该搜索词的二次修正概率从大到小排序，选取前K个关键词构成该搜索词的标签体系。Optionally, the mining unit is also adapted to use the first sixth preset threshold keywords corresponding to each search term as the second-stage label system of the search term; for the second-stage label of each search term System, count the TF-IDF value of each keyword in the second stage label system of the search word in the training corpus of the search word; for each keyword, the probability of the keyword about the search word is compared with the The product of the above TF-IDF values is used as the secondary correction probability of the keyword with respect to the search term; each keyword in the second-stage tag system of the search term is ranked from large to small according to the secondary correction probability of the search term Sorting, select the first K keywords to form the label system of the search term.

可选地，所述挖掘单元，适于从应用搜索引擎的查询会话日志中获取关于该搜索词在预设时间段内的查询次数；根据所述查询次数选取前K个关键词构成该搜索词的标签体系；其中K值作为该搜索词对应的查询次数的折线函数。Optionally, the mining unit is adapted to obtain the number of queries about the search term within a preset time period from the query session log of the application search engine; select the first K keywords according to the query times to form the search term label system; where the K value is a broken line function of the number of queries corresponding to the search term.

依据本发明的再一个方面，提供了一种应用搜索服务器，该服务器包括：According to still another aspect of the present invention, an application search server is provided, and the server includes:

数据库构建单元，适于构建搜索词标签数据库，该搜索词标签数据库中包括多个搜索词的标签体系；The database construction unit is suitable for constructing a search word tag database, and the search word tag database includes a tag system of multiple search words;

交互单元，适于接收客户端上传的当前搜索词；an interactive unit adapted to receive the current search term uploaded by the client;

搜索处理单元，适于根据所述搜索词标签数据库获取当前搜索词的标签体系；计算当前搜索词的标签体系与各应用的标签体系之间的关联程度；The search processing unit is adapted to obtain the label system of the current search term according to the search term label database; calculate the degree of association between the label system of the current search term and the label system of each application;

所述交互单元，还适于当当前搜索词的标签体系与一个应用的标签体系之间的关联程度符合预设条件时，将该应用的相关信息返回至客户端进行展示；The interaction unit is further adapted to return relevant information of the application to the client for display when the degree of association between the label system of the current search term and the label system of an application meets a preset condition;

所述数据库构建单元与权利要求22-39中任一项所述的应用搜索意图的识别装置构建所述搜索词标签数据库的过程相同。The process of the database construction unit constructing the search term tag database is the same as that described in any one of claims 22-39 by the device for identifying application search intentions.

可选地，所述搜索处理单元，适于计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的语义相似度，按照语义相似度从大到小排序，选取前第一预设阈值个搜索词；根据所选取的各搜索词的标签体系，获得当前搜索词的标签体系。Optionally, the search processing unit is adapted to calculate the semantic similarity between the current search term and each search term in the search term tag database, sort the semantic similarity from large to small, and select the first pre-selected Set a threshold number of search words; obtain the label system of the current search word according to the label system of each selected search word.

可选地，所述搜索处理单元，适于计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的欧式距离，将每个搜索词与当前搜索词之间的欧式距离作为该搜索词对应的语义相似度；每个搜索词对应的语义相似度作为该搜索词的标签体系中的各标签的权重；对于各搜索词的标签体系对应的各标签，将相同的标签的权重相加，得到各标签的最终权重；按照最终权重从大到小排序，选取前第二预设阈值个标签构成当前搜索词的标签体系。Optionally, the search processing unit is adapted to calculate the Euclidean distance between the current search term and each search term in the search term tag database, and use the Euclidean distance between each search term and the current search term as the The semantic similarity corresponding to the search word; the semantic similarity corresponding to each search word is used as the weight of each tag in the tag system of the search word; for each tag corresponding to the tag system of each search word, the weight of the same tag is Add to get the final weight of each tag; sort according to the final weight from large to small, and select the first second preset threshold tags to form the tag system of the current search term.

根据本发明的方案，提出了与app应用标签体系相匹配的用户意图识别方法-标签法，灵活有效，准确地将搜索词对应的标签体系挖掘出来，建立搜索词标签数据库，对于用户输入的搜索词，可以以标签体系来对搜索词进行准确的描述，由此既解决了用户意图识别问题。进一步地，可以把用户意图和app应用映射到同一个标签体系内，进行搜索匹配能够得到更为准确的应用搜索结果。由此既解决了用户意图识别问题，同时又解决了应用搜索引擎的相关计算问题，为应用搜索引擎中的一项核心技术-功能搜索技术的打下了基础。According to the solution of the present invention, a user intent recognition method matching the app application label system - the label method is proposed, which is flexible and effective, accurately excavates the label system corresponding to the search word, and establishes a search word label database. Words can be used to accurately describe search terms with a tag system, thus solving the problem of user intent recognition. Furthermore, user intent and app applications can be mapped to the same label system, and more accurate application search results can be obtained by performing search matching. This not only solves the problem of user intention recognition, but also solves the related calculation problems of the application search engine, and lays the foundation for the application of a core technology in the search engine-functional search technology.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solution of the present invention. In order to better understand the technical means of the present invention, it can be implemented according to the contents of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and understandable , the specific embodiments of the present invention are enumerated below.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating a preferred embodiment and are not to be considered as limiting the invention. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

图1示出了根据本发明一个实施例的一种应用搜索意图的识别方法的流程图；FIG. 1 shows a flowchart of a method for identifying application search intent according to an embodiment of the present invention;

图2示出了根据本发明一个实施例的一种应用搜索方法的流程图；Fig. 2 shows a flow chart of an application search method according to an embodiment of the present invention;

图3示出了根据本发明一个实施例的一种应用搜索意图的识别装置的示意图；以及Fig. 3 shows a schematic diagram of an application search intent recognition device according to an embodiment of the present invention; and

图4示出了根据本发明一个实施例的一种应用搜索服务器的示意图。Fig. 4 shows a schematic diagram of an application search server according to an embodiment of the present invention.

具体实施方式detailed description

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided for more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

在下文中，以app表示应用，query表示搜索词，tag表示标签，Session表示查询会话。In the following, app represents an application, query represents a search term, tag represents a tag, and Session represents a query session.

本发明提出了一种新的针对应用搜索引擎的用户意图识别方法，标签法，灵活有效的表达用户细粒度的查询意图，基于无监督机器学习技术构建用户意图的标签体系，抛弃了传统的用户意图分类方法，实现了一套自动化用户意图挖掘流程，可生成高准确率、召回率的用户意图标签列表，将用户query与app映射到共有的标签体系内，同时解决了用户意图识别问题和应用搜索引擎的相关性计算问题，取得非常好的效果。The present invention proposes a new user intent recognition method for application search engines, the label method, which flexibly and effectively expresses the user's fine-grained query intent, builds a user intent tag system based on unsupervised machine learning technology, and abandons the traditional user intent The intent classification method implements a set of automated user intent mining processes, which can generate a user intent tag list with high accuracy and recall rate, map user query and app to a shared tag system, and solve the problem of user intent identification and application The correlation calculation problem of the search engine has achieved very good results.

图1示出了根据本发明的一个实施例的一种应用搜索意图的识别方法的流程图。如图1所示，该方法包括：Fig. 1 shows a flow chart of a method for identifying application search intent according to an embodiment of the present invention. As shown in Figure 1, the method includes:

步骤S110，从应用搜索引擎的查询会话日志中获取各查询会话中的搜索词；Step S110, obtaining the search words in each query session from the query session log of the application search engine;

步骤S120，根据各查询会话中的搜索词以及预设策略，挖掘出各搜索词的标签体系；Step S120, according to the search words in each query session and the preset strategy, dig out the label system of each search word;

步骤S130，根据每个搜索词的标签体系识别出该搜索词对应的应用搜索意图。Step S130, according to the tag system of each search term, identify the application search intent corresponding to the search term.

可见，传统的用户意图识别是针对网页设计的分类方法，不适用于app应用场景，每一款应用都有固定的应用领域，为人们提供某一种具体化的功能，使用标签挖掘用户细粒度的功能需求是恰当的，基于分类的方法粒度广、宽泛因而不适用。本方案提出了与应用标签体系相匹配的用户意图识别方法-标签法，灵活有效，把用户意图和应用映射到同一个标签体系内，既解决了用户意图识别问题，同时解决了应用搜索引擎的相关性计算问题，是实现应用搜索引擎的一项核心技术-功能搜索技术的基础。It can be seen that the traditional user intent recognition is a classification method for web design, which is not suitable for app application scenarios. Each application has a fixed application field, providing people with a specific function, and using tags to mine user fine-grained The functional requirements are appropriate, and the classification-based approach is too granular and broad to be applicable. This solution proposes a user intent recognition method that matches the application tag system - the tag method, which is flexible and effective, and maps user intent and applications to the same tag system, which not only solves the problem of user intent recognition, but also solves the problem of application search engines. The problem of correlation calculation is the basis of realizing a core technology of application search engine - function search technology.

通常情况下，用户搜索词是短文本，用户在根据自己心中所想的需求构造的搜索词特征稀疏，并不能够全面描述需求本身。但如果用户在一个短时间段内只寻找某个单一功能场景的app应用，往往围绕着一个单一需求不断改写查询搜索词，发出的这些查询词之间通常有很强的语义关联，这是应用搜索引擎的一个重要特征。Usually, the user's search term is a short text, and the search term constructed by the user according to the demand in his mind has sparse features and cannot fully describe the demand itself. However, if a user is only looking for an app with a single functional scenario in a short period of time, they often rewrite the query search terms around a single requirement, and these query words usually have a strong semantic relationship between them. An important feature of search engines.

在搜索引擎服务中，系统会自动记录下与用户搜索相关关信息，并保存至查询日志中。例如，当用户打开一个百度搜索页面，依次输入“游戏”“游戏软件”“好玩的游戏”“游戏应用下载”等的搜索词后进入搜索页面，或者在进入某一搜索页面后，继续输入某些搜索词进行搜索动作，直到用户完成该搜索事件，关闭整个百度搜索页面，整个过程被称作一次查询会话。In the search engine service, the system will automatically record the information related to the user's search and save it in the query log. For example, when a user opens a Baidu search page, enters the search words such as "game", "game software", "fun game", "game application download" in sequence, and then enters the search page, or after entering a certain search page, continues to enter a certain These search words are searched until the user completes the search event and closes the entire Baidu search page. The whole process is called a query session.

在本发明的一个实施例中，步骤S120中根据各查询会话中的搜索词以及预设策略，挖掘出各搜索词的标签体系包括：根据各查询会话中的搜索词，获得训练语料集合；将训练语料集合输入至LDA模型中进行训练，得到LDA模型输出的搜索词-主题概率分布结果以及主题-关键词概率分布结果；根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系。In one embodiment of the present invention, in step S120, according to the search terms in each query session and the preset strategy, digging out the label system of each search term includes: obtaining a training corpus set according to the search terms in each query session; The training corpus is input into the LDA model for training, and the search word-topic probability distribution result and the topic-keyword probability distribution result of the LDA model output are obtained; according to the search word-topic probability distribution result and the topic-keyword probability The distribution results are calculated to obtain the label system of each search term.

在获得训练语料的过程中，技术难点是query短文本扩展，扩展成长文本后可以将一个query看作一个文档，是我们有效利用LDA主题模型的关键，从而产生高准确率、高召回率的意图tag。意图tag分为类别性tag和功能性tag，类别性标签反映用户需求的应用领域，功能性标签反映用户的具体化需求。In the process of obtaining the training corpus, the technical difficulty is the expansion of the short text of the query. After expanding the long text, a query can be regarded as a document, which is the key to our effective use of the LDA topic model, resulting in high accuracy and high recall. tag. Intent tags are divided into categorical tags and functional tags. The categorical tags reflect the application fields of user needs, and the functional tags reflect the specific needs of users.

其中，所述根据各查询会话中的搜索词，获得训练语料集合包括：Wherein, according to the search terms in each query session, obtaining the training corpus set includes:

根据各查询会话中的搜索词，获得各搜索词的原始语料；各搜索词的原始语料构成原始语料集合；对所述原始语料集合进行预处理，获得训练语料集合。具体地，所述根据各查询会话中的搜索词，获得各搜索词的原始语料包括：根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；以及，获得多个查询会话对应的搜索词集合。According to the search words in each query session, the original corpus of each search word is obtained; the original corpus of each search word forms an original corpus set; the original corpus set is preprocessed to obtain a training corpus set. Specifically, the obtaining the original corpus of each search word according to the search words in each query session includes: obtaining a set of search word sequences corresponding to multiple query sessions according to the search words in each query session; and obtaining multiple query A collection of search terms corresponding to a session.

保持一个查询会话内部的查询搜索词序列，把搜索词当做一个整体看待，某个query下用户下载了某些app，将app名字紧挨着拼在该query序列后。如：一个用户session序列是query1、query2、query3，用户在输入query2后下载了一个app1，把app1拼写在query2后、query3前面，即query1、query2、app1、query3。每一个session序列是一行，输出到文件session_query-app_list.txt，即搜索词序列集合。并把所有query输出到另一个文件query_all.txt，即搜索词集合。Keep a query search word sequence inside a query session, and treat the search words as a whole. If a user downloads some apps under a certain query, put the app name next to the query sequence. For example: a user session sequence is query1, query2, query3, the user downloads an app1 after entering query2, and spells app1 after query2 and before query3, namely query1, query2, app1, query3. Each session sequence is a line, which is output to the file session_query-app_list.txt, which is a collection of search word sequences. And output all queries to another file query_all.txt, which is the set of search terms.

对所述搜索词序列集合进行训练得到N维的搜索词向量文件；对于所搜索词集合中的每个搜索词，根据所述N维的搜索词向量文件计算该搜索词与其他各搜索词之间的关联程度；将与该搜索词的关联程度符合预设条件的其他各搜索词作为该搜索词的原始语料。The search word sequence set is trained to obtain an N-dimensional search word vector file; for each search word in the search word set, calculate the relationship between the search word and other search words according to the N-dimensional search word vector file. The degree of association among them; the other search words whose association degree with the search term meets the preset conditions are used as the original corpus of the search term.

在本发明的一个实施例中，所述获得多个查询会话对应的搜索词序列集合包括：对于每个查询会话，将该查询会话中的搜索词按照顺序排成一个序列；如果该序列中的一个搜索词对应于应用下载操作，将所下载的应用的名称插入到该序列中的相应搜索词的后面相邻位置；得到该查询会话对应的搜索词序列；所述获得多个查询会话对应的搜索词集合包括：将多个查询会话中的搜索词的集合作为所述多个查询会话对应的搜索词集合。In one embodiment of the present invention, the obtaining the set of search word sequences corresponding to multiple query sessions includes: for each query session, arranging the search words in the query session into a sequence; A search word corresponds to an application download operation, inserting the name of the downloaded application into the adjacent position behind the corresponding search word in the sequence; obtaining the search word sequence corresponding to the query session; The set of search terms includes: using a set of search terms in multiple query sessions as a set of search terms corresponding to the multiple query sessions.

例如，一个用户在一次查询会话中，依次输入“搜索词1”、“搜索词2”、“搜索词3”，且该用户在输入“搜索词2”后下载了一个app1。故该查询会话对应的搜索词序列即为：搜索词1、搜索词2、app1、搜索词3。每一个查询会话对应的搜索词序列均为一行，多个查询会话对应的搜索词序列集合为多行。For example, a user sequentially inputs "search term 1", "search term 2", and "search term 3" in a query session, and the user downloads an app1 after inputting "search term 2". Therefore, the search word sequence corresponding to the query session is: search word 1, search word 2, app1, search word 3. The search word sequence corresponding to each query session is one line, and the search word sequence corresponding to multiple query sessions is set into multiple lines.

上述对所述搜索词序列集合进行训练得到N维的搜索词向量文件包括：将所述搜索词序列集合中的每个搜索词作为一个单词，利用深度学习工具包word2vec对所述搜索词序列集合进行训练，生成N维的搜索词向量文件。例如，利用深度学习工具包word2vec训练，生成300维的query向量，生成一个query向量文件query_w2v_300.dict，即搜索词向量文件。The above-mentioned training of the search word sequence set to obtain the N-dimensional search word vector file includes: using each search word in the search word sequence set as a word, and using the deep learning toolkit word2vec to process the search word sequence set Perform training to generate N-dimensional search word vector files. For example, use the deep learning toolkit word2vec training to generate a 300-dimensional query vector, and generate a query vector file query_w2v_300.dict, which is the search word vector file.

实际上，用户在搜索查询想要的应用时，输入的搜索词形式多样，或为一个名词(如：“游戏”)，或为一个短语(如“好玩的游戏”)，或为一个句子(如：“我想下载一个好玩的游戏。”)。In fact, when users search for the desired application, they input various search words, either as a noun (such as: "game"), or as a phrase (such as "fun game"), or as a sentence ( Such as: "I want to download a fun game.").

在本发明的一个实施例中，前文中所得到的搜索词向量文件是为了用于作为搜索词集合中的每个搜索词计算词向量的依据，所述对于所搜索词集合中的每个搜索词，根据所述N维的搜索词向量文件计算该搜索词与其他各搜索词之间的关联程度；将与该搜索词的关联程度符合符合预设条件的其他各搜索词作为该搜索词的原始语料，具体地包括：In one embodiment of the present invention, the search term vector file obtained above is used as the basis for calculating the word vector for each search term in the search term set, and for each search term in the search term set word, according to the N-dimensional search word vector file to calculate the degree of association between the search word and other search words; the degree of association with the search word meets other search words that meet the preset conditions as the search word The original corpus specifically includes:

利用KNN算法对所述搜索词集合以及所述N维的搜索词向量文件进行运算，根据所述N维的搜索词向量文件计算所述搜索词集合中的每两个搜索词之间的距离；对于所述搜索词集合中的每个搜索词，按照与该搜索词的距离从大到小排序，选取前第一预设阈值个搜索词作为该搜索词的原始语料。Using the KNN algorithm to perform operations on the search word set and the N-dimensional search word vector file, and calculate the distance between every two search words in the search word set according to the N-dimensional search word vector file; For each search word in the set of search words, the search words are sorted according to the distance from the search word in descending order, and the first preset threshold search words are selected as the original corpus of the search word.

表1示出了在本发明的一个实施例中的搜索词为“搜狗”的前10个最近邻项，最近邻项包括搜索词和app应用名字，正如表1中第一列所示的“搜狗手机输入法”、“搜狗输入法”等。10即为此例中的第一预设值，表1中第二列则表示对应的最近邻项与搜索词“搜狗”的距离。Table 1 shows the top 10 nearest neighbors whose search term is "Sogou" in one embodiment of the present invention. The nearest neighbors include the search term and app application name, as shown in the first column of Table 1, " Sogou Mobile Phone Input Method", "Sogou Input Method", etc. 10 is the first preset value in this example, and the second column in Table 1 indicates the distance between the corresponding nearest neighbor and the search word "Sogou".

表1Table 1

最近邻项nearest neighbor 统计指标(基于欧氏距离)Statistical indicators (based on Euclidean distance) 搜狗手机输入法Sogou mobile phone input method 38 303.827 0.83810438 303.827 0.838104 搜狗输入法sogou Input 26 323.494 0.84515326 323.494 0.845153 SogouSogou 20332 372.525 0.77858920332 372.525 0.778589 收狗Take the dog 6986 385.809 0.769656986 385.809 0.76965 搜狗拼音Sogou Pinyin 14577 410.986 0.75303714577 410.986 0.753037 搜狗输入法小米版Sogou Input Method Xiaomi Version 4042 423.929 0.7469414042 423.929 0.746941 搜狗拼音输入法Sogou Pinyin Input Method 4927 435.273 0.7361724927 435.273 0.736172 搜狐输入法Sohu input method 18233 452.955 0.72487218233 452.955 0.724872 搜狗输入Sogou input 10274 455.505 0.72003410274 455.505 0.720034 手机搜狗输入法Mobile phone Sogou input method 3075 476.93 0.7210993075 476.93 0.721099

表2示出了在本发明的一个实施例中的搜索词为“彩票开奖查询”的前10个最近邻项，对应的代表含义与表1类似，不再赘述。Table 2 shows the top 10 nearest neighbors whose search term is "lottery lottery drawing query" in an embodiment of the present invention, and the corresponding representative meanings are similar to those in Table 1, and will not be repeated here.

表2Table 2

在本发明的一个实施例中，在得到各搜索词对应的原始语料集合之后，所述对所述原始语料集合进行预处理包括：In one embodiment of the present invention, after obtaining the original corpus set corresponding to each search term, the preprocessing of the original corpus set includes:

在所述原始语料集合中，对于每个原始语料，对所述原始语料进行分词处理，得到包含多个词项的分词结果；查找由所述分词结果中的相邻词项构成的短语；保留所述短语、所述分词结果中属于名词的词项和属于动词的词项，作为该原始语料对应保留的关键词。In the original corpus collection, for each original corpus, the original corpus is subjected to word segmentation processing to obtain a word segmentation result containing a plurality of lexical items; search for phrases formed by adjacent lexical items in the word segmentation results; retain The phrase, the lexical items belonging to nouns and the lexical items belonging to verbs in the word segmentation result are used as keywords correspondingly reserved for the original corpus.

例如，用户输入一个搜索词为“下载游戏”，则该搜索词属于名词的词项为“游戏”，属于动词的词项为“下载”。For example, if a user enters a search term of "download game", then the term of the search term that belongs to a noun is "game", and the term that belongs to a verb is "download".

其中，所述查找由所述分词结果中的相邻词项构成的短语包括：Wherein, the search for phrases formed by adjacent terms in the word segmentation results includes:

公式1示出了cPMId计算方法，其中，d(x,y)表示两个词项x、y的共现频数，d(x)表示词项x的出现频数，d(y)表示词项y的出现频数，D表示总的app数量，δ＝0.7。Equation 1 shows the cPMId calculation method, where d(x, y) represents the co-occurrence frequency of two terms x, y, d(x) represents the frequency of occurrence of term x, and d(y) represents the term y The frequency of occurrence of , D represents the total number of apps, δ=0.7.

公式1Formula 1

例如按照cPMId值逆序排序，选择cPMId高于阀值5的词项组合作为一个短语，与刚才保留的动词和名词合并，生成新文件query_corpus_seg_nouns_verb_phrase.txt，即第一次预处理之后的训练语料。For example, according to the reverse order of the cPMId value, select the combination of terms whose cPMId is higher than the threshold value 5 as a phrase, merge with the verbs and nouns just reserved, and generate a new file query_corpus_seg_nouns_verb_phrase.txt, which is the training corpus after the first preprocessing.

进一步地，在本发明的一个实施例中，所述对所述原始语料集合进行预处理还包括：将每个搜索词的原始物料对应保留的关键词作为该搜索词的第一阶段训练语料；各搜索词的第一阶段训练语料构成第一阶段训练语料集合；对所述第一阶段训练语料集合中的关键词进行数据清洗。Further, in one embodiment of the present invention, the preprocessing of the original corpus further includes: using the keywords retained corresponding to the original material of each search word as the first-stage training corpus of the search word; The first-stage training corpus of each search term constitutes a first-stage training corpus set; data cleaning is performed on keywords in the first-stage training corpus set.

具体地，所述对所述第一阶段训练语料集合中的关键词进行数据清洗包括：在所述第一阶段训练语料集合中，对于每个搜索词的第一阶段训练语料，计算所述第一阶段训练语料中的每个关键词的TF-IDF值；将TF-IDF值高于第三预设阈值和/或低于第四预设阈值的关键词删除，得到该搜索词的训练语料；各搜索词的训练语料构成训练语料集合。Specifically, the data cleaning of the keywords in the first-stage training corpus includes: in the first-stage training corpus, for each search word in the first-stage training corpus, calculating the first The TF-IDF value of each keyword in the one-stage training corpus; delete the keywords whose TF-IDF value is higher than the third preset threshold and/or lower than the fourth preset threshold to obtain the training corpus of the search term ; The training corpus of each search term constitutes a training corpus set.

这一步是挖掘第一阶段训练语料集合中中的非tag词，用于数据清洗。一个高频或低频出现的词项是tag的概率较小，在第一阶段训练语料集合中利用tf-idf统计方法，计算每个词项、短语的tf-idf权重，将高于某个阀值或低于某个阀值的词项或短语作为非tag词，这个阀值与具体语料有关，此处不列出具体值，非tag词生成一个黑名单black_tag.list，过滤掉文件第一阶段训练语料集合中的非tag词，生成一个新训练语料集合，格式：搜索词_id\t词项1词项2…词项n。This step is to mine the non-tag words in the first-stage training corpus for data cleaning. A high-frequency or low-frequency item is less likely to be a tag. In the first stage of the training corpus, the tf-idf statistical method is used to calculate the tf-idf weight of each term and phrase, which will be higher than a certain threshold. Items or phrases with a value or lower than a certain threshold are regarded as non-tag words. This threshold is related to specific corpus. The specific value is not listed here. Non-tag words generate a blacklist black_tag.list, and filter out the first file Phase non-tag words in the training corpus to generate a new training corpus in the format: search word_id\tterm1term2...termn.

表3示出了数据清洗中部分可被抛弃掉的非标签的词汇，这些词汇或过高频出现或过低频出现，均对用户搜索没有意义。Table 3 shows some non-label words that can be discarded in data cleaning. These words appear too frequently or appear too infrequently, which are meaningless for users to search.

表3table 3

在得到训练语料集合之后，LDA模型选用GibbsLDA++版。需要修改GibbsLDA++源代码，将query语料中同一个词项的主题初始化为同一个。在原来的代码中是每一个词项都随机初始化成一个主题，导致同一个重复词项会初始化为多个主题，因为在垂直应用领域，词项歧义的可能性小，所以同一个词项初始化成同一个主题符合垂直应用领域的现状，也能改善LDA模型的效果。例如，LDA训练选择120个主题，迭代300轮，输出两份数据，分别是主题-词项概率分布和文档-主题概率分布。After obtaining the training corpus, the LDA model uses the GibbsLDA++ version. It is necessary to modify the GibbsLDA++ source code to initialize the topic of the same term in the query corpus as the same. In the original code, each term is randomly initialized to a topic, resulting in the same repeated term being initialized to multiple topics, because in the vertical application field, the possibility of ambiguity of terms is small, so the same term is initialized Forming the same theme is in line with the status quo in the vertical application field, and can also improve the effect of the LDA model. For example, LDA training selects 120 topics, iterates for 300 rounds, and outputs two sets of data, which are topic-term probability distribution and document-topic probability distribution.

则本方案需根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系，包括：Then this solution needs to calculate the label system of each search word according to the search word-topic probability distribution result and the topic-keyword probability distribution result, including:

根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到搜索词-关键词概率分布结果；根据所述搜索词-关键词概率分布结果，对于每个搜索词，将关键词按照关于该搜索词的概率从大到小排序，选取前第五预设阈值数目的关键词。According to the search word-topic probability distribution result and the topic-keyword probability distribution result, calculate the search word-keyword probability distribution result; according to the search word-keyword probability distribution result, for each search word, The keywords are sorted according to the probability of the search word in descending order, and the keywords with the fifth preset threshold number are selected.

其中，所述根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到搜索词-关键词概率分布结果包括：对于每个搜索词，根据所述搜索词-主题概率分布结果得到各主题关于该搜索词的概率；对于每个主题，根据所述主题-关键词概率分布结果得到各关键词关于该主题的概率；则对于每个关键词，将该关键词关于一个主题的概率与该主题关于一个搜索词的概率的乘积作为该关键词基于该主题的关于所述搜索词的概率；将该关键词基于各主题关于所述搜索词的概率之和作为该关键词关于所述搜索词的概率。Wherein, the calculation of the search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result includes: for each search word, according to the search word-topic The probability distribution result obtains the probability of each topic about the search word; for each topic, obtains the probability of each keyword about the topic according to the topic-keyword probability distribution result; then for each keyword, the keyword about the The product of the probability of a topic and the probability of the topic with respect to a search word is used as the probability of the keyword based on the topic with respect to the search word; the sum of the probability of the keyword based on each topic with respect to the search word is used as the keyword The probability of the word with respect to the search term.

这一步即初始LDA tag生成的过程，这一步得到LDA产生的tag。LDA输出的是每个query下的topic概率分布，以及每个topic下的词项概率分布。为了得到每个query的tag，我们分别对topic概率分布、词项概率分布按照概率从大到小逆序排序，我们选择每个query下前50个topic，每个topic下选择前120个词项，词项的概率使用topic的概率进行加权排序，每个tag词项都有一个lda权重，表示在该query下的重要性，按照这个tag权重逆序排序，就得到了LDA产生的tag列表，含有不少噪音，tag的顺序也不准确。This step is the process of initial LDA tag generation, and this step gets the tag generated by LDA. The output of LDA is the topic probability distribution under each query, and the term probability distribution under each topic. In order to get the tag of each query, we sort the topic probability distribution and term probability distribution in reverse order according to the probability from large to small. We select the top 50 topics under each query, and select the top 120 terms under each topic. The probability of the term is weighted and sorted by the probability of the topic. Each tag term has an lda weight, which indicates the importance of the query. According to the reverse order of the tag weight, the tag list generated by LDA is obtained. Less noise, the order of tags is not accurate.

进一步地，要对LDA模型的预测结果进行微调，使得每个query的重要tag的次序更靠前，在本发明的一个实施例中，所述根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系还包括：将每个搜索词对应选取的前第五预设阈值数目的关键词作为该搜索词的第一阶段标签体系；对于每个搜索词的第一阶段标签体系，计算该搜索词的第一阶段标签体系中的每个关键词与该搜索词之间的语义关系值；对于每个关键词，将该关键词对应的语义关系值与该关键词关于该搜索词的概率的乘积作为该关键词关于该搜索词的修正概率；将该搜索词的第一阶段标签体系中的各关键词按照关于该搜索词的修正概率从大到小排序，选取前第六预设阈值个关键词构成该搜索词的标签体系。Further, it is necessary to fine-tune the prediction results of the LDA model so that the order of the important tags of each query is higher. In one embodiment of the present invention, the search term-topic probability distribution results and the The topic-keyword probability distribution results, the calculation of the label system of each search term also includes: using the keywords corresponding to the first fifth preset threshold number selected for each search term as the first stage label system of the search term; The first-stage tag system of a search word, calculate the semantic relationship value between each keyword in the first-stage tag system of the search word and the search word; for each keyword, the corresponding semantic value of the keyword The product of the relationship value and the probability of the keyword with respect to the search term is used as the modified probability of the keyword with respect to the search term; each keyword in the first-stage label system of the search term is changed from Sort from large to small, and select the first sixth preset threshold keywords to form the label system of the search term.

其中，计算该搜索词的第一阶段标签体系中的每个关键词与该搜索词之间的语义关系值包括：根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；对所述搜索词序列集合进行训练得到N维的关键词向量文件；根据所述N维的关键词向量文件，计算该关键词的词向量，计算该搜索词中的每个词项的词向量；计算该关键词的词向量与每个词项的词向量之间的余弦相似度，作为该关键词与相应词项的语义关系值；将该关键词与各词项的语义关系值之和作为该关键词与该搜索词之间的语义关系值。Wherein, calculating the semantic relationship value between each keyword in the tag system of the first stage of the search term and the search term includes: according to the search terms in each query session, obtaining a set of search term sequences corresponding to multiple query sessions ; The search word sequence set is trained to obtain an N-dimensional keyword vector file; according to the N-dimensional keyword vector file, calculate the word vector of the keyword, and calculate the word of each term in the search word Vector; calculate the cosine similarity between the word vector of the keyword and the word vector of each term, as the semantic relationship value of the keyword and the corresponding term; the relationship between the keyword and the semantic relationship value of each term and as the semantic relationship value between the keyword and the search term.

例如，计算每个tag词与query的语义关系，这个使用训练好的词向量term_w2v_300.dict，方法是：分别计算tag词向量与query中各个词语的词向量的余玄相似度，累加在一起，值越大，说明tag越重要，与lda权重加权后再重新逆序排序。For example, to calculate the semantic relationship between each tag word and query, this uses the trained word vector term_w2v_300.dict, the method is: separately calculate the co-xuan similarity between the tag word vector and the word vectors of each word in the query, and add them together. The larger the value, the more important the tag is, and it is re-sorted in reverse order after being weighted with the lda weight.

具体地，所述对所述搜索词序列集合进行训练得到N维的关键词向量文件包括：对所述搜索词序列集合进行分词处理，利用深度学习工具包word2vec对分词处理后的搜索词序列集合进行训练，生成N维的关键词向量文件。Specifically, the training of the set of search word sequences to obtain an N-dimensional keyword vector file includes: performing word segmentation processing on the set of search word sequences, and using the deep learning toolkit word2vec to segment the set of search word sequences after word segmentation. Perform training to generate N-dimensional keyword vector files.

例如，上对所述搜索词序列集合进行中文分词，利用深度学习工具包word2vec训练，生成300维的query向量，生成另一份词向量文件term_w2v_300.dict，即关键词向量文件。For example, perform Chinese word segmentation on the set of search word sequences above, use the deep learning toolkit word2vec to train, generate a 300-dimensional query vector, and generate another word vector file term_w2v_300.dict, which is a keyword vector file.

再进一步地，在本发明的一个实施例中，所述根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系还包括：将每个搜索词对应选取的前第六预设阈值个关键词作为该搜索词的第二阶段标签体系；对于每个搜索词的第二阶段标签体系，统计该搜索词的第二阶段标签体系中的每个关键词在该搜索词的训练语料中的TF-IDF值；对于每个关键词，将该关键词关于该搜索词的概率与所述TF-IDF值的乘积作为该关键词关于该搜索词的二次修正概率；将该搜索词的第二阶段标签体系中的各关键词按照关于该搜索词的二次修正概率从大到小排序，选取前K个关键词构成该搜索词的标签体系。Still further, in an embodiment of the present invention, the calculating and obtaining the tag system of each search word based on the search word-topic probability distribution result and the topic-keyword probability distribution result further includes: The first sixth preset threshold keywords selected corresponding to the search term are used as the second-stage label system of the search term; for the second-stage label system of each search term, each The TF-IDF value of a keyword in the training corpus of the search term; for each keyword, the product of the probability of the keyword with respect to the search term and the TF-IDF value is used as the keyword with respect to the search term The secondary correction probability of the search term; the keywords in the second-stage label system of the search term are sorted from large to small according to the secondary correction probability of the search term, and the first K keywords are selected to form the label system of the search term .

例如，依据tag在query扩展语料中tf-idf权重进行适当加权，归一化权重并以此重排tag次序。For example, the tf-idf weights in the query extended corpus are properly weighted according to the tags, and the weights are normalized to rearrange the tag order.

经过以上两种方法的修正后，表达query意图的tag次序准确率大幅度提升After the correction of the above two methods, the accuracy of the tag order expressing the query intention is greatly improved

在本发明的一个实施例中，所述选取前K个关键词构成该搜索词的标签体系包括：从应用搜索引擎的查询会话日志中获取关于该搜索词在预设时间段内的查询次数；根据所述查询次数选取前K个关键词构成该搜索词的标签体系；其中K值作为该搜索词对应的查询次数的折线函数。In one embodiment of the present invention, said selecting the first K keywords to form the label system of the search term includes: obtaining the number of queries about the search term within a preset time period from the query session log of the application search engine; According to the number of queries, the first K keywords are selected to form the label system of the search word; where the value of K is a broken line function of the number of queries corresponding to the search word.

这一步是给每个query确定tag的数量，保留top k个tag词，这个k值作为query搜索次数的折线函数，每个query我们保留了2个到5个不等的tag，准确率88％，召回率75％。这一步我们生成一个query意图词典query_intent_tag.txt。This step is to determine the number of tags for each query, and keep the top k tag words. This k value is used as a broken line function of the number of query searches. We keep 2 to 5 tags for each query, and the accuracy rate is 88%. , with a recall rate of 75%. In this step we generate a query intent dictionary query_intent_tag.txt.

进一步地，在一个具体的例子中，本方案给约260万的query标记上表达用户意图的tag词，是将query看作一个整体，当用户同义重构改写query后，新query不在我们的query意图词典中，这时需要计算新query与词典中的query的语义相似度，将语义相似query的意图tag赋予新query。计算方法是：将新query中各个词语的词项量累加作为新query向量，与query意图词典的query向量计算欧氏距离，选择前3个最近邻query，可使用KdTree降低计算复杂度；将欧氏距离用高斯核平滑后作为tag词的加权权重，综合3个近邻query的意图tag词，生成新query的意图tag词，保留前3个tag就满足用户的搜索意图，准确率达到80％。Furthermore, in a specific example, this solution marks about 2.6 million queries with tag words that express user intentions, and considers the query as a whole. When the user synonymously reconstructs and rewrites the query, the new query is not in our In the query intent dictionary, at this time, it is necessary to calculate the semantic similarity between the new query and the query in the dictionary, and assign the intent tag of the semantically similar query to the new query. The calculation method is: add up the amount of terms of each word in the new query as a new query vector, calculate the Euclidean distance with the query vector of the query intent dictionary, select the first 3 nearest neighbor queries, and use KdTree to reduce the computational complexity; The K-distance is smoothed with a Gaussian kernel and used as the weighted weight of the tag words. The intent tag words of the three neighboring queries are integrated to generate the intent tag words of the new query. Keeping the first three tags can satisfy the user's search intent, and the accuracy rate reaches 80%.

图2示出了根据本发明一个实施例的一种应用搜索方法流程图，该方法包括：Fig. 2 shows a flow chart of an application search method according to an embodiment of the present invention, the method includes:

步骤210，构建搜索词标签数据库，该搜索词标签数据库中包括多个搜索词的标签体系。Step 210, constructing a search word tag database, which includes a plurality of search word tag systems.

步骤220，接收客户端上传的当前搜索词，根据所述搜索词标签数据库获取当前搜索词的标签体系。Step 220, receiving the current search word uploaded by the client, and obtaining the tag system of the current search word according to the search word tag database.

步骤230，计算当前搜索词的标签体系与各应用的标签体系之间的关联程度。Step 230, calculating the degree of association between the tag system of the current search word and the tag system of each application.

步骤240，当当前搜索词的标签体系与一个应用的标签体系之间的关联程度符合预设条件时，将该应用的相关信息返回至客户端进行展示。Step 240, when the degree of association between the tag system of the current search word and the tag system of an application meets the preset condition, return the relevant information of the application to the client for display.

其中，步骤S210在构建搜索词标签数据库的过程中，对搜索词的标签体系的挖掘与图1所示的方法的任一实施例中所示的对搜索词的标签体系的挖掘过程相同。Wherein, in step S210, during the process of constructing the search word tag database, the mining of the search word tag system is the same as the mining process of the search word tag system shown in any embodiment of the method shown in FIG. 1 .

在本发明的一个实施例中，所述根据所述搜索词标签数据库获取当前搜索词的标签体系包括：计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的语义相似度，按照语义相似度从大到小排序，选取前第一预设阈值个搜索词；根据所选取的各搜索词的标签体系，获得当前搜索词的标签体系。In an embodiment of the present invention, said obtaining the tag system of the current search term according to the search term tag database includes: calculating the semantic similarity between the current search term and each search term in the search term tag database, Sorting from large to small according to the semantic similarity, selecting the first search term with a preset threshold; according to the label system of each selected search term, the label system of the current search term is obtained.

在本发明的一个实施例中，所述计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的语义相似度包括：计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的欧式距离，将每个搜索词与当前搜索词之间的欧式距离作为该搜索词对应的语义相似度；所述根据所选取的各搜索词的标签体系，获得当前搜索词的标签体系包括：每个搜索词对应的语义相似度作为该搜索词的标签体系中的各标签的权重；对于各搜索词的标签体系对应的各标签，将相同的标签的权重相加，得到各标签的最终权重；按照最终权重从大到小排序，选取前第二预设阈值个标签构成当前搜索词的标签体系。In an embodiment of the present invention, the calculating the semantic similarity between the current search term and each search term in the search term label database includes: calculating the current search term and each search term in the search term label database The Euclidean distance between words, the Euclidean distance between each search term and the current search term is used as the semantic similarity corresponding to the search term; the label system of the selected search terms is used to obtain the label of the current search term The system includes: the semantic similarity corresponding to each search word is used as the weight of each tag in the tag system of the search word; for each tag corresponding to the tag system of each search word, the weights of the same tags are added to obtain each tag The final weight of the final weight; according to the final weight sorted from large to small, select the first second preset threshold tags to form the tag system of the current search term.

表4示出了360手机助手搜索部分搜索词的意图标签词。Table 4 shows the intent tag words of some search words searched by 360 Mobile Assistant.

表4Table 4

图3示出了根据本发明的一个实施例的一种应用搜索意图的识别装置，该应用搜索意图的识别装置300包括：Fig. 3 shows a device for identifying an application search intent according to an embodiment of the present invention, the device 300 for identifying an application search intent includes:

获取单元310，适于从应用搜索引擎的查询会话日志中获取各查询会话中的搜索词；The obtaining unit 310 is adapted to obtain the search words in each query session from the query session log of the application search engine;

挖掘单元320，适于根据各查询会话中的搜索词以及预设策略，挖掘出各搜索词的标签体系；The mining unit 320 is adapted to mine the tag system of each search word according to the search words in each query session and the preset strategy;

识别单元330，适于根据每个搜索词的标签体系识别出该搜索词对应的应用搜索意图。The identification unit 330 is adapted to identify the application search intent corresponding to the search term according to the tag system of each search term.

在本发明的一个实施例中，所述挖掘单元320，适于根据各查询会话中的搜索词，获得训练语料集合；将训练语料集合输入至LDA模型中进行训练，得到LDA模型输出的搜索词-主题概率分布结果以及主题-关键词概率分布结果；根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到各搜索词的标签体系。In one embodiment of the present invention, the mining unit 320 is adapted to obtain a training corpus set according to the search words in each query session; input the training corpus set into the LDA model for training, and obtain the search words output by the LDA model - Topic probability distribution results and topic-keyword probability distribution results; according to the search word-topic probability distribution results and the topic-keyword probability distribution results, the label system of each search word is calculated.

其中，在本发明的一个实施例中，所述挖掘单元320，适于根据各查询会话中的搜索词，获得各搜索词的原始语料；各搜索词的原始语料构成原始语料集合；对所述原始语料集合进行预处理，获得训练语料集合。Wherein, in one embodiment of the present invention, the mining unit 320 is adapted to obtain the original corpus of each search term according to the search term in each query session; the original corpus of each search term constitutes an original corpus set; The original corpus is preprocessed to obtain the training corpus.

具体地，在本发明的一个实施例中，所述挖掘单元320，适于根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；以及，获得多个查询会话对应的搜索词集合；对所述搜索词序列集合进行训练得到N维的搜索词向量文件；对于所搜索词集合中的每个搜索词，根据所述N维的搜索词向量文件计算该搜索词与其他各搜索词之间的关联程度；将与该搜索词的关联程度符合预设条件的其他各搜索词作为该搜索词的原始语料。Specifically, in one embodiment of the present invention, the mining unit 320 is adapted to obtain search word sequence sets corresponding to multiple query sessions according to the search words in each query session; A set of search words; the search word sequence set is trained to obtain an N-dimensional search word vector file; for each search word in the search word set, the search word and other words are calculated according to the N-dimensional search word vector file The degree of association between each search word; use other search words whose degree of association with the search word meets the preset conditions as the original corpus of the search word.

也就是说，所述挖掘单元320，适于对于每个查询会话，将该查询会话中的搜索词按照顺序排成一个序列；如果该序列中的一个搜索词对应于应用下载操作，将所下载的应用的名称插入到该序列中的相应搜索词的后面相邻位置；得到该查询会话对应的搜索词序列；将多个查询会话中的搜索词的集合作为所述多个查询会话对应的搜索词集合。That is to say, the mining unit 320 is adapted to, for each query session, arranging the search words in the query session into a sequence in order; if a search word in the sequence corresponds to an application download operation, the downloaded The name of the application is inserted into the adjacent position behind the corresponding search term in the sequence; the search term sequence corresponding to the query session is obtained; the set of search terms in multiple query sessions is used as the search term corresponding to the multiple query sessions word collection.

例如，所述挖掘单元320，适于将所述搜索词序列集合中的每个搜索词作为一个单词，利用深度学习工具包word2vec对所述搜索词序列集合进行训练，生成N维的搜索词向量文件。For example, the mining unit 320 is adapted to use each search word in the search word sequence set as a word, and use the deep learning toolkit word2vec to train the search word sequence set to generate an N-dimensional search word vector document.

在此基础上，在本发明的一个实施例中，所述挖掘单元320，适于利用KNN算法对所述搜索词集合以及所述N维的搜索词向量文件进行运算，根据所述N维的搜索词向量文件计算所述搜索词集合中的每两个搜索词之间的距离；对于所述搜索词集合中的每个搜索词，按照与该搜索词的距离从大到小排序，选取前第一预设阈值个搜索词作为该搜索词的原始语料。On this basis, in one embodiment of the present invention, the mining unit 320 is adapted to use the KNN algorithm to perform operations on the search word set and the N-dimensional search word vector file, and according to the N-dimensional The search word vector file calculates the distance between every two search words in the search word set; The first preset threshold search words are used as the original corpus of the search words.

预处理的过程中，在本发明的一个实施例中，所述挖掘单元320，适于在所述原始语料集合中，对于每个原始语料，对所述原始语料进行分词处理，得到包含多个词项的分词结果；查找由所述分词结果中的相邻词项构成的短语；保留所述短语、所述分词结果中属于名词的词项和属于动词的词项，作为该原始语料对应保留的关键词。In the process of preprocessing, in one embodiment of the present invention, the mining unit 320 is adapted to perform word segmentation processing on the original corpus for each original corpus in the original corpus set to obtain multiple The word segmentation result of the word item; find the phrase formed by the adjacent words in the word segmentation result; keep the phrase, the word item belonging to the noun and the word item belonging to the verb in the word segmentation result, as the corresponding reservation of the original corpus keywords.

具体地，所述挖掘单元320，适于计算分词结果中的每两个相邻词项的cPMId值，当两个相邻词项的cPMId值大于第二预设阈值时，确定这两个相邻词项构成短语。Specifically, the mining unit 320 is adapted to calculate the cPMId value of every two adjacent terms in the word segmentation result, and when the cPMId values of the two adjacent terms are greater than a second preset threshold, it is determined that the two adjacent terms Adjacent terms form phrases.

进一步地，在本发明的一个实施例中，所述挖掘单元320，还适于将每个搜索词的原始物料对应保留的关键词作为该搜索词的第一阶段训练语料；各搜索词的第一阶段训练语料构成第一阶段训练语料集合；对所述第一阶段训练语料集合中的关键词进行数据清洗。Further, in one embodiment of the present invention, the mining unit 320 is also adapted to use the keywords retained corresponding to the original material of each search word as the first-stage training corpus of the search word; the first-stage training corpus of each search word The first-stage training corpus constitutes a first-stage training corpus set; data cleaning is performed on keywords in the first-stage training corpus set.

具体地，在本发明的一个实施例中，所述挖掘单元320，适于在所述第一阶段训练语料集合中，对于每个搜索词的第一阶段训练语料，计算所述第一阶段训练语料中的每个关键词的TF-IDF值；将TF-IDF值高于第三预设阈值和/或低于第四预设阈值的关键词删除，得到该搜索词的训练语料；各搜索词的训练语料构成训练语料集合。Specifically, in one embodiment of the present invention, the mining unit 320 is adapted to calculate the first-stage training corpus for each search term in the first-stage training corpus set. The TF-IDF value of each keyword in the corpus; the TF-IDF value is higher than the third preset threshold value and/or the keyword lower than the fourth preset threshold value is deleted to obtain the training corpus of the search term; each search The training corpus of words constitutes the training corpus set.

在本发明的一个实施例中，所述挖掘单元320，适于根据所述搜索词-主题概率分布结果和所述主题-关键词概率分布结果，计算得到搜索词-关键词概率分布结果；根据所述搜索词-关键词概率分布结果，对于每个搜索词，将关键词按照关于该搜索词的概率从大到小排序，选取前第五预设阈值数目的关键词。In an embodiment of the present invention, the mining unit 320 is adapted to calculate the search word-keyword probability distribution result according to the search word-topic probability distribution result and the topic-keyword probability distribution result; In the search word-keyword probability distribution result, for each search word, the keywords are sorted according to the probability of the search word in descending order, and the keywords with the fifth preset threshold number are selected.

在本发明的一个实施例中，所述挖掘单元320，适于对于每个搜索词，根据所述搜索词-主题概率分布结果得到各主题关于该搜索词的概率；对于每个主题，根据所述主题-关键词概率分布结果得到各关键词关于该主题的概率；则对于每个关键词，将该关键词关于一个主题的概率与该主题关于一个搜索词的概率的乘积作为该关键词基于该主题的关于所述搜索词的概率；将该关键词基于各主题关于所述搜索词的概率之和作为该关键词关于所述搜索词的概率。In one embodiment of the present invention, the mining unit 320 is adapted to, for each search word, obtain the probability of each topic related to the search word according to the search word-topic probability distribution result; for each topic, according to the According to the topic-keyword probability distribution results, the probability of each keyword on the topic is obtained; then for each keyword, the product of the probability of the keyword on a topic and the probability of the topic on a search word is used as the keyword based on The probability of the topic with respect to the search word; the sum of the probabilities of the keyword with respect to the search word based on each topic is used as the probability of the keyword with respect to the search word.

进一步地，在本发明的一个实施例中，所述挖掘单元320，还适于将每个搜索词对应选取的前第五预设阈值数目的关键词作为该搜索词的第一阶段标签体系；对于每个搜索词的第一阶段标签体系，计算该搜索词的第一阶段标签体系中的每个关键词与该搜索词之间的语义关系值；对于每个关键词，将该关键词对应的语义关系值与该关键词关于该搜索词的概率的乘积作为该关键词关于该搜索词的修正概率；将该搜索词的第一阶段标签体系中的各关键词按照关于该搜索词的修正概率从大到小排序，选取前第六预设阈值个关键词构成该搜索词的标签体系。Further, in one embodiment of the present invention, the mining unit 320 is also adapted to use the first fifth preset threshold number of keywords corresponding to each search term as the first-stage tag system of the search term; For the first-stage tag system of each search word, calculate the semantic relationship value between each keyword in the first-stage tag system of the search word and the search word; for each keyword, the corresponding keyword The product of the semantic relationship value of the keyword and the probability of the keyword with respect to the search term is used as the modified probability of the keyword with respect to the search term; each keyword in the first-stage label system of the search term is adjusted according to the modified probability of the search term The probabilities are sorted from large to small, and the first sixth preset threshold keywords are selected to form the label system of the search term.

在本发明的一个实施例中，所述挖掘单元320，适于根据各查询会话中的搜索词，获得多个查询会话对应的搜索词序列集合；对所述搜索词序列集合进行训练得到N维的关键词向量文件；根据所述N维的关键词向量文件，计算该关键词的词向量，计算该搜索词中的每个词项的词向量；计算该关键词的词向量与每个词项的词向量之间的余弦相似度，作为该关键词与相应词项的语义关系值；将该关键词与各词项的语义关系值之和作为该关键词与该搜索词之间的语义关系值。In one embodiment of the present invention, the mining unit 320 is adapted to obtain search word sequence sets corresponding to multiple query sessions according to the search words in each query session; train the search word sequence sets to obtain N-dimensional The keyword vector file; According to the keyword vector file of the N dimension, calculate the word vector of this keyword, calculate the word vector of each term in this search word; Calculate the word vector of this keyword and each word The cosine similarity between the word vectors of the term is used as the semantic relationship value between the keyword and the corresponding term; the sum of the semantic relationship values between the keyword and each term is used as the semantic relationship between the keyword and the search term relationship value.

在本发明的一个实施例中，所述挖掘单元320，适于对所述搜索词序列集合进行分词处理，利用深度学习工具包word2vec对分词处理后的搜索词序列集合进行训练，生成N维的关键词向量文件。In one embodiment of the present invention, the mining unit 320 is adapted to perform word segmentation processing on the set of search word sequences, use the deep learning toolkit word2vec to train the set of search word sequences after word segmentation, and generate N-dimensional Keyword vector file.

进一步地，在本发明的一个实施例中，所述挖掘单元320，还适于将每个搜索词对应选取的前第六预设阈值个关键词作为该搜索词的第二阶段标签体系；对于每个搜索词的第二阶段标签体系，统计该搜索词的第二阶段标签体系中的每个关键词在该搜索词的训练语料中的TF-IDF值；对于每个关键词，将该关键词关于该搜索词的概率与所述TF-IDF值的乘积作为该关键词关于该搜索词的二次修正概率；将该搜索词的第二阶段标签体系中的各关键词按照关于该搜索词的二次修正概率从大到小排序，选取前K个关键词构成该搜索词的标签体系。Further, in one embodiment of the present invention, the mining unit 320 is also adapted to use the first sixth preset threshold keywords corresponding to each search word as the second stage label system of the search word; for The second-stage label system of each search term, count the TF-IDF value of each keyword in the second-stage label system of the search term in the training corpus of the search term; for each keyword, the key The product of the probability of the word about the search word and the TF-IDF value is used as the secondary revised probability of the keyword about the search word; The secondary correction probability of is sorted from large to small, and the first K keywords are selected to form the label system of the search term.

在本发明的一个实施例中，所述挖掘单元320，适于从应用搜索引擎的查询会话日志中获取关于该搜索词在预设时间段内的查询次数；根据所述查询次数选取前K个关键词构成该搜索词的标签体系；其中K值作为该搜索词对应的查询次数的折线函数。In one embodiment of the present invention, the mining unit 320 is adapted to obtain the number of queries about the search word within a preset time period from the query session log of the application search engine; select the top K according to the number of queries Keywords constitute the label system of the search term; where the K value is a broken line function of the number of queries corresponding to the search term.

图4示出了根据本发明的一个实施例的一种应用搜索服务器，该应用搜索服务器400包括：Fig. 4 shows an application search server according to an embodiment of the present invention, the application search server 400 includes:

数据库构建单元410，适于构建搜索词标签数据库，该搜索词标签数据库中包括多个搜索词的标签体系；The database construction unit 410 is adapted to construct a search word tag database, which includes a tag system of multiple search words;

交互单元420，适于接收客户端上传的当前搜索词；The interaction unit 420 is adapted to receive the current search word uploaded by the client;

搜索处理单元430，适于根据所述搜索词标签数据库获取当前搜索词的标签体系；计算当前搜索词的标签体系与各应用的标签体系之间的关联程度；The search processing unit 430 is adapted to obtain the tag system of the current search term according to the search term tag database; calculate the degree of association between the tag system of the current search word and the tag systems of each application;

所述交互单元420，还适于当当前搜索词的标签体系与一个应用的标签体系之间的关联程度符合预设条件时，将该应用的相关信息返回至客户端进行展示；The interaction unit 420 is further adapted to return relevant information of the application to the client for display when the degree of association between the label system of the current search word and the label system of an application meets the preset condition;

其中，所述数据库构建单元410构建所述搜索词标签数据库的过程中挖掘搜索词的标签体系的方案与本发明上述实施例中任一项所述的应用搜索意图的识别装置300挖掘搜索词的标签体系的方案相同。Wherein, the scheme of mining the tag system of the search word in the process of the database construction unit 410 constructing the search word tag database is the same as that of the search word identification device 300 according to any one of the above-mentioned embodiments of the present invention. The scheme of the labeling system is the same.

在本发明的一个实施例中，所述搜索处理单元430，适于计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的语义相似度，按照语义相似度从大到小排序，选取前第一预设阈值个搜索词；根据所选取的各搜索词的标签体系，获得当前搜索词的标签体系。In one embodiment of the present invention, the search processing unit 430 is adapted to calculate the semantic similarity between the current search term and each search term in the search term label database, and sort the terms according to the semantic similarity from large to small , select the first preset threshold search words before; obtain the tag system of the current search word according to the tag system of each selected search word.

在本发明的一个中，所述搜索处理单元430，适于计算当前搜索词与所述搜索词标签数据库中的各搜索词之间的欧式距离，将每个搜索词与当前搜索词之间的欧式距离作为该搜索词对应的语义相似度；每个搜索词对应的语义相似度作为该搜索词的标签体系中的各标签的权重；对于各搜索词的标签体系对应的各标签，将相同的标签的权重相加，得到各标签的最终权重；按照最终权重从大到小排序，选取前第二预设阈值个标签构成当前搜索词的标签体系。In one aspect of the present invention, the search processing unit 430 is adapted to calculate the Euclidean distance between the current search term and each search term in the search term label database, and calculate the distance between each search term and the current search term The Euclidean distance is used as the semantic similarity corresponding to the search term; the semantic similarity corresponding to each search term is used as the weight of each label in the label system of the search term; for each label corresponding to the label system of each search term, the same Add the weights of the tags to get the final weight of each tag; sort the final weights from large to small, and select the first second preset threshold tags to form the tag system of the current search term.

需要说明的是，图3-图4所示装置的各实施例与图1-图2所示方法的各实施例对应相同，上文已有详细说明，在此不再赘述。It should be noted that the embodiments of the apparatus shown in FIGS. 3-4 are correspondingly the same as the embodiments of the method shown in FIGS. 1-2 , which have been described in detail above and will not be repeated here.

综上所述，本发明中应用搜索意图的识别方法、装置、应用搜索方法和服务器，提出了与app应用标签体系相匹配的用户意图识别方法-标签法，灵活的表达用户细粒度的查询意图。基于无监督机器学习技术构建用户意图的标签体系，抛弃了传统的用户意图分类方法，实现了一套自动化用户意图挖掘流程，可生成高准确率、召回率的用户意图标签列表。可以把用户意图和app应用映射到同一个标签体系内，由此既解决了用户意图识别问题，同时又解决了应用搜索引擎的相关计算问题，为应用搜索引擎中的一项核心技术-功能搜索技术的打下了基础。To sum up, the application search intent recognition method, device, application search method, and server in the present invention propose a user intent recognition method that matches the app application tag system—the labeling method, which flexibly expresses the user's fine-grained query intent . Based on unsupervised machine learning technology, the user intent labeling system is built, and the traditional user intent classification method is abandoned, and a set of automatic user intent mining process is realized, which can generate a user intent tag list with high accuracy and recall rate. User intentions and app applications can be mapped to the same label system, thus solving the problem of user intention identification and the calculation of application search engines, which is a core technology in application search engines - functional search technology laid the foundation.

需要说明的是：It should be noted:

在此提供的算法和显示不与任何特定计算机、虚拟装置或者其它设备固有相关。各种通用装置也可以与基于在此的示教一起使用。根据上面的描述，构造这类装置所要求的结构是显而易见的。此外，本发明也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays presented herein are not inherently related to any particular computer, virtual appliance, or other device. Various general purpose devices can also be used with the teachings based on this. The structure required to construct such an apparatus will be apparent from the foregoing description. Furthermore, the present invention is not specific to any particular programming language. It should be understood that various programming languages can be used to implement the content of the present invention described herein, and the above description of specific languages is for disclosing the best mode of the present invention.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure the understanding of this description.

类似地，应当理解，为了精简本公开并帮助理解各个发明方面中的一个或多个，在上面对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该公开的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如下面的权利要求书所反映的那样，发明方面在于少于前面公开的单个实施例的所有特征。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, in order to streamline this disclosure and to facilitate an understanding of one or more of the various inventive aspects, various features of the invention are sometimes grouped together in a single embodiment, figure, or its description. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域那些技术人员可以理解，可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件，以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and arranged in one or more devices different from the embodiment. Modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore may be divided into a plurality of sub-modules or sub-units or sub-assemblies. All features disclosed in this specification (including accompanying claims, abstract and drawings) and any method or method so disclosed may be used in any combination, except that at least some of such features and/or processes or units are mutually exclusive. All processes or units of equipment are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在下面的权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will understand that although some embodiments described herein include some features included in other embodiments but not others, combinations of features from different embodiments are meant to be within the scope of the invention. and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的应用搜索意图的识别装置和应用搜索服务器中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。The various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) can be used in practice to implement some or all of the application search intent identification device and application search server according to the embodiment of the present invention. Or full functionality. The present invention can also be implemented as an apparatus or an apparatus program (for example, a computer program and a computer program product) for performing a part or all of the methods described herein. Such a program for realizing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such a signal may be downloaded from an Internet site, or provided on a carrier signal, or provided in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means can be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. does not indicate any order. These words can be interpreted as names.

Claims

1. A method for identifying application search intent, comprising:

Obtain the search terms in each query session from the query session log of the application search engine;

According to the search terms and preset strategies in each query session, the label system of each search term is excavated;

According to the tag system of each search term, the application search intent corresponding to the search term is identified.

2. The method as claimed in claim 1, wherein, according to the search term and the preset strategy in each query session, digging out the label system of each search term comprises:

Obtain a training corpus set according to the search terms in each query session;

Input the training corpus set into the LDA model for training, and obtain the search word-topic probability distribution result and the topic-keyword probability distribution result output by the LDA model;

According to the search word-topic probability distribution result and the topic-keyword probability distribution result, the label system of each search word is calculated.

3. The method according to claim 1 or 2, wherein, according to the search term in each query session, obtaining the training corpus collection includes:

According to the search terms in each query session, the original corpus of each search term is obtained;

The original corpus of each search word constitutes an original corpus set; the original corpus set is preprocessed to obtain a training corpus set.

4. The method according to any one of claims 1-3, wherein, according to the search term in each query session, obtaining the original corpus of each search term comprises:

Obtain a set of search word sequences corresponding to multiple query sessions according to the search words in each query session; and obtain a set of search words corresponding to multiple query sessions;

The search term sequence set is trained to obtain an N-dimensional search term vector file;

For each search term in the search term set, calculate the degree of association between the search term and other search terms according to the N-dimensional search term vector file; Other search terms are used as the original corpus of the search term.

5. An application search method, comprising:

Build a search term label database, which includes a label system of multiple search terms;

Receive the current search term uploaded by the client, and obtain the tag system of the current search term according to the search term tag database;

Calculate the degree of association between the label system of the current search term and the label system of each application;

When the degree of association between the label system of the current search word and the label system of an application meets the preset conditions, the relevant information of the application is returned to the client for display;

The search word tag database is constructed by the method according to any one of claims 1-4.

6. A recognition device for applying a search intent, comprising:

An acquisition unit adapted to acquire the search words in each query session from the query session log of the application search engine;

The mining unit is adapted to mine the label system of each search term according to the search term and the preset strategy in each query session;

The identification unit is adapted to identify the application search intent corresponding to the search term according to the tag system of each search term.

7. The method of claim 6, wherein,

The mining unit is adapted to obtain a training corpus set according to the search words in each query session; input the training corpus set into the LDA model for training, and obtain the search word-topic probability distribution results and the topic-keywords output by the LDA model Probability distribution results: According to the search word-topic probability distribution results and the topic-keyword probability distribution results, the label system of each search word is calculated.

8. The method of claim 6 or 7, wherein,

The mining unit is adapted to obtain the original corpus of each search word according to the search words in each query session; the original corpus of each search word forms an original corpus set; preprocess the original corpus set to obtain a training corpus set.

9. The method according to any one of claims 6-8, wherein the mining unit is adapted to obtain a set of search word sequences corresponding to a plurality of query sessions according to the search words in each query session; and, obtain A set of search words corresponding to multiple query sessions; training the set of search word sequences to obtain an N-dimensional search word vector file; for each search word in the search word set, according to the N-dimensional search word vector file Calculate the degree of association between the search word and other search words; use other search words whose association degree with the search word meets the preset conditions as the original corpus of the search word.

10. An application search server, comprising:

The database construction unit is suitable for constructing a search word tag database, and the search word tag database includes a tag system of multiple search words;

an interactive unit adapted to receive the current search term uploaded by the client;

The search processing unit is adapted to obtain the label system of the current search term according to the search term label database; calculate the degree of association between the label system of the current search term and the label system of each application;

The interaction unit is further adapted to return relevant information of the application to the client for display when the degree of association between the label system of the current search term and the label system of an application meets a preset condition;

The database construction unit is the same as the process of constructing the search word tag database by the device for identifying application search intentions according to any one of claims 6-9.